2024-08-27 17:11:47 +01:00
# Web Crawler
## Overview
2024-08-28 07:39:24 +01:00
This web crawler crawls a given website and generates a report for all the internal and external links found during the crawl.
2024-08-27 17:11:47 +01:00
### Repository mirrors
- **Code Flow:** https://codeflow.dananglin.me.uk/apollo/web-crawler
- **GitHub:** https://github.com/dananglin/web-crawler
## Requirements
- **Go:** A minimum version of Go 1.23.0 is required for building/installing the web crawler. Please go [here ](https://go.dev/dl/ ) to download the latest version.
2024-08-28 12:00:25 +01:00
## Build the application
2024-08-27 17:11:47 +01:00
Clone this repository to your local machine.
```
git clone https://github.com/dananglin/web-crawler.git
```
Build the application.
2024-08-28 07:39:24 +01:00
- Build with go
```
go build -o crawler .
```
- Or build with [mage ](https://magefile.org/ ) if you have it installed.
```
mage build
```
2024-08-27 17:11:47 +01:00
2024-08-28 12:00:25 +01:00
## Run the application
2024-08-27 17:11:47 +01:00
Run the application specifying the website that you want to crawl.
2024-08-28 12:00:25 +01:00
### Format
`./crawler [FLAGS] URL`
### Examples
- Crawl the [Crawler Test Site ](https://crawler-test.com ).
```
./crawler https://crawler-test.com
```
2024-08-28 13:03:06 +01:00
- Crawl the site using 3 concurrent workers and stop the crawl after discovering a maximum of 100 unique pages.
2024-08-28 12:00:25 +01:00
```
./crawler --max-workers 3 --max-pages 100 https://crawler-test.com
```
- Crawl the site and print out a CSV report.
```
./crawler --max-workers 3 --max-pages 100 --format csv https://crawler-test.com
```
- Crawl the site and save the report to a CSV file.
2024-08-27 17:11:47 +01:00
```
2024-08-28 12:00:25 +01:00
mkdir -p reports
./crawler --max-workers 3 --max-pages 100 --format csv --file reports/report.csv https://crawler-test.com
2024-08-27 17:11:47 +01:00
```
## Flags
You can configure the application with the following flags.
| Name | Description | Default |
|------|-------------|---------|
| `max-workers` | The maximum number of concurrent workers. | 2 |
2024-08-28 13:03:06 +01:00
| `max-pages` | The maximum number of pages the crawler can discoverd before stopping the crawl. | 10 |
2024-08-28 12:00:25 +01:00
| `format` | The format of the generated report.< br > Currently supports `text` and `csv` . | text |
| `file` | The file to save the generated report to.< br > Leave this empty to print to the screen instead. | |