No description
Find a file
2024-08-28 17:59:56 +01:00
.forgejo/workflows ci: use remote mage-ci action 2024-08-27 18:52:43 +01:00
internal feat: add support for generating JSON reports 2024-08-28 17:59:56 +01:00
magefiles feat: generate CSV reports and save to file 2024-08-28 12:00:25 +01:00
.gitignore feat: generate CSV reports and save to file 2024-08-28 12:00:25 +01:00
.golangci.yaml feat: use flags to configure the crawler 2024-08-27 17:11:47 +01:00
go.mod feat: add the web crawler 2024-08-27 15:42:26 +01:00
go.sum feat: add the web crawler 2024-08-27 15:42:26 +01:00
LICENSE feat: add the web crawler 2024-08-27 15:42:26 +01:00
main.go feat: add support for generating JSON reports 2024-08-28 17:59:56 +01:00
README.md feat: add support for generating JSON reports 2024-08-28 17:59:56 +01:00

Web Crawler

Overview

This web crawler crawls a given website and generates a report for all the internal and external links found during the crawl.

Repository mirrors

Requirements

  • Go: A minimum version of Go 1.23.0 is required for building/installing the web crawler. Please go here to download the latest version.

Build the application

Clone this repository to your local machine.

git clone https://github.com/dananglin/web-crawler.git

Build the application.

  • Build with go
    go build -o crawler .
    
  • Or build with mage if you have it installed.
    mage build
    

Run the application

Run the application specifying the website that you want to crawl.

Format

./crawler [FLAGS] URL

Examples

  • Crawl the Crawler Test Site.
    ./crawler https://crawler-test.com
    
  • Crawl the site using 3 concurrent workers and stop the crawl after discovering a maximum of 100 unique pages.
    ./crawler --max-workers 3 --max-pages 100 https://crawler-test.com
    
  • Crawl the site and print out a JSON report.
    ./crawler --max-workers 3 --max-pages 100 --format json https://crawler-test.com
    
  • Crawl the site and save the report to a CSV file.
    mkdir -p reports
    ./crawler --max-workers 3 --max-pages 100 --format csv --file reports/report.csv https://crawler-test.com
    

Flags

You can configure the application with the following flags.

Name Description Default
max-workers The maximum number of concurrent workers. 2
max-pages The maximum number of pages the crawler can discoverd before stopping the crawl. 10
format The format of the generated report.
Currently supports text, csv or json.
text
file The file to save the generated report to.
Leave this empty to print to the screen instead.