Simple command line tool for checking the validity of links on a small webpage.
Make sure you have Rust installed, then:
git clone https://github.com/matildasmeds/link_checker
cd link_checker
cargo build --release
./target/release/link_checker https://www.example.comReplace the url with the specific url you want to check!
You can of course run it with the good old cargo run as well.
cargo run https://www.example.com
We use tokio, and shared mutable state based on Arc and Mutex.
There's a recursive function visit_url(), that visits links, and scrapes the HTML bodies for more links, but only if they are on the same domain as the starting url.
For links outside of the domain, we do a HEAD request, because that is enough to validate the existence of the endpoint. Also, it works well for websites such as StackOverflow, Wikipedia and LinkedIn who tend to reject scrapers. Instead of 403s and a 999 (respectively), we get 200s and 405, and with these know the link exists.
We treat 405 (Method Not Allowed) as a valid link since the server is confirming the resource exists.
- This solution keeps the checked links in unbounded memory. If the link collection exceeds process memory limits, it will crash.
- We don't validate #section fragments' correctness.
- While we limit parallel requests with a semaphore, there is no delay or throttle.
- No retries, so in case of temporary glitches, you might need to run the program again.
- Error handling is on the simple side.
- There are no configuration options at the moment. The code can be easily adapted for that, if needed.