A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.
- Domain-specific crawling: Stays within the specified domain and its subdomains
- Customizable crawl depth: Set maximum number of pages to crawl
- Concurrent link checking: Efficiently checks for 404 errors using multiple threads
- Progress updates: Shows status updates during crawling and link checking
- Detailed reporting: Generates both console output and a text file report
- URL relationship mapping: Identifies which pages contain broken links
- Respectful crawling: Configurable delay between requests to reduce server load
- Python 3.6+
- Required packages:
- requests
- beautifulsoup4
-
Clone this repository:
git clone https://github.com/manvirheer/DeadLinkChecker.git cd DeadLinkChecker
-
Install the required dependencies:
pip install -r requirements.txt
Basic usage:
python website_crawler.py --start-url "https://www.example.org" --org-name "Example Organization"
Argument | Default | Description |
---|---|---|
--start-url |
https://www.example.org | URL to start crawling from |
--org-name |
Example | Organization name for reporting |
--max-pages |
100 | Maximum number of pages to crawl |
--concurrent |
5 | Number of concurrent requests for link checking |
--delay |
0.1 | Delay between requests in seconds |
--progress-interval |
500 | Number of links to check before progress update |
Crawl a website with more pages:
python website_crawler.py --start-url "https://www.example.org" --org-name "Example" --max-pages 500
Be more respectful to the server with longer delays:
python website_crawler.py --start-url "https://www.example.org" --delay 0.5
Use more concurrent threads for faster checking (be careful with this):
python website_crawler.py --start-url "https://www.example.org" --concurrent 10
Get more frequent progress updates:
python website_crawler.py --start-url "https://www.example.org" --progress-interval 100
- Initialization: Sets up the crawler with the specified domain and parameters
- Crawling process:
- Starts from the provided URL
- Extracts all links on each page
- Follows links that belong to the same domain or subdomains
- Builds a map of page relationships
- Link checking:
- Tests all unique links found during crawling
- Records only 404 (Not Found) errors
- Provides progress updates at specified intervals
- Reporting:
- Displays a summary of findings in the console
- Lists all 404 links found
- Shows which pages contain those broken links
- Saves a detailed report to a text file
Console output:
Starting crawl of Example Organization website from https://www.example.org
Crawling (1/100): https://www.example.org
Crawling (2/100): https://www.example.org/about
...
Crawling completed. Visited 100 pages.
Checking 423 external links for 404 errors...
Checked 500/1245 links...
Checked 1000/1245 links...
Checked 1245/1245 links...
===== Example Organization Website Crawl Report =====
Starting URL: https://www.example.org
Base domain: www.example.org
Pages crawled: 100
Total unique URLs found: 1245
404 errors found: 23
----- 404 Links -----
[404] https://www.example.org/outdated-page
[404] https://www.example.org/products/discontinued-item
...
----- Pages Containing 404 Links -----
Source: https://www.example.org/products
→ [404] https://www.example.org/products/discontinued-item
...
Report saved to example_organization_404_report.txt
Please use this tool responsibly:
- Respect websites'
robots.txt
files and terms of service - Use reasonable delays between requests to avoid overloading servers
- Consider running during off-peak hours for high-traffic websites
- Obtain permission before crawling private or commercial websites extensively
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request