8000 GitHub - manvirheer/DeadLinkCrawler: A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.

Notifications You must be signed in to change notification settings

manvirheer/DeadLinkCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

DeadLinkChecker

A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.

Features

  • Domain-specific crawling: Stays within the specified domain and its subdomains
  • Customizable crawl depth: Set maximum number of pages to crawl
  • Concurrent link checking: Efficiently checks for 404 errors using multiple threads
  • Progress updates: Shows status updates during crawling and link checking
  • Detailed reporting: Generates both console output and a text file report
  • URL relationship mapping: Identifies which pages contain broken links
  • Respectful crawling: Configurable delay between requests to reduce server load

Requirements

  • Python 3.6+
  • Required packages:
    • requests
    • beautifulsoup4

Installation

  1. Clone this repository:

    git clone https://github.com/manvirheer/DeadLinkChecker.git
    cd DeadLinkChecker
  2. Install the required dependencies:

    pip install -r requirements.txt

Usage

Basic usage:

python website_crawler.py --start-url "https://www.example.org" --org-name "Example Organization"

Command-line Arguments

Argument Default Description
--start-url https://www.example.org URL to start crawling from
--org-name Example Organization name for reporting
--max-pages 100 Maximum number of pages to crawl
--concurrent 5 Number of concurrent requests for link checking
--delay 0.1 Delay between requests in seconds
--progress-interval 500 Number of links to check before progress update

Examples

Crawl a website with more pages:

python website_crawler.py --start-url "https://www.example.org" --org-name "Example" --max-pages 500

Be more respectful to the server with longer delays:

python website_crawler.py --start-url "https://www.example.org" --delay 0.5

Use more concurrent threads for faster checking (be careful with this):

python website_crawler.py --start-url "https://www.example.org" --concurrent 10

Get more frequent progress updates:

python website_crawler.py --start-url "https://www.example.org" --progress-interval 100

How It Works

  1. Initialization: Sets up the crawler with the specified domain and parameters
  2. Crawling process:
    • Starts from the provided URL
    • Extracts all links on each page
    • Follows links that belong to the same domain or subdomains
    • Builds a map of page relationships
  3. Link checking:
    • Tests all unique links found during crawling
    • Records only 404 (Not Found) errors
    • Provides progress updates at specified intervals
  4. Reporting:
    • Displays a summary of findings in the console
    • Lists all 404 links found
    • Shows which pages contain those broken links
    • Saves a detailed report to a text file

Sample Output

Console output:

Starting crawl of Example Organization website from https://www.example.org
Crawling (1/100): https://www.example.org
Crawling (2/100): https://www.example.org/about
...
Crawling completed. Visited 100 pages.
Checking 423 external links for 404 errors...
Checked 500/1245 links...
Checked 1000/1245 links...
Checked 1245/1245 links...

===== Example Organization Website Crawl Report =====
Starting URL: https://www.example.org
Base domain: www.example.org
Pages crawled: 100
Total unique URLs found: 1245
404 errors found: 23

----- 404 Links -----
[404] https://www.example.org/outdated-page
[404] https://www.example.org/products/discontinued-item
...

----- Pages Containing 404 Links -----
Source: https://www.example.org/products
  → [404] https://www.example.org/products/discontinued-item
...

Report saved to example_organization_404_report.txt

Ethical Use

Please use this tool responsibly:

  • Respect websites' robots.txt files and terms of service
  • Use reasonable delays between requests to avoid overloading servers
  • Consider running during off-peak hours for high-traffic websites
  • Obtain permission before crawling private or commercial websites extensively

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

About

A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0