DeadLinkChecker

A Python tool that crawls websites to identify broken links (404 errors). This utility helps webmasters and content managers identify broken links on their websites.

Features

Domain-specific crawling: Stays within the specified domain and its subdomains
Customizable crawl depth: Set maximum number of pages to crawl
Concurrent link checking: Efficiently checks for 404 errors using multiple threads
Progress updates: Shows status updates during crawling and link checking
Detailed reporting: Generates both console output and a text file report
URL relationship mapping: Identifies which pages contain broken links
Respectful crawling: Configurable delay between requests to reduce server load

Requirements

Python 3.6+
Required packages:
- requests
- beautifulsoup4

Installation

Clone this repository:

git clone https://github.com/manvirheer/DeadLinkChecker.git
cd DeadLinkChecker

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Basic usage:

python website_crawler.py --start-url "https://www.example.org" --org-name "Example Organization"

Command-line Arguments

Argument	Default	Description
`--start-url`	https://www.example.org	URL to start crawling from
`--org-name`	Example	Organization name for reporting
`--max-pages`	100	Maximum number of pages to crawl
`--concurrent`	5	Number of concurrent requests for link checking
`--delay`	0.1	Delay between requests in seconds
`--progress-interval`	500	Number of links to check before progress update

Examples

Crawl a website with more pages:

python website_crawler.py --start-url "https://www.example.org" --org-name "Example" --max-pages 500

Be more respectful to the server with longer delays:

python website_crawler.py --start-url "https://www.example.org" --delay 0.5

Use more concurrent threads for faster checking (be careful with this):

python website_crawler.py --start-url "https://www.example.org" --concurrent 10

Get more frequent progress updates:

python website_crawler.py --start-url "https://www.example.org" --progress-interval 100

How It Works

Initialization: Sets up the crawler with the specified domain and parameters
Crawling process:
- Starts from the provided URL
- Extracts all links on each page
- Follows links that belong to the same domain or subdomains
- Builds a map of page relationships
Link checking:
- Tests all unique links found during crawling
- Records only 404 (Not Found) errors
- Provides progress updates at specified intervals
Reporting:
- Displays a summary of findings in the console
- Lists all 404 links found
- Shows which pages contain those broken links
- Saves a detailed report to a text file

Sample Output

Console output:

Starting crawl of Example Organization website from https://www.example.org
Crawling (1/100): https://www.example.org
Crawling (2/100): https://www.example.org/about
...
Crawling completed. Visited 100 pages.
Checking 423 external links for 404 errors...
Checked 500/1245 links...
Checked 1000/1245 links...
Checked 1245/1245 links...

===== Example Organization Website Crawl Report =====
Starting URL: https://www.example.org
Base domain: www.example.org
Pages crawled: 100
Total unique URLs found: 1245
404 errors found: 23

----- 404 Links -----
[404] https://www.example.org/outdated-page
[404] https://www.example.org/products/discontinued-item
...

----- Pages Containing 404 Links -----
Source: https://www.example.org/products
  → [404] https://www.example.org/products/discontinued-item
...

Report saved to example_organization_404_report.txt

Ethical Use

Please use this tool responsibly:

Respect websites' robots.txt files and terms of service
Use reasonable delays between requests to avoid overloading servers
Consider running during off-peak hours for high-traffic websites
Obtain permission before crawling private or commercial websites extensively

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt
website_crawler.py		website_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeadLinkChecker

Features

Requirements

Installation

Usage

Command-line Arguments

Examples

How It Works

Sample Output

Ethical Use

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

manvirheer/DeadLinkCrawler

Folders and files

Latest commit

History

Repository files navigation

DeadLinkChecker

Features

Requirements

Installation

Usage

Command-line Arguments

Examples

How It Works

Sample Output

Ethical Use

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages