PyScraperX

PyScraperX is a resilient, asynchronous web scraping framework built in Python. Concretely, it is designed for scheduling and concurrent execution of multiple scraping tasks whilst providing a real-time frontend UI panel for monitoring, scheduling and resubmitting failed jobs.

Core Features

Async Execution: Leverages asyncio and aiohttp for high-performance, non-blocking net I/O, enabling scraping hundreds of endpoints concurrently.
Background Threads: Core scrape ops run in a background thread, ensuring that the UI and the scheduler remain responsive.
Real-time Control Panel UI: Leveraging FastAPI backend, it provides a live view of all scraping jobs and allows the admin to monitor job statuses, previous/next runtimes per job and performance metrics.
Job Control & Retry Logic: The admin has the ability to manually resubmit failed jobs or filter by permanently deleted status and submit a batch restart.
Configuration Driven: Fine grained control over run intervals, target URL paths and retry policies via .env.local file.
Mock API included: Pre-built mock API inside the api/ directory makes it easy to test the overall functionality of the framework out-of-the-box.

Dashboard Preview

The Admin UI provides a comprehensive, real-time view of all scraping jobs:

Drill down into individual jobs to see run history and error messages:

Core Architecture

Two thread model:

Main Thread:
- Runs the FastAPI server (report/server.py) and launches the frontend UI.
- Runs the Job Scheduler (job_scheduler.py) which utilizes the schedule package to trigger tasks at predefined intervals.
Background Thread:
- Hosts a dedicated asyncio event loop (scheduler.py). All network bound scraping tasks are dispatched to this loop.

Workflow

Scheduling: The JobScheduler in the main thread triggers the scrape_job function based on the set interval.
Dispatch: scrape_job uses asyncio.run_coroutine_threadsafe to safely dispatch ScraperEngine.run_all() coroutine to the bg event loop.
Concurrent Scraping: The ScraperEngine gathers all active WebScraper instances and executes them using asyncio.gather.
Fetching JSON Data: Each WebScraper uses a shared aiohttp session for async HTTP requests and stores the JSON response data in a separate SQLite db (database.py) after validating the data with pydantic.
Global State Management: Each scraper, periodically updates its status in the thread-safe StateManager.
UI Reports: The FastAPI /api/jobs endpoint reads from the StateManager to provide the latest health status and run details to the frontend, which polls the endpoint to keep the dashboard up-to-date.

Getting Started

Prerequisites

Python 3.9+
pip

1. Installation

Clone the repository and install the required dependencies in a new virtualenv from requirements.txt:

pip install -r requirements.txt

2. Configuration

Endpoints: Add the URLs you want to scrape to a text file i.e., endpoints.txt, one per line.
Environment: Customize the run schedule, max retries, path to endpoints.txt file and other settings in the .env.local file.

3. Run the Mock API (Optional)

The api/ folder contains a minimal mock API that returns JSON responses for testing. Serve it with:

uvicorn api.app:app --port 50001 --reload

4. Run the Scraper Service

Start the main application from main.py. This will initiate all scheduling jobs and launch the Admin UI:

python main.py

Usage

Once the service is running, access the UI by navigating to http://localhost:8000 in your web browser.

Additional features

Upon starting the main application a logger is instantiated, writing logs inside the logs/ folder, created in the current working directory. A new log file is created on every run.
Each .sqlite db file is stored with a unique identifier inside the dbs/ folder in the current working directory. Records are appended if the database already exists for an endpoint.

Disclaimer

Important Legal Disclaimer

PyScraperX is provided for educational and informational purposes only.

By using this framework, you acknowledge and agree that you are solely responsible for complying with all applicable laws, regulations, website terms of service, and any other relevant policies when performing web scraping activities.

The creators and contributors of PyScraperX are not responsible for any misuse or illegal activities conducted by users of this software. It is your responsibility to ensure that your scraping activities are lawful and ethical. Always review a website's robots.txt file and terms of service before scraping.

Use PyScraperX responsibly and at your own risk.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
api		api
assets		assets
report		report
.env.local		.env.local
.env.server		.env.server
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
database.py		database.py
endpoints.txt		endpoints.txt
engine.py		engine.py
fetcher.py		fetcher.py
job_scheduler.py		job_scheduler.py
logger.py		logger.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyScraperX

Core Features

Dashboard Preview

Core Architecture

Workflow

Getting Started

Prerequisites

1. Installation

2. Configuration

3. Run the Mock API (Optional)

4. Run the Scraper Service

Usage

Additional features

Disclaimer

Important Legal Disclaimer

PyScraperX is provided for educational and informational purposes only.

Use PyScraperX responsibly and at your own risk.

License

About

Uh oh!

Packages

Languages

License

pdoup/PyScraperX

Folders and files

Latest commit

History

Repository files navigation

PyScraperX

Core Features

Dashboard Preview

Core Architecture

Workflow

Getting Started

Prerequisites

1. Installation

2. Configuration

3. Run the Mock API (Optional)

4. Run the Scraper Service

Usage

Additional features

Disclaimer

Important Legal Disclaimer

PyScraperX is provided for educational and informational purposes only.

Use PyScraperX responsibly and at your own risk.

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Languages

Packages