PyScraperX is a resilient, asynchronous web scraping framework built in Python. Concretely, it is designed for scheduling and concurrent execution of multiple scraping tasks whilst providing a real-time frontend UI panel for monitoring, scheduling and resubmitting failed jobs.
- Async Execution: Leverages
asyncio
andaiohttp
for high-performance, non-blocking net I/O, enabling scraping hundreds of endpoints concurrently. - Background Threads: Core scrape ops run in a background thread, ensuring that the UI and the scheduler remain responsive.
- Real-time Control Panel UI: Leveraging FastAPI backend, it provides a live view of all scraping jobs and allows the admin to monitor job statuses, previous/next runtimes per job and performance metrics.
- Job Control & Retry Logic: The admin has the ability to manually resubmit failed jobs or filter by permanently deleted status and submit a batch restart.
- Configuration Driven: Fine grained control over run intervals, target URL paths and retry policies via
.env.local
file. - Mock API included: Pre-built mock API inside the
api/
directory makes it easy to test the overall functionality of the framework out-of-the-box.
The Admin UI provides a comprehensive, real-time view of all scraping jobs:
Drill down into individual jobs to see run history and error messages:
Two thread model:
- Main Thread:
- Runs the FastAPI server (
report/server.py
) and launches the frontend UI. - Runs the Job Scheduler (
job_scheduler.py
) which utilizes theschedule
package to trigger tasks at predefined intervals.
- Runs the FastAPI server (
- Background Thread:
- Hosts a dedicated
asyncio
event loop (scheduler.py
). All network bound scraping tasks are dispatched to this loop.
- Hosts a dedicated
- Scheduling: The
JobScheduler
in the main thread triggers thescrape_job
function based on the set interval. - Dispatch:
scrape_job
usesasyncio.run_coroutine_threadsafe
to safely dispatchScraperEngine.run_all()
coroutine to the bg event loop. - Concurrent Scraping: The
ScraperEngine
gathers all activeWebScraper
instances and executes them usingasyncio.gather
. - Fetching JSON Data: Each
WebScraper
uses a sharedaiohttp
session for async HTTP requests and stores the JSON response data in a separate SQLite db (database.py
) after validating the data withpydantic
. - Global State Management: Each scraper, periodically updates its status in the thread-safe
StateManager
. - UI Reports: The FastAPI
/api/jobs
endpoint reads from theStateManager
to provide the latest health status and run details to the frontend, which polls the endpoint to keep the dashboard up-to-date.
- Python 3.9+
- pip
Clone the repository and install the required dependencies in a new virtualenv
from requirements.txt:
pip install -r requirements.txt
- Endpoints: Add the URLs you want to scrape to a text file i.e., endpoints.txt, one per line.
- Environment: Customize the run schedule, max retries, path to
endpoints.txt
file and other settings in the .env.local file.
The api/
folder contains a minimal mock API that returns JSON responses for testing. Serve it with:
uvicorn api.app:app --port 50001 --reload
Start the main application from main.py. This will initiate all scheduling jobs and launch the Admin UI:
python main.py
Once the service is running, access the UI by navigating to http://localhost:8000 in your web browser.
- Upon starting the main application a logger is instantiated, writing logs inside the
logs/
folder, created in the current working directory. A new log file is created on every run. - Each
.sqlite
db file is stored with a unique identifier inside thedbs/
folder in the current working directory. Records are appended if the database already exists for an endpoint.
By using this framework, you acknowledge and agree that you are solely responsible for complying with all applicable laws, regulations, website terms of service, and any other relevant policies when performing web scraping activities.
The creators and contributors of PyScraperX are not responsible for any misuse or illegal activities conducted by users of this software. It is your responsibility to ensure that your scraping activities are lawful and ethical. Always review a website's robots.txt
file and terms of service before scraping.
This project is licensed under the MIT License - see the LICENSE file for details.