Universal Agentic Web Scraper

A powerful web scraping platform that combines Claude 3.7's intelligence with Firecrawl's web extraction capabilities, built with Django and FastAPI.

System Architecture

This backend architecture combines Django's robust admin and ORM capabilities with FastAPI's high-performance async processing, creating an ideal system for the Universal Agentic Web Scraper.

┌──────────────────────────────────────────────────────────┐
│                      Client Applications                  │
└───────────────────────────────┬──────────────────────────┘
                                │
┌───────────────────────────────▼──────────────────────────┐
│                        API Gateway                        │
└───────┬─────────────────────────────────────────┬────────┘
        │                                         │
┌───────▼────────┐                       ┌────────▼───────┐
│  Django Admin  │                       │  FastAPI API   │
│  & Web Portal  │                       │    Services    │
└───────┬────────┘                       └────────┬───────┘
        │                                         │
┌───────▼─────────────────────────────────▼───────┐
│                Shared Database Layer             │
└───────┬─────────────────────────────────┬───────┘
        │                                 │
┌───────▼───────┐               ┌────────▼────────┐
│  External API │               │   Task Queue    │
│  Integrations │               │   (Celery/RQ)   │
└───────────────┘               └─────────────────┘

Agent Memory Architecture

The system uses a memory-augmented approach with a central scratchpad that coordinates between different information sources:

┌─────────────────────────┐
│                         │
│        Scratchpad       │
│    (Memory & Planning)  │
│                         │
└───────┬─────┬───────────┘
        │     │
        │     │
┌───────▼─────▼───────────┐
│                         │
│   Claude 3.7 (Agent)    │
│  w/ Extended Thinking   │
│                         │
└───────┬─────┬───────────┘
        │     │
        │     │
┌───────▼─────┴───────────┐
│                         │
│  Company    Internet    │
│  Domain     Search      │
│                         │
└─────────────────────────┘

Features

Intelligent Web Scraping: Leverage Claude 3.7's intelligence to plan and execute web scraping tasks
Powerful Extraction: Use Firecrawl for robust web extraction, handling JavaScript, anti-bot measures, and more
Advanced Memory System: Enhanced scratchpad with vector storage for semantic search and session management
Source-Aware Storage: Metadata tagging to track information origins (company domain vs internet search)
Extended Thinking: Claude 3.7's enhanced reasoning with transparent thought processes
Asynchronous Jobs: Long-running jobs are processed in the background using Celery
RESTful API: Access all functionality through a well-documented FastAPI interface
Admin Interface: Manage users, jobs, and system settings through the Django admin portal
Modern Frontend: Sleek React/Next.js frontend for intuitive user interaction

Getting Started

Prerequisites

Python 3.9+
Node.js 18+ (for frontend)
Redis (for Celery)
API keys for Firecrawl and Anthropic

Installation

Clone the repository:

git clone https://github.com/Amankrah/aws.git
cd aws

Create a virtual environment:

python -m venv venv
# On Windows:
venv\Scripts\activate
# On Unix/MacOS:
source venv/bin/activate

Install backend dependencies:

pip install -r requirements.txt

Install frontend dependencies:

cd frontend
npm install
cd ..

Create a .env file in the project root:

FIRECRAWL_API_KEY=your_firecrawl_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
DEBUG=True
SECRET_KEY=your_secret_key
DATABASE_URL=sqlite:///db.sqlite3

Run migrations:

python manage.py migrate

Create a superuser:

python manage.py createsuperuser

Development

Starting the backend

Start the Redis server (required for Celery):

# Windows (using WSL or Docker)
docker run -p 6379:6379 redis

# Unix/MacOS
redis-server

Start the Celery worker:

celery -A core worker --loglevel=info

Start the Django/FastAPI server:

uvicorn core.asgi:app --reload

Starting the frontend

cd frontend
npm run dev

Accessing the application

Frontend: http://localhost:3000
API Documentation: http://localhost:8000/api/docs
Admin Interface: http://localhost:8000/django/admin/

API Endpoints

The API provides the following main endpoints:

Authentication: /api/auth/
- Register, get current user, refresh API keys
Web Scraping: /api/scraper/
- Start scraping jobs, get job status and results
Scratchpad: /api/scratchpad/
- Store and retrieve temporary data
- Perform semantic searches
- Filter by source or session
- Track operation history
Jobs: /api/jobs/
- List and manage scraping jobs

Agent-Scratchpad Coordination

The system coordinates between Claude and the Scratchpad in multiple phases:

Planning Phase: Claude generates scraping plans that are stored in the scratchpad
Execution Phase: Scraping results are saved to the scratchpad with source metadata
Analysis Phase: Claude retrieves relevant information from the scratchpad using semantic search
Synthesis Phase: Claude generates final answers based on all collected information

This coordination allows the agent to effectively utilize both company-specific information and general internet search results.

Project Structure

scraper_project/
├── core/                      # Shared core functionality
├── apps/                      # Django applications
│   ├── users/                 # User management
│   ├── scratchpad/            # Enhanced scratchpad with vector storage
│   ├── crawl_jobs/            # Crawl job management
│   └── admin_portal/          # Admin interface
├── api/                       # FastAPI application
├── services/                  # Business logic services
│   ├── claude_service.py      # Claude 3.7 service with extended thinking
│   ├── scratchpad_service.py  # Memory management with semantic search
│   ├── firecrawl_service.py   # Web extraction capabilities
│   └── extraction_service.py  # Structured data extraction
├── tasks/                     # Async task definitions
│   ├── crawl_tasks.py         # Web crawling tasks
│   ├── agent_tasks.py         # Agent reasoning tasks
│   └── extraction_tasks.py    # Data extraction tasks
└── utils/                     # Utility functions

Key Technologies

Django: Core application framework and ORM
FastAPI: High-performance API endpoints
Next.js: React framework for the frontend
Celery: Asynchronous task processing
Claude 3.7: Advanced AI reasoning with extended thinking
Firecrawl: Powerful web extraction
Chroma/HuggingFace: Vector storage for semantic search
PostgreSQL: Primary database (optional)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Anthropic for Claude 3.7
Firecrawl for web extraction capabilities
Langchain for vector store integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Universal Agentic Web Scraper

System Architecture

Agent Memory Architecture

Features

Getting Started

Prerequisites

Installation

Development

Starting the backend

Starting the frontend

Accessing the application

API Endpoints

Agent-Scratchpad Coordination

Project Structure

Key Technologies

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
apps		apps
chroma_db		chroma_db
core		core
frontend		frontend
services		services
tasks		tasks
utils		utils
.gitignore		.gitignore
README.md		README.md
claude_batch_processing.pdf		claude_batch_processing.pdf
manage.py		manage.py
requirements.txt		requirements.txt

Amankrah/aws

Folders and files

Latest commit

History

Repository files navigation

Universal Agentic Web Scraper

System Architecture

Agent Memory Architecture

Features

Getting Started

Prerequisites

Installation

Development

Starting the backend

Starting the frontend

Accessing the application

API Endpoints

Agent-Scratchpad Coordination

Project Structure

Key Technologies

Contributing

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages