A powerful and lightweight web scraping library with LLM extraction capabilities. This library combines web scraping with AI-powered content extraction using either OpenAI or OpenRouter APIs.
- Configurable web scraping with Playwright
- Support for both headless and visible browser modes
- Content cleaning and preprocessing
- LLM-based information extraction
- Support for both OpenAI and OpenRouter APIs
- Customizable schema definitions with type specifications:
- String fields
- Array fields
- Object fields with nested properties
- Ad blocking and media handling
- Automatic handling of srcset attributes
- HTML minification support
pip install aiohttp beautifulsoup4 fake-useragent playwright pydantic tiktoken openai lxml
pip install scrapeneatly
import asyncio
from scrapeneatly import scrape_product
async def main():
# Define what you want to extract
fields = {
"title": {
"description": "Product title",
"type": "string"
},
"images": {
"description": "Product images",
"type": "array",
"items": {"type": "string"}
}
}
result = await scrape_product(
url="https://example.com/product",
fields_to_extract=fields,
provider="openai", # or "openrouter"
api_key="your-api-key",
model="anthropic/claude-2" # optional, for OpenRouter
)
if result["success"]:
print(result["data"])
if __name__ == "__main__":
asyncio.run(main())
fields = {
"price": {
"description": "Product price",
"type": "string"
},
"variants": {
"description": "Product variants",
"type": "array",
"items": {
"type": "object",
"properties": {
"color": {"type": "string"},
"size": {"type": "string"}
}
}
}
}
result = await scrape_product(
url="your_url",
fields_to_extract=fields,
provider="openrouter",
api_key="your-openrouter-key",
model="google/gemini-2.0-flash-001"
)
result = await scrape_product(
url="your_url",
fields_to_extract=fields,
provider="openai",
api_key="your-openai-api-key",
)
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.