Botanica is a Python-based framework designed to help journalists, newsrooms and data enthusiasts build and manage their data pipelines, or bots. It simplifies the data collection by offering a structured, template-driven ecosystem for:
- Developing and organizing multiple data scrapers.
- Automating scraper execution using GitHub Actions for scheduled tasks.
- Streamlining data uploads to cloud storage (e.g., AWS S3).
- Maintaining a version-controlled and shareable collection of bots.
Botanica aims to lower the technical barriers to creating robust, automated data gathering workflows. Note: This is a new project and a work in progress.
Botanica is designed to be used as a GitHub template. This allows you to quickly create your own repository pre-configured with the Botanica framework, ready for you to add your custom bots.
To get started:
- Click the "Use this template" button located at the top of the Botanica GitHub repository page.
- Choose "Create a new repository".
- Give your new repository a name (e.g.,
my-botanica-scrapers
). - Choose whether to include all branches (usually just the main branch is needed).
- Click "Create repository from template".
You will now have your own copy of Botanica in your GitHub account. You can then clone it to your local machine and follow the steps below to create and deploy your bots.
- Unified system: Manage all your scraper projects in one place without needing separate repositories.
- Automated scheduling: Use GitHub Actions to run tasks on a regular basis, from hourly to daily updates.
- Cloud integration: Easily upload processed data to AWS S3 or other cloud services.
- Modular design: Each scraper is self-contained, allowing for easy addition, removal or modification.
- Secure handling: Sensitive information like API keys and credentials are securely managed through GitHub Secrets.
- Flexible and scalable: Easily extend Botanica with new scrapers, configurations and data workflows.
Botanica's structure is designed to keep things organized and straightforward. Each scraper lives in its own folder within bots
, and the utilities and workflows are kept in their own dedicated directories:
botanica/
├── bots/
│ └── example/
│ ├── src/
│ │ └── data/ # Where processed data is stored
│ ├── config.example.json # Template for bot-specific settings
│ ├── requirements.txt # Dependencies for the bot
│ └── main.py # Main script for the bot
├── .github/
│ └── workflows/
│ └── example-scraper.yml # GitHub Action for automating tasks
├── utils/
│ ├── s3_upload.py # Utility script for S3 uploads
│ └── logs/ # Directory for logs or temp files
└── .env.example # Example environment variables template
- Add a new bot: To create a new bot, run the
create_bot.py
script and follow the instructions below for customizing the scripts, configuration files and dependencies it needs. - Set up automation: Use the GitHub Actions workflow template to schedule your bot. Define when and how often it should run.
- Process and upload: When a bot runs, it processes data and saves it in the
src/data
directory. From there, the utility scripts can help you upload the results to cloud storage like AWS S3. - Manage configurations: Use your local environment or
.env
files and Github Actions secrets for sensitive information andconfig.json
files for bot-specific settings. Example templates are included to make setup easier.
Creating and deploying a new bot in Botanica is designed to be quick and easy. Follow the steps below to get your new data scraper up and running.
Before creating a new bot, ensure your environment is set up:
- Python 3.9+ installed
- Required packages installed (
pip install -r requirements.txt
) - AWS credentials configured if using cloud uploads
- Run
create_bot.py
: From the root of the project, run the following command:python create_bot.py
- Follow the prompts:
- Bot name: Enter a unique name for your new bot (e.g.,
weather_scraper
). - AWS profile name: Specify an AWS profile if you want to use a specific one (optional).
- Users/entities: Provide a list of entities or usernames separated by commas (optional).
- Cron schedule: Define the GitHub Actions cron schedule for your bot (e.g.,
0 8 * * *
for daily at 8 AM UTC).
- Bot name: Enter a unique name for your new bot (e.g.,
What this does:
- Copies the
template_bot
directory and sets up a new bot in./bots/<your_bot_name>
. - Updates
config.json
with your specified details. - Generates a GitHub Actions workflow file (e.g.,
.github/workflows/<your_bot_name>.yml
) pre-configured with your bot's name and schedule, based on.github/workflows/template_workflow.yml
. - Appends a log entry to the
README.md
with your new bot (this feature might be simplified in the script).
Navigate to the new bot directory:
cd bots/<your_bot_name>
-
Edit
main.py
:- Replace the placeholder scraping logic with your specific data extraction code. Modify the section marked
CUSTOM SCRAPING LOGIC STARTS HERE
. - Use the existing structure for data processing, storage, and upload, so you don't need to worry about file handling or cloud integration.
- Replace the placeholder scraping logic with your specific data extraction code. Modify the section marked
-
Edit
config.json
(Optional):- Modify any parameters such as
output_directory
,timeseries_file
, orretry_attempts
as needed. - Update other settings like query parameters, API endpoints, or user lists.
- Modify any parameters such as
- Navigate to the bot's directory:
cd bots/<your_bot_name>
- Install any additional dependencies:
If your bot requires packages not included in the main project's
requirements.txt
or the basetemplate_bot/requirements.txt
, add them to your bot'srequirements.txt
file and install:pip install -r requirements.txt
- Run the bot:
Test your bot locally to ensure everything is working as expected:
python main.py
The create_bot.py
script (as of recent updates) now automatically generates the initial GitHub Actions workflow file for your bot (e.g., .github/workflows/<your_bot_name>.yml
) based on the template found at .github/workflows/template_workflow.yml
.
-
Verify the generated workflow:
- Open the newly created
.github/workflows/<your_bot_name>.yml
. - Check if the
BOT_NAME
andcron
schedule match what you intended. - The workflow is designed to use GitHub Secrets for AWS credentials (
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_DEFAULT_REGION
).
- Open the newly created
-
Set up GitHub Secrets: For the workflow to run successfully, especially if it involves AWS services like S3 uploads, you must configure the necessary secrets in your GitHub repository:
- Go to your repository on GitHub.
- Click on Settings > Secrets and variables > Actions.
- Click New repository secret for each of the following (if your bot needs them):
AWS_ACCESS_KEY_ID
: Your AWS access key ID.AWS_SECRET_ACCESS_KEY
: Your AWS secret access key.AWS_DEFAULT_REGION
: The AWS region your resources are in (e.g.,us-east-1
).
- Ensure these secret names match exactly what's used in the workflow file's
env
section for theRun the scraper
step.
-
Push changes to GitHub: After customizing your bot and verifying the workflow, commit and push your changes:
git add . git commit -m "Add new bot: <your_bot_name> and its workflow" git push origin main # Or your primary branch
-
Check GitHub Actions:
- Navigate to the Actions tab in your GitHub repository.
- Confirm that your new bot's workflow is listed and has successfully run. If there are any errors, check the logs for troubleshooting.
-
Check the output:
- Verify that the processed data is correctly stored in the output directory you specified.
- Ensure the data has been uploaded to S3 (or your cloud storage), and confirm accessibility if needed.
The create_bot.py
script now generates a bot-specific workflow file (e.g., <your_bot_name>.yml
) in the .github/workflows/
directory. This generated file is based on .github/workflows/template_workflow.yml
. Please refer to the template_workflow.yml
for the current standard structure.
The template includes:
- Manual trigger (
workflow_dispatch
). - Scheduled execution based on the cron you provide.
- Python setup and dependency installation.
- Execution of the bot's
main.py
. - Steps to commit and push data changes back to the repository.
- Use of GitHub Secrets for credentials.
# Example content of .github/workflows/template_workflow.yml
# (This is a conceptual snippet; the actual template may evolve)
name: "%%BOT_NAME%%" # Placeholder for the bot's name
on:
workflow_dispatch:
schedule:
- cron: '%%CRON_SCHEDULE%%' # Placeholder for the cron schedule
jobs:
run-bot:
runs-on: ubuntu-latest
env:
BOT_NAME: "%%BOT_NAME%%"
BOT_PATH: "./bots/%%BOT_NAME%%"
PYTHONPATH: "${{ github.workspace }}"
steps:
- name: Checkout code
uses: actions/checkout@v4
# ... other steps for python setup, install, run, commit, push ...
# Refer to the actual .github/workflows/template_workflow.yml for full details.
- Automated cleanup: Add a feature to automatically archive or delete old files from S3.
- Comprehensive error handling: Improve the bot's codebase to handle more specific errors and retry failed requests.
- Bot monitoring: A monitoring system that alerts if a bot encounters repeated failures.
- Automate inventory collection: Keep an updated list of bots, metadata and output paths.