Housefly is an interactive learning project designed to teach web scraping through structured challenges. Each chapter includes a companion website built specifically to be scraped, allowing you to practice in a controlled environment.
- Realistic Web Scraping Challenges – Work with purpose-built websites.
- Structured Learning – Progress through guided exercises.
- Automated Solution Checking – Verify your scrapers against expected outputs.
- Clone the Repository
git clone https://github.com/jonaylor89/housefly.git
cd housefly
- Navigate to Chapter 1
Each chapter contains a simple website to scrape, along with an expected.txt
file defining the correct output.
- Write Your Scraper
Implement your solution inside the corresponding solution{number}/
directory.
- Check Your Answer
Run the validation script to compare your scraper’s output against expected.txt:
# npx install playwright (optionally for some exercises)
npm run ca 1
- Add Env Vars (Optional)
Some of the challenges require 3rd party apis e.g. OpenAI and for those, there is a .env.template
file that you can fill in and rename to .env
to use them
mv .env.template .env
housefly/
├── apps/
│ ├── chapter1/ # Website for Chapter 1
│ │ ├── index.html
│ │ ├── package.json
│ ├── chapter2/
│ ├── chapter3/
│ ├── solution1/ # Place your Chapter 1 solution here
│ │ ├── expected.(txt, csv, json)
│ │ ├── index.ts
│ │ ├── package.json
├── scripts/
│ ├── check_answers.sh # Script to validate solutions
Pull requests and suggestions are welcome! Feel free to open issues for bug reports or feature requests.
MIT License
This is for educational purposes and web scraping on websites that don't want you to can violate ToSes and potentially get you in trouble if done at an industrial scale