Housefly: A Hands-On Web Scraping Playground

Housefly is an interactive learning project designed to teach web scraping through structured challenges. Each chapter includes a companion website built specifically to be scraped, allowing you to practice in a controlled environment.

Features

Realistic Web Scraping Challenges – Work with purpose-built websites.
Structured Learning – Progress through guided exercises.
Automated Solution Checking – Verify your scrapers against expected outputs.

Getting Started

Clone the Repository

git clone https://github.com/jonaylor89/housefly.git
cd housefly

Navigate to Chapter 1

Each chapter contains a simple website to scrape, along with an expected.txt file defining the correct output.

Write Your Scraper

Implement your solution inside the corresponding solution{number}/ directory.

Check Your Answer

Run the validation script to compare your scraper’s output against expected.txt:

# npx install playwright (optionally for some exercises)
npm run ca 1

Add Env Vars (Optional)

Some of the challenges require 3rd party apis e.g. OpenAI and for those, there is a .env.template file that you can fill in and rename to .env to use them

mv .env.template .env

Project Structure

housefly/
├── apps/
│   ├── chapter1/  # Website for Chapter 1
│   │   ├── index.html
│   │   ├── package.json
│   ├── chapter2/
│   ├── chapter3/
│   ├── solution1/  # Place your Chapter 1 solution here
│   │   ├── expected.(txt, csv, json)
│   │   ├── index.ts
│   │   ├── package.json
├── scripts/
│   ├── check_answers.sh  # Script to validate solutions

Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bug reports or feature requests.

License

MIT License

Ready to Start Scraping?

👉 Try Housefly Now

Disclaimer

This is for educational purposes and web scraping on websites that don't want you to can violate ToSes and potentially get you in trouble if done at an industrial scale

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.github		.github
.husky		.husky
_solved		_solved
apps		apps
scripts		scripts
.env.template		.env.template
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
README.es.md		README.es.md
README.ja.md		README.ja.md
README.md		README.md
README.ru.md		README.ru.md
README.zh.md		README.zh.md
package-lock.json		package-lock.json
package.json		package.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

Contributing

License

Ready to Start Scraping?

Disclaimer

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

jonaylor89/housefly

Folders and files

Latest commit

History

Repository files navigation

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

Contributing

License

Ready to Start Scraping?

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages