Web scraping can be harmful when things go wrong. When developing a web scraper it is not uncommon to send tons of requests to a server that is not under your control. This might be because of a bug or just the nature of development, especially in a team. Instead of flooding external servers with requests in development - we should flood our own.
- download external pages by params
- serve them on a local server
Run the follwing commands in order
download and install
curl -L https://github.com/ropfoo/clobbopus/releases/download/Latest/install.sh -o install.sh && bash install.sh
create a config yml file called clobbopus.yml
port: 3000 # default
dist: pages # default
domains:
sample:
url: "www.sample.com/results/"
params:
- result/with?query=test&page=~1-7~ # range from page=1 to page=7
- result/without
get the initial page 6150 data
./clobbopus_data
run the server
./clobbopus_server
Dockerfile
FROM alpine:latest
RUN apk update
RUN apk add bash
RUN apk add curl
WORKDIR /app
RUN curl -L https://github.com/ropfoo/clobbopus/releases/download/Latest/install.sh -o install.sh && bash install.sh
# update with your path
COPY ./clobbopus/clobbopus.yml .
CMD ./clobbopus_data; ./clobbopus_server
docker-compose.yml
version: "3.8"
services:
clobbopus:
build:
context: .
dockerfile: ./clobbopus/Dockerfile
ports:
- 3000:3000
volumes:
- ./clobbopus:/app/clobbopus