twds-crawler

This repository contains the code to build a highly scalable webcrawler for towardsdatascience.com by using Python, Selenium, Docker, Kubernetes and the infrastructure of the Google Cloud Platform. It was part of a datascience-class to get in touch with some of the most common technologies when it comes to big web- and big data processing.

Documentation

A more detailed description of the implementation can be found in my medium.com article.

Trouble Shooting

Additionally I documented some of my challenges in the trouble-shooting.md

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
Dockerfile		Dockerfile
README.md		README.md
StartCrawler.sh		StartCrawler.sh
StartSelenium.sh		StartSelenium.sh
TWDS_Crawler.py		TWDS_Crawler.py
TWDS_Crawler_Cluster.py		TWDS_Crawler_Cluster.py
requirements.txt		requirements.txt
selenium-hub-deployment.yaml		selenium-hub-deployment.yaml
selenium-hub-svc.yaml		selenium-hub-svc.yaml
selenium-node-chrome-deployment.yaml		selenium-node-chrome-deployment.yaml
selenium-node-firefox-deployment.yaml		selenium-node-firefox-deployment.yaml
trouble-shooting.md		trouble-shooting.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

twds-crawler

Documentation

Trouble Shooting

About

Uh oh!

Releases

Packages

Languages

Postiii/twds-crawler

Folders and files

Latest commit

History

Repository files navigation

twds-crawler

Documentation

Trouble Shooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages