README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

How do I get set up?

Clone this repository and follow these simple steps:

First, create an empty database (the crawler will create the tables automatically):

echo "CREATE DATABASE your_db_name" |mysql

Then modify lib/db/connect.js to suit your MySQL setup (user/password and database name)

Example: mysql://user:password@localhost/your_db_name

Now the obligatory

npm i

or

yarn

You're set!

Now run the crawler with:

npm run demo

You can hit Ctrl+C to stop crawling and wait about 2 seconds for the script to finish the exit routines.

Running npm run demo once more will resume the crawling.

The script will auto-restart itself every 100 URLs to workaround a memory leak in cheerio

See lib/constants.js for more settings regarding crawl delay, in-memory LRU cache size, user agent and others

Be a good citizen

Please don't abuse the demo configuration, write your own (for example ./config/my_config.js) in and run it with

node runner.js  --preset my_config

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
config		config
dist		dist
lib		lib
.babelrc		.babelrc
.eslintrc		.eslintrc
.gitignore		.gitignore
README.md		README.md
cli.js		cli.js
config.js		config.js
package.json		package.json
runner.js		runner.js
webpack.config.js		webpack.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

How do I get set up?

Be a good citizen

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tudorilisoi/trawl4

Folders and files

Latest commit

History

Repository files navigation

README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

How do I get set up?

Be a good citizen

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages