Bootstrap the index with archives and data-dumps #105

walking-octopus · 2023-07-01T23:21:53Z

Introduction

The Mwmbl is currently severely lacking a proper index. However, we do know that most of the searched for content is originating from relatively few websites, most of which have first-party or third-party data dumps.

For the purposes of bootstrapping the index, perhaps a manual importer can be written for a common format like ZIM, which can reconstruct the URLs, process the HTML like any other web page, and add all to index?

Candidates

Wikipedia, for example, discourages crawling, but offers a complete archive, available as ether a giant full database, or more practically, as a compressed ZIM file, which is essentially a bunch of ZSTD compressed HTML with a full-text search index. Entire Wikipedia (without images or edits) only weighs around 30GB.
Other wikis, like Simple English Wikipedia, Wiktionary, MDN, HowToWiki, IFixit, ArchWiki, Gutenberg, and ExplanXKCD are too available as ZIM files.
StackExchange has ether full community data-dumps, or third-party ZIM files as well. The entire unfiltered StackOverflow weights only around 100GB.
Reddit had been comprehensively scraped by Pushshift, with data spanning up to 2022. This data is quite valuable for search, given most people are now forced to append site:reddit.com to their every search, given the degrading quality of the results. But the database is rather large, growing up to terabytes if we include the latest data, might need plenty of filtering, and it might be in a bit of a legal gray zone, though probably lesser so, given it's a search engine.

Additionally, there's a great >1TB webtext corpus, Google's Colossal Clean Crawled Corpus, which contains plenty of websites, which text can be simply thrown into the indexing pipeline.

The text was updated successfully, but these errors were encountered:

ghost · 2024-01-01T03:14:05Z

Is this still wanted or even possible at this stage to bootstrap a completely new index ?

daoudclarke added the enhancement New feature or request label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap the index with archives and data-dumps #105

Bootstrap the index with archives and data-dumps #105

Bootstrap the index with archives and data-dumps #105

Bootstrap the index with archives and data-dumps #105

Comments

Introduction

Candidates