8000 Bootstrap the index with archives and data-dumps · Issue #105 · mwmbl/mwmbl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bootstrap the index with archives and data-dumps #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

8000

Already on GitHub? Sign in to your account

Open
walking-octopus opened this issue Jul 1, 2023 · 0 comments
Open

Bootstrap the index with archives and data-dumps #105

walking-octopus opened this issue Jul 1, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@walking-octopus
Copy link
walking-octopus commented Jul 1, 2023

Introduction

The Mwmbl is currently severely lacking a proper index. However, we do know that most of the searched for content is originating from relatively few websites, most of which have first-party or third-party data dumps.

For the purposes of bootstrapping the index, perhaps a manual importer can be written for a common format like ZIM, which can reconstruct the URLs, process the HTML like any other web page, and add all to index?

Candidates

  • Wikipedia, for example, discourages crawling, but offers a complete archive, available as ether a giant full database, or more practically, as a compressed ZIM file, which is essentially a bunch of ZSTD compressed HTML with a full-text search index. Entire Wikipedia (without images or edits) only weighs around 30GB.
  • Other wikis, like Simple English Wikipedia, Wiktionary, MDN, HowToWiki, IFixit, ArchWiki, Gutenberg, and ExplanXKCD are too available as ZIM files.
  • StackExchange has ether full community data-dumps, or third-party ZIM files as well. The entire unfiltered StackOverflow weights only around 100GB.
  • Reddit had been comprehensively scraped by Pushshift, with data spanning up to 2022. This data is quite valuable for search, given most people are now forced to append site:reddit.com to their every search, given the degrading quality of the results. But the database is rather large, growing up to terabytes if we include the latest data, might need plenty of filtering, and it might be in a bit of a legal gray zone, though probably lesser so, given it's a search engine.

Additionally, there's a great >1TB webtext corpus, Google's Colossal Clean Crawled Corpus, which contains plenty of websites, which text can be simply thrown into the indexing pipeline.

@daoudclarke daoudclarke added the enhancement New feature or request label Dec 23, 2023
@ghost
Copy link
ghost commented Jan 1, 2024

Is this still wanted or even possible at this stage to bootstrap a completely new index ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
0