You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Mwmbl is currently severely lacking a proper index. However, we do know that most of the searched for content is originating from relatively few websites, most of which have first-party or third-party data dumps.
For the purposes of bootstrapping the index, perhaps a manual importer can be written for a common format like ZIM, which can reconstruct the URLs, process the HTML like any other web page, and add all to index?
Candidates
Wikipedia, for example, discourages crawling, but offers a complete archive, available as ether a giant full database, or more practically, as a compressed ZIM file, which is essentially a bunch of ZSTD compressed HTML with a full-text search index. Entire Wikipedia (without images or edits) only weighs around 30GB.
Other wikis, like Simple English Wikipedia, Wiktionary, MDN, HowToWiki, IFixit, ArchWiki, Gutenberg, and ExplanXKCD are too available as ZIM files.
StackExchange has ether full community data-dumps, or third-party ZIM files as well. The entire unfiltered StackOverflow weights only around 100GB.
Reddit had been comprehensively scraped by Pushshift, with data spanning up to 2022. This data is quite valuable for search, given most people are now forced to append site:reddit.com to their every search, given the degrading quality of the results. But the database is rather large, growing up to terabytes if we include the latest data, might need plenty of filtering, and it might be in a bit of a legal gray zone, though probably lesser so, given it's a search engine.
Additionally, there's a great >1TB webtext corpus, Google's Colossal Clean Crawled Corpus, which contains plenty of websites, which text can be simply thrown into the indexing pipeline.
The text was updated successfully, but these errors were encountered:
Introduction
The Mwmbl is currently severely lacking a proper index. However, we do know that most of the searched for content is originating from relatively few websites, most of which have first-party or third-party data dumps.
For the purposes of bootstrapping the index, perhaps a manual importer can be written for a common format like ZIM, which can reconstruct the URLs, process the HTML like any other web page, and add all to index?
Candidates
site:reddit.com
to their every search, given the degrading quality of the results. But the database is rather large, growing up to terabytes if we include the latest data, might need plenty of filtering, and it might be in a bit of a legal gray zone, though probably lesser so, given it's a search engine.Additionally, there's a great >1TB webtext corpus, Google's Colossal Clean Crawled Corpus, which contains plenty of websites, which text can be simply thrown into the indexing pipeline.
The text was updated successfully, but these errors were encountered: