8000 GitHub - nhirakawa/TextProcessor: A text processor to transform raw text for insertion in inverted index
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

nhirakawa/TextProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TextProcessor

Implemented a basic text processor for a search engine. The text processor takes an input file to read and a file containing stopwords to remove from the resulting text. After reading in the text, the text processor passes the text through a tokenizer. The tokenizer removes periods and splits on all whitespace and all punctuation except periods. The resulting list is then passed to the stopword remover. The stopword remover reads in the stopwords and removes these words from the text. Finally, the text is passed to a stemmer, which truncates the words according to the Porter Stemmer rules. The words are then printed.

To Run:

Run $ python main.py to process text files

By default, TextProcessor transforms an input.txt file, and removes stopwords contained in stopwords.txt

Command line arguments coming soon

About

A text processor to transform raw text for insertion in inverted index

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0