Nanogram is a simple library for doing basic text tokenizating and ngram generation. Most libraries for doing this kind of thing are very academic and/or combine it with lots of other NLP-related algorithms. If you need to just clean up some text, filter out some stop words and then get some ngrams for generating word counts or to feed into other algorithms, then nanogram is for you.
Here's the general idea:
# Basically just split on whitespace with no sanitizing
tokenizer = Nanogram::Tokenizer.new
tokenizer.ngrams(1, "Hello! Here are some ngrams.")
# => ["Hello!", "Here", "are", "some", "ngrams."]
# A more complex example
tokenizer = Nanogram::Tokenizer.new
# Setup some sanitizing rules:
# Downcase text, replace punctuation with space and strip whitepsace from the ends
tokenizer.sanitizer.downcase.gsub(/[^a-z0-9']/,' ').strip
# Add some stop words (using file from http://www.infochimps.com/datasets/list-of-english-stopwords)
tokenizer.filters << Nanogram::Filters::StopWords.load("english_stopwords.tsv")
# And filter out 1 character words
tokenizer.filters << Nanogram::Filters::Proc.new{|text| text.length == 1}
# Generate 1 grams and 2 grams
tokenizer.ngrams(1..2, "Hello! Here are some ngrams.")
# => ["hello", "here", "ngrams", "hello here", "here ngrams"]
Here are the key things that happen:
- Sanitize: take raw text and remove stuff you don't want (e.g. punctuation, etc)
- Split: take sanitized text and split it up on some token (usually whitespace)
- Filter: remove any of the tokens generated by splitting (e.g. stop words)
- Ngram: collect tokens into groups of N
The Tokenizer manages these steps. You can control the sanitize step with a Sanitizer, the Split step with the splitter attribute on the Tokenizer and the Filter step by attaching filters to the Tokenizer.
Here are some recommended places to get English stop words:
- http://www.infochimps.com/datasets/list-of-english-stopwords
- http://www.ranks.nl/resources/stopwords.html
MIT License (see LICENSE)