nanogram

Nanogram is a simple library for doing basic text tokenizating and ngram generation. Most libraries for doing this kind of thing are very academic and/or combine it with lots of other NLP-related algorithms. If you need to just clean up some text, filter out some stop words and then get some ngrams for generating word counts or to feed into other algorithms, then nanogram is for you.

General Usage

Here's the general idea:

# Basically just split on whitespace with no sanitizing
tokenizer = Nanogram::Tokenizer.new
tokenizer.ngrams(1, "Hello! Here are some ngrams.")
# => ["Hello!", "Here", "are", "some", "ngrams."] 

# A more complex example
tokenizer = Nanogram::Tokenizer.new
# Setup some sanitizing rules:
# Downcase text, replace punctuation with space and strip whitepsace from the ends
tokenizer.sanitizer.downcase.gsub(/[^a-z0-9']/,' ').strip
# Add some stop words (using file from http://www.infochimps.com/datasets/list-of-english-stopwords)
tokenizer.filters << Nanogram::Filters::StopWords.load("english_stopwords.tsv")
# And filter out 1 character words
tokenizer.filters << Nanogram::Filters::Proc.new{|text| text.length == 1}
# Generate 1 grams and 2 grams
tokenizer.ngrams(1..2, "Hello! Here are some ngrams.")
# => ["hello", "here", "ngrams", "hello here", "here ngrams"]

How it works

Here are the key things that happen:

Sanitize: take raw text and remove stuff you don't want (e.g. punctuation, etc)
Split: take sanitized text and split it up on some token (usually whitespace)
Filter: remove any of the tokens generated by splitting (e.g. stop words)
Ngram: collect tokens into groups of N

The Tokenizer manages these steps. You can control the sanitize step with a Sanitizer, the Split step with the splitter attribute on the Tokenizer and the Filter step by attaching filters to the Tokenizer.

Stop Words

Here are some recommended places to get English stop words:

License

MIT License (see LICENSE)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nanogram.gemspec		nanogram.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanogram

General Usage

How it works

Stop Words

License

About

Releases

Packages

Languages

License

hayesdavis/nanogram

Folders and files

Latest commit

History

Repository files navigation

nanogram

General Usage

How it works

Stop Words

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages