8000 GitHub - hayesdavis/nanogram: A very small, very simple text tokenizer and ngram generator.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A very small, very simple text tokenizer and ngram generator.

License

Notifications You must be signed in to change notification settings

hayesdavis/nanogram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanogram

Nanogram is a simple library for doing basic text tokenizating and ngram generation. Most libraries for doing this kind of thing are very academic and/or combine it with lots of other NLP-related algorithms. If you need to just clean up some text, filter out some stop words and then get some ngrams for generating word counts or to feed into other algorithms, then nanogram is for you.

General Usage

Here's the general idea:

# Basically just split on whitespace with no sanitizing
tokenizer = Nanogram::Tokenizer.new
tokenizer.ngrams(1, "Hello! Here are some ngrams.")
# => ["Hello!", "Here", "are", "some", "ngrams."] 

# A more complex example
tokenizer = Nanogram::Tokenizer.new
# Setup some sanitizing rules:
# Downcase text, replace punctuation with space and strip whitepsace from the ends
tokenizer.sanitizer.downcase.gsub(/[^a-z0-9']/,' ').strip
# Add some stop words (using file from http://www.infochimps.com/datasets/list-of-english-stopwords)
tokenizer.filters << Nanogram::Filters::StopWords.load("english_stopwords.tsv")
# And filter out 1 character words
tokenizer.filters << Nanogram::Filters::Proc.new{|text| text.length == 1}
# Generate 1 grams and 2 grams
tokenizer.ngrams(1..2, "Hello! Here are some ngrams.")
# => ["hello", "here", "ngrams", "hello here", "here ngrams"] 

How it works

Here are the key things that happen:

  • Sanitize: take raw text and remove stuff you don't want (e.g. punctuation, etc)
  • Split: take sanitized text and split it up on some token (usually whitespace)
  • Filter: remove any of the tokens generated by splitting (e.g. stop words)
  • Ngram: collect tokens into groups of N

The Tokenizer manages these steps. You can control the sanitize step with a Sanitizer, the Split step with the splitter attribute on the Tokenizer and the Filter step by attaching filters to the Tokenizer.

Stop Words

Here are some recommended places to get English stop words:

License

MIT License (see LICENSE)

About

A very small, very simple text tokenizer and ngram generator.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0