The following two-step abstraction is provided by the package:
- The vocabulary object is first built from the entire corpus with the help of
vocab()
,update_vocab()
andprune_vocab()
functions. - Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the
mlvocab
functions acceptnbuckets
argument for partial or full hashing of the corpus.
Current functionality includes:
- term index sequences:
tix_seq()
,tix_mat()
andtix_df()
produce integer sequences suitable for direct consumption by various sequence models. - term matrices:
dtm()
,tdm()
andtcm()
create document-term term-document and term-co-occurrence matrices respectively. - subseting embedding matrices: given pre-trained word-vectors
prune_embeddings()
creates smaller embedding matrices treating missing and unknown vocabulary terms with grace. - tfidf weighting:
tfidf()
computes various versions of term frequency, inverse document frequency weighting ofdtm
andtdm
matrices.
Package is in alpha state. API changes are likely.