spacyr: an R wrapper for spaCy

This package is an R wrapper to the spaCy "industrial strength natural language processing" library from http://spacy.io.

Prerequisites

Python must be installed on your system.
spaCy must be installed on your system. Follow these instructions.
You need (of course) to install this package:
```
devtools::install_github("kbenoit/spacyr")
```

Examples

The tag() function calls spaCy to both tokenize and tag the texts, and returns a special class of tokenizedText object (see quanteda) that has both tokens and tags. The approach to tokenizing taken by spaCy is inclusive: it includes all tokens without restrictions. The default method for tag() is the Google tagset for parts-of-speech.

require(spacyr)
#> Loading required package: spacyr
#> Loading required package: quanteda
#> quanteda version 0.9.9.2
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:base':
#> 
#>     sample
# set this for Ken's macOS system, because homebrew Python is doing the work
options(PYTHON_PATH = "/usr/local/bin")

# show tag on some sample sentences
head(data_sentences)
#> [1] "They can at any moment have peace simply by laying down their arms and submitting to the national authority under the Constitution."                                                                                                                                                    
#> [2] "But our laws have provided no means by which this could be accomplished, or by which the losses of the regiments when once sent to the front could be repaired."                                                                                                                        
#> [3] "The negotiation with France has been conducted by our minister with zeal and ability, and in all respects to my entire satisfaction."                                                                                                                                                   
#> [4] "So again, if you have specific plans to cut costs, cover more people, and increase choice - tell America what you'd do differently."                                                                                                                                                    
#> [5] "They are trying to shake the will of our country and our friends, but the United States of America will never be intimidated by thugs and assassins."                                                                                                                                   
#> [6] "Some expansion in peacetime medical research and other programs of the Public Health Service is provided for in the appropriation estimates for these purposes totaling approximately 87 million dollars for the fiscal year 1947 which are submitted under provisions of existing law."
taggedsents <- tag(data_sentences[1:6])
taggedsents
#> tokenizedText_tagged object from 2 documents (tagset = Google).
#> text1 :
#>  [1] "They_PRON"          "can_VERB"           "at_ADP"            
#>  [4] "any_DET"            "moment_NOUN"        "have_VERB"         
#>  [7] "peace_NOUN"         "simply_ADV"         "by_ADP"            
#> [10] "laying_VERB"        "down_PART"          "their_ADJ"         
#> [13] "arms_NOUN"          "and_CONJ"           "submitting_VERB"   
#> [16] "to_ADP"             "the_DET"            "national_ADJ"      
#> [19] "authority_NOUN"     "under_ADP"          "the_DET"           
#> [22] "Constitution_PROPN" "._PUNCT"           
#> 
#> text2 :
#>  [1] "But_CONJ"          "our_ADJ"           "laws_NOUN"        
#>  [4] "have_VERB"         "provided_VERB"     "no_DET"           
#>  [7] "means_NOUN"        "by_ADP"            "which_ADJ"        
#> [10] "this_DET"          "could_VERB"        "be_VERB"          
#> [13] "accomplished_VERB" ",_PUNCT"           "or_CONJ"          
#> [16] "by_ADP"            "which_ADJ"         "the_DET"          
#> [19] "losses_NOUN"       "of_ADP"            "the_DET"          
#> [22] "regiments_NOUN"    "when_ADV"          "once_ADP"         
#> [25] "sent_VERB"         "to_ADP"            "the_DET"          
#> [28] "front_NOUN"        "could_VERB"        "be_VERB"          
#> [31] "repaired_VERB"     "._PUNCT"          
#> 
#> text3 :
#>  [1] "The_DET"           "negotiation_NOUN"  "with_ADP"         
#>  [4] "France_PROPN"      "has_VERB"          "been_VERB"        
#>  [7] "conducted_VERB"    "by_ADP"            "our_ADJ"          
#> [10] "minister_NOUN"     "with_ADP"          "zeal_NOUN"        
#> [13] "and_CONJ"          "ability_NOUN"      ",_PUNCT"          
#> [16] "and_CONJ"          "in_ADP"            "all_DET"          
#> [19] "respects_NOUN"     "to_ADP"            "my_ADJ"           
#> [22] "entire_ADJ"        "satisfaction_NOUN" "._PUNCT"          
#> 
#> text4 :
8817

#>  [1] "So_ADV"          "again_ADV"       ",_PUNCT"        
#>  [4] "if_ADP"          "you_PRON"        "have_VERB"      
#>  [7] "specific_ADJ"    "plans_NOUN"      "to_PART"        
#> [10] "cut_VERB"        "costs_NOUN"      ",_PUNCT"        
#> [13] "cover_VERB"      "more_ADJ"        "people_NOUN"    
#> [16] ",_PUNCT"         "and_CONJ"        "increase_VERB"  
#> [19] "choice_NOUN"     "-_PUNCT"         "tell_VERB"      
#> [22] "America_PROPN"   "what_NOUN"       "you_PRON"       
#> [25] "'d_VERB"         "do_VERB"         "differently_ADV"
#> [28] "._PUNCT"        
#> 
#> text5 :
#>  [1] "They_PRON"        "are_VERB"         "trying_VERB"     
#>  [4] "to_PART"          "shake_VERB"       "the_DET"         
#>  [7] "will_NOUN"        "of_ADP"           "our_ADJ"         
#> [10] "country_NOUN"     "and_CONJ"         "our_ADJ"         
#> [13] "friends_NOUN"     ",_PUNCT"          "but_CONJ"        
#> [16] "the_DET"          "United_PROPN"     "States_PROPN"    
#> [19] "of_ADP"           "America_PROPN"    "will_VERB"       
#> [22] "never_ADV"        "be_VERB"          "intimidated_VERB"
#> [25] "by_ADP"           "thugs_NOUN"       "and_CONJ"        
#> [28] "assassins_NOUN"   "._PUNCT"         
#> 
#> text6 :
#>  [1] "Some_DET"           "expansion_NOUN"     "in_ADP"            
#>  [4] "peacetime_NOUN"     "medical_ADJ"        "research_NOUN"     
#>  [7] "and_CONJ"           "other_ADJ"          "programs_NOUN"     
#> [10] "of_ADP"             "the_DET"            "Public_PROPN"      
#> [13] "Health_PROPN"       "Service_PROPN"      "is_VERB"           
#> [16] "provided_VERB"      "for_ADP"            "in_ADP"            
#> [19] "the_DET"            "appropriation_NOUN" "estimates_NOUN"    
#> [22] "for_ADP"            "these_DET"          "purposes_NOUN"     
#> [25] "totaling_VERB"      "approximately_ADV"  "87_NUM"            
#> [28] "million_NUM"        "dollars_NOUN"       "for_ADP"           
#> [31] "the_DET"            "fiscal_ADJ"         "year_NOUN"         
#> [34] "1947_NUM"           "which_ADJ"          "are_VERB"          
#> [37] "submitted_VERB"     "under_ADP"          "provisions_NOUN"   
#> [40] "of_ADP"             "existing_ADJ"       "law_NOUN"          
#> [43] "._PUNCT"

Note that while the printed structure appears to append the token and its tag, in fact the structure of the object records these separately:

str(taggedsents)
#> List of 2
#>  $ tokens:List of 6
#>   ..$ text1: chr [1:23] "They" "can" "at" "any" ...
#>   ..$ text2: chr [1:32] "But" "our" "laws" "have" ...
#>   ..$ text3: chr [1:24] "The" "negotiation" "with" "France" ...
#>   ..$ text4: chr [1:28] "So" "again" "," "if" ...
#>   ..$ text5: chr [1:29] "They" "are" "trying" "to" ...
#>   ..$ text6: chr [1:43] "Some" "expansion" "in" "peacetime" ...
#>  $ tags  :List of 6
#>   ..$ text1: chr [1:23] "PRON" "VERB" "ADP" "DET" ...
#>   ..$ text2: chr [1:32] "CONJ" "ADJ" "NOUN" "VERB" ...
#>   ..$ text3: chr [1:24] "DET" "NOUN" "ADP" "PROPN" ...
#>   ..$ text4: chr [1:28] "ADV" "ADV" "PUNCT" "ADP" ...
#>   ..$ text5: chr [1:29] "PRON" "VERB" "VERB" "PART" ...
#>   ..$ text6: chr [1:43] "DET" "NOUN" "ADP" "NOUN" ...
#>  - attr(*, "tagset")= chr "google"
#>  - attr(*, "class")= chr [1:2] "tokenizedTexts_tagged" "list"

To get a summary of the parts of speech for each document, use the data.frame returned by the summary() method for this new object class:

summary(taggedsents)
#>       ADJ ADP ADV CONJ DET NOUN PART PRON PROPN PUNCT VERB NUM
#> text1   2   4   1    1   3    4    1    1     1     1    4   0
#> text2   3   5   1    2   5    5    0    0     0     2    9   0
#> text3   3   5   0    2   2    6    0    0     1     2    3   0
#> text4   2   1   3    1   0    5    1    2     1     5    7   0
#> text5   2   3   1    3   2    5    1    1     3     2    6   0
#> text6   5   8   1    1   5   11    0    0     3     1    5   3

Alternatively the Penn Treebank part-of-speech tagset can be applied:

taggedsents2 <- tag(data_sentences[1:6], tagset = "penn")
summary(taggedsents2)
#>       . CC DT IN JJ MD NN NNP NNS PRP PRP. RB RP VB VBG X. VBN VBP WDT WRB
#> text1 1  1  3  4  1  1  3   1   1   1    1  1  1  1   2  0   0   0   0   0
#> text2 1  2  5  5  0  2  1   0   4   0    1  0  0  2   0  1   4   1   2   1
#> text3 1  2  2  5  1  0  5   1   1   0    2  0  0  0   0  1   2   0   0   0
#> text4 1  1  0  1  1  1  1   1   3   2    0  3  0  4   0  3   0   2   0   0
#> text5 1  3  2  3  0  1  2   3   3   1    2  1  0  2   1  1   1   1   0   0
#> text6 1  1  5  8  4  0  6   3   5   0    0  1  0  0   1  0   2   1   1   0
#>       VBZ HYPH JJR TO WP CD
#> text1   0    0   0  0  0  0
#> text2   0    0   0  0  0  0
#> text3   1    0   0  0  0  0
#> text4   0    1   1  1  1  0
#> text5   0    0   0  1  0  0
#> text6   1    0   0  0  0  3

Many of the standard methods from quanteda work on the new tagged token objects:

docnames(taggedsents)
#> [1] "text1" "text2" "text3" "text4" "text5" "text6"
ndoc(taggedsents)
#> [1] 6
ntoken(taggedsents)
#> text1 text2 text3 text4 text5 text6 
#>    23    32    24    28    29    43
ntype(taggedsents)
#> text1 text2 text3 text4 text5 text6 
#>    22    26    22    25    24    37

Comments and feedback

We welcome your comments and feedback. Please file issues on the issues page, and/or send me comments at kbenoit@lse.ac.uk.

Plans moving ahead include finding much more efficient methods of calling spaCy from R than the current use of system2().

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
R		R
data		data
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spacyr: an R wrapper for spaCy

Prerequisites

Examples

Comments and feedback

About

Uh oh!

Releases

Packages

Languages

adamobeng/spacyr

Folders and files

Latest commit

History

Repository files navigation

spacyr: an R wrapper for spaCy

Prerequisites

Examples

Comments and feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages