8000 Replace 'XML' by 'xml2' · Issue #7 · trinker/textreadr · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Replace 'XML' by 'xml2' #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeroen opened this issue Apr 25, 2017 · 2 comments
Closed

Replace 'XML' by 'xml2' #7

jeroen opened this issue Apr 25, 2017 · 2 comments

Comments

@jeroen
Copy link
jeroen commented Apr 25, 2017

The XML package is super old and no longer maintained. You should switch to xml2. It shouldn't be difficult, but let me know if you need help.

@trinker
Copy link
Owner
trinker commented Apr 25, 2017

Agreed. Didn't realize XML is no longer maintained. This should get the user what they're after.

#' Read in .docx Content
#'
#' Read in the content from a .docx file.
#'
#' @param file The path to the .docx file.
#' @param skip The number of lines to skip.
#' @param remove.empty logical.  If \code{TRUE} empty elements in the vector are 
#' removed.
#' @param \dots ignored.
#' @return Returns a character vector.
#' @keywords docx
#' @export
#' @author Bryan Goodrich and Tyler Rinker <tyler.rinker@@gmail.com>.
#' @examples
#' \dontrun{
#' url <- "https://github.com/trinker/textreadr/raw/master/inst/docs/Yasmine_Interview_Transcript.docx"
#' file <- download(url)
#' (txt <- read_docx(file))
#' }
read_docx <- function (file, skip = 0, remove.empty = TRUE, trim = TRUE, ...) {

    ## create temp dir
    tmp <- tempfile()
    if (!dir.create(tmp)) stop("Temporary directory could not be established.")

    ## clean up
    on.exit(unlink(tmp, recursive=TRUE))

    ## unzip docx
    xmlfile <- file.path(tmp, "word", "document.xml")
    utils::unzip(file, exdir = tmp)

    ## read in the unzipped docx
    doc <- xml2::read_xml(xmlfile)

    ## extract the content
    nodeSet <- xml2::xml_find_all(doc, "//w:p")
    pvalues <- xml2:::xml_text(nodeSet)

    ## formatting
    if (isTRUE(remove.empty)) pvalues <- pvalues[!grepl("^\\s*$", pvalues)]
    if (skip > 0) pvalues <- pvalues[-seq(skip)]
    if (isTRUE(trim)) pvalues <- trimws(pvalues)

    pvalues

}

@trinker
Copy link
Owner
trinker commented Apr 25, 2017

Note other read_ functions should get a trim argument and possibly a remove.empty arg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0