Replace 'XML' by 'xml2' #7

jeroen · 2017-04-25T10:29:33Z

The XML package is super old and no longer maintained. You should switch to xml2. It shouldn't be difficult, but let me know if you need help.

The text was updated successfully, but these errors were encountered:

trinker · 2017-04-25T18:48:17Z

Agreed. Didn't realize XML is no longer maintained. This should get the user what they're after.

#' Read in .docx Content
#'
#' Read in the content from a .docx file.
#'
#' @param file The path to the .docx file.
#' @param skip The number of lines to skip.
#' @param remove.empty logical.  If \code{TRUE} empty elements in the vector are 
#' removed.
#' @param \dots ignored.
#' @return Returns a character vector.
#' @keywords docx
#' @export
#' @author Bryan Goodrich and Tyler Rinker <tyler.rinker@@gmail.com>.
#' @examples
#' \dontrun{
#' url <- "https://github.com/trinker/textreadr/raw/master/inst/docs/Yasmine_Interview_Transcript.docx"
#' file <- download(url)
#' (txt <- read_docx(file))
#' }
read_docx <- function (file, skip = 0, remove.empty = TRUE, trim = TRUE, ...) {

    ## create temp dir
    tmp <- tempfile()
    if (!dir.create(tmp)) stop("Temporary directory could not be established.")

    ## clean up
    on.exit(unlink(tmp, recursive=TRUE))

    ## unzip docx
    xmlfile <- file.path(tmp, "word", "document.xml")
    utils::unzip(file, exdir = tmp)

    ## read in the unzipped docx
    doc <- xml2::read_xml(xmlfile)

    ## extract the content
    nodeSet <- xml2::xml_find_all(doc, "//w:p")
    pvalues <- xml2:::xml_text(nodeSet)

    ## formatting
    if (isTRUE(remove.empty)) pvalues <- pvalues[!grepl("^\\s*$", pvalues)]
    if (skip > 0) pvalues <- pvalues[-seq(skip)]
    if (isTRUE(trim)) pvalues <- trimws(pvalues)

    pvalues

}

trinker · 2017-04-25T18:49:09Z

Note other read_ functions should get a trim argument and possibly a remove.empty arg.

trinker closed this as completed in 7cd7e06 Apr 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace 'XML' by 'xml2' #7

Replace 'XML' by 'xml2' #7

Uh oh!

Uh oh!

Replace 'XML' by 'xml2' #7

Replace 'XML' by 'xml2' #7

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!