8000 GitHub - ZhangChengX/PDFBoT: Extracting Body Text from Academic PDF Documents for Text Mining.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Jun 3, 2024. It is now read-only.

ZhangChengX/PDFBoT

Repository files navigation

PDFBoT API

PDFBoT is a tool for accurately extracting body text of the article in PDF format.

Publications

Please cite the following paper if you are using our tool. Thanks!

Environment

MacOS with python3.6+ installed

Dependencies

  • pdf2htmlEX
  • bs4
  • flask

Usage

Command to run PDFBoT API on local computer

python3 main.py

Once the builtin server is launched, head over to http://127.0.0.1:5000/. Follow the instrucation to upload PDF file to extract text.

Run the source code to extract the text

step 1 convert the PDF to HTML by the function "pdftohtml_test", in main.py L72.

  • The input is the url of the PDF document.
  • The output is the path of the corresponding HTML file.

step 2 extract body text from HTML file by the function "getTextFrom2HTML", in extractTextFrom2colHTML.py.

  • The input is the path of the HTML file.
  • The output is the a list of string, each string in the list represents one paragraph of the article in document.

About

Extracting Body Text from Academic PDF Documents for Text Mining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0