8000 ImageMagick and Tesseract failures · Issue #4 · deranjer/goEDMS · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ImageMagick and Tesseract failures #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Penquincoder opened this issue Feb 3, 2020 · 5 comments
Open

ImageMagick and Tesseract failures #4

Penquincoder opened this issue Feb 3, 2020 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@Penquincoder
Copy link

Just set this up and ran into some bugs while trying to import PDF files:

System: CentOS 7 3.10.0-1062.9.1.el7.x86_64
Method: Download release goEDMS_0.1.8_Linux_x86_64.tar.gz
SESTATUS: permissive

Modified serverConfig.toml for correct paths to convert and tesseract

$ which tesseract
/bin/tesseract

$which convert
/bin/convert
[ingress]
    IngressPath = 'staging'

[ocr]
    TesseractBin = "/bin/tesseract"
    MagickBin = "/bin/convert"   

Copied existing PDFs to /opt/goEDMS/staging, and receive the following errors in goedms.log for ALL pdfs to ingest:

{"level":"info","time":"2020-02-02T23:02:09-06:00","message":"Converting PDF To image for OCR/opt/goEDMS/staging/bill.pdf"}
{"level":"info","time":"2020-02-02T23:02:09-06:00","message":"Creating temp image for OCR at: /opt/goEDMS/temp/bill.png"}
{"level":"error","time":"2020-02-02T23:02:09-06:00","message":"Unable to convert PDF Using Magick: /opt/goEDMS/staging/bill.pdfexit status 1"}
{"level":"error","time":"2020-02-02T23:02:09-06:00","message":"OCR Processing failed on file: /opt/goEDMS/staging/bill.pdf: exit status 1"}  

No documents appear in the web-gui.

@deranjer
Copy link
Owner
deranjer commented Feb 3, 2020

Can you install the latest version and try again? I've updated the error reporting to get a much more detailed error response.

@Penquincoder
Copy link
Author
Penquincoder commented Feb 4, 2020

Thanks for adding additional debugging. I'm going to say there's a definite bug here, with potential for destruction of data:

  • Delete old goEdms directory
  • Download release goEDMS_0.1.9_Linux_x86_64.tar.gz
  • Extract to /opt/goEdms
  • Create /opt/goEdms/staging directory
  • Run ./goEdms
  • mv PDF into /opt/goEdms/staging

Log

{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Starting processing for file: /opt/goEdms/staging/20280201.pdf"}
{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Working on current file: 20280201.pdf"}        
{"level":"info","time":"2020-02-03T19:01:38-06:00","message":"Text processed from PDF without OCR: 20280201.pdf"}           
{"level":"info","time":"2020-02-03T19:01:38-06:00","message":"No record found, assume no duplicate hash: not found"}
{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Adding full text for search to bleve: Creation date: 2020-01-28<OCR'd text>"}
{"level":"-","time":"2020-02-03T19:02:38-06:00","message":"wake, now=2020-02-03T19:02:38-06:00"}
{"level":"-","time":"2020-02-03T19:02:38-06:00","message":"run, now=2020-02-03T19:02:38-06:00, entry=1, next=2020-02-03T19:03:38-06:00"}
{"level":"info","time":"2020-02-03T19:02:38-06:00","message":"Starting Ingress Job on folder:/opt/goEdms/staging"}
{"level":"debug","time":"2020-02-03T19:02:38-06:00","message":"Starting processing for file: /opt/goEdms/staging"}
{"level":"warn","time":"2020-02-03T19:02:38-06:00","message":"Unable to get information for file, won't process: /opt/goEdms/staging: stat /opt/goEdms/staging: no such file or directory"}

Based on the error, I checked the /opt/goEdms directory. Sure enough /staging/ doesn't exist! It appears that the goEDMS processing is deleting the staging directory instead of just the processed files.

Changing the serverConfig.toml option for IngressDeleteOnProcess doesn't affect the outcome. /staging/ directory is still deleted entirely for either true/false.

No documents appear in the web gui.

@deranjer
Copy link
Owner
deranjer commented Feb 4, 2020

Okay, let me do some testing and get back to you.

@deranjer deranjer added the bug Something isn't working label Feb 4, 2020
@deranjer
Copy link
Owner
deranjer commented Feb 5, 2020

So for some reason goEDMS is saying that 'staging' is a file, not a folder. I'm not sure why that is. I'm adding a few more logging statements and checks to ensure that the root ingress folder is not deleted. I'll hopefully push a new build today for you to try out.

@deranjer
Copy link
Owner
deranjer commented Feb 5, 2020

Okay version 0.2.0 is out, please try that and let me know.

@deranjer deranjer self-assigned this Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0