8000 fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? · Issue #1194 · Zipstack/unstract · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gaetanquentin opened this issue Mar 17, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@gaetanquentin
8E81 Copy link

Describe the bug

uploading and indexing a big pdf containing scans , tesseract is used but is too slow and get a timeout:

tesseract is still running when extractor do a tiemout

unstract-backend                | 172.28.0.1 - - [17/Mar/2025:09:57:30 +0000] "GET /api/v1/socket/?EIO=4&transport=websocket HTTP/1.1" 400 25 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:7)
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [7] [ERROR] Error handling request /api/v1/x2text/process
unstract-x2text-service         | Traceback (most recent call last):
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 134, in handle
unstract-x2text-service         |     self.handle_request(listener, req, client, addr)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 177, in handle_request
unstract-x2text-service         |     respiter = self.wsgi(environ, resp.start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-x2text-service         |     return self.wsgi_app(environ, start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-x2text-service         |     response = self.full_dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-x2text-service         |     rv = self.dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-x2text-service         |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-x2text-service         |   File "/app/app/authentication_middleware.py", line 16, in wrapper
unstract-x2text-service         |     return func(*args, **kwargs)
unstract-x2text-service         |   File "/app/app/controllers/controller.py", line 120, in process
unstract-x2text-service         |     response = requests.request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/api.py", line 59, in request
unstract-x2text-service         |     return session.request(method=method, url=url, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
unstract-x2text-service         |     resp = self.send(prep, **send_kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
unstract-x2text-service         |     r = adapter.send(request, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
unstract-x2text-service         |     resp = conn.urlopen(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 789, in urlopen
unstract-x2text-service         |     response = self._make_request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 536, in _make_request
unstract-x2text-service         |     response = conn.getresponse()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 464, in getresponse
unstract-x2text-service         |     httplib_response = super().getresponse()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
unstract-x2text-service         |     response.begin()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
unstract-x2text-service         |     version, status, reason = self._read_status()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
unstract-x2text-service         |     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
unstract-x2text-service         |   File "/usr/local/lib/python3.9/socket.py", line 716, in readinto
unstract-x2text-service         |     return self._sock.recv_into(b)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
unstract-x2text-service         |     sys.exit(1)
unstract-x2text-service         | SystemExit: 1

To reproduce

llm profile:
Name LLM Embedding Model Vector Database Text Extractor
ollama-deepseek-r1 ollama-deepseek-r1 ollama-emb-deepseek-r1 pg-vdb-1 unstructured-io-1

Expected behavior

indexation ok

Environment details

  • Version: latest with optional profil

Additional context

Question

is there a way to replace old tesseract , not accelerated by gpu, with model llama 3.2 vision?

@gaetanquentin gaetanquentin added the bug Something isn't working label Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant
0