fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

gaetanquentin · 2025-03-17T10:49:16Z

Describe the bug

uploading and indexing a big pdf containing scans , tesseract is used but is too slow and get a timeout:

tesseract is still running when extractor do a tiemout

unstract-backend                | 172.28.0.1 - - [17/Mar/2025:09:57:30 +0000] "GET /api/v1/socket/?EIO=4&transport=websocket HTTP/1.1" 400 25 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:7)
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [7] [ERROR] Error handling request /api/v1/x2text/process
unstract-x2text-service         | Traceback (most recent call last):
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 134, in handle
unstract-x2text-service         |     self.handle_request(listener, req, client, addr)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 177, in handle_request
unstract-x2text-service         |     respiter = self.wsgi(environ, resp.start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-x2text-service         |     return self.wsgi_app(environ, start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-x2text-service         |     response = self.full_dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-x2text-service         |     rv = self.dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-x2text-service         |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-x2text-service         |   File "/app/app/authentication_middleware.py", line 16, in wrapper
unstract-x2text-service         |     return func(*args, **kwargs)
unstract-x2text-service         |   File "/app/app/controllers/controller.py", line 120, in process
unstract-x2text-service         |     response = requests.request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/api.py", line 59, in request
unstract-x2text-service         |     return session.request(method=method, url=url, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
unstract-x2text-service         |     resp = self.send(prep, **send_kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
unstract-x2text-service         |     r = adapter.send(request, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
unstract-x2text-service         |     resp = conn.urlopen(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 789, in urlopen
unstract-x2text-service         |     response = self._make_request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 536, in _make_request
unstract-x2text-service         |     response = conn.getresponse()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 464, in getresponse
unstract-x2text-service         |     httplib_response = super().getresponse()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
unstract-x2text-service         |     response.begin()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
unstract-x2text-service         |     version, status, reason = self._read_status()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
unstract-x2text-service         |     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
unstract-x2text-service         |   File "/usr/local/lib/python3.9/socket.py", line 716, in readinto
unstract-x2text-service         |     return self._sock.recv_into(b)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
unstract-x2text-service         |     sys.exit(1)
unstract-x2text-service         | SystemExit: 1

To reproduce

llm profile:
Name LLM Embedding Model Vector Database Text Extractor
ollama-deepseek-r1 ollama-deepseek-r1 ollama-emb-deepseek-r1 pg-vdb-1 unstructured-io-1

Expected behavior

indexation ok

Environment details

Version: latest with optional profil

Additional context

Question

is there a way to replace old tesseract , not accelerated by gpu, with model llama 3.2 vision?

The text was updated successfully, but these errors were encountered:

gaetanquentin added the bug Something isn't working label Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

Comments

Describe the bug

To reproduce

Expected behavior

Environment details

Additional context

Question