Batch services fails to mark job started when the batch has more than one failed job group · Issue #14864 · hail-is/hail · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SELECT (jobs.cancelledORjob_groups_cancelled.idIS NOT NULL) AND NOT jobs.always_run
INTO cur_job_cancel
FROM jobs
LEFT JOIN job_groups_cancelled ONjob_groups_cancelled.id=jobs.batch_id
WHERE batch_id = in_batch_id AND job_id = in_job_id
LOCK IN SHARE MODE;
This has started happening more regularly after 61955df, which added functionality to cancel the query's stage (job group) if one partition (job therein) failed.
Version
0.2.134
Relevant log output
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py", line 480, in _handle_request
resp = await request_handler(request)
File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py", line 569, in _handle
return await handler(request)
File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py", line 117, in impl
return await handler(request)
File "/usr/local/lib/python3.9/dist-packages/gear/csrf.py", line 27, in check_csrf_token
return await handler(request)
File "/usr/local/lib/python3.9/dist-packages/gear/metrics.py", line 28, in monitor_endpoints_middleware
response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request)) # type: ignore
File "/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py", line 55, in measure
rv = await future
File "/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py", line 191, in factory
response = await handler(request)
File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 190, in wrapped
return await fun(request, instance)
File "/usr/local/lib/python3.9/dist-packages/batch/utils.py", line 45, in wrapped
return await fun(request, *args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 399, in job_started
return await asyncio.shield(job_started_1(request, instance))
File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 388, in job_started_1
await mark_job_started(request.app, batch_id, job_id, attempt_id, instance, start_time, resources)
File "/usr/local/lib/python3.9/dist-packages/batch/driver/job.py", line 295, in mark_job_started
rv = await db.execute_and_fetchone(
File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 55, in wrapper
return await f(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 325, in execute_and_fetchone
return await tx.execute_and_fetchone(sql, args, query_name)
File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 247, in execute_and_fetchone
await cursor.execute(sql, args)
File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 239, in execute
await self._query(query)
File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 457, in _query
await conn.query(q)
File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 469, in query
await self._read_query_result(unbuffered=unbuffered)
File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 683, in _read_query_result
await result.read()
File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 1164, inread
first_packet = await self.connection._read_packet()
File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 652, in _read_packet
packet.raise_for_error()
File "/usr/local/lib/python3.9/dist-packages/pymysql/protocol.py", line 219, in raise_for_error
err.raise_mysql_exception(self._data)
File "/usr/local/lib/python3.9/dist-packages/pymysql/err.py", line 150, in raise_mysql_exception
raise errorclass(errno, errval)
pymysql.err.OperationalError: (1172, 'Result consisted of more than one row')
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
What happened?
The SQL procedure
mark_job_started
fails if a batch has more than one cancelled job group.hail/batch/sql/rename-job-groups-tables.sql
Lines 717 to 722 in 8fe952b
This has started happening more regularly after 61955df, which added functionality to cancel the query's stage (job group) if one partition (job therein) failed.
Version
0.2.134
Relevant log output
The text was updated successfully, but these errors were encountered: