8000 Batch services fails to mark job started when the batch has more than one failed job group · Issue #14864 · hail-is/hail · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Batch services fails to mark job started when the batch has more than one failed job group #14864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ehigham opened this issue Apr 28, 2025 · 0 comments
Assignees

Comments

@ehigham
Copy link
Member
ehigham commented Apr 28, 2025

What happened?

The SQL procedure mark_job_started fails if a batch has more than one cancelled job group.

SELECT (jobs.cancelled OR job_groups_cancelled.id IS NOT NULL) AND NOT jobs.always_run
INTO cur_job_cancel
FROM jobs
LEFT JOIN job_groups_cancelled ON job_groups_cancelled.id = jobs.batch_id
WHERE batch_id = in_batch_id AND job_id = in_job_id
LOCK IN SHARE MODE;

This has started happening more regularly after 61955df, which added functionality to cancel the query's stage (job group) if one partition (job therein) failed.

Version

0.2.134

Relevant log output

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py", line 480, in _handle_request
    resp = await request_handler(request)
  File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py", line 569, in _handle
    return await handler(request)
  File "/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py", line 117, in impl
    return await handler(request)
  File "/usr/local/lib/python3.9/dist-packages/gear/csrf.py", line 27, in check_csrf_token
    return await handler(request)
  File "/usr/local/lib/python3.9/dist-packages/gear/metrics.py", line 28, in monitor_endpoints_middleware
    response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request))  # type: ignore
  File "/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py", line 55, in measure
    rv = await future
  File "/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py", line 191, in factory
    response = await handler(request)
  File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 190, in wrapped
    return await fun(request, instance)
  File "/usr/local/lib/python3.9/dist-packages/batch/utils.py", line 45, in wrapped
    return await fun(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 399, in job_started
    return await asyncio.shield(job_started_1(request, instance))
  File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 388, in job_started_1
    await mark_job_started(request.app, batch_id, job_id, attempt_id, instance, start_time, resources)
  File "/usr/local/lib/python3.9/dist-packages/batch/driver/job.py", line 295, in mark_job_started
    rv = await db.execute_and_fetchone(
  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 55, in wrapper
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 325, in execute_and_fetchone
    return await tx.execute_and_fetchone(sql, args, query_name)
  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 247, in execute_and_fetchone
    await cursor.execute(sql, args)
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 239, in execute
    await self._query(query)
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 457, in _query
    await conn.query(q)
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 469, in query
    await self._read_query_result(unbuffered=unbuffered)
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 683, in _read_query_result
    await result.read()
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 1164, in read
    first_packet = await self.connection._read_packet()
  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 652, in _read_packet
    packet.raise_for_error()
  File "/usr/local/lib/python3.9/dist-packages/pymysql/protocol.py", line 219, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/usr/local/lib/python3.9/dist-packages/pymysql/err.py", line 150, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1172, 'Result consisted of more than one row')
@ehigham ehigham added the needs-triage A brand new issue that needs triaging. label Apr 28, 2025
@ehigham ehigham self-assigned this Apr 28, 2025
@ehigham ehigham added bug batch and removed needs-triage A brand new issue that needs triaging. labels Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0