8000 _jvm_initializer_task silently aborts on errors · Issue #14910 · hail-is/hail · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
_jvm_initializer_task silently aborts on errors #14910
Closed
@jmarshall

Description

@jmarshall

What happened?

We have also experienced the connecting to JVM stalling jobs problem described in this Zulip thread and have investigated it in some depth.

In our observations, jobs were stalled in this line in Worker.borrow_jvm(), waiting to get an item from a queue that will never become non-empty:

return await jvmpool.borrow_jvm()

…which in turn waits on…
return await self.queue.get()

No item will ever appear on the queue because the background task _jvm_initializer_task has raised an exception and exited. This background task is the only thing that adds items to the queue (there is also return_broken_jvm() but we have not seen that being used in our logs).

In our case, the task was abending due to a failing assertion — one or the other of these two:

assert self.queue.qsize() < self.max_jvms
assert self.total_jvms_including_borrowed < self.max_jvms

and in turn this was because the task was still creating new JVMs even when max_jvms JVMs had already been created. We believe this is fixed by #14909.

However there are additional bugs in this background task that are not addressed by that PR:

  • If _jvm_initializer_task crashes, nothing is logged and the main code never notices. This causes jobs to stall while trying to borrow a JVM. Exceptions encountered should be logged, and the main code should check that the background task completed successfully.

  • JVM.create_container_and_connect() takes pains to raise a JVMCreationError if it fails. There is also error handling code that will cause a job attempt to fail if it receives a JVMCreationError. However JVMCreationError is never propagated from the _jvm_initializer_task task back to the worker that was trying to acquire that JVM.

Version

0.2.134

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageA brand new issue that needs triaging.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0