_jvm_initializer_task silently aborts on errors

What happened?

We have also experienced the connecting to JVM stalling jobs problem described in this Zulip thread and have investigated it in some depth.

In our observations, jobs were stalled in this line in Worker.borrow_jvm(), waiting to get an item from a queue that will never become non-empty:

return await jvmpool.borrow_jvm()

…which in turn waits on…
return await self.queue.get()

No item will ever appear on the queue because the background task _jvm_initializer_task has raised an exception and exited. This background task is the only thing that adds items to the queue (there is also return_broken_jvm() but we have not seen that being used in our logs).

In our case, the task was abending due to a failing assertion — one or the other of these two:

assert self.queue.qsize() < self.max_jvms
assert self.total_jvms_including_borrowed < self.max_jvms

and in turn this was because the task was still creating new JVMs even when max_jvms JVMs had already been created. We believe this is fixed by #14909.

However there are additional bugs in this background task that are not addressed by that PR:

If _jvm_initializer_task crashes, nothing is logged and the main code never notices. This causes jobs to stall while trying to borrow a JVM. Exceptions encountered should be logged, and the main code should check that the background task completed successfully.
JVM.create_container_and_connect() takes pains to raise a JVMCreationError if it fails. There is also error handling code that will cause a job attempt to fail if it receives a JVMCreationError. However JVMCreationError is never propagated from the _jvm_initializer_task task back to the worker that was trying to acquire that JVM.

Version

0.2.134

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened?

Version

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened?

Version

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions