Description
What happened?
We have also experienced the connecting to JVM
stalling jobs problem described in this Zulip thread and have investigated it in some depth.
In our observations, jobs were stalled in this line in Worker.borrow_jvm()
, waiting to get an item from a queue that will never become non-empty:
return await jvmpool.borrow_jvm()
…which in turn waits on…
return await self.queue.get()
No item will ever appear on the queue because the background task _jvm_initializer_task
has raised an exception and exited. This background task is the only thing that adds items to the queue (there is also return_broken_jvm()
but we have not seen that being used in our logs).
In our case, the task was abending due to a failing assertion — one or the other of these two:
assert self.queue.qsize() < self.max_jvms
assert self.total_jvms_including_borrowed < self.max_jvms
and in turn this was because the task was still creating new JVMs even when max_jvms
JVMs had already been created. We believe this is fixed by #14909.
However there are additional bugs in this background task that are not addressed by that PR:
-
If
_jvm_initializer_task
crashes, nothing is logged and the main code never notices. This causes jobs to stall while trying to borrow a JVM. Exceptions encountered should be logged, and the main code should check that the background task completed successfully. -
JVM.create_container_and_connect()
takes pains to raise aJVMCreationError
if it fails. There is also error handling code that will cause a job attempt to fail if it receives aJVMCreationError
. HoweverJVMCreationError
is never propagated from the_jvm_initializer_task
task back to the worker that was trying to acquire that JVM.
Version
0.2.134