KubernetesJobTask "No pod scheduled" error

I often see the following error when using KubernetesJobTask:

[28:luigi-interface:224@18:33] [ERROR] [pid 28] Worker Worker(salt=476402868, workers=20, host=something-host-1541289600-zntd7, username=root, pid=1) failed    Something(local_execution=Falser)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 205, in run
    new_deps = self._run_get_new_deps()
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
    task_gen = self.task.run()
  File "/usr/src/app/phpipeline/luigi_util.py", line 231, in run
    super(PHKubernetesJobTask, self).run()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 355, in run
    self.__track_job()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 199, in __track_job
    while not self.__verify_job_has_started():
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 261, in __verify_job_has_started
    assert len(pods) > 0, "No pod scheduled by " + self.uu_name
AssertionError: No pod scheduled by something-20181104183346-b51d371a5bfd4197
[1:luigi-interface:570@18:33] [INFO] Informed scheduler that task   Something_42b6a6d55a   has status   FAILED

It is hard to reproduce, but seems like sometimes pod needs a bit more time to be created, but task does not wait for it to be created and ends up raising the error. Task gets FAILED status, but a pod is still created and is run without Luigi control.

When the task is restarted and not failed, a pod is run under Luigi control, but there's still that uncontrolled pod, so we end up having two pods making the same thing.

I've managed to fix it the most naive way. When getting pods to verify that they are started, instead of:

pods = self.__get_pods()

I do the following:

for _ in range(3):
    pods = self._KubernetesJobTask__get_pods()

    if pods:
        break

    # If pod was not returned, sleep to wait for pod to be created.
    sleep(15)

This is not a beautiful way of fixing the issue, but it works (though it will for sure fail if pod needs >45 seconds to be started).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions