8000 KubernetesJobTask "No pod scheduled" error · Issue #2570 · spotify/luigi · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
KubernetesJobTask "No pod scheduled" error #2570
Closed
@StasDeep

Description

@StasDeep

I often see the following error when using KubernetesJobTask:

[28:luigi-interface:224@18:33] [ERROR] [pid 28] Worker Worker(salt=476402868, workers=20, host=something-host-1541289600-zntd7, username=root, pid=1) failed    Something(local_execution=Falser)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 205, in run
    new_deps = self._run_get_new_deps()
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
    task_gen = self.task.run()
  File "/usr/src/app/phpipeline/luigi_util.py", line 231, in run
    super(PHKubernetesJobTask, self).run()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 355, in run
    self.__track_job()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 199, in __track_job
    while not self.__verify_job_has_started():
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 261, in __verify_job_has_started
    assert len(pods) > 0, "No pod scheduled by " + self.uu_name
AssertionError: No pod scheduled by something-20181104183346-b51d371a5bfd4197
[1:luigi-interface:570@18:33] [INFO] Informed scheduler that task   Something_42b6a6d55a   has status   FAILED

It is hard to reproduce, but seems like sometimes pod needs a bit more time to be created, but task does not wait for it to be created and ends up raising the error. Task gets FAILED status, but a pod is still created and is run without Luigi control.

When the task is restarted and not failed, a pod is run under Luigi control, but there's still that uncontrolled pod, so we end up having two pods making the same thing.

I've managed to fix it the most naive way. When getting pods to verify that they are started, instead of:

pods = self.__get_pods()

I do the following:

for _ in range(3):
    pods = self._KubernetesJobTask__get_pods()

    if pods:
        break

    # If pod was not returned, sleep to wait for pod to be created.
    sleep(15)

This is not a beautiful way of fixing the issue, but it works (though it will for sure fail if pod needs >45 seconds to be started).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0