Description
I often see the following error when using KubernetesJobTask
:
[28:luigi-interface:224@18:33] [ERROR] [pid 28] Worker Worker(salt=476402868, workers=20, host=something-host-1541289600-zntd7, username=root, pid=1) failed Something(local_execution=Falser)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 205, in run
new_deps = self._run_get_new_deps()
File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
task_gen = self.task.run()
File "/usr/src/app/phpipeline/luigi_util.py", line 231, in run
super(PHKubernetesJobTask, self).run()
File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 355, in run
self.__track_job()
File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 199, in __track_job
while not self.__verify_job_has_started():
File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 261, in __verify_job_has_started
assert len(pods) > 0, "No pod scheduled by " + self.uu_name
AssertionError: No pod scheduled by something-20181104183346-b51d371a5bfd4197
[1:luigi-interface:570@18:33] [INFO] Informed scheduler that task Something_42b6a6d55a has status FAILED
It is hard to reproduce, but seems like sometimes pod needs a bit more time to be created, but task does not wait for it to be created and ends up raising the error. Task gets FAILED status, but a pod is still created and is run without Luigi control.
When the task is restarted and not failed, a pod is run under Luigi control, but there's still that uncontrolled pod, so we end up having two pods making the same thing.
I've managed to fix it the most naive way. When getting pods to verify that they are started, instead of:
pods = self.__get_pods()
I do the following:
for _ in range(3):
pods = self._KubernetesJobTask__get_pods()
if pods:
break
# If pod was not returned, sleep to wait for pod to be created.
sleep(15)
This is not a beautiful way of fixing the issue, but it works (though it will for sure fail if pod needs >45 seconds to be started).