Description
What happened + What you expected to happen
-
We use RayJob to run some machine learning-related jobs in k8s mode. There is a submitter pod, which mainly executes the ray job submit. However, after our jobs end normally or fail abnormally, in some cases, the job status on the dashboard is already marked as ended or completed, but the submitter pod remains running, causing the entire RayJob to not end and continuously occupy resources.
-
We observed that this is due to the submitter pod not exiting. And there is only one process, ray job submit, inside the submitter pod. We haven't documented reproducing this issue yet, but we have done some stack prints. For example, the following ones.
Versions / Dependencies
Ray:2.40
Python:3.10
Reproduction script
For the time being, no stable reproducible scene has been found
Issue Severity
Medium: It is a significant difficulty but I can work around it.