8000 [Core] ray job submit may hang in some scenarios · Issue #54120 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content 8000
[Core] ray job submit may hang in some scenarios #54120
Open
@pxp531

Description

@pxp531

What happened + What you expected to happen

  • We use RayJob to run some machine learning-related jobs in k8s mode. There is a submitter pod, which mainly executes the ray job submit. However, after our jobs end normally or fail abnormally, in some cases, the job status on the dashboard is already marked as ended or completed, but the submitter pod remains running, causing the entire RayJob to not end and continuously occupy resources.

  • We observed that this is due to the submitter pod not exiting. And there is only one process, ray job submit, inside the submitter pod. We haven't documented reproducing this issue yet, but we have done some stack prints. For example, the following ones.

Image

Image

Image

Versions / Dependencies

Ray:2.40
Python:3.10

Reproduction script

For the time being, no stable reproducible scene has been found

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0