Description
Description
Currently when a job is PENDING, the JobStatus message doesn't provide any more details other than "the job may be waiting for resources" (see https://github.com/ray-project/ray/pull/28654/files/7ccabf2b606f325f1d4e793421e5651e86a9885e#r1010817973)
Ideally, the status would include the requested resources for the job and the available resources for the job, similar to how this is achieved for Ray Serve replicas here:
ray/python/ray/serve/_private/deployment_state.py
Line 1566 in 41dd69a
We might need some kind of wrapper for the JobSupervisor actor that periodically submits updates to the JobStatus, because the available resources are expected to change (for example, as the cluster scales up and adds nodes.)
Use case
When a job with num_cpus
or num_gpus
specified is PENDING for a while, there's no way for the user to find out exactly why it's pending without looking at the logs or checking ray status
for the details about the internal JobSupervisor actor. This information should ideally be available in the JobStatus message for the job itself.