-
Notifications
You must be signed in to change notification settings - Fork 777
Add provision to provide local-queue for the training job in SDKv1 an… #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-1.9
Are you sure you want to change the base?
Add provision to provide local-queue for the training job in SDKv1 an… #2636
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6b4ff3e
to
8c27ed6
Compare
Please review : @astefanutti @ChristianZaccaria @andreyvelich |
logger.debug( | ||
f"Kueue queue '{queue_name}' exists in namespace '{namespace}'" | ||
) | ||
except ApiException as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if Kueue is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Kueue is completely uninstalled, including the CRDs (LocalQueue), the error is i.e.,:
Kueue queue 'user-queue' not found in namespace 'demo'
Looks as expected.
If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:
ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1ba137b4-a501-418d-a053-5fe648fb0f65', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Kubernetes-Pf-Flowschema-Uid': 'd16caeb2-bdf1-44ae-bde7-5807bee26c25', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c42df38f-421f-49c6-99a6-d7666b465418', 'Date': 'Wed, 07 May 2025 10:40:10 GMT', 'Content-Length': '677'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s\": no endpoints available for service "kueue-webhook-service"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "[https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s](https://kueue-webhook-service.redhat-ods-applications.svc/mutate-kubeflow-org-v1-pytorchjob?timeout=10s)": no endpoints available for service "kueue-webhook-service""}]},"code":500}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add a non-blocking warning to say that Kueue is disabled or the LocalQueue CRD is not installed, hence the job won't be managed by Kueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you rightly mentioned in review thread :
If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:
ApiException: (500)
Reason: Internal Server Error
Maybe based on status code we can add this warning log once status code matches 404/500, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems from the failure message that this is caused because the Kueue webhooks are not properly removed when Kueue is disabled.
To rephrase my original question: is 404 also returned when the LocalQueue CRD is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add a non-blocking warning to say that Kueue is disabled or the LocalQueue CRD is not installed, hence the job won't be managed by Kueue.
Right, a warning could make sense in the case queue_name
is explicitly provided and Kueue is not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC , 404 mean either:
- The specific queue doesn't exist
- The LocalQueue CRD is not installed (which would cause a 404 when trying to access the CRD)
500 mean:
- Kueue is disabled or having internal issues
- Webhook issues (as you mentioned)
403 (Forbidden) mean :
- Status code in this context indicates that the user/service account doesn't have the necessary permissions to access the Kueue queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes and added checks accordingly, please review !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!!
Tested manually on OCP 4.17 with VAP enforced onto PyTorchJobs:
- Adding non-existing
queue-name
label raises an error✅ - Adding non-existing
queue-name
tolocal_queue
field raises an error✅ - Using conflicting Kueue queue labels between the
label
andlocal_queue
fields raises an error✅ - Job without
queue-name
can't be created while VAP is enforced✅ - Job with
queue-name
can be created while VAP is enforced, whether we use thelabels
or thelocal_queue
parameters, or both at the same time✅ - Job is created when labels has other keys, and
queue-name
is provided✅ - Job can be created without
queue-name
while VAP is NOT enforced✅
logger.debug( | ||
f"Kueue queue '{queue_name}' exists in namespace '{namespace}'" | ||
) | ||
except ApiException as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Kueue is completely uninstalled, including the CRDs (LocalQueue), the error is i.e.,:
Kueue queue 'user-queue' not found in namespace 'demo'
Looks as expected.
If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:
ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1ba137b4-a501-418d-a053-5fe648fb0f65', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Kubernetes-Pf-Flowschema-Uid': 'd16caeb2-bdf1-44ae-bde7-5807bee26c25', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c42df38f-421f-49c6-99a6-d7666b465418', 'Date': 'Wed, 07 May 2025 10:40:10 GMT', 'Content-Length': '677'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s\": no endpoints available for service "kueue-webhook-service"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "[https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s](https://kueue-webhook-service.redhat-ods-applications.svc/mutate-kubeflow-org-v1-pytorchjob?timeout=10s)": no endpoints available for service "kueue-webhook-service""}]},"code":500}
@ChristianZaccaria: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…d the related unit tests Signed-off-by: Abhijeet Dhumal <abhijeetdhumal652@gmail.com>
8c27ed6
to
df1341a
Compare
What this PR does / why we need it:
queue_name
parameter to bothtrain
andcreate_job
methodsNeither queue_name nor Kueue label provided
✅ : No label is set, no validation performed.Only queue_name provided
✅ : Label is set to queue_name value and queue existence is validated.Only Kueue label in labels provided
✅ : Label value is used and queue existence is validated.queue_name provided but doesn't exist in given namespace
🟥 : Raises value-error stating the provided local-queue doesn't exists in specified namespaceBoth queue_name and Kueue label provided, same value
✅ : No conflict; label is set and queue existence is validated.Both queue_name and Kueue label provided, different values
🟥 : Error raised due to conflict.Labels is None, lqueue_name provided
✅ : Labels dict is created, label is set, and queue existence is validated.Labels has other keys, queue_name provided
✅ : Other labels are preserved, Kueue label is set/overwritten, and queue existence is validated.Checklist: