8000 Add provision to provide local-queue for the training job in SDKv1 an… by abhijeet-dhumal · Pull Request #2636 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add provision to provide local-queue for the training job in SDKv1 an… #2636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: release-1.9
Choose a base branch
from

Conversation

abhijeet-dhumal
Copy link
@abhijeet-dhumal abhijeet-dhumal commented May 5, 2025

What this PR does / why we need it:

  1. Added queue_name parameter to both train and create_job methods
  2. Implemented validation logic for:
    • Neither queue_name nor Kueue label provided ✅ : No label is set, no validation performed.
    • Only queue_name provided ✅ : Label is set to queue_name value and queue existence is validated.
    • Only Kueue label in labels provided ✅ : Label value is used and queue existence is validated.
    • queue_name provided but doesn't exist in given namespace 🟥 : Raises value-error stating the provided local-queue doesn't exists in specified namespace
    • Both queue_name and Kueue label provided, same value ✅ : No conflict; label is set and queue existence is validated.
    • Both queue_name and Kueue label provided, different values 🟥 : Error raised due to conflict.
    • Labels is None, lqueue_name provided ✅ : Labels dict is created, label is set, and queue existence is validated.
    • Labels has other keys, queue_name provided ✅ : Other labels are preserved, Kueue label is set/overwritten, and queue existence is validated.
Screenshot 2025-05-05 at 9 10 03 PM
client.create_job(
    name="pytorch-trainer",
    namespace=<namespace>,
    train_func=<train_func>,  # The training function
    num_workers=2,
    resources_per_worker={"gpu": <
8000
num_gpus>},
    base_image=<training_image>,
    labels={"kueue.x-k8s.io/queue-name":"test-local-queue"},
    queue_name="test-local-queue",
)

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jinchihe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested review from jinchihe and kuizhiqing May 5, 2025 12:28
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kueue-validation-for-training-job branch 3 times, most recently from 6b4ff3e to 8c27ed6 Compare May 5, 2025 15:36
@abhijeet-dhumal
Copy link
Author
abhijeet-dhumal commented May 6, 2025

Please review : @astefanutti @ChristianZaccaria @andreyvelich

@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review May 6, 2025 06:44
logger.debug(
f"Kueue queue '{queue_name}' exists in namespace '{namespace}'"
)
except ApiException as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if Kueue is not installed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Kueue is completely uninstalled, including the CRDs (LocalQueue), the error is i.e.,:

Kueue queue 'user-queue' not found in namespace 'demo'

Looks as expected.

If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:

ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1ba137b4-a501-418d-a053-5fe648fb0f65', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Kubernetes-Pf-Flowschema-Uid': 'd16caeb2-bdf1-44ae-bde7-5807bee26c25', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c42df38f-421f-49c6-99a6-d7666b465418', 'Date': 'Wed, 07 May 2025 10:40:10 GMT', 'Content-Length': '677'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s\": no endpoints available for service "kueue-webhook-service"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "[https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s](https://kueue-webhook-service.redhat-ods-applications.svc/mutate-kubeflow-org-v1-pytorchjob?timeout=10s)": no endpoints available for service "kueue-webhook-service""}]},"code":500}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add a non-blocking warning to say that Kueue is disabled or the LocalQueue CRD is not installed, hence the job won't be managed by Kueue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you rightly mentioned in review thread :

If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:
ApiException: (500)
Reason: Internal Server Error

Maybe based on status code we can add this warning log once status code matches 404/500, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems from the failure message that this is caused because the Kueue webhooks are not properly removed when Kueue is disabled.

To rephrase my original question: is 404 also returned when the LocalQueue CRD is not installed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add a non-blocking warning to say that Kueue is disabled or the LocalQueue CRD is not installed, hence the job won't be managed by Kueue.

Right, a warning could make sense in the case queue_name is explicitly provided and Kueue is not installed.

Copy link
Author
@abhijeet-dhumal abhijeet-dhumal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC , 404 mean either:

  • The specific queue doesn't exist
  • The LocalQueue CRD is not installed (which would cause a 404 when trying to access the CRD)

500 mean:

  • Kueue is disabled or having internal issues
  • Webhook issues (as you mentioned)

403 (Forbidden) mean :

  • Status code in this context indicates that the user/service account doesn't have the necessary permissions to access the Kueue queue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes and added checks accordingly, please review !

Copy link
Contributor
@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!!

Tested manually on OCP 4.17 with VAP enforced onto PyTorchJobs:

  • Adding non-existing queue-name label raises an error✅
  • Adding non-existing queue-name to local_queue field raises an error✅
  • Using conflicting Kueue queue labels between the label and local_queue fields raises an error✅
  • Job without queue-name can't be created while VAP is enforced✅
  • Job with queue-name can be created while VAP is enforced, whether we use the labels or the local_queue parameters, or both at the same time✅
  • Job is created when labels has other keys, and queue-name is provided✅
  • Job can be created without queue-name while VAP is NOT enforced✅

logger.debug(
f"Kueue queue '{queue_name}' exists in namespace '{namespace}'"
)
except ApiException as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Kueue is completely uninstalled, including the CRDs (LocalQueue), the error is i.e.,:

Kueue queue 'user-queue' not found in namespace 'demo'

Looks as expected.

If Kueue has been enabled and then disabled (keeping the LocalQueue CRD installed), the error is i.e.,:

ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1ba137b4-a501-418d-a053-5fe648fb0f65', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Kubernetes-Pf-Flowschema-Uid': 'd16caeb2-bdf1-44ae-bde7-5807bee26c25', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c42df38f-421f-49c6-99a6-d7666b465418', 'Date': 'Wed, 07 May 2025 10:40:10 GMT', 'Content-Length': '677'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s\": no endpoints available for service "kueue-webhook-service"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook "mpytorchjob.kb.io": failed to call webhook: Post "[https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-kubeflow-org-v1-pytorchjob?timeout=10s](https://kueue-webhook-service.redhat-ods-applications.svc/mutate-kubeflow-org-v1-pytorchjob?timeout=10s)": no endpoints available for service "kueue-webhook-service""}]},"code":500}

Copy link

@ChristianZaccaria: changing LGTM is restricted to collaborators

In response to this:

Great work!!

Tested manually on OCP 4.17 with VAP enforced onto PyTorchJobs:

  • Adding non-existing queue-name label raises an error✅
  • Adding non-existing queue-name to local_queue field raises an error✅
  • Using conflicting Kueue queue labels between the label and local_queue fields raises an error✅
  • Job without queue-name can't be created while VAP is enforced✅
  • Job with queue-name can be created while VAP is enforced, whether we use the labels or the local_queue parameters, or both at the same time✅
  • Job is created when labels has other keys, and queue-name is provided✅
  • Job can be created without queue-name while VAP is NOT enforced✅

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…d the related unit tests

Signed-off-by: Abhijeet Dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kueue-validation-for-training-job branch from 8c27ed6 to df1341a Compare May 12, 2025 13:22
@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft May 14, 2025 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0