8000 Fix training client error logs by efazal · Pull Request #2586 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix training client error logs #2586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: release-1.9
Choose a base branch
from

Conversation

efazal
Copy link
@efazal efazal commented Apr 7, 2025

What this PR does / why we need it:
This PR has a fix to enhance the error log messages to give detailed exceptions for job operations in training client.

Checklist:

  • Docs included if any changes are user facing

Copy link
Contributor
@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left one comment for the particular case of TimeourError, otherwise looks good to me.

This is really useful in the general Exception case.

Thanks.

@@ -567,13 +567,13 @@ def create_job(
constants.JOB_PARAMETERS[job_kind]["plural"],
job,
)
except multiprocessing.TimeoutError:
except multiprocessing.TimeoutError as e:
raise TimeoutError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can leave the message as is for TimeoutError as the message is clear enough in that particular case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: astefanutti
Once this PR has been reviewed and has the lgtm label, please assign jinchihe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@astefanutti
Copy link
Contributor
astefanutti commented Apr 7, 2025

@efazal could you please sign the commit?

Copy link
Author
@efazal efazal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR, and signed off as well. Please review again. Thanks.

@IRONICBo
Copy link
Contributor
IRONICBo commented Apr 8, 2025

Updated PR, and signed off as well. Please review again. Thanks.

Hi @efazal , The DCO CI check looks like it failed. You can rebase your commits so that they can be signed off here.

You can refer to the CI tips here.

efazal added 2 commits April 8, 2025 10:28
…ns in training client

Signed-off-by: Esa Fazal <efazal@redhat.com>
@efazal
Copy link
Author
efazal commented Apr 8, 2025

Hi @IRONICBo , I updated the PR and the signoff appears in all the commits.

@@ -571,9 +571,9 @@ def create_job(
raise TimeoutError(
f"Timeout to create {job_kind}: {namespace}/{job.metadata.name}"
)
except Exception:
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @efazal!
@astefanutti @efazal I am wondering do we really need to duplicate error in our exception since Kubernetes Python client traceback the error:

Traceback (most recent call last):
  File "/Users/avelichk/go/src/github.com/kubeflow/trainer/sdk/kubeflow/trainer/api/trainer_client.py", line 461, in delete_job
    self.custom_api.delete_namespaced_custom_object(
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api/custom_objects_api.py", line 916, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api/custom_objects_api.py", line 1043, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/rest.py", line 270, in DELETE
    return self.request("DELETE", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '43086818-3937-4afa-a693-1f2c2b284c27', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'c15b1bfd-44dc-4a76-98a4-761930436ba4', 'X-Kubernetes-Pf-Prioritylevel-Uid': '19ffa315-03b4-4735-85e4-93b62200c787', 'Date': 'Tue, 08 Apr 2025 14:09:19 GMT', 'Content-Length': '250'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"trainjobs.trainer.kubeflow.org \"not-exists\" not found","reason":"NotFound","details":{"name":"not-exists","group":"trainer.kubeflow.org","kind":"trainjobs"},"code":404}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/avelichk/workspace/test-trainer.py", line 22, in <module>
    TrainerClient().delete_job(name="not-exists")
  File "/Users/avelichk/go/src/github.com/kubeflow/trainer/sdk/kubeflow/trainer/api/trainer_client.py", line 473, in delete_job
    raise RuntimeError(
RuntimeError: Failed to delete TrainJob: kubeflow-andrey/not-exists

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich, its interesting it shows you the traceback when the exception occurred. However, recently when we using the training client, we experienced the following error log without any traceback or details of the error:

image

After printing the exception itself, we were able to see the exact cause:

image

Copy link
Member
@andreyvelich andreyvelich Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efazal Do you know if you configure sys.tracebacklimit = 0 as part of your application ?
https://docs.python.org/3/library/sys.html#sys.tracebacklimit

Copy link
Author
@efazal efazal Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich not that i'm aware of. But I did a quick check anyway and don't seem to find that attribute set anywhere in the application.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I think it's worth mentioning that, traceback's are printed in most other cases, following is an example to confirm that the tracebacklimit is not set to zero anywhere in our application:
image

When creating a job however, it doesn't print a traceback.

Comment on lines 575 to +576
raise RuntimeError(
f"Failed to create {job_kind}: {namespace}/{job.metadata.name}"
f"Failed to create {job_kind}: {namespace}/{job.metadata.name} due to {str(e)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should modify all our error message to propagate traceback into __cause__ parameter for Exception object:

except Exception as e:
    raise RuntimeError(f"Failed to get {job_kind}: {namespace}/{name}") from e

In that case, users of SDK can process to traceback as follows:

try:
  get_job(...)
except Exception as e:
    print(e.__cause__)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0