-
Notifications
You must be signed in to change notification settings - Fork 783
Fix training client error logs #2586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-1.9
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left one comment for the particular case of TimeourError, otherwise looks good to me.
This is really useful in the general Exception case.
Thanks.
@@ -567,13 +567,13 @@ def create_job( | |||
constants.JOB_PARAMETERS[job_kind]["plural"], | |||
job, | |||
) | |||
except multiprocessing.TimeoutError: | |||
except multiprocessing.TimeoutError as e: | |||
raise TimeoutError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can leave the message as is for TimeoutError as the message is clear enough in that particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@efazal could you please sign the commit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR, and signed off as well. Please review again. Thanks.
Signed-off-by: Esa Fazal <efazal@redhat.com>
…ns in training client Signed-off-by: Esa Fazal <efazal@redhat.com>
Hi @IRONICBo , I updated the PR and the signoff appears in all the commits. |
@@ -571,9 +571,9 @@ def create_job( | |||
raise TimeoutError( | |||
f"Timeout to create {job_kind}: {namespace}/{job.metadata.name}" | |||
) | |||
except Exception: | |||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @efazal!
@astefanutti @efazal I am wondering do we really need to duplicate error in our exception since Kubernetes Python client traceback the error:
Traceback (most recent call last):
File "/Users/avelichk/go/src/github.com/kubeflow/trainer/sdk/kubeflow/trainer/api/trainer_client.py", line 461, in delete_job
self.custom_api.delete_namespaced_custom_object(
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api/custom_objects_api.py", line 916, in delete_namespaced_custom_object
return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs) # noqa: E501
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api/custom_objects_api.py", line 1043, in delete_namespaced_custom_object_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 415, in request
return self.rest_client.DELETE(url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/rest.py", line 270, in DELETE
return self.request("DELETE", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/avelichk/miniconda3/envs/trainer/lib/python3.11/site-packages/kubernetes/client/rest.py", line 238, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '43086818-3937-4afa-a693-1f2c2b284c27', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'c15b1bfd-44dc-4a76-98a4-761930436ba4', 'X-Kubernetes-Pf-Prioritylevel-Uid': '19ffa315-03b4-4735-85e4-93b62200c787', 'Date': 'Tue, 08 Apr 2025 14:09:19 GMT', 'Content-Length': '250'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"trainjobs.trainer.kubeflow.org \"not-exists\" not found","reason":"NotFound","details":{"name":"not-exists","group":"trainer.kubeflow.org","kind":"trainjobs"},"code":404}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/avelichk/workspace/test-trainer.py", line 22, in <module>
TrainerClient().delete_job(name="not-exists")
File "/Users/avelichk/go/src/github.com/kubeflow/trainer/sdk/kubeflow/trainer/api/trainer_client.py", line 473, in delete_job
raise RuntimeError(
RuntimeError: Failed to delete TrainJob: kubeflow-andrey/not-exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andreyvelich, its interesting it shows you the traceback when the exception occurred. However, recently when we using the training client, we experienced the following error log without any traceback or details of the error:
After printing the exception itself, we were able to see the exact cause:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efazal Do you know if you configure sys.tracebacklimit = 0
as part of your application ?
https://docs.python.org/3/library/sys.html#sys.tracebacklimit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich not that i'm aware of. But I did a quick check anyway and don't seem to find that attribute set anywhere in the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I think it's worth mentioning that, traceback's are printed in most other cases, following is an example to confirm that the tracebacklimit is not set to zero anywhere in our application:
When creating a job however, it doesn't print a traceback.
raise RuntimeError( | ||
f"Failed to create {job_kind}: {namespace}/{job.metadata.name}" | ||
f"Failed to create {job_kind}: {namespace}/{job.metadata.name} due to {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should modify all our error message to propagate traceback into __cause__
parameter for Exception object:
except Exception as e:
raise RuntimeError(f"Failed to get {job_kind}: {namespace}/{name}") from e
In that case, users of SDK can process to traceback as follows:
try:
get_job(...)
except Exception as e:
print(e.__cause__)
What this PR does / why we need it:
This PR has a fix to enhance the error log messages to give detailed exceptions for job operations in training client.
Checklist: