-
Notifications
You must be signed in to change notification settings - Fork 40
doc(components/*): document /v1/states API health field behavior #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #922 +/- ##
==========================================
- Coverage 64.70% 64.65% -0.05%
==========================================
Files 284 284
Lines 23259 23259
==========================================
- Hits 15049 15038 -11
- Misses 7430 7436 +6
- Partials 780 785 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
- kubelet -> kubelet - containerd-pod -> containerd - docker-container -> docker c.f., #922 Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
78ab940
to
731c6fa
Compare
…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
731c6fa
to
b576eb7
Compare
// - [apiv1.HealthStateTypeHealthy] when bad environment variables are detected (treated as informational) | ||
// - [apiv1.HealthStateTypeHealthy] when no bad environment variables are found | ||
// | ||
// This component ALWAYS reports "healthy" regardless of findings, as bad environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird that why we keep the component if we always return Healthy...
// /v1/states API Health Field Behavior: | ||
// The [apiv1.HealthState.Health] field in the /v1/states API response is set as follows: | ||
// - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected) | ||
// - [apiv1.HealthStateTypeUnhealthy] when there's an error retrieving clock speed from any GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should report degraded not Unhealthy.
// | ||
// Suggested Actions: | ||
// This component does not set the [apiv1.HealthState.SuggestedActions] field. | ||
// ECC errors are reported for monitoring purposes and may require hardware inspection if persistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we report Degraded first then report Unhealthy with HW inspection if persistent?
// - [apiv1.HealthStateTypeHealthy] when GPU doesn't support fabric manager | ||
// - [apiv1.HealthStateTypeHealthy] when NVSwitch is not detected (fabric manager not needed) | ||
// - [apiv1.HealthStateTypeHealthy] when nv-fabricmanager executable is not found | ||
// - [apiv1.HealthStateTypeUnhealthy] when fabric manager executable exists but service is not active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be able to identify the specific error by SXID? IMO if the fm service is inactive, should we report Degraded?
// - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected) | ||
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM supported status from any GPU | ||
// - [apiv1.HealthStateTypeHealthy] when GPM is not supported by any GPU | ||
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM metrics from any GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we report Degraded if we just collect the metrics?
// - [apiv1.HealthStateTypeHealthy] when all GPUs' GSP firmware modes are successfully retrieved | ||
// | ||
// Suggested Actions: | ||
// This component does not set the [apiv1.HealthState.SuggestedActions] field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we return HW inspection? or should introduce the Human Inspection?
// - [apiv1.HealthStateTypeHealthy] when no event bucket is available | ||
// - [apiv1.HealthStateTypeHealthy] when no clock events are found in the evaluation window | ||
// - [apiv1.HealthStateTypeHealthy] when HW slowdown events frequency is below threshold | ||
// - [apiv1.HealthStateTypeUnhealthy] when HW slowdown events frequency exceeds threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has similar issues as ib flapping. We should consider introducing similar policy for this case. It should be Degraded first, then Unhealthy.
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting memory information from any GPU | ||
// - [apiv1.HealthStateTypeUnhealthy] when there's an error calculating memory usage percentage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the component just collects the metrics, so we should always report Healthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We used to have some threshold, but not in main branch. Used for reporting metrics.
@eahydra Let's review this in the Google sheet, and I will update this accordingly. |
Fix #595.