8000 doc(components/*): document /v1/states API health field behavior by gyuho · Pull Request #922 · leptonai/gpud · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

doc(components/*): document /v1/states API health field behavior #922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gyuho
Copy link
Member
@gyuho gyuho commented Jun 12, 2025

Fix #595.

@gyuho gyuho added this to the v0.5.0 milestone Jun 12, 2025
@gyuho gyuho self-assigned this Jun 12, 2025
Copy link
codecov bot commented Jun 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.65%. Comparing base (4d5491f) to head (b576eb7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   64.70%   64.65%   -0.05%     
==========================================
  Files         284      284              
  Lines       23259    23259              
==========================================
- Hits        15049    15038      -11     
- Misses       7430     7436       +6     
- Partials      780      785       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gyuho added a commit that referenced this pull request Jun 13, 2025
Need to update #922.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
gyuho added a commit that referenced this pull request Jun 13, 2025
…health state on egress check issues

Need to update #922.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
gyuho added a commit that referenced this pull request Jun 13, 2025
…health state on egress check issues

Need to update #922.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
gyuho added a commit that referenced this pull request Jun 13, 2025
- kubelet -> kubelet
- containerd-pod -> containerd
- docker-container -> docker

c.f., #922

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@gyuho gyuho force-pushed the document-all-health-states branch from 78ab940 to 731c6fa Compare June 14, 2025 10:33
gyuho added a commit that referenced this pull request Jun 14, 2025
…health state on egress check issues

Need to update #922.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
gyuho added a commit that referenced this pull request Jun 17, 2025
…state on egress check issues (#925)

Need to update #922.

---------

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@gyuho gyuho force-pushed the document-all-health-states branch from 731c6fa to b576eb7 Compare June 17, 2025 13:23
// - [apiv1.HealthStateTypeHealthy] when bad environment variables are detected (treated as informational)
// - [apiv1.HealthStateTypeHealthy] when no bad environment variables are found
//
// This component ALWAYS reports "healthy" regardless of findings, as bad environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that why we keep the component if we always return Healthy...

// /v1/states API Health Field Behavior:
// The [apiv1.HealthState.Health] field in the /v1/states API response is set as follows:
// - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected)
// - [apiv1.HealthStateTypeUnhealthy] when there's an error retrieving clock speed from any GPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should report degraded not Unhealthy.

//
// Suggested Actions:
// This component does not set the [apiv1.HealthState.SuggestedActions] field.
// ECC errors are reported for monitoring purposes and may require hardware inspection if persistent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we report Degraded first then report Unhealthy with HW inspection if persistent?

// - [apiv1.HealthStateTypeHealthy] when GPU doesn't support fabric manager
// - [apiv1.HealthStateTypeHealthy] when NVSwitch is not detected (fabric manager not needed)
// - [apiv1.HealthStateTypeHealthy] when nv-fabricmanager executable is not found
// - [apiv1.HealthStateTypeUnhealthy] when fabric manager executable exists but service is not active
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be able to identify the specific error by SXID? IMO if the fm service is inactive, should we report Degraded?

// - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected)
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM supported status from any GPU
// - [apiv1.HealthStateTypeHealthy] when GPM is not supported by any GPU
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM metrics from any GPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we report Degraded if we just collect the metrics?

// - [apiv1.HealthStateTypeHealthy] when all GPUs' GSP firmware modes are successfully retrieved
//
// Suggested Actions:
// This component does not set the [apiv1.HealthState.SuggestedActions] field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we return HW inspection? or should introduce the Human Inspection?

// - [apiv1.HealthStateTypeHealthy] when no event bucket is available
// - [apiv1.HealthStateTypeHealthy] when no clock events are found in the evaluation window
// - [apiv1.HealthStateTypeHealthy] when HW slowdown events frequency is below threshold
// - [apiv1.HealthStateTypeUnhealthy] when HW slowdown events frequency exceeds threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has similar issues as ib flapping. We should consider introducing similar policy for this case. It should be Degraded first, then Unhealthy.

Comment on lines +6 to +7
// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting memory information from any GPU
// - [apiv1.HealthStateTypeUnhealthy] when there's an error calculating memory usage percentage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the component just collects the metrics, so we should always report Healthy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We used to have some threshold, but not in main branch. Used for reporting metrics.

@gyuho
Copy link
Member Author
gyuho commented Jun 18, 2025

@eahydra Let's review this in the Google sheet, and I will update this accordingly.

@gyuho gyuho modified the milestones: v0.5.0, v0.5.1 Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write GPUd integration document and clarify API stability and explain core APIs (/states, /events, /metrics)
2 participants
0