doc(components/*): document /v1/states API health field behavior #922

gyuho · 2025-06-12T13:01:14Z

codecov · 2025-06-12T13:07:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.65%. Comparing base (4d5491f) to head (b576eb7).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   64.70%   64.65%   -0.05%     
==========================================
  Files         284      284              
  Lines       23259    23259              
==========================================
- Hits        15049    15038      -11     
- Misses       7430     7436       +6     
- Partials      780      785       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

- kubelet -> kubelet - containerd-pod -> containerd - docker-container -> docker c.f., #922 Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

…state on egress check issues (#925) Need to update #922. --------- Signed-off-by: Gyuho Lee <gyuhol@nvidia.com> Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

eahydra · 2025-06-17T06:59:39Z

components/accelerator/nvidia/bad-envs/component.go

+//   - [apiv1.HealthStateTypeHealthy] when bad environment variables are detected (treated as informational)
+//   - [apiv1.HealthStateTypeHealthy] when no bad environment variables are found
+//
+// This component ALWAYS reports "healthy" regardless of findings, as bad environment


It's a bit weird that why we keep the component if we always return Healthy...

eahydra · 2025-06-17T07:04:24Z

components/accelerator/nvidia/clock-speed/component.go

+// /v1/states API Health Field Behavior:
+// The [apiv1.HealthState.Health] field in the /v1/states API response is set as follows:
+//   - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected)
+//   - [apiv1.HealthStateTypeUnhealthy] when there's an error retrieving clock speed from any GPU


We should report degraded not Unhealthy.

eahydra · 2025-06-17T07:13:31Z

components/accelerator/nvidia/ecc/component.go

+//
+// Suggested Actions:
+// This component does not set the [apiv1.HealthState.SuggestedActions] field.
+// ECC errors are reported for monitoring purposes and may require hardware inspection if persistent.


So should we report Degraded first then report Unhealthy with HW inspection if persistent?

eahydra · 2025-06-17T07:17:29Z

components/accelerator/nvidia/fabric-manager/component.go

+//   - [apiv1.HealthStateTypeHealthy] when GPU doesn't support fabric manager
+//   - [apiv1.HealthStateTypeHealthy] when NVSwitch is not detected (fabric manager not needed)
+//   - [apiv1.HealthStateTypeHealthy] when nv-fabricmanager executable is not found
+//   - [apiv1.HealthStateTypeUnhealthy] when fabric manager executable exists but service is not active


Should we also be able to identify the specific error by SXID? IMO if the fm service is inactive, should we report Degraded?

eahydra · 2025-06-17T07:47:13Z

components/accelerator/nvidia/gpm/component.go

+//   - [apiv1.HealthStateTypeHealthy] when NVIDIA components are unavailable (no NVML, no GPU detected)
+//   - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM supported status from any GPU
+//   - [apiv1.HealthStateTypeHealthy] when GPM is not supported by any GPU
+//   - [apiv1.HealthStateTypeUnhealthy] when there's an error getting GPM metrics from any GPU


Would we report Degraded if we just collect the metrics?

eahydra · 2025-06-17T07:52:51Z

components/accelerator/nvidia/gsp-firmware-mode/component.go

+//   - [apiv1.HealthStateTypeHealthy] when all GPUs' GSP firmware modes are successfully retrieved
+//
+// Suggested Actions:
+// This component does not set the [apiv1.HealthState.SuggestedActions] field.


Shall we return HW inspection? or should introduce the Human Inspection?

eahydra · 2025-06-17T07:57:01Z

components/accelerator/nvidia/hw-slowdown/component.go

+//   - [apiv1.HealthStateTypeHealthy] when no event bucket is available
+//   - [apiv1.HealthStateTypeHealthy] when no clock events are found in the evaluation window
+//   - [apiv1.HealthStateTypeHealthy] when HW slowdown events frequency is below threshold
+//   - [apiv1.HealthStateTypeUnhealthy] when HW slowdown events frequency exceeds threshold


It has similar issues as ib flapping. We should consider introducing similar policy for this case. It should be Degraded first, then Unhealthy.

eahydra · 2025-06-17T08:02:27Z

components/accelerator/nvidia/memory/component.go

+//   - [apiv1.HealthStateTypeUnhealthy] when there's an error getting memory information from any GPU
+//   - [apiv1.HealthStateTypeUnhealthy] when there's an error calculating memory usage percentage


the component just collects the metrics, so we should always report Healthy?

Yes. We used to have some threshold, but not in main branch. Used for reporting metrics.

gyuho · 2025-06-18T02:33:36Z

@eahydra Let's review this in the Google sheet, and I will update this accordingly.

gyuho added this to the v0.5.0 milestone Jun 12, 2025

gyuho self-assigned this Jun 12, 2025

This was referenced Jun 13, 2025

fix(infiniband): set hw inspection when ibstat drop/flaps #924

Merged

feat(network/latency): disable periodic check, set "degraded" health state on egress check issues #925

Merged

gyuho added a commit that referenced this pull request Jun 13, 2025

fix(infiniband): set hw inspection when ibstat drop/flaps (#924)

9410081

Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho added a commit that referenced this pull request Jun 13, 2025

feat(network/latency): lower frequency to 20 minutes, set "degraded" …

2ed30a3

…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho mentioned this pull request Jun 13, 2025

feat(kubelet, containerd, docker): rename component IDs, names #923

Merged

gyuho added a commit that referenced this pull request Jun 13, 2025

feat(network/latency): lower frequency to 20 minutes, set "degraded" …

99a42c5

…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho added a commit that referenced this pull request Jun 13, 2025

feat(kubelet, containerd, docker): rename component IDs, names (#923)

59fca24

- kubelet -> kubelet - containerd-pod -> containerd - docker-container -> docker c.f., #922 Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho force-pushed the document-all-health-states branch from 78ab940 to 731c6fa Compare June 14, 2025 10:33

gyuho added a commit that referenced this pull request Jun 14, 2025

feat(network/latency): lower frequency to 20 minutes, set "degraded" …

82c366a

…health state on egress check issues Need to update #922. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

doc(components/*): document /v1/states API health field behavior

b576eb7

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho force-pushed the document-all-health-states branch from 731c6fa to b576eb7 Compare June 17, 2025 13:23

eahydra reviewed Jun 17, 2025

View reviewed changes

gyuho modified the milestones: v0.5.0, v0.5.1 Jun 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

doc(components/*): document /v1/states API health field behavior #922

doc(components/*): document /v1/states API health field behavior #922

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		// - [apiv1.HealthStateTypeUnhealthy] when there's an error getting memory information from any GPU
		// - [apiv1.HealthStateTypeUnhealthy] when there's an error calculating memory usage percentage

doc(components/*): document /v1/states API health field behavior #922

Are you sure you want to change the base?

doc(components/*): document /v1/states API health field behavior #922

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!