8000 [SLI Metrics] RayClusterHeadPodReady by owenowenisme · Pull Request #3571 · ray-project/kuberay · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[SLI Metrics] RayClusterHeadPodReady #3571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base bran 8000 ch
from

Conversation

owenowenisme
Copy link
Contributor
@owenowenisme owenowenisme commented May 9, 2025

Why are these changes needed?

Manual test:

❯ curl -s localhost:8080/metrics | grep  kuberay_cluster_head_pod_ready
# HELP Describes whether the ray cluster is ready to serve requests
# TYPE kuberay_cluster_head_pod_ready gauge
kuberay_cluster_head_pod_ready{condition="false",name="raycluster-kuberay",namespace="default"} 1
kuberay_cluster_head_pod_ready{condition="true",name="raycluster-kuberay",namespace="default"} 0
❯ curl -s localhost:8080/metrics | grep  kuberay_cluster_head_pod_ready
# HELP Describes whether the ray cluster is ready to serve requests
# TYPE kuberay_cluster_head_pod_ready gauge
kuberay_cluster_head_pod_ready{condition="false",name="raycluster-kuberay",namespace="default"} 0
kuberay_cluster_head_pod_ready{condition="true",name="raycluster-kuberay",namespace="default"} 1

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@owenowenisme
Copy link
Contributor Author

@win5923 @troychiu
Could you guys please take a look at my implementation first since this is my first SLI Metrics PR.
I'll add unit test in this PR after you approve the implementation.

Comment on lines 58 to 61
func (c *RayClusterMetricsManager) ObserveRayClusterHeadPodReady(name, namespace string, ready bool) {
c.rayClusterHeadPodReady.DeleteLabelValues(name, namespace, strconv.FormatBool(!ready))
c.rayClusterHeadPodReady.WithLabelValues(name, namespace, strconv.FormatBool(ready)).Set(1)
}
Copy link
Contributor
@win5923 win5923 May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is to follow the convention used by kube_state_metrics, where metrics represent status conditions using a condition label (e.g., "true", "false", "unknown"), and the metric value is either 0 or 1. In this model, each resource instance emits multiple metrics for the same status, but only one will have a value of 1 at any given time — indicating the current state.

User can also determine when a given resource changed state, providing valuable insights for auditing, debugging, and historical analysis.

However, this approach does come with a trade-off: it increases metric cardinality, as each resource will now generate multiple labeled series — one for each condition.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach also works well — it's simple and straightforward, so perhaps we can continue with it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to the way you recommended, so we don't need to change it in the future.

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
@owenowenisme owenowenisme force-pushed the SLI-Metrics/rayClusterHeadPodReady branch from 6e29a16 to b8b8f0c Compare May 12, 2025 13:06
@owenowenisme owenowenisme requested a review from win5923 May 12, 2025 14:57
@owenowenisme owenowenisme marked this pull request as ready for review May 12, 2025 14:57
Copy link
Contributor
@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can you also update the description?

@owenowenisme
Copy link
Contributor Author

cc @kevin85421 for review

Copy link
Member
@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our previous discussion, we've decided to use kube-state-metrics for this instead. cc @troychiu @win5923?

https://docs.google.com/document/d/1zNiE7lVZYjhrxlTbh1UXOVpR6hh1GIeSfCfE9Lt5v6Y/edit?tab=t.0

image

@win5923
Copy link
Contributor
win5923 commented May 12, 2025

Based on our previous discussion, we've decided to use kube-state-metrics for this instead. cc @troychiu @win5923?

https://docs.google.com/document/d/1zNiE7lVZYjhrxlTbh1UXOVpR6hh1GIeSfCfE9Lt5v6Y/edit?tab=t.0

image

Yes, but I saw Troy mention on Slack that we could do this first, so I thought that we were supposed to keep this metric.

@win5923
Copy link
Contributor
win5923 commented May 12, 2025

By default, kube-state-metrics does not expose pod labels to Prometheus. To include them, we also need to configure the --metric-labels-allowlist flag. This allows to perform aggregation and use labels to identify, for example, which is the head pod and which RayCluster does it belong to.

In this case, I think kuberay_cluster_head_pod_ready can leverage kube_pod_status_ready.

@troychiu
Copy link
Contributor

Yeah I think our conclusion is using kube-state-metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0