10000 [RayService] don't update serveConfigV2 in current ray cluster if ray… by fscnick · Pull Request #3559 · ray-project/kuberay · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RayService] don't update serveConfigV2 in current ray cluster if ray… #3559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

fscnick
Copy link
Contributor
@fscnick fscnick commented May 7, 2025

Why are these changes needed?

When the serveConfigV2 and RayCluster has changed at the same time, it would also update the current active ray cluster with the new serveConfigV2. It might be possible to cause a failure if something in the runtime environment is different.

Related issue number

Closes #3423

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@fscnick
Copy link
Contributor Author
fscnick commented May 7, 2025

This is the screenshot of reproduction.

Before applying the fix:
without fix

After applying the fix: (the script periodically check if the new serveConfigV2 has been updated or not until the ray cluster has been terminated.)
with fix

Attach the reproduction script for your reference.
issue-3423-reproduce.sh.txt

@fscnick
Copy link
Contributor Author
fscnick commented May 7, 2025

@kevin85421 PTAL

fscnick added 2 commits May 8, 2025 00:46
… cluster spec has changed at the same time.

Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
@fscnick fscnick force-pushed the update-serve-config-v2-not-affecting-running-cluster-2 branch from e54f5ee to c02c81f Compare May 7, 2025 16:49
Comment on lines 169 to 180
if err != nil &&
(strings.Contains(err.Error(), "exit code 52") ||
strings.Contains(err.Error(), "exit code 6")) {
// which is "curl: (52) Empty reply from server" or "curl: (6) Couldn't resolve host"
// it implies the ray cluster is terminated.
break
}

// Except the errors above, it should not occur.
g.Expect(err).NotTo(HaveOccurred())

g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster")
Copy link
Contributor Author
@fscnick fscnick May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test condition is not that straight forward. It would keep looping because it should check serveConfigV2 is not modified until it terminated.
There are a few of explanations about these conditions.

  • check the request to dashboard is unreachable, which implies the ray cluster terminated. it will exit the loop. (code 52 or 6 might happen). This is test pass if the following condition is not happened.
  • check the serveConfigV2 is not contain updated price until the ray cluster terminated. If it contains the updated price, test fails immediately.

@fscnick fscnick marked this pull request as ready for review May 9, 2025 13:08
@fscnick
Copy link
Contributor Author
fscnick commented May 9, 2025

Hi @kevin85421 , the test is ready. PTAL

@kevin85421
Copy link
Member

cc @rueian would you mind reviewing this PR? Thanks.

// Except the errors above, it should not occur.
g.Expect(err).NotTo(HaveOccurred())

g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me, but I am afraid that this g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") could never hit due to timing races. Could we simply wait for the old cluster to be destroyed and verify we don't reconcileServe on it by checking there are no new UpdatedServeApplications or FailedToUpdateServeApplications events on the cluster?

@@ -82,3 +86,97 @@ func TestRayServiceInPlaceUpdate(t *testing.T) {
g.Expect(stdout.String()).To(Equal("9 pizzas please!"))
}, TestTimeoutShort).Should(Succeed())
}

func TestRayServiceInPlaceUpdateWithRayClusterSpec(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have another test for zero time upgrade enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like enabled already.

There is no upgradeStrategy in RayServiceSampleYamlApplyConfiguration()

screenshot:
zero downtime 01
zero downtime 02
zero downtime 03

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then can we have a test for zero downtime upgrade being disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit of confusing about what it should do if zero downtime upgrade is disabled.

If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster. Which might be an potential issue that this pr is trying to solve. Because ShouldPrepareNewCluster(...) wouldn't be true under these conditions.

However, if serveConfigV2 is not allow to update on the existed cluster under zero downtime upgrade is disabled with RayClusterSpec and serveConfigV2 modified at the same time. It might violate the design https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-7-in-place-update-for-ray-serve-applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster.

I think that is the correct behavior. Is there any existing test guarantee that already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems no such test so add a test without zero downtime upgrade.

…lications happen or not

Signed-off-by: fscnick <fscnick.dev@gmail.com>
@fscnick fscnick requested a review from rueian May 11, 2025 11:48
Comment on lines 156 to 162
g.Eventually(func() string {
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name)
if err == nil {
return ""
}
return err.Error()
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
g.Eventually(func() string {
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name)
if err == nil {
return ""
}
return err.Error()
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name)))
g.Eventually(func() bool {
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name)
return k8sApiErrors.IsNotFound(err)
}, TestTimeoutMedium).Should(BeTrue())


// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update.
for _, event := range events {
if event.CreationTimestamp.Before(updateTimestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we test the LastTimestamp?

Copy link
Contributor Author
@fscnick fscnick May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LastTimestamp might be later than CreationTimestamp which might need to consider more situation. It would be simple to choose CreationTimestamp. Kindly let me know if there are any other concern.
lasttimestamp

CreationTimestamp definition:
https://github.com/kubernetes/apimachinery/blob/202cba0f14e50d9b4106b34623f8c8056db3f86e/pkg/apis/meta/v1/types.go#L179-L188

FirstTimestamp and LastTimestamp definition:
https://github.com/kubernetes/api/blob/f7e72be095ee5a3e60d9df2f79b6591ee329734a/core/v1/types.go#L7205-L7211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.

How about you first check if the reason is what we want to test, and then test the LastTimestamp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the LastTimestamp should record the event we update to the new cluster. How about test the Count = 2?

Copy link
Contributor Author
@fscnick fscnick May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is okay to test the Count == 2. It seems message are excluded from aggregation key so that it is not knowing that belong to which RayCluster. However, if it happens more than 2 or less than 2, it definitely goes wrong. But it seems no other information in the event to tell which is which.

@@ -90,6 +90,22 @@ func CurlRayServicePod(
return ExecPodCmd(t, curlPod, curlPodContainerName, cmd)
}

func CurlRayClusterDashboard(
Copy li 9E7A nk
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this and the ExecPodCmdReturnError can be removed.

@@ -128,3 +131,15 @@ func getWhitespaceStr(size int) string {
}
return whiteSpaceStr
}

func RayServiceEvents(t Test, rayService *rayv1.RayService) ([]corev1.Event, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This by itself seems too small to be a helper.

fscnick added 2 commits May 12, 2025 20:54
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>

// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update.
for _, event := range events {
if event.CreationTimestamp.Before(updateTimestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.

How about you first check if the reason is what we want to test, and then test the LastTimestamp?

g.Expect(err).NotTo(HaveOccurred())

updateTimestamp := rayService.Status.LastUpdateTime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a curl test to make sure the new serve config is updated on the new cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it after checking the old RayCluster has terminated.

Signed-off-by: fscnick <fscnick.dev@gmail.com>
@fscnick fscnick force-pushed the update-serve-config-v2-not-affecting-running-cluster-2 branch from 3c44f1d to 78c45cc Compare May 12, 2025 18:39
Signed-off-by: fscnick <fscnick.dev@gmail.com>
@fscnick fscnick requested a review from rueian May 13, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] KubeRay applies serve config to the running ray cluster when image is updated
3 participants
0