-
Notifications
You must be signed in to change notification settings - Fork 536
[RayService] don't update serveConfigV2 in current ray cluster if ray… #3559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[RayService] don't update serveConfigV2 in current ray cluster if ray… #3559
Conversation
This is the screenshot of reproduction. After applying the fix: (the script periodically check if the new serveConfigV2 has been updated or not until the ray cluster has been terminated.) Attach the reproduction script for your reference. |
@kevin85421 PTAL |
… cluster spec has changed at the same time. Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
e54f5ee
to
c02c81f
Compare
if err != nil && | ||
(strings.Contains(err.Error(), "exit code 52") || | ||
strings.Contains(err.Error(), "exit code 6")) { | ||
// which is "curl: (52) Empty reply from server" or "curl: (6) Couldn't resolve host" | ||
// it implies the ray cluster is terminated. | ||
break | ||
} | ||
|
||
// Except the errors above, it should not occur. | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test condition is not that straight forward. It would keep looping because it should check serveConfigV2 is not modified until it terminated.
There are a few of explanations about these conditions.
- check the request to dashboard is unreachable, which implies the ray cluster terminated. it will exit the loop. (code 52 or 6 might happen). This is test pass if the following condition is not happened.
- check the serveConfigV2 is not contain updated price until the ray cluster terminated. If it contains the updated price, test fails immediately.
Hi @kevin85421 , the test is ready. PTAL |
cc @rueian would you mind reviewing this PR? Thanks. |
// Except the errors above, it should not occur. | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me, but I am afraid that this g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster")
could never hit due to timing races. Could we simply wait for the old cluster to be destroyed and verify we don't reconcileServe
on it by checking there are no new UpdatedServeApplications
or FailedToUpdateServeApplications
events on the cluster?
@@ -82,3 +86,97 @@ func TestRayServiceInPlaceUpdate(t *testing.T) { | |||
g.Expect(stdout.String()).To(Equal("9 pizzas please!")) | |||
}, TestTimeoutShort).Should(Succeed()) | |||
} | |||
|
|||
func TestRayServiceInPlaceUpdateWithRayClusterSpec(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have another test for zero time upgrade enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like enabled already.
There is no upgradeStrategy
in RayServiceSampleYamlApplyConfiguration()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then can we have a test for zero downtime upgrade being disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit of confusing about what it should do if zero downtime upgrade is disabled.
If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster. Which might be an potential issue that this pr is trying to solve. Because ShouldPrepareNewCluster(...) wouldn't be true under these conditions.
However, if serveConfigV2 is not allow to update on the existed cluster under zero downtime upgrade is disabled with RayClusterSpec and serveConfigV2 modified at the same time. It might violate the design https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-7-in-place-update-for-ray-serve-applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster.
I think that is the correct behavior. Is there any existing test guarantee that already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems no such test so add a test without zero downtime upgrade.
…lications happen or not Signed-off-by: fscnick <fscnick.dev@gmail.com>
g.Eventually(func() string { | ||
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | ||
if err == nil { | ||
return "" | ||
} | ||
return err.Error() | ||
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g.Eventually(func() string { | |
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | |
if err == nil { | |
return "" | |
} | |
return err.Error() | |
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name))) | |
g.Eventually(func() bool { | |
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | |
return k8sApiErrors.IsNotFound(err) | |
}, TestTimeoutMedium).Should(BeTrue()) |
|
||
// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update. | ||
for _, event := range events { | ||
if event.CreationTimestamp.Before(updateTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we test the LastTimestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastTimestamp
might be later than CreationTimestamp
which might need to consider more situation. It would be simple to choose CreationTimestamp
. Kindly let me know if there are any other concern.
CreationTimestamp
definition:
https://github.com/kubernetes/apimachinery/blob/202cba0f14e50d9b4106b34623f8c8056db3f86e/pkg/apis/meta/v1/types.go#L179-L188
FirstTimestamp
and LastTimestamp
definition:
https://github.com/kubernetes/api/blob/f7e72be095ee5a3e60d9df2f79b6591ee329734a/core/v1/types.go#L7205-L7211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.
How about you first check if the reason is what we want to test, and then test the LastTimestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the LastTimestamp should record the event we update to the new cluster. How about test the Count = 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is okay to test the Count == 2. It seems message are excluded from aggregation key so that it is not knowing that belong to which RayCluster. However, if it happens more than 2 or less than 2, it definitely goes wrong. But it seems no other information in the event to tell which is which.
@@ -90,6 +90,22 @@ func CurlRayServicePod( | |||
return ExecPodCmd(t, curlPod, curlPodContainerName, cmd) | |||
} | |||
|
|||
func CurlRayClusterDashboard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this and the ExecPodCmdReturnError can be removed.
ray-operator/test/support/events.go
Outdated
@@ -128,3 +131,15 @@ func getWhitespaceStr(size int) string { | |||
} | |||
return whiteSpaceStr | |||
} | |||
|
|||
func RayServiceEvents(t Test, rayService *rayv1.RayService) ([]corev1.Event, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This by itself seems too small to be a helper.
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
|
||
// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update. | ||
for _, event := range events { | ||
if event.CreationTimestamp.Before(updateTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.
How about you first check if the reason is what we want to test, and then test the LastTimestamp?
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
updateTimestamp := rayService.Status.LastUpdateTime | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a curl test to make sure the new serve config is updated on the new cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it after checking the old RayCluster has terminated.
Signed-off-by: fscnick <fscnick.dev@gmail.com>
3c44f1d
to
78c45cc
Compare
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Why are these changes needed?
When the serveConfigV2 and RayCluster has changed at the same time, it would also update the current active ray cluster with the new serveConfigV2. It might be possible to cause a failure if something in the runtime environment is different.
Related issue number
Closes #3423
Checks