-
Notifications
You must be signed in to change notification settings - Fork 61
e2e: add workload health validation to DR ops and refactor cluster name handling #2071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change converts GetCluster function from e2e/env package to types package as a method on *Env type, encapsulating env functionality by grouping cluster retrieval logic with the Env type it operates on. Signed-off-by: Parikshith <parikshithb@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! see comment about simplying error hanlding.
Can you share results with this change? how much time we wait for health in each step, and logs showing that we wait for health. If we don't have these logs we need to add them to make it easier to debug when wait times out.
1e68575
to
e61af30
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's understand why the application needs 2m:30s to become healthy after relocate when using appset-deploy-cephfs. We may need to increase relocate timeout to avoid failures on slow setups.
Updated GetCurrentCluster function to return types.Cluster object by looking up the cluster name from PlacementDecision and retrieving the corresponding cluster from the env. Updated the function comment to reflect that it now returns the cluster object. Updated callers to access the full cluster object rather than name string across deployers and DR actions to handle the new return type. Signed-off-by: Parikshith <parikshithb@gmail.com>
Upated getTargetCluster function to return types.Cluster object by looking up the target cluster name and retrieving the cluster from the env. Updated variable naming from targetCluster to targetClusterName. Updated all callers in Failover and Relocate functions to use targetCluster.Name when passing cluster names. Signed-off-by: Parikshith <parikshithb@gmail.com>
Updated failoverRelocate and failoverRelocateDiscoveredApps functions to accept types.Cluster objects instead of cluster name strings for currentCluster and targetCluster parameters. This eliminates redundant GetCluster calls within failoverRelocateDiscoveredApps since the cluster objects are now passed directly from the callers. Updated function calls in Failover and Relocate to pass cluster objects instead of cluster names Signed-off-by: Parikshith <parikshithb@gmail.com>
Updated waitAndUpdateDRPC function to accept types.Cluster object instead of cluster name string, maintaining consistency with other functions that now work with cluster objects. Updated callers in failoverRelocate for managed and disapp to pass cluster objects instead of cluster names. Signed-off-by: Parikshith <parikshithb@gmail.com>
Add debug log when starting to wait for workload health to improve debugging visibility when health checks timeout. Standardize error message format to use namespace/appName and fix method call to getAppName() instead of GetName(). Signed-off-by: Parikshith <parikshithb@gmail.com>
Ran e2e including the workload validation after ops with drpolicies having 5 min and 1m(default) scheduling interval locally: 5m drpolicy used in config:
Run 2:
1m interval ci run result:
In both scenaios, it is not exceeding the timeout (600s), it varies but seems relocate for appset is faster in 5m interval and failover is bit slower when compared with above runs on 1m interval policy. |
The times looks very similar, did you change the drPolicy in the config? The validation logs should show the drpolicy. |
Yup, made sure I updated the drpolicy in config, logged in our logs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! one commit message should be fixed.
Add WaitWorkloadHealth checks to all DR operations except unprotect(due to RamenDR#2077) to ensure workload is healthy after operations complete. Updated DiscoveredApp deployer to use the cluster variable instead of ctx.Env().C1 for consistency. These changes ensure workloads are fully operational before considering workload deployments and different DR operations successful. Signed-off-by: Parikshith <parikshithb@gmail.com>
The Health method was incorrectly returning nil (success) even when deployments were not ready, causing WaitWorkloadHealth to immediately succeed without waiting. Now returns proper error with replica status when deployment is not healthy. Signed-off-by: Parikshith <parikshithb@gmail.com>
The Health() method was logging when a deployment is ready, but this is redundant since the caller (WaitWorkloadHealth) already logs both the "waiting" and "healthy" status messages. This eliminates duplicate log entries for the same event. Signed-off-by: Parikshith <parikshithb@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@parikshithb please send a ramenctl PR to consume this change. |
Consuming fix for validating workload health after DR operations: RamenDR/ramen#2071 Issue fixed in ramen e2e: RamenDR/ramen#2018 Signed-off-by: Parikshith <parikshithb@gmail.com>
Consuming fix for validating workload health after DR operations: RamenDR/ramen#2071 Issue fixed in ramen e2e: RamenDR/ramen#2018 Signed-off-by: Parikshith <parikshithb@gmail.com>
This PR adds workload health validation to all DR operations and deployer methods to ensure workloads are fully operational before considering operations successful. The implementation required refactoring to work with cluster objects instead of cluster name strings.
Changes Made
Workload Health Validation:
Cluster Handling Refactoring:
Moved GetCluster from standalone function to method on *Env type, updated functions to return and accept types.Cluster objects instead of cluster name strings, and refactored failoverRelocate functions to eliminate redundant cluster lookups.
Fixes #2018