e2e: add workload health validation to DR ops and refactor cluster name handling #2071

parikshithb · 2025-05-30T11:48:08Z

This PR adds workload health validation to all DR operations and deployer methods to ensure workloads are fully operational before considering operations successful. The implementation required refactoring to work with cluster objects instead of cluster name strings.

Changes Made

Workload Health Validation:

Added health validation after all deployment operations (ApplicationSet & Subscription)
Added health validation after DR protection enable operation for both regular and discovered apps
Added health validation after failover/relocate operations to ensure workloads are healthy on target clusters

Cluster Handling Refactoring:
Moved GetCluster from standalone function to method on *Env type, updated functions to return and accept types.Cluster objects instead of cluster name strings, and refactored failoverRelocate functions to eliminate redundant cluster lookups.

Fixes #2018

This change converts GetCluster function from e2e/env package to types package as a method on *Env type, encapsulating env functionality by grouping cluster retrieval logic with the Env type it operates on. Signed-off-by: Parikshith <parikshithb@gmail.com>

nirs

Awesome! see comment about simplying error hanlding.

Can you share results with this change? how much time we wait for health in each step, and logs showing that we wait for health. If we don't have these logs we need to add them to make it easier to debug when wait times out.

e2e/dractions/retry.go

parikshithb · 2025-06-02T13:51:03Z

Workload validation in deploy takes <10 seconds except for appset apps which takes <1min:

2025-06-02T16:51:12.772+0530	INFO	appset-deploy-rbd-busybox	deployers/appset.go:23	Deploying applicationset app "e2e-appset-deploy-rbd-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:13.043+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:52:03.144+0530	DEBUG	appset-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:52:03.144+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:52:03.144+0530	INFO	appset-deploy-rbd-busybox	deployers/appset.go:49	Workload deployed

2025-06-02T16:51:12.773+0530	INFO	appset-deploy-cephfs-busybox	deployers/appset.go:23	Deploying applicationset app "e2e-appset-deploy-cephfs-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:13.078+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:52:03.207+0530	DEBUG	appset-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:52:03.207+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:52:03.207+0530	INFO	appset-deploy-cephfs-busybox	deployers/appset.go:49	Workload deployed

2025-06-02T16:51:12.773+0530	INFO	subscr-deploy-cephfs-busybox	deployers/subscr.go:41	Deploying subscription app "e2e-subscr-deploy-cephfs-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:24.473+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:34.494+0530	DEBUG	subscr-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:51:34.494+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:34.494+0530	INFO	subscr-deploy-cephfs-busybox	deployers/subscr.go:78	Workload deployed

2025-06-02T16:51:12.773+0530	INFO	subscr-deploy-rbd-busybox	deployers/subscr.go:41	Deploying subscription app "e2e-subscr-deploy-rbd-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:29.503+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:39.524+0530	DEBUG	subscr-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:51:39.524+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:39.524+0530	INFO	subscr-deploy-rbd-busybox	deployers/subscr.go:78	Workload deployed

2025-06-02T16:51:12.880+0530	INFO	disapp-deploy-cephfs-busybox	deployers/disapp.go:47	Deploying discovered app "e2e-disapp-deploy-cephfs-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:16.080+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:16.108+0530	DEBUG	disapp-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:51:16.108+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:16.108+0530	INFO	disapp-deploy-cephfs-busybox	deployers/disapp.go:66	Workload deployed

2025-06-02T16:51:12.880+0530	INFO	disapp-deploy-rbd-busybox	deployers/disapp.go:47	Deploying discovered app "e2e-disapp-deploy-rbd-busybox/busybox" in cluster "dr1"
2025-06-02T16:51:15.983+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:16.010+0530	DEBUG	disapp-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:51:16.010+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:51:16.010+0530	INFO	disapp-deploy-rbd-busybox	deployers/disapp.go:66	Workload deployed

In Protect and unprotect, workload validation completes in mili seconds for all apps.
In Failover aswell workload validation completes in mili seconds to complete for all apps:

2025-06-02T16:52:46.792+0530	INFO	disapp-deploy-rbd-busybox	dractions/actions.go:163	Failing over workload "e2e-disapp-deploy-rbd-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T16:56:23.021+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:23.036+0530	DEBUG	disapp-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-rbd-busybox/busybox" is ready in cluster "dr2"
2025-06-02T16:56:23.036+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:23.036+0530	INFO	disapp-deploy-rbd-busybox	dractions/actions.go:171	Workload failed over

2025-06-02T16:52:51.975+0530	INFO	disapp-deploy-cephfs-busybox	dractions/actions.go:163	Failing over workload "e2e-disapp-deploy-cephfs-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T16:56:51.023+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:51.028+0530	DEBUG	disapp-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-cephfs-busybox/busybox" is ready in cluster "dr2"
2025-06-02T16:56:51.028+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:51.028+0530	INFO	disapp-deploy-cephfs-busybox	dractions/actions.go:171	Workload failed over

2025-06-02T16:53:15.129+0530	INFO	subscr-deploy-rbd-busybox	dractions/actions.go:163	Failing over workload "e2e-subscr-deploy-rbd-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T16:56:43.056+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:43.060+0530	DEBUG	subscr-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-rbd-busybox/busybox" is ready in cluster "dr2"
2025-06-02T16:56:43.060+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:56:43.060+0530	INFO	subscr-deploy-rbd-busybox	dractions/actions.go:171	Workload failed over

2025-06-02T16:53:38.470+0530	INFO	appset-deploy-rbd-busybox	dractions/actions.go:163	Failing over workload "e2e-appset-deploy-rbd-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T17:00:09.649+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T17:00:09.682+0530	DEBUG	appset-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-rbd-busybox/busybox" is ready in cluster "dr2"
2025-06-02T17:00:09.682+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T17:00:09.682+0530	INFO	appset-deploy-rbd-busybox	dractions/actions.go:171	Workload failed over

2025-06-02T16:53:38.588+0530	INFO	appset-deploy-cephfs-busybox	dractions/actions.go:163	Failing over workload "e2e-appset-deploy-cephfs-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T17:00:09.654+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T17:00:09.681+0530	DEBUG	appset-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-cephfs-busybox/busybox" is ready in cluster "dr2"
2025-06-02T17:00:09.682+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T17:00:09.682+0530	INFO	appset-deploy-cephfs-busybox	dractions/actions.go:171	Workload failed over

2025-06-02T16:53:40.677+0530	INFO	subscr-deploy-cephfs-busybox	dractions/actions.go:163	Failing over workload "e2e-subscr-deploy-cephfs-busybox/busybox" from cluster "dr1" to cluster "dr2"
2025-06-02T16:58:08.215+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:58:08.223+0530	DEBUG	subscr-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-cephfs-busybox/busybox" is ready in cluster "dr2"
2025-06-02T16:58:08.223+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr2"
2025-06-02T16:58:08.223+0530	INFO	subscr-deploy-cephfs-busybox	dractions/actions.go:171	Workload failed over

During relocate, workload validation in sub rbd took 15secs, and appset cephfs ~2.2mins(might reach deadline for relocate) took appset rbd took ~80sec:

2025-06-02T16:56:23.057+0530	INFO	disapp-deploy-rbd-busybox	dractions/actions.go:196	Relocating workload "e2e-disapp-deploy-rbd-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T16:58:21.134+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:58:21.149+0530	DEBUG	disapp-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:58:21.149+0530	DEBUG	disapp-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:58:21.149+0530	INFO	disapp-deploy-rbd-busybox	dractions/actions.go:204	Workload relocated

2025-06-02T16:56:43.085+0530	INFO	subscr-deploy-rbd-busybox	dractions/actions.go:196	Relocating workload "e2e-subscr-deploy-rbd-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T16:58:43.363+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:58:58.382+0530	DEBUG	subscr-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T16:58:58.383+0530	DEBUG	subscr-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T16:58:58.383+0530	INFO	subscr-deploy-rbd-busybox	
8000
dractions/actions.go:204	Workload relocated

2025-06-02T16:56:51.055+0530	INFO	disapp-deploy-cephfs-busybox	dractions/actions.go:196	Relocating workload "e2e-disapp-deploy-cephfs-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T17:00:49.462+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:00:49.468+0530	DEBUG	disapp-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-disapp-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T17:00:49.468+0530	DEBUG	disapp-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-disapp-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:00:49.468+0530	INFO	disapp-deploy-cephfs-busybox	dractions/actions.go:204	Workload relocated

2025-06-02T16:58:08.244+0530	INFO	subscr-deploy-cephfs-busybox	dractions/actions.go:196	Relocating workload "e2e-subscr-deploy-cephfs-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T17:03:14.749+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:03:14.756+0530	DEBUG	subscr-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-subscr-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T17:03:14.756+0530	DEBUG	subscr-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-subscr-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:03:14.756+0530	INFO	subscr-deploy-cephfs-busybox	dractions/actions.go:204	Workload relocated

2025-06-02T17:00:09.785+0530	INFO	appset-deploy-rbd-busybox	dractions/actions.go:196	Relocating workload "e2e-appset-deploy-rbd-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T17:05:08.509+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:06:33.653+0530	DEBUG	appset-deploy-rbd-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-rbd-busybox/busybox" is ready in cluster "dr1"
2025-06-02T17:06:33.653+0530	DEBUG	appset-deploy-rbd-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-rbd-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:06:33.653+0530	INFO	appset-deploy-rbd-busybox	dractions/actions.go:204	Workload relocated

2025-06-02T17:00:09.790+0530	INFO	appset-deploy-cephfs-busybox	dractions/actions.go:196	Relocating workload "e2e-appset-deploy-cephfs-busybox/busybox" from cluster "dr2" to cluster "dr1"
2025-06-02T17:07:08.992+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:46	Waiting until workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:09:24.374+0530	DEBUG	appset-deploy-cephfs-busybox	workloads/deploy.go:95	Deployment "e2e-appset-deploy-cephfs-busybox/busybox" is ready in cluster "dr1"
2025-06-02T17:09:24.374+0530	DEBUG	appset-deploy-cephfs-busybox	deployers/retry.go:51	Workload "e2e-appset-deploy-cephfs-busybox/busybox" is healthy in cluster "dr1"
2025-06-02T17:09:24.374+0530	INFO	appset-deploy-cephfs-busybox	dractions/actions.go:204	Workload relocated

dr.log

nirs

Let's understand why the application needs 2m:30s to become healthy after relocate when using appset-deploy-cephfs. We may need to increase relocate timeout to avoid failures on slow setups.

e2e/util/placement.go

Updated GetCurrentCluster function to return types.Cluster object by looking up the cluster name from PlacementDecision and retrieving the corresponding cluster from the env. Updated the function comment to reflect that it now returns the cluster object. Updated callers to access the full cluster object rather than name string across deployers and DR actions to handle the new return type. Signed-off-by: Parikshith <parikshithb@gmail.com>

Upated getTargetCluster function to return types.Cluster object by looking up the target cluster name and retrieving the cluster from the env. Updated variable naming from targetCluster to targetClusterName. Updated all callers in Failover and Relocate functions to use targetCluster.Name when passing cluster names. Signed-off-by: Parikshith <parikshithb@gmail.com>

Updated failoverRelocate and failoverRelocateDiscoveredApps functions to accept types.Cluster objects instead of cluster name strings for currentCluster and targetCluster parameters. This eliminates redundant GetCluster calls within failoverRelocateDiscoveredApps since the cluster objects are now passed directly from the callers. Updated function calls in Failover and Relocate to pass cluster objects instead of cluster names Signed-off-by: Parikshith <parikshithb@gmail.com>

Updated waitAndUpdateDRPC function to accept types.Cluster object instead of cluster name string, maintaining consistency with other functions that now work with cluster objects. Updated callers in failoverRelocate for managed and disapp to pass cluster objects instead of cluster names. Signed-off-by: Parikshith <parikshithb@gmail.com>

Add debug log when starting to wait for workload health to improve debugging visibility when health checks timeout. Standardize error message format to use namespace/appName and fix method call to getAppName() instead of GetName(). Signed-off-by: Parikshith <parikshithb@gmail.com>

parikshithb · 2025-06-04T14:33:23Z

Ran e2e including the workload validation after ops with drpolicies having 5 min and 1m(default) scheduling interval locally:

5m drpolicy used in config:

    name: dr-policy-5m
    resourceVersion: "67851"
    uid: 870a6b4d-8c3b-432d-af79-ca1cd393735c
  spec:
    drClusters:
    - dr1
    - dr2
    replicationClassSelector: {}
    schedulingInterval: 5m
    volumeGroupSnapshotClassSelector: {}
    volumeSnapshotClassSelector: {}
  status:
    async:
      peerClasses:
      - clusterIDs:
        - 2bccec29-379e-4973-86c8-c893bda3f69d
        - 39a0947e-ae41-4cca-9c73-74055bd0a8e6
        storageClassName: rook-ceph-block
        storageID:
        - rook-ceph-block-dr1-1
        - rook-ceph-block-dr2-1
      - clusterIDs:
        - 2bccec29-379e-4973-86c8-c893bda3f69d
        - 39a0947e-ae41-4cca-9c73-74055bd0a8e6
        storageClassName: rook-cephfs-fs1
        storageID:
        - rook-cephfs-fs1-dr1-1
        - rook-cephfs-fs1-dr2-1
    conditions:
    - lastTransitionTime: "2025-06-04T13:25:51Z"
      message: drpolicy validated
      observedGeneration: 2
      reason: Succeeded
      status: "True"
      type: Validated

Time duration for individual tests to complete with 5m policy(ran twice):
Run 1:

--- PASS: TestDR (6.06s)
    --- PASS: TestDR/disapp-deploy-cephfs-busybox (490.70s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Deploy (1.77s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Enable (90.08s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Failover (151.74s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Relocate (181.95s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Disable (56.17s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Undeploy (8.98s)
    --- PASS: TestDR/disapp-deploy-rbd-busybox (612.37s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Deploy (1.87s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Enable (95.10s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Failover (206.97s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Relocate (241.89s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Disable (57.17s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Undeploy (9.37s)
    --- PASS: TestDR/subscr-deploy-rbd-busybox (651.87s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Deploy (15.12s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Enable (90.08s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Failover (270.29s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Relocate (210.14s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Disable (60.18s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Undeploy (6.06s)
    --- PASS: TestDR/subscr-deploy-cephfs-busybox (682.06s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Deploy (15.12s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Enable (90.08s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Failover (300.51s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Relocate (210.13s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Disable (60.18s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Undeploy (6.05s)
    --- PASS: TestDR/appset-deploy-rbd-busybox (971.87s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Deploy (5.06s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Enable (90.07s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Failover (420.27s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Relocate (390.25s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Disable (60.18s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Undeploy (6.05s)
    --- PASS: TestDR/appset-deploy-cephfs-busybox (975.85s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Deploy (10.06s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Enable (120.09s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Failover (450.28s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Relocate (335.21s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Disable (54.17s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Undeploy (6.04s)
PASS

Run 2:

--- PASS: TestDR (6.06s)
    --- PASS: TestDR/disapp-deploy-cephfs-busybox (550.59s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Deploy (1.78s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Enable (90.09s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Failover (151.78s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Relocate (241.73s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Disable (56.20s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Undeploy (9.00s)
    --- PASS: TestDR/subscr-deploy-rbd-busybox (556.93s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Deploy (10.10s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Enable (65.42s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Failover (295.17s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Relocate (150.09s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Disable (30.10s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Undeploy (6.04s)
    --- PASS: TestDR/disapp-deploy-rbd-busybox (610.94s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Deploy (1.78s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Enable (90.09s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Failover (211.90s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Relocate (241.95s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Disable (56.17s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Undeploy (9.05s)
    --- PASS: TestDR/subscr-deploy-cephfs-busybox (646.86s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Deploy (10.10s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Enable (95.23s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Failover (295.18s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Relocate (180.12s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Disable (60.18s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Undeploy (6.05s)
    --- PASS: TestDR/appset-deploy-cephfs-busybox (797.77s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Deploy (10.06s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Enable (35.04s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Failover (330.21s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Relocate (360.21s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Disable (56.18s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Undeploy (6.07s)
    --- PASS: TestDR/appset-deploy-rbd-busybox (973.87s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Deploy (10.06s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Enable (60.05s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Failover (510.32s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Relocate (330.20s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Disable (57.19s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Undeploy (6.05s)
PASS

Time taken with 1m policy ran locally:

--- PASS: TestDR (6.06s)
    --- PASS: TestDR/subscr-deploy-rbd-busybox (502.03s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Deploy (15.08s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Enable (90.11s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Failover (210.52s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Relocate (130.10s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Disable (50.15s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Undeploy (6.07s)
    --- PASS: TestDR/disapp-deploy-rbd-busybox (521.56s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Deploy (1.73s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Enable (95.09s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Failover (206.72s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Relocate (151.92s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Disable (57.18s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Undeploy (8.92s)
    --- PASS: TestDR/disapp-deploy-cephfs-busybox (611.66s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Deploy (1.74s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Enable (95.08s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Failover (176.76s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Relocate (271.83s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Disable (57.19s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Undeploy (9.06s)
    --- PASS: TestDR/subscr-deploy-cephfs-busybox (681.98s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Deploy (15.08s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Enable (90.11s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Failover (270.36s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Relocate (240.16s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Disable (60.19s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Undeploy (6.08s)
    --- PASS: TestDR/appset-deploy-rbd-busybox (778.81s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Deploy (5.07s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Enable (80.11s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Failover (330.21s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Relocate (305.20s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Disable (52.17s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Undeploy (6.05s)
    --- PASS: TestDR/appset-deploy-cephfs-busybox (962.92s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Deploy (5.07s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Enable (85.11s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Failover (360.23s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Relocate (465.30s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Disable (41.15s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Undeploy (6.05s)
PASS

1m interval ci run result:

--- PASS: TestDR/disapp-deploy-rbd-busybox (529.69s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Deploy (3.12s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Enable (91.75s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Failover (214.73s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Relocate (149.96s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Disable (57.51s)
        --- PASS: TestDR/disapp-deploy-rbd-busybox/Undeploy (12.63s)
    --- PASS: TestDR/subscr-deploy-rbd-busybox (540.13s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Deploy (40.72s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Enable (94.11s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Failover (211.76s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Relocate (142.75s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Disable (43.00s)
        --- PASS: TestDR/subscr-deploy-rbd-busybox/Undeploy (7.78s)
    --- PASS: TestDR/disapp-deploy-cephfs-busybox (832.37s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Deploy (3.06s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Enable (182.78s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Failover (301.96s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Relocate (274.89s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Disable (55.91s)
        --- PASS: TestDR/disapp-deploy-cephfs-busybox/Undeploy (13.77s)
    --- PASS: TestDR/subscr-deploy-cephfs-busybox (976.13s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Deploy (30.69s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Enable (154s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Failover (450.86s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Relocate (271.02s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Disable (60.42s)
        --- PASS: TestDR/subscr-deploy-cephfs-busybox/Undeploy (8.43s)
    --- PASS: TestDR/appset-deploy-rbd-busybox (989.20s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Deploy (40.33s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Enable (94.84s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Failover (399.47s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Relocate (387.44s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Disable (59.85s)
        --- PASS: TestDR/appset-deploy-rbd-busybox/Undeploy (7.27s)
    --- PASS: TestDR/appset-deploy-cephfs-busybox (1166.96s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Deploy (35.31s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Enable ([159.85s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Failover (450.46s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Relocate (465.89s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Disable (47.14s)
        --- PASS: TestDR/appset-deploy-cephfs-busybox/Undeploy (8.30s)
PASS

In both scenaios, it is not exceeding the timeout (600s), it varies but seems relocate for appset is faster in 5m interval and failover is bit slower when compared with above runs on 1m interval policy.

e2e/dractions/actions.go

nirs · 2025-06-04T15:01:53Z

Ran e2e including the workload validation after ops with drpolicies having 5 min and 1m(default) scheduling interval locally:

5m drpolicy used in config:
    name: dr-policy-5m
    resourceVersion: "67851"
    uid: 870a6b4d-8c3b-432d-af79-ca1cd393735c
  spec:
    drClusters:
    - dr1
    - dr2
    replicationClassSelector: {}
    schedulingInterval: 5m

The times looks very similar, did you change the drPolicy in the config? The validation logs should show the drpolicy.

parikshithb · 2025-06-04T15:06:41Z

Ran e2e including the workload validation after ops with drpolicies having 5 min and 1m(default) scheduling interval locally:
5m drpolicy used in config:
    name: dr-policy-5m
    resourceVersion: "67851"
    uid: 870a6b4d-8c3b-432d-af79-ca1cd393735c
  spec:
    drClusters:
    - dr1
    - dr2
    replicationClassSelector: {}
    schedulingInterval: 5m
The times looks very similar, did you change the drPolicy in the config? The validation logs should show the drpolicy.

Yup, made sure I updated the drpolicy in config, logged in our logs:

2025-06-04T18:57:20.107+0530	INFO	validate/validate.go:145	Validated clusters ["dr1", "dr2"] in DRPolicy "dr-policy-5m"

nirs

Looks good! one commit message should be fixed.

e2e/deployers/appset.go

Add WaitWorkloadHealth checks to all DR operations except unprotect(due to RamenDR#2077) to ensure workload is healthy after operations complete. Updated DiscoveredApp deployer to use the cluster variable instead of ctx.Env().C1 for consistency. These changes ensure workloads are fully operational before considering workload deployments and different DR operations successful. Signed-off-by: Parikshith <parikshithb@gmail.com>

The Health method was incorrectly returning nil (success) even when deployments were not ready, causing WaitWorkloadHealth to immediately succeed without waiting. Now returns proper error with replica status when deployment is not healthy. Signed-off-by: Parikshith <parikshithb@gmail.com>

The Health() method was logging when a deployment is ready, but this is redundant since the caller (WaitWorkloadHealth) already logs both the "waiting" and "healthy" status messages. This eliminates duplicate log entries for the same event. Signed-off-by: Parikshith <parikshithb@gmail.com>

nirs

Thanks!

nirs · 2025-06-05T11:35:29Z

@parikshithb please send a ramenctl PR to consume this change.

Consuming fix for validating workload health after DR operations: RamenDR/ramen#2071 Issue fixed in ramen e2e: RamenDR/ramen#2018 Signed-off-by: Parikshith <parikshithb@gmail.com>

parikshithb requested review from nirs and raghavendra-talur as code owners May 30, 2025 11:48

parikshithb removed the request for review from raghavendra-talur May 30, 2025 11:48

nirs reviewed Jun 1, 2025

View reviewed changes

e2e/dractions/retry.go Outdated Show resolved Hide resolved

parikshithb force-pushed the validate_wl branch 2 times, most recently from 1e68575 to e61af30 Compare June 2, 2025 09:37

parikshithb requested a review from nirs June 2, 2025 13:53

nirs approved these changes Jun 3, 2025

View reviewed changes

parikshithb marked this pull request as draft June 4, 2025 12:05

nirs self-requested a review June 4, 2025 12:26

nirs reviewed Jun 4, 2025

View reviewed changes

e2e/util/placement.go Outdated Show resolved Hide resolved

parikshithb added 5 commits June 4, 2025 18:14

parikshithb force-pushed the validate_wl branch from e63949a to 1f1cc61 Compare June 4, 2025 13:45

parikshithb requested a review from nirs June 4, 2025 14:34

nirs reviewed Jun 4, 2025

View reviewed changes

e2e/dractions/actions.go Outdated Show resolved Hide resolved

parikshithb force-pushed the validate_wl branch from 1f1cc61 to f4ac770 Compare June 4, 2025 15:16

parikshithb mentioned this pull request Jun 4, 2025

e2e: Move cleanup logic from ramenctl to e2e #2077

Closed

parikshithb force-pushed the validate_wl branch from f4ac770 to 1ace917 Compare June 4, 2025 15:21

parikshithb marked this pull request as ready for review June 4, 2025 15:25

parikshithb requested a review from nirs June 4, 2025 15:30

nirs reviewed Jun 4, 2025

View reviewed changes

e2e/deployers/appset.go Outdated Show resolved Hide resolved

parikshithb added 3 commits June 5, 2025 13:58

parikshithb force-pushed the validate_wl branch from 1ace917 to bc89c14 Compare June 5, 2025 08:29

parikshithb requested a review from nirs June 5, 2025 09:36

nirs approved these changes Jun 5, 2025

View reviewed changes

nirs merged commit 7ca6bd4 into RamenDR:main Jun 5, 2025
23 checks passed

parikshithb mentioned this pull request Jun 5, 2025

deps: update ramen/e2e to latest version RamenDR/ramenctl#197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

e2e: add workload health validation to DR ops and refactor cluster name handling #2071

e2e: add workload health validation to DR ops and refactor cluster name handling #2071

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

e2e: add workload health validation to DR ops and refactor cluster name handling #2071

e2e: add workload health validation to DR ops and refactor cluster name handling #2071

Uh oh!

Conversation

Uh oh!

Changes Made

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!