Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

ELENAGER · 2025-04-02T14:29:09Z

8000

We need to distinguish between situation, when VRG conditions are not ready and preventing us from going to failover/relocate completed state and situation, when we have real problem in VRG conditions (for example, condition doesn't exist or condition generation doesn't match VRG generation.
In case of real problem we must not change progression, just return error and wait for reconciliation.
Fixes: https://issues.redhat.com/browse/DFBUGS-612

nirs · 2025-04-02T19:05:26Z

internal/controller/drplacementcontrol.go

+
+	clusterDataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected)
+
+	return err == nil && clusterDataReady


How err can be non nil here? we return in line 1199 in this case.

It is another error - in line 1199 we checked VRGConditionTypeDataReady condition and here is VRGConditionTypeClusterDataProtected condition, it can be absent or generation can be wrong also

Right but we drop this error silently again.

@nirs both errors are logged now

nirs · 2025-04-02T19:07:54Z

internal/controller/drplacementcontrol.go

+	dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady)
+	if err != nil || !dataReady {
+		return false
+	}


We need to log the error or pass it to caller. Dropping an error silently without any check is not an option. How are we going to debug the error?

Since we we don't use dataReady after this check, we should keep dataReady and err in the scope of the if block:

if dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady); err != nil || !dataReady { return false }

This avoids the confusion about err value when we return below.

internal/controller/drplacementcontrol.go

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>

nirs

Looks good now. Does it work? Can you share logs showing the new error logs and expected progression?

Also maybe compare output of:

kubectl get drpoc -A -o wide -w --context hub

Witout and with this change.

For example we want to be sure that case which are handled as errors now (e.g. missing vrg, missing condition, stale condition) and do not update the progression, will not cause the progression to never show some values.

nirs · 2025-04-03T13:25:22Z

internal/controller/drplacementcontrol.go

-		d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected)
+	dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady)
+	if err != nil {
+              		d.log.Info("readyToSwitchOver", "Error", err.Error())

            


We can use d.log.Error() now to keep errors easy to find. log.Info() is good if you are sure that all possible errors are not really an error, but in this case why do we return an error?

We do not return error here, we return "not ready to switch over to cluster". Changes in this function is collateral damage, I wanted to separate error from "not ready" only in function checkReadiness, because progression flickering happens only there.

Ok, so the error from d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady) is never an error but more the reason why we cannot tell if we are ready or not?

And this is new code added by this change. If this is really never an error but expected condition, maybe we need to return an enum instead of bool, error?

For example:

type readyStatus string const statusReady readyStatus("ready") const statusNotReady readyStatus("not-ready") const statusUnknown readyStatus("unknown")

We did something similar in isVRConditionMet.

ELENAGER · 2025-04-03T15:50:11Z

While changing VRG generation on dr2 after Failover manually (by changing VRG spec) the ramen hub operator log shows generation mismatch:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl logs ramen-hub-operator-659658dfcf-z8hq8 -n ramen-system --context hub | grep "generation mismatch"
2025-04-03T15:33:36.558Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "c844d1b9-e067-4af5-9671-e691ceaf87b5", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:36.954Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "5ae516e9-24af-4f33-974e-d0506f427b04", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:37.250Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "bdc5906c-61d7-4a7b-a7da-dc4ea53ddebd", "error": "generation mismatch in DataReady condition on cluster dr2"}

But the Completed progression of DRPC stays unchanged:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True

While changing schedulingInterval in VRG to random value, error is seen in VRG condition:

status:
  conditions:
  - lastTransitionTime: "2025-04-03T15:48:50Z"
    message: 'Failed to label PVCs for consistency groups: failed to label PVC deployment-rbd/busybox-pvc
      for consistency group (no VolumeGroupReplicationClass found to match provisioner
      and schedule)'
    observedGeneration: 9
    reason: Error
    status: "False"
    type: DataReady

and Completed progression of DRPC is changed to WaitForReadiness

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   16m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   18m   dr1                dr2               Failover       FailedOver     WaitForReadiness   2025-04-03T15:23:08Z   2m26.997683793s   True

nirs · 2025-04-03T16:15:16Z

While changing VRG generation on dr2 after Failover manually (by changing VRG spec) the ramen hub operator log shows generation mismatch:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl logs ramen-hub-operator-659658dfcf-z8hq8 -n ramen-system --context hub | grep "generation mismatch"
2025-04-03T15:33:36.558Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "c844d1b9-e067-4af5-9671-e691ceaf87b5", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:36.954Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "5ae516e9-24af-4f33-974e-d0506f427b04", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:37.250Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "bdc5906c-61d7-4a7b-a7da-dc4ea53ddebd", "error": "generation mismatch in DataReady condition on cluster dr2"}

But the Completed progression of DRPC stays unchanged:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True

While changing schedulingInterval in VRG to random value, error is seen in VRG condition:

status:
  conditions:
  - lastTransitionTime: "2025-04-03T15:48:50Z"
    message: 'Failed to label PVCs for consistency groups: failed to label PVC deployment-rbd/busybox-pvc
      for consistency group (no VolumeGroupReplicationClass found to match provisioner
      and schedule)'
    observedGeneration: 9
    reason: Error
    status: "False"
    type: DataReady

and Completed progression of DRPC is changed to WaitForReadiness

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   16m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   18m   dr1                dr2               Failover       FailedOver     WaitForReadiness   2025-04-03T15:23:08Z   2m26.997683793s   True

9B34

Looks good!

Lets get review from others with dipper understanding of the possible outcome of this change.

ELENAGER force-pushed the DFBUGS-612-new branch 2 times, most recently from 110062c to 6324972 Compare April 2, 2025 15:12

nirs reviewed Apr 2, 2025

View reviewed changes

ELENAGER force-pushed the DFBUGS-612-new branch from 39d37c0 to bb88410 Compare April 3, 2025 07:43

Prevent flickering between Completed and WaitingForReadiness progression

8f60984

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>

ELENAGER force-pushed the DFBUGS-612-new branch from 1d9f86a to 8f60984 Compare April 3, 2025 10:57

nirs reviewed Apr 3, 2025

View reviewed changes

ELENAGER changed the title ~~Prevent flickering between Completed and WaitingForReadiness progression~~ Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		clusterDataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected)

		return err == nil && clusterDataReady

Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

Are you sure you want to change the base?

Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!