8000 Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression by ELENAGER · Pull Request #1983 · RamenDR/ramen · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ELENAGER
Copy link
Member
@ELENAGER ELENAGER commented Apr 2, 2025
8000

We need to distinguish between situation, when VRG conditions are not ready and preventing us from going to failover/relocate completed state and situation, when we have real problem in VRG conditions (for example, condition doesn't exist or condition generation doesn't match VRG generation.
In case of real problem we must not change progression, just return error and wait for reconciliation.
Fixes: https://issues.redhat.com/browse/DFBUGS-612

@ELENAGER ELENAGER force-pushed the DFBUGS-612-new branch 2 times, most recently from 110062c to 6324972 Compare April 2, 2025 15:12

clusterDataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected)

return err == nil && clusterDataReady
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How err can be non nil here? we return in line 1199 in this case.

Copy link
Member Author
@ELENAGER ELENAGER Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is another error - in line 1199 we checked VRGConditionTypeDataReady condition and here is VRGConditionTypeClusterDataProtected condition, it can be absent or generation can be wrong also

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but we drop this error silently again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs both errors are logged now

dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady)
if err != nil || !dataReady {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log the error or pass it to caller. Dropping an error silently without any check is not an option. How are we going to debug the error?

Since we we don't use dataReady after this check, we should keep dataReady and err in the scope of the if block:

if dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady); err != nil || !dataReady {
    return false
}

This avoids the confusion about err value when we return below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Copy link
Member
@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Does it work? Can you share logs showing the new error logs and expected progression?

Also maybe compare output of:

kubectl get drpoc -A -o wide -w --context hub

Witout and with this change.

For example we want to be sure that case which are handled as errors now (e.g. missing vrg, missing condition, stale condition) and do not update the progression, will not cause the progression to never show some values.

d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected)
dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady)
if err != nil {
d.log.Info("readyToSwitchOver", "Error", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use d.log.Error() now to keep errors easy to find. log.Info() is good if you are sure that all possible errors are not really an error, but in this case why do we return an error?

Copy link
Member Author
@ELENAGER ELENAGER Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not return error here, we return "not ready to switch over to cluster". Changes in this function is collateral damage, I wanted to separate error from "not ready" only in function checkReadiness, because progression flickering happens only there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the error from d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady) is never an error but more the reason why we cannot tell if we are ready or not?

And this is new code added by this change. If this is really never an error but expected condition, maybe we need to return an enum instead of bool, error?

For example:

type readyStatus string
const statusReady readyStatus("ready")
const statusNotReady readyStatus("not-ready")
const statusUnknown readyStatus("unknown")

We did something similar in isVRConditionMet.

@ELENAGER
Copy link
Member Author
ELENAGER commented Apr 3, 2025

While changing VRG generation on dr2 after Failover manually (by changing VRG spec) the ramen hub operator log shows generation mismatch:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl logs ramen-hub-operator-659658dfcf-z8hq8 -n ramen-system --context hub | grep "generation mismatch"
2025-04-03T15:33:36.558Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "c844d1b9-e067-4af5-9671-e691ceaf87b5", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:36.954Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "5ae516e9-24af-4f33-974e-d0506f427b04", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:37.250Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "bdc5906c-61d7-4a7b-a7da-dc4ea53ddebd", "error": "generation mismatch in DataReady condition on cluster dr2"}

But the Completed progression of DRPC stays unchanged:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True

While changing schedulingInterval in VRG to random value, error is seen in VRG condition:

status:
  conditions:
  - lastTransitionTime: "2025-04-03T15:48:50Z"
    message: 'Failed to label PVCs for consistency groups: failed to label PVC deployment-rbd/busybox-pvc
      for consistency group (no VolumeGroupReplicationClass found to match provisioner
      and schedule)'
    observedGeneration: 9
    reason: Error
    status: "False"
    type: DataReady

and Completed progression of DRPC is changed to WaitForReadiness

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   16m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   18m   dr1                dr2               Failover       FailedOver     WaitForReadiness   2025-04-03T15:23:08Z   2m26.997683793s   True

@nirs
Copy link
Member
nirs commented Apr 3, 2025

While changing VRG generation on dr2 after Failover manually (by changing VRG spec) the ramen hub operator log shows generation mismatch:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl logs ramen-hub-operator-659658dfcf-z8hq8 -n ramen-system --context hub | grep "generation mismatch"
2025-04-03T15:33:36.558Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "c844d1b9-e067-4af5-9671-e691ceaf87b5", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:36.954Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "5ae516e9-24af-4f33-974e-d0506f427b04", "error": "generation mismatch in DataReady condition on cluster dr2"}
2025-04-03T15:33:37.250Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:97     Process placement     {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "bdc5906c-61d7-4a7b-a7da-dc4ea53ddebd", "error": "generation mismatch in DataReady condition on cluster dr2"}

But the Completed progression of DRPC stays unchanged:

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True

While changing schedulingInterval in VRG to random value, error is seen in VRG condition:

status:
  conditions:
  - lastTransitionTime: "2025-04-03T15:48:50Z"
    message: 'Failed to label PVCs for consistency groups: failed to label PVC deployment-rbd/busybox-pvc
      for consistency group (no VolumeGroupReplicationClass found to match provisioner
      and schedule)'
    observedGeneration: 9
    reason: Error
    status: "False"
    type: DataReady

and Completed progression of DRPC is changed to WaitForReadiness

(ramen) [egershko@dhcp53-116 ramen]$ kubectl get drpc -A -o wide -w --context hub
NAMESPACE        NAME                  AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
deployment-rbd   deployment-rbd-drpc   11m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   13m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   14m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   16m   dr1                dr2               Failover       FailedOver     Completed     2025-04-03T15:23:08Z   2m26.997683793s   True
deployment-rbd   deployment-rbd-drpc   18m   dr1                dr2               Failover       FailedOver     WaitForReadiness   2025-04-03T15:23:08Z   2m26.997683793s   True

9B34

Looks good!

Lets get review from others with dipper understanding of the possible outcome of this change.

@ELENAGER ELENAGER changed the title Prevent flickering between Completed and WaitingForReadiness progression Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0