-
Notifications
You must be signed in to change notification settings - Fork 61
Prevent flickering between Completed/CleaningUp and WaitingForReadiness progression #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
110062c
to
6324972
Compare
|
||
clusterDataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected) | ||
|
||
return err == nil && clusterDataReady |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How err can be non nil here? we return in line 1199 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is another error - in line 1199 we checked VRGConditionTypeDataReady condition and here is VRGConditionTypeClusterDataProtected condition, it can be absent or generation can be wrong also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but we drop this error silently again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirs both errors are logged now
dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady) | ||
if err != nil || !dataReady { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to log the error or pass it to caller. Dropping an error silently without any check is not an option. How are we going to debug the error?
Since we we don't use dataReady after this check, we should keep dataReady and err in the scope of the if block:
if dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady); err != nil || !dataReady {
return false
}
This avoids the confusion about err value when we return below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Does it work? Can you share logs showing the new error logs and expected progression?
Also maybe compare output of:
kubectl get drpoc -A -o wide -w --context hub
Witout and with this change.
For example we want to be sure that case which are handled as errors now (e.g. missing vrg, missing condition, stale condition) and do not update the progression, will not cause the progression to never show some values.
d.isVRGConditionMet(homeCluster, VRGConditionTypeClusterDataProtected) | ||
dataReady, err := d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady) | ||
if err != nil { | ||
d.log.Info("readyToSwitchOver", "Error", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use d.log.Error() now to keep errors easy to find. log.Info() is good if you are sure that all possible errors are not really an error, but in this case why do we return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not return error here, we return "not ready to switch over to cluster". Changes in this function is collateral damage, I wanted to separate error from "not ready" only in function checkReadiness, because progression flickering happens only there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the error from d.isVRGConditionMet(homeCluster, VRGConditionTypeDataReady)
is never an error but more the reason why we cannot tell if we are ready or not?
And this is new code added by this change. If this is really never an error but expected condition, maybe we need to return an enum instead of bool, error?
For example:
type readyStatus string
const statusReady readyStatus("ready")
const statusNotReady readyStatus("not-ready")
const statusUnknown readyStatus("unknown")
We did something similar in isVRConditionMet.
While changing VRG generation on dr2 after Failover manually (by changing VRG spec) the ramen hub operator log shows generation mismatch:
But the Completed progression of DRPC stays unchanged:
While changing schedulingInterval in VRG to random value, error is seen in VRG condition:
and Completed progression of DRPC is changed to WaitForReadiness
|
Looks good! Lets get review from others with dipper understanding of the possible outcome of this change. |
We need to distinguish between situation, when VRG conditions are not ready and preventing us from going to failover/relocate completed state and situation, when we have real problem in VRG conditions (for example, condition doesn't exist or condition generation doesn't match VRG generation.
In case of real problem we must not change progression, just return error and wait for reconciliation.
Fixes: https://issues.redhat.com/browse/DFBUGS-612