-
Notifications
You must be signed in to change notification settings - Fork 8
Support "transactional" Get-then-Update and Get-then-Delete #621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
10a3d64
to
0897277
Compare
@codyjrivera @Catoverflow Sorry that this PR accidentally contains some changes for formatting the proof of VRS controllers (mostly on removing whitespace); I turned on format-on-save in my dev env so... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Obviously the predicate being hardcoded isn't ideal, but good work!
Ideally, more people should check the mathematics of course.
@@ -385,6 +408,157 @@ where | |||
return Ok(Action::requeue(Duration::from_secs(60))); | |||
} | |||
10000
|
|||
// transactional_get_then_delete_by_retry retries get and then delete upon conflict errors to simulate atomic operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a basic Kubernetes question than a comment on this code, but do real-life examples of this 'read->update' loop have a timeout, or would they just contend for a resource forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They limit the number of retries (and also have back-off when a try fails).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some other way that would be facilitated? Or are we content, as a simplifier for the purposes of Anvil, with possible starvation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we limit the number of retries then there is no way to have a refinement mapping from the retry code to the transactional spec API.
We could implement back-off later, which does not affect the liveness argument.
with possible starvation?
What do you mean by starvation here? You mean the controller always retries and never terminates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understandable, I was just curious.
I meant 'always retries'.
let current_obj = s.resources[req.key()]; | ||
// Step 2: if the object exists, perform a check using a predicate on object | ||
// The predicate: Is the current object owned by req.owner_ref? | ||
// TODO: the predicate should be provided by clients instead of the hardcoded one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long shot: might there be a way to specify this predicate at compile/typechecking time, rather than either hardcoding it or trying to pass in a higher-order function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope there is a way without using dyn
. Need to ask @utaal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have lemma like lemma_always_key_of_object_in_matched_ok_create_resp_message_is_same_as_key_of_pending_req
, do we want to add those in another PR and then we can migrate to use this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the lemmas when we need them
7902839
to
1991ea7
Compare
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
1991ea7
to
999b343
Compare
This PR introduces "transactional" APIs: Get-then-Update and Get-then-Delete. This PR breaks some proofs of the VRS controller. Fixing the broken proof is left for future work. **Spec of Get-then-Update**: The API server checks whether the object exists and whether the object is owned by some owner reference, and if so the API server updates the object using the current resource version and uid. These operations happen in a single step. Get-then-Update will never fail due to Conflict error because it applies update using the current resource version and uid. **Implementation of Get-then-Update**: Kubernetes does not provide such a transactional API, so Get-then-Update is implemented by retrying get and update. It first gets the object, checks its owner, then updates the object using the resource version and uid from the get result. If the update fails due to Conflict error (e.g., some other controller updates the object between our get and update), it retries get and update until it succeeds or other error happens. Note that the termination of the implementation depends on fairness. The spec and implementation of Get-then-Delete are similar to Get-then-Update. **Why the implementation refines the spec**: There is currently no machine-checked proof for the refinement. The proof will require the fairness condition that guarantees the implementation's termination. The intuition of the proof is to show that for any possible trace of the implementation, if the implementation (re)tries "Get-then-Update" for N times, the first N-1 tries will not cause any changes to the cluster state (as they failed due to Conflict error) so the first N-1 tries are mapped to no-ops to the trace of the spec. For the Nth try of "Get-then-Update", there are two cases to consider: (1) the object gets deleted between the get and update, and the Nth try fails with ObjectNotFound error. In this case, we map the implementation's get to a no-op before the deletion, and the implementation's update to the spec's atomic get-then-update after the deletion. (2) the object still exists when the update happens, then it implies no other controllers touches the object between the get and update (otherwise the Nth try will also fail with Conflict error). In this case, we map the implementation's "Get-then-Update" to the spec's atomic Get-then-Update. **Limitation**: Currently, Get-then-Update and Get-then-Delete's expressiveness is limited: they only check the object's owner before issuing update or delete, instead of checking arbitrary conditions of the object. To support arbitrary condition check, the controller's `reconcile_core` needs to return a function (e.g., `Fn(DynamicObject) -> bool`) as a field of the request struct. Rust requires the function to be wrapped in a `Box<dyn ...>`, but `dyn` is not supported by Verus for now. --------- Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
This PR introduces "transactional" APIs: Get-then-Update and Get-then-Delete. This PR breaks some proofs of the VRS controller. Fixing the broken proof is left for future work.
Spec of Get-then-Update: The API server checks whether the object exists and whether the object is owned by some owner reference, and if so the API server updates the object using the current resource version and uid. These operations happen in a single step. Get-then-Update will never fail due to Conflict error because it applies update using the current resource version and uid.
Implementation of Get-then-Update: Kubernetes does not provide such a transactional API, so Get-then-Update is implemented by retrying get and update. It first gets the object, checks its owner, then updates the object using the resource version and uid from the get result. If the update fails due to Conflict error (e.g., some other controller updates the object between our get and update), it retries get and update until it succeeds or other error happens. Note that the termination of the implementation depends on fairness.
The spec and implementation of Get-then-Delete are similar to Get-then-Update.
Why the implementation refines the spec: There is currently no machine-checked proof for the refinement. The proof will require the fairness condition that guarantees the implementation's termination. The intuition of the proof is to show that for any possible trace of the implementation, if the implementation (re)tries "Get-then-Update" for N times, the first N-1 tries will not cause any changes to the cluster state (as they failed due to Conflict error) so the first N-1 tries are mapped to no-ops to the trace of the spec. For the Nth try of "Get-then-Update", there are two cases to consider: (1) the object gets deleted between the get and update, and the Nth try fails with ObjectNotFound error. In this case, we map the implementation's get to a no-op before the deletion, and the implementation's update to the spec's atomic get-then-update after the deletion. (2) the object still exists when the update happens, then it implies no other controllers touches the object between the get and update (otherwise the Nth try will also fail with Conflict error). In this case, we map the implementation's "Get-then-Update" to the spec's atomic Get-then-Update.
Limitation: Currently, Get-then-Update and Get-then-Delete's expressiveness is limited: they only check the object's owner before issuing update or delete, instead of checking arbitrary conditions of the object. To support arbitrary condition check, the controller's
reconcile_core
needs to return a function (e.g.,Fn(DynamicObject) -> bool
) as a field of the request struct. Rust requires the function to be wrapped in aBox<dyn ...>
, butdyn
is not supported by Verus for now.