Support "transactional" Get-then-Update and Get-then-Delete #621

marshtompsxd · 2025-05-10T17:22:27Z

This PR introduces "transactional" APIs: Get-then-Update and Get-then-Delete. This PR breaks some proofs of the VRS controller. Fixing the broken proof is left for future work.

Spec of Get-then-Update: The API server checks whether the object exists and whether the object is owned by some owner reference, and if so the API server updates the object using the current resource version and uid. These operations happen in a single step. Get-then-Update will never fail due to Conflict error because it applies update using the current resource version and uid.

Implementation of Get-then-Update: Kubernetes does not provide such a transactional API, so Get-then-Update is implemented by retrying get and update. It first gets the object, checks its owner, then updates the object using the resource version and uid from the get result. If the update fails due to Conflict error (e.g., some other controller updates the object between our get and update), it retries get and update until it succeeds or other error happens. Note that the termination of the implementation depends on fairness.

The spec and implementation of Get-then-Delete are similar to Get-then-Update.

Why the implementation refines the spec: There is currently no machine-checked proof for the refinement. The proof will require the fairness condition that guarantees the implementation's termination. The intuition of the proof is to show that for any possible trace of the implementation, if the implementation (re)tries "Get-then-Update" for N times, the first N-1 tries will not cause any changes to the cluster state (as they failed due to Conflict error) so the first N-1 tries are mapped to no-ops to the trace of the spec. For the Nth try of "Get-then-Update", there are two cases to consider: (1) the object gets deleted between the get and update, and the Nth try fails with ObjectNotFound error. In this case, we map the implementation's get to a no-op before the deletion, and the implementation's update to the spec's atomic get-then-update after the deletion. (2) the object still exists when the update happens, then it implies no other controllers touches the object between the get and update (otherwise the Nth try will also fail with Conflict error). In this case, we map the implementation's "Get-then-Update" to the spec's atomic Get-then-Update.

Limitation: Currently, Get-then-Update and Get-then-Delete's expressiveness is limited: they only check the object's owner before issuing update or delete, instead of checking arbitrary conditions of the object. To support arbitrary condition check, the controller's reconcile_core needs to return a function (e.g., Fn(DynamicObject) -> bool) as a field of the request struct. Rust requires the function to be wrapped in a Box<dyn ...>, but dyn is not supported by Verus for now.

marshtompsxd · 2025-05-12T17:31:12Z

@codyjrivera @Catoverflow Sorry that this PR accidentally contains some changes for formatting the proof of VRS controllers (mostly on removing whitespace); I turned on format-on-save in my dev env so...

codyjrivera

Looks good. Obviously the predicate being hardcoded isn't ideal, but good work!

Ideally, more people should check the mathematics of course.

codyjrivera · 2025-05-12T18:09:48Z

src/v2/shim_layer/controller_runtime.rs

@@ -385,6 +408,157 @@ where
    return Ok(Action::requeue(Duration::from_secs(60)));
 }

+// transactional_get_then_delete_by_retry retries get and then delete upon conflict errors to simulate atomic operations.


More of a basic Kubernetes question than a comment on this code, but do real-life examples of this 'read->update' loop have a timeout, or would they just contend for a resource forever?

They limit the number of retries (and also have back-off when a try fails).

See this https://github.com/kubernetes/kubernetes/blob/0e64c6443f8e1f760c92a64304925986d4519a77/staging/src/k8s.io/client-go/util/retry/util.go#L68

Is there some other way that would be facilitated? Or are we content, as a simplifier for the purposes of Anvil, with possible starvation?

If we limit the number of retries then there is no way to have a refinement mapping from the retry code to the transactional spec API.

We could implement back-off later, which does not affect the liveness argument.

with possible starvation?

What do you mean by starvation here? You mean the controller always retries and never terminates?

Understandable, I was just curious.

I meant 'always retries'.

codyjrivera · 2025-05-12T18:13:10Z

src/v2/kubernetes_cluster/spec/api_server/state_machine.rs

+            let current_obj = s.resources[req.key()];
+            // Step 2: if the object exists, perform a check using a predicate on object
+            // The predicate: Is the current object owned by req.owner_ref?
+            // TODO: the predicate should be provided by clients instead of the hardcoded one


Long shot: might there be a way to specify this predicate at compile/typechecking time, rather than either hardcoding it or trying to pass in a higher-order function?

I hope there is a way without using dyn. Need to ask @utaal

I think we need to have lemma like lemma_always_key_of_object_in_matched_ok_create_resp_message_is_same_as_key_of_pending_req, do we want to add those in another PR and then we can migrate to use this API?

Let's add the lemmas when we need them

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

This PR introduces "transactional" APIs: Get-then-Update and Get-then-Delete. This PR breaks some proofs of the VRS controller. Fixing the broken proof is left for future work. **Spec of Get-then-Update**: The API server checks whether the object exists and whether the object is owned by some owner reference, and if so the API server updates the object using the current resource version and uid. These operations happen in a single step. Get-then-Update will never fail due to Conflict error because it applies update using the current resource version and uid. **Implementation of Get-then-Update**: Kubernetes does not provide such a transactional API, so Get-then-Update is implemented by retrying get and update. It first gets the object, checks its owner, then updates the object using the resource version and uid from the get result. If the update fails due to Conflict error (e.g., some other controller updates the object between our get and update), it retries get and update until it succeeds or other error happens. Note that the termination of the implementation depends on fairness. The spec and implementation of Get-then-Delete are similar to Get-then-Update. **Why the implementation refines the spec**: There is currently no machine-checked proof for the refinement. The proof will require the fairness condition that guarantees the implementation's termination. The intuition of the proof is to show that for any possible trace of the implementation, if the implementation (re)tries "Get-then-Update" for N times, the first N-1 tries will not cause any changes to the cluster state (as they failed due to Conflict error) so the first N-1 tries are mapped to no-ops to the trace of the spec. For the Nth try of "Get-then-Update", there are two cases to consider: (1) the object gets deleted between the get and update, and the Nth try fails with ObjectNotFound error. In this case, we map the implementation's get to a no-op before the deletion, and the implementation's update to the spec's atomic get-then-update after the deletion. (2) the object still exists when the update happens, then it implies no other controllers touches the object between the get and update (otherwise the Nth try will also fail with Conflict error). In this case, we map the implementation's "Get-then-Update" to the spec's atomic Get-then-Update. **Limitation**: Currently, Get-then-Update and Get-then-Delete's expressiveness is limited: they only check the object's owner before issuing update or delete, instead of checking arbitrary conditions of the object. To support arbitrary condition check, the controller's `reconcile_core` needs to return a function (e.g., `Fn(DynamicObject) -> bool`) as a field of the request struct. Rust requires the function to be wrapped in a `Box<dyn ...>`, but `dyn` is not supported by Verus for now. --------- Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

marshtompsxd changed the title ~~Support "transactional" Get-Then-Update to address fairness issues on version races~~ Support "transactional" Get-Then-Update API May 10, 2025

marshtompsxd force-pushed the xudong/fictional-transaction branch 2 times, most recently from 10a3d64 to 0897277 Compare May 12, 2025 00:07

marshtompsxd changed the title ~~Support "transactional" Get-Then-Update API~~ Support "transactional" Get-Then-Update and Get-Then-Delete May 12, 2025

marshtompsxd changed the title ~~Support "transactional" Get-Then-Update and Get-Then-Delete~~ Support "transactional" Get-then-Update and Get-then-Delete May 12, 2025

marshtompsxd marked this pull request as ready for review May 12, 2025 17:29

marshtompsxd requested review from codyjrivera and Catoverflow May 12, 2025 17:29

codyjrivera reviewed May 12, 2025

View reviewed changes

marshtompsxd force-pushed the xudong/fictional-transaction branch from 7902839 to 1991ea7 Compare May 13, 2025 15:01

marshtompsxd added this pull request to the merge queue May 15, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 15, 2025

marshtompsxd added 14 commits May 15, 2025 16:03

API for GetThenUpdate

daf0f45

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Support Get-then-Update

e3403b6

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Add transactional error

8fbbc80

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Implement transactional Get-Then-Update with retry

57266dc

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Add transactional get-then-update API to the state machine

d201bec

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Fix bug in shim layer

a3e4f25

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Fix error type of wrong uid

864975c

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Mark broken proofs as external

f4ed934

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Doc the new API and TODO

955ea9a

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Fix conflicts

b2bbf7d

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Add get-then-delete

c5e1211

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Remove updated comments

78ed0e5

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Another broken proof

70c5c9e

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

Fix vd failures

999b343

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>

marshtompsxd force-pushed the xudong/fictional-transaction branch from 1991ea7 to 999b343 Compare May 15, 2025 21:08

marshtompsxd added this pull request to the merge queue May 15, 2025

Merged via the queue into main with commit 67e7f2c May 15, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "transactional" Get-then-Update and Get-then-Delete #621

Support "transactional" Get-then-Update and Get-then-Delete #621

Support "transactional" Get-then-Update and Get-then-Delete #621

Support "transactional" Get-then-Update and Get-then-Delete #621

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment