8000 Support "transactional" Get-then-Update and Get-then-Delete by marshtompsxd · Pull Request #621 · anvil-verifier/anvil · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Support "transactional" Get-then-Update and Get-then-Delete #621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 15, 2025

Conversation

marshtompsxd
Copy link
Collaborator
@marshtompsxd marshtompsxd commented May 10, 2025

This PR introduces "transactional" APIs: Get-then-Update and Get-then-Delete. This PR breaks some proofs of the VRS controller. Fixing the broken proof is left for future work.

Spec of Get-then-Update: The API server checks whether the object exists and whether the object is owned by some owner reference, and if so the API server updates the object using the current resource version and uid. These operations happen in a single step. Get-then-Update will never fail due to Conflict error because it applies update using the current resource version and uid.

Implementation of Get-then-Update: Kubernetes does not provide such a transactional API, so Get-then-Update is implemented by retrying get and update. It first gets the object, checks its owner, then updates the object using the resource version and uid from the get result. If the update fails due to Conflict error (e.g., some other controller updates the object between our get and update), it retries get and update until it succeeds or other error happens. Note that the termination of the implementation depends on fairness.

The spec and implementation of Get-then-Delete are similar to Get-then-Update.

Why the implementation refines the spec: There is currently no machine-checked proof for the refinement. The proof will require the fairness condition that guarantees the implementation's termination. The intuition of the proof is to show that for any possible trace of the implementation, if the implementation (re)tries "Get-then-Update" for N times, the first N-1 tries will not cause any changes to the cluster state (as they failed due to Conflict error) so the first N-1 tries are mapped to no-ops to the trace of the spec. For the Nth try of "Get-then-Update", there are two cases to consider: (1) the object gets deleted between the get and update, and the Nth try fails with ObjectNotFound error. In this case, we map the implementation's get to a no-op before the deletion, and the implementation's update to the spec's atomic get-then-update after the deletion. (2) the object still exists when the update happens, then it implies no other controllers touches the object between the get and update (otherwise the Nth try will also fail with Conflict error). In this case, we map the implementation's "Get-then-Update" to the spec's atomic Get-then-Update.

Limitation: Currently, Get-then-Update and Get-then-Delete's expressiveness is limited: they only check the object's owner before issuing update or delete, instead of checking arbitrary conditions of the object. To support arbitrary condition check, the controller's reconcile_core needs to return a function (e.g., Fn(DynamicObject) -> bool) as a field of the request struct. Rust requires the function to be wrapped in a Box<dyn ...>, but dyn is not supported by Verus for now.

@marshtompsxd marshtompsxd changed the title Support "transactional" Get-Then-Update to address fairness issues on version races Support "transactional" Get-Then-Update API May 10, 2025
@marshtompsxd marshtompsxd force-pushed the xudong/fictional-transaction branch 2 times, most recently from 10a3d64 to 0897277 Compare May 12, 2025 00:07
@marshtompsxd marshtompsxd changed the title Support "transactional" Get-Then-Update API Support "transactional" Get-Then-Update and Get-Then-Delete May 12, 2025
@marshtompsxd marshtompsxd changed the title Support "transactional" Get-Then-Update and Get-Then-Delete Support "transactional" Get-then-Update and Get-then-Delete May 12, 2025
@marshtompsxd marshtompsxd marked this pull request as ready for review May 12, 2025 17:29
@marshtompsxd
Copy link
Collaborator Author

@codyjrivera @Catoverflow Sorry that this PR accidentally contains some changes for formatting the proof of VRS controllers (mostly on removing whitespace); I turned on format-on-save in my dev env so...

Copy link
Collaborator
@codyjrivera codyjrivera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Obviously the predicate being hardcoded isn't ideal, but good work!

Ideally, more people should check the mathematics of course.

@@ -385,6 +408,157 @@ where
return Ok(Action::requeue(Duration::from_secs(60)));
}
10000
// transactional_get_then_delete_by_retry retries get and then delete upon conflict errors to simulate atomic operations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a basic Kubernetes question than a comment on this code, but do real-life examples of this 'read->update' loop have a timeout, or would they just contend for a resource forever?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They limit the number of retries (and also have back-off when a try fails).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some other way that would be facilitated? Or are we content, as a simplifier for the purposes of Anvil, with possible starvation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we limit the number of retries then there is no way to have a refinement mapping from the retry code to the transactional spec API.

We could implement back-off later, which does not affect the liveness argument.

with possible starvation?

What do you mean by starvation here? You mean the controller always retries and never terminates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable, I was just curious.

I meant 'always retries'.

let current_obj = s.resources[req.key()];
// Step 2: if the object exists, perform a check using a predicate on object
// The predicate: Is the current object owned by req.owner_ref?
// TODO: the predicate should be provided by clients instead of the hardcoded one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long shot: might there be a way to specify this predicate at compile/typechecking time, rather than either hardcoding it or trying to pass in a higher-order function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope there is a way without using dyn. Need to ask @utaal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have lemma like lemma_always_key_of_object_in_matched_ok_create_resp_message_is_same_as_key_of_pending_req, do we want to add those in another PR and then we can migrate to use this API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the lemmas when we need them

@marshtompsxd marshtompsxd force-pushed the xudong/fictional-transaction branch from 7902839 to 1991ea7 Compare May 13, 2025 15:01
@marshtompsxd marshtompsxd added this pull request to the merge queue May 15, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 15, 2025
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
@marshtompsxd marshtompsxd force-pushed the xudong/fictional-transaction branch from 1991ea7 to 999b343 Compare May 15, 2025 21:08
@marshtompsxd marshtompsxd added this pull request to the merge queue May 15, 2025
Merged via the queue into main with commit 67e7f2c May 15, 2025
9 checks passed
Catoverflow pushed a commit that referenced this pull request May 17, 2025
This PR introduces "transactional" APIs: Get-then-Update and
Get-then-Delete. This PR breaks some proofs of the VRS controller.
Fixing the broken proof is left for future work.

**Spec of Get-then-Update**: The API server checks whether the object
exists and whether the object is owned by some owner reference, and if
so the API server updates the object using the current resource version
and uid. These operations happen in a single step. Get-then-Update will
never fail due to Conflict error because it applies update using the
current resource version and uid.

**Implementation of Get-then-Update**: Kubernetes does not provide such
a transactional API, so Get-then-Update is implemented by retrying get
and update. It first gets the object, checks its owner, then updates the
object using the resource version and uid from the get result. If the
update fails due to Conflict error (e.g., some other controller updates
the object between our get and update), it retries get and update until
it succeeds or other error happens. Note that the termination of the
implementation depends on fairness.

The spec and implementation of Get-then-Delete are similar to
Get-then-Update.

**Why the implementation refines the spec**: There is currently no
machine-checked proof for the refinement. The proof will require the
fairness condition that guarantees the implementation's termination. The
intuition of the proof is to show that for any possible trace of the
implementation, if the implementation (re)tries "Get-then-Update" for N
times, the first N-1 tries will not cause any changes to the cluster
state (as they failed due to Conflict error) so the first N-1 tries are
mapped to no-ops to the trace of the spec. For the Nth try of
"Get-then-Update", there are two cases to consider: (1) the object gets
deleted between the get and update, and the Nth try fails with
ObjectNotFound error. In this case, we map the implementation's get to a
no-op before the deletion, and the implementation's update to the spec's
atomic get-then-update after the deletion. (2) the object still exists
when the update happens, then it implies no other controllers touches
the object between the get and update (otherwise the Nth try will also
fail with Conflict error). In this case, we map the implementation's
"Get-then-Update" to the spec's atomic Get-then-Update.

**Limitation**: Currently, Get-then-Update and Get-then-Delete's
expressiveness is limited: they only check the object's owner before
issuing update or delete, instead of checking arbitrary conditions of
the object. To support arbitrary condition check, the controller's
`reconcile_core` needs to return a function (e.g., `Fn(DynamicObject) ->
bool`) as a field of the request struct. Rust requires the function to
be wrapped in a `Box<dyn ...>`, but `dyn` is not supported by Verus for
now.

---------

Signed-off-by: Xudong Sun <xudongs3@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0