8000 [core] Fix race condition b/w object eviction & repinning for recovery. by israbbani · Pull Request #53934 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[core] Fix race condition b/w object eviction & repinning for recovery. #53934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

israbbani
Copy link
Contributor
@israbbani israbbani commented Jun 18, 2025

This PR has two components:

Fixing the Race Condition b/w Object Eviction and Repining for Object Recovery

The sequence of events documents the life of an object called A.

  1. Owner sends object eviction notifications to all locations as part of Object A going out of scope. (done through worker<->worker pubsub)
  2. Workers that have a copy of Object A mark it for asynchronous deletion.
  3. Owner decides to recover the object and repin a copy of object A that has already been marked for asynchronous deletion.
  4. Owner submits a task that has Object A as an argument.
  5. The task executor tries to fetch Object A and the pull manager hangs forever.

The fix is to check to see if an object is pending deletion before pinning it. If it is, the owner cannot pin the object and has to either try a different location or trigger reconstruction by resubmitting the task that created the object.

(The Joys of) Writing a Unit Test in Node Manager

There are a few problems with writing unit tests in the NodeManager

  1. Despite heroics from @rueian and @dayshah recently, the NodeManager still had a few dependencies that needed to be put behind interfaces so fakes could be injected.
  2. The gRPC handlers are all private to the class so they cannot be called directly. Instead the unit tests need to create a gRPC client.
  3. The NodeManager has a lot of dependencies and the dependencies have interdependencies so we currently dedicate 200 lines to constructing a test fixture.
  4. There is a lack of usable fakes e.g. for the PlasmaClient for writing unit tests.

Follow Ups

  1. Create a shim between the NodeManager and it's gRPC handlers so they can be unit tested directly without creating gRPC clients.
  2. Create a utility function that lets you construct a NodeManager for testing optionally overriding interesting dependencies (otherwise they will default to a Noop fake).
  3. Move the fakes created in this PR into their own (more complete implementations).
  4. Create a Fake for SubscriberInterface.

The sequence of events:

1. The owner sends object eviction notifications to all locations as part
of an object going out of scope.
2. The remote copies are deleted asynchronously.
3. The owner can decide to recover the object and repin the copy
that has already been marked for asynchronous deletion.
4. The pull manager hangs forever.

Signed-off-by: irabbani <irabbani@anyscale.com>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Jun 18, 2025
Signed-off-by: irabbani <irabbani@anyscale.com>
1. Adding an interface for LocalObjectManager
2. Adding a unit test

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
// Therefore, a new object cannot be marked for deletion while this function is
// executing.
if (*result_it == nullptr ||
local_object_manager_.ObjectPendingDeletion(*object_id_it)) {
Copy link
Contributor Author
@israbbani israbbani Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix and logic change.

@israbbani israbbani requested a review from a team June 25, 2025 22:17
@israbbani israbbani marked this pull request as ready for review June 25, 2025 22:17
Copy link
Contributor
@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a utility function that lets you construct a NodeManager for testing optionally overriding interesting dependencies (otherwise they will default to a Noop fake).

Instead of this we should just have mocks and you can override implementations of a mock per test or even per invocation in a test. More flexible and less code

Also node_manager_test failing build bc warnings

absl::flat_hash_set<ObjectID> objects_ids_in_plasma_;
absl::flat_hash_map<ObjectID, std::pair<std::vector<uint8_t>, std::vector<uint8_t>>>
objects_in_plasma_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo prefer mocks in mock/ to no-op fakes, more similar to the rest of the code too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I am hoping to remove as much of our mock code as possible and replace it with fake implementations. Mocks encourage intercepting calls and break encapsulation and overfit the tests to the implementations rather than the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I conceptually agree, intercepting calls and actually using the mocks everywhere makes the tests brittle. The reason i prefer the mocks is because they give you the default no-op implementations by default and give you the option of overriding / intercepting whatever's necessary. It'll generally result in less verbose test files and it's up to the reviewers to make sure people aren't writing bad tests that overfit to implementations.

I'm open to going with shared no-op implementations instead of mocks. But that's effectively the same thing except that devs will be more discouraged to mock bc it means writing more code?

But this is probably a broader team decision + a shift in the way we do things now. cc @jjyao + @edoakes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much prefer to make the tests more verbose in exchange for reducing our use of mocks. Test code being verbose is not really a strong detractor to me, in fact it's often preferable because it makes the tests more explicit/self-documenting.

I agree that in theory mocks can be "held correctly" to avoid writing low quality tests, but they tend to encourage bad practices and are also often error-prone because so much of the behavior is implicit/hidden.

Let's walk down the path of more explicit fakes for a bit and see where it takes us... from past experience, it will take some up-front work but then lead to more maintainable & robust tests. If I'm wrong, it isn't hard to go back.


grpc::ClientContext context;
auto stub = rpc::NodeManagerService::NewStub(channel);
auto status = stub->PinObjectIDs(&context, request, &reply);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make the test a friend and call HandlePinObjectIDs with the request and reply, this would simplify things a lot and this complexity in a test isn't worth it

Also idk if handlers should really be private, this is a node manager only thing. Gcs handlers are public and that makes sense... the handlers are expected to be triggered extrenally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the test a friend breaks encapsulation. Tests should be written against the public API of the class as much as possible. The short-term pain of making code testable is worth the long-term gain of not having brittle tests and it'll force us to improve code quality.

Agreed that the handlers should be public. I'm planning on changing this in a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make them public here and avoid merging the grpc logic in test code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I made this specific method public to make this a little easier. Ideally, the gRPC handler should be a thin shim around the actual public function that does the work. It'll make testing the logic independent of gRPC requests, responses, callbacks etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the gRPC handler should be a thin shim around the actual public function that does the work.

Agreed. Was going to say the same.


std::thread io_thread{[&] {
auto work_guard = boost::asio::make_work_guard(io_service_);
io_service_.run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally shouldn't just let the io_service_ run on its own thread in a test. You can just call run_one on the test service as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally shouldn't just let the io_service_ run on its own thread in a test. You can just call run_one on the test service as needed.

Super agree. Can you say a little more about "You can just call run_one on the test service as needed."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://think-async.com/Asio/asio-1.16.0/doc/asio/reference/io_context/run_one.html

just runs one handler at a time, and it'll run on that thread. Don't need the work_guard either. Just do run_one's on the main test thread

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super super agree. We should ban this.

Here is a test I wrote awhile back that does this:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I was being lazy. I don't think I need the io_service to be running since I'm calling a function that doesn't require it (and none of the periodical_runner_ stuff depends on it).

@israbbani
Copy link
Contributor Author

Also node_manager_test failing build bc warnings

This is interesting. CI builds with clang. I'm using gcc on my devbox. I'll change it to clang. Good catch!

Signed-off-by: irabbani <irabbani@anyscale.com>
@dayshah
Copy link
Contributor
dayshah commented Jun 26, 2025

Also node_manager_test failing build bc warnings

This is interesting. CI builds with clang. I'm using gcc on my devbox. I'll change it to clang. Good catch!

ci also builds the wheel steps with gcc though lol 😭

- making HandlePinObjectIDs public so it can be tested directly
- removing io_service_

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
@israbbani israbbani requested review from edoakes and dayshah June 26, 2025 22:49
Copy link
Contributor
@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

RAY_LOG(DEBUG) << "Freeing " << objects_pending_deletion_.size()
<< " out-of-scope objects";
// TODO(irabbani): CORE-1640 will modify as much as the plasma API as is
// reasonable to remove usage of vectors in favor of sets.
Copy link
Contributor
@dayshah dayshah Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably no jira tickets in oss code, make it a gh issue or don't keep in code

const ObjectID &generator_id = ObjectID::Nil()) override {
}

// NOOP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above is noop too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0