[core] Fix race condition b/w object eviction & repinning for recovery. #53934

israbbani · 2025-06-18T20:54:17Z

This PR has two components:

Fixing the Race Condition b/w Object Eviction and Repining for Object Recovery

The sequence of events documents the life of an object called A.

Owner sends object eviction notifications to all locations as part of Object A going out of scope. (done through worker<->worker pubsub)
Workers that have a copy of Object A mark it for asynchronous deletion.
Owner decides to recover the object and repin a copy of object A that has already been marked for asynchronous deletion.
Owner submits a task that has Object A as an argument.
The task executor tries to fetch Object A and the pull manager hangs forever.

The fix is to check to see if an object is pending deletion before pinning it. If it is, the owner cannot pin the object and has to either try a different location or trigger reconstruction by resubmitting the task that created the object.

(The Joys of) Writing a Unit Test in Node Manager

There are a few problems with writing unit tests in the NodeManager

Despite heroics from @rueian and @dayshah recently, the NodeManager still had a few dependencies that needed to be put behind interfaces so fakes could be injected.
The gRPC handlers are all private to the class so they cannot be called directly. Instead the unit tests need to create a gRPC client.
The NodeManager has a lot of dependencies and the dependencies have interdependencies so we currently dedicate 200 lines to constructing a test fixture.
There is a lack of usable fakes e.g. for the PlasmaClient for writing unit tests.

Follow Ups

Create a shim between the NodeManager and it's gRPC handlers so they can be unit tested directly without creating gRPC clients.
Create a utility function that lets you construct a NodeManager for testing optionally overriding interesting dependencies (otherwise they will default to a Noop fake).
Move the fakes created in this PR into their own (more complete implementations).
Create a Fake for SubscriberInterface.

The sequence of events: 1. The owner sends object eviction notifications to all locations as part of an object going out of scope. 2. The remote copies are deleted asynchronously. 3. The owner can decide to recover the object and repin the copy that has already been marked for asynchronous deletion. 4. The pull manager hangs forever. Signed-off-by: irabbani <irabbani@anyscale.com>

Signed-off-by: irabbani <irabbani@anyscale.com>

1. Adding an interface for LocalObjectManager 2. Adding a unit test Signed-off-by: irabbani <irabbani@anyscale.com>

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

Signed-off-by: irabbani <irabbani@anyscale.com>

israbbani · 2025-06-25T22:16:19Z

src/ray/raylet/node_manager.cc

+      // Therefore, a new object cannot be marked for deletion while this function is
+      // executing.
+      if (*result_it == nullptr ||
+          local_object_manager_.ObjectPendingDeletion(*object_id_it)) {


This is the fix and logic change.

dayshah

Create a utility function that lets you construct a NodeManager for testing optionally overriding interesting dependencies (otherwise they will default to a Noop fake).

Instead of this we should just have mocks and you can override implementations of a mock per test or even per invocation in a test. More flexible and less code

Also node_manager_test failing build bc warnings

src/ray/raylet/local_object_manager_interface.h

dayshah · 2025-06-26T01:02:52Z

src/ray/raylet/test/node_manager_test.cc

+                absl::flat_hash_set<ObjectID> objects_ids_in_plasma_;

            
+  absl::flat_hash_map<ObjectID, std::pair<std::vector<uint8_t>, std::vector<uint8_t>>>
+      objects_in_plasma_;
+};


imo prefer mocks in mock/ to no-op fakes, more similar to the rest of the code too

I disagree. I am hoping to remove as much of our mock code as possible and replace it with fake implementations. Mocks encourage intercepting calls and break encapsulation and overfit the tests to the implementations rather than the API.

I conceptually agree, intercepting calls and actually using the mocks everywhere makes the tests brittle. The reason i prefer the mocks is because they give you the default no-op implementations by default and give you the option of overriding / intercepting whatever's necessary. It'll generally result in less verbose test files and it's up to the reviewers to make sure people aren't writing bad tests that overfit to implementations.

I'm open to going with shared no-op implementations instead of mocks. But that's effectively the same thing except that devs will be more discouraged to mock bc it means writing more code?

But this is probably a broader team decision + a shift in the way we do things now. cc @jjyao + @edoakes

I much prefer to make the tests more verbose in exchange for reducing our use of mocks. Test code being verbose is not really a strong detractor to me, in fact it's often preferable because it makes the tests more explicit/self-documenting.

I agree that in theory mocks can be "held correctly" to avoid writing low quality tests, but they tend to encourage bad practices and are also often error-prone because so much of the behavior is implicit/hidden.

Let's walk down the path of more explicit fakes for a bit and see where it takes us... from past experience, it will take some up-front work but then lead to more maintainable & robust tests. If I'm wrong, it isn't hard to go back.

dayshah · 2025-06-26T01:06:12Z

src/ray/raylet/test/node_manager_test.cc

+
+    grpc::ClientContext context;
+    auto stub = rpc::NodeManagerService::NewStub(channel);
+    auto status = stub->PinObjectIDs(&context, request, &reply);


just make the test a friend and call HandlePinObjectIDs with the request and reply, this would simplify things a lot and this complexity in a test isn't worth it

Also idk if handlers should really be private, this is a node manager only thing. Gcs handlers are public and that makes sense... the handlers are expected to be triggered extrenally.

Making the test a friend breaks encapsulation. Tests should be written against the public API of the class as much as possible. The short-term pain of making code testable is worth the long-term gain of not having brittle tests and it'll force us to improve code quality.

Agreed that the handlers should be public. I'm planning on changing this in a follow up.

can we just make them public here and avoid merging the grpc logic in test code

That's fair. I made this specific method public to make this a little easier. Ideally, the gRPC handler should be a thin shim around the actual public function that does the work. It'll make testing the logic independent of gRPC requests, responses, callbacks etc.

Ideally, the gRPC handler should be a thin shim around the actual public function that does the work.

Agreed. Was going to say the same.

dayshah · 2025-06-26T01:07:33Z

src/ray/raylet/test/node_manager_test.cc

+
+  std::thread io_thread{[&] {
+    auto work_guard = boost::asio::make_work_guard(io_service_);
+    io_service_.run();


ideally shouldn't just let the io_service_ run on its own thread in a test. You can just call run_one on the test service as needed.

ideally shouldn't just let the io_service_ run on its own thread in a test. You can just call run_one on the test service as needed.

Super agree. Can you say a little more about "You can just call run_one on the test service as needed."?

https://think-async.com/Asio/asio-1.16.0/doc/asio/reference/io_context/run_one.html

just runs one handler at a time, and it'll run on that thread. Don't need the work_guard either. Just do run_one's on the main test thread

Super super agree. We should ban this.

Here is a test I wrote awhile back that does this:

ray/src/ray/common/test/client_connection_test.cc

Line 126 in 5957aab

io_service_.run();

Turns out I was being lazy. I don't think I need the io_service to be running since I'm calling a function that doesn't require it (and none of the periodical_runner_ stuff depends on it).

israbbani · 2025-06-26T02:26:00Z

Also node_manager_test failing build bc warnings

This is interesting. CI builds with clang. I'm using gcc on my devbox. I'll change it to clang. Good catch!

Signed-off-by: irabbani <irabbani@anyscale.com>

dayshah · 2025-06-26T05:26:08Z

Also node_manager_test failing build bc warnings

This is interesting. CI builds with clang. I'm using gcc on my devbox. I'll change it to clang. Good catch!

ci also builds the wheel steps with gcc though lol 😭

- making HandlePinObjectIDs public so it can be tested directly - removing io_service_ Signed-off-by: irabbani <irabbani@anyscale.com>

Signed-off-by: irabbani <irabbani@anyscale.com>

dayshah

🚢

dayshah · 2025-06-27T00:21:45Z

src/ray/raylet/local_object_manager.cc

+    RAY_LOG(DEBUG) << "Freeing " << objects_pending_deletion_.size()
+                   << " out-of-scope objects";
+    // TODO(irabbani): CORE-1640 will modify as much as the plasma API as is
+    // reasonable to remove usage of vectors in favor of sets.


probably no jira tickets in oss code, make it a gh issue or don't keep in code

dayshah · 2025-06-27T00:23:21Z

src/ray/raylet/test/node_manager_test.cc

+                                const ObjectID &generator_id = ObjectID::Nil()) override {
+  }
+
+  // NOOP


above is noop too?

israbbani added the go add ONLY when ready to merge, run all tests label Jun 18, 2025

israbbani added 8 commits June 18, 2025 21:12

formatting

7ff93f7

Signed-off-by: irabbani <irabbani@anyscale.com>

Merge branch 'master' into irabbani/CORE-1618

c559ba0

Merge branch 'master' into irabbani/CORE-1618

998db18

Merge branch 'master' into irabbani/CORE-1618

dbda0e4

[core] Making node manager more testable by

9b75f87

1. Adding an interface for LocalObjectManager 2. Adding a unit test Signed-off-by: irabbani <irabbani@anyscale.com>

Merge branch 'master' into irabbani/CORE-1618

330b2ab

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

format

4617dc2

Signed-off-by: irabbani <irabbani@anyscale.com>

merge conflicts

fd5c49b

Signed-off-by: irabbani <irabbani@anyscale.com>

israbbani commented Jun 25, 2025

View reviewed changes

israbbani requested a review from a team June 25, 2025 22:17

israbbani marked this pull request as ready for review June 25, 2025 22:17

dayshah reviewed Jun 26, 2025

View reviewed changes

Fixing clang build

3bdbddf

Signed-off-by: irabbani <irabbani@anyscale.com>

israbbani mentioned this pull request Jun 26, 2025

[core][test] fix flaky data races in NodeManagerTest #54129

Open

8 tasks

israbbani added 2 commits June 26, 2025 18:59

A few updates to the unit test:

a190477

- making HandlePinObjectIDs public so it can be tested directly - removing io_service_ Signed-off-by: irabbani <irabbani@anyscale.com>

removing io_service.stop()

1e84ad2

Signed-off-by: irabbani <irabbani@anyscale.com>

israbbani requested review from edoakes and dayshah June 26, 2025 22:49

dayshah approved these changes Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Fix race condition b/w object eviction & repinning for recovery. #53934

[core] Fix race condition b/w object eviction & repinning for recovery. #53934

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[core] Fix race condition b/w object eviction & repinning for recovery. #53934

Are you sure you want to change the base?

[core] Fix race condition b/w object eviction & repinning for recovery. #53934

Uh oh!

Conversation

Uh oh!

Fixing the Race Condition b/w Object Eviction and Repining for Object Recovery

(The Joys of) Writing a Unit Test in Node Manager

Follow Ups

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!