10000 [core] Fix GCS subscribers map race condition by dayshah · Pull Request #53781 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[core] Fix GCS subscribers map race condition #53781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 16, 2025

Conversation

dayshah
Copy link
Contributor
@dayshah dayshah commented Jun 12, 2025

Problem

The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread.

auto &io_context = io_context_provider_.GetIOContext<GcsPublisher>();
pubsub_handler_ = std::make_unique<InternalPubSubHandler>(io_context, *gcs_publisher_);
rpc_server_.RegisterService(
std::make_unique<rpc::InternalPubSubGrpcService>(io_context, *pubsub_handler_));

RemoveSubscriberFrom is called from the main thread on the GCS though. And sender_to_subscribers_ is a map that is modified there.

sender_to_subscribers_.erase(iter);

sender_to_subscribers_ is also modified in HandleGcsSubscriberCommandBatch which will run on the GCSPublisher thread.

iter = sender_to_subscribers_.insert({sender_id, {}}).first;

Obvious races here.

Solution

We can just post the RemoveSubscriberFrom call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@Copilot Copilot AI review requested due to automatic review settings June 12, 2025 21:44
@dayshah dayshah requested a review from a team as a code owner June 12, 2025 21:44
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition in the GCS subscribers map by ensuring that all modifications occur on the GCSPublisher IOContext.

  • Updated pubsub_handler documentation to clarify thread safety and IOContext expectations.
  • Modified GcsServer event listeners to post RemoveSubscriberFrom calls onto the GCSPublisher thread, thereby eliminating race conditions.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ray/gcs/gcs_server/pubsub_handler.h Added documentation clarifying that the pubsub handler is not thread safe.
src/ray/gcs/gcs_server/gcs_server.cc Updated subscriber removal calls to post on the GCSPublisher IOContext.

@dayshah dayshah marked this pull request as draft June 12, 2025 22:40
@dayshah dayshah marked this pull request as ready for review June 13, 2025 06:11
Copy link
Collaborator
@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanding more usage of io-service-post-as-thread-safety makes me sad, but LGTM as patch fix

@edoakes
Copy link
Collaborator
edoakes commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Contributor Author
dayshah commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post.

I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Contributor Author
dayshah commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post.

I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again.

Actually going with calling it Async and always posting to make the behavior clear and consistent. This should only ever be called externally. Internally should just erase from the map directly. Added a comment noting that.

@edoakes edoakes merged commit 861b319 into ray-project:master Jun 16, 2025
5 checks passed
@dayshah dayshah deleted the remove-subscriber-race branch June 16, 2025 14:41
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
## Problem

The GCS's pubsub service is created with it's own io context. This means
that all handlers will be posted to it's own thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617

`RemoveSubscriberFrom` is called from the main thread on the GCS though.
And `sender_to_subscribers_` is a map that is modified there.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124

`sender_to_subscribers_` is also modified in
`HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher
thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82

Obvious races here.

## Solution
We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher
io_context so everything runs on a single thread. This keeps us from
trying to make the pub sub handler thread safe. All usage of the pub sub
handler should always be on the GCSPublisher io_context.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0