-
Notifications
You must be signed in to change notification settings - Fork 6.4k
[core] Fix GCS subscribers map race condition #53781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a race condition in the GCS subscribers map by ensuring that all modifications occur on the GCSPublisher IOContext.
- Updated pubsub_handler documentation to clarify thread safety and IOContext expectations.
- Modified GcsServer event listeners to post RemoveSubscriberFrom calls onto the GCSPublisher thread, thereby eliminating race conditions.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/ray/gcs/gcs_server/pubsub_handler.h | Added documentation clarifying that the pubsub handler is not thread safe. |
src/ray/gcs/gcs_server/gcs_server.cc | Updated subscriber removal calls to post on the GCSPublisher IOContext. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanding more usage of io-service-post-as-thread-safety makes me sad, but LGTM as patch fix
Actually, can we enforce this more explicitly by moving the ray/src/ray/gcs/gcs_server/pubsub_handler.cc Line 116 in 149e152
It's only ever called in the contexts you've fixed here. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post. I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Actually going with calling it Async and always posting to make the behavior clear and consistent. This should only ever be called externally. Internally should just erase from the map directly. Added a comment noting that. |
## Problem The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617 `RemoveSubscriberFrom` is called from the main thread on the GCS though. And `sender_to_subscribers_` is a map that is modified there. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124 `sender_to_subscribers_` is also modified in `HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82 Obvious races here. ## Solution We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Problem
The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread.
ray/src/ray/gcs/gcs_server/gcs_server.cc
Lines 614 to 617 in a1b9dc6
RemoveSubscriberFrom
is called from the main thread on the GCS though. Andsender_to_subscribers_
is a map that is modified there.ray/src/ray/gcs/gcs_server/pubsub_handler.cc
Line 124 in a1b9dc6
sender_to_subscribers_
is also modified inHandleGcsSubscriberCommandBatch
which will run on the GCSPublisher thread.ray/src/ray/gcs/gcs_server/pubsub_handler.cc
Line 82 in a1b9dc6
Obvious races here.
Solution
We can just post the
RemoveSubscriberFrom
call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context.