-
Notifications
You must be signed in to change notification settings - Fork 22
dispatcher: fix panic for closed chan #1387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes.
The primary goal of this PR, based on the title, is to fix a panic that occurs due to interacting with a closed channel within the ConflictDetector
in the downstream adapter's MySQL sink. The changes involve refactoring the closing logic of the ConflictDetector
to prevent this issue.
Highlights
- Panic Fix: The core issue addressed is a panic related to using a closed channel, likely when the
ConflictDetector
is shutting down or being closed concurrently with operations trying to add events. - Refactor Closing Logic: The mechanism for closing the
ConflictDetector
has been refactored. A new methodcloseCache
is introduced to specifically handle the closing of the transaction cache channels, separating this from the draining of thenotifiedNodes
channel. - Removed Redundant Checks: Checks for the
closed
state within theAdd
andsendToCache
methods have been removed. This suggests the new closing strategy handles concurrency differently, or these checks were not effectively preventing the panic. - Renamed Field: The
closed
atomic boolean field has been renamed tocacheClosed
for better clarity, indicating what specific part of the detector's state it tracks.
Changelog
- downstreamadapter/sink/mysql/causality/conflict_detector.go
- Renamed the
closed
atomic boolean field tocacheClosed
(line 45). - Changed the deferred call in the
Run
method fromd.Close()
tod.closeCache()
(line 66). - Removed the
if d.closed.Load()
check and early return in theAdd
method (lines 84-86). - Removed the
if d.closed.Load()
check and early return in thesendToCache
method (lines 109-111). - Introduced a new private method
closeCache()
to handle closing theresolvedTxnCaches
channels and settingcacheClosed
(lines 117-125). - Modified the
Close()
method to first drainnotifiedNodes
and then call the newcloseCache()
method (lines 128-130).
- Renamed the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A channel closed, a panic arose,
A bug to fix, as everyone knows.
Refactor the close,
Where the data flows,
And the program happily goes.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a panic related to closed channels in the ConflictDetector
. The changes involve renaming a flag from closed
to cacheClosed
for clarity, refactoring the channel closing logic into a new closeCache
method, and adjusting the main Close
method's sequence of operations. Additionally, checks for this closed state have been removed from the Add
and sendToCache
methods.
While the renaming and refactoring improve clarity and structure, the removal of the closed-state checks in Add
and sendToCache
is a significant concern. These removals could potentially lead to panics if these methods are called concurrently with the Close
operation, specifically when attempting to send data to channels (notifiedNodes.In()
or cache.out()
) that are already closed or are in the process of closing. The original checks might have been imperfect, but removing them without a more robust replacement could exacerbate the issue or introduce new ones.
My review focuses on these potential race conditions and the safety of operations during the shutdown sequence.
Summary of Findings
- Potential Panic in
Add
Method During Shutdown: TheAdd
method no longer checks if theConflictDetector
is closing. If called concurrently with or afterClose()
, it can set up callbacks that attempt to send tod.notifiedNodes.In()
, which may already be closed, leading to a panic. This is a critical issue. - Potential Panic in
sendToCache
Method During Shutdown: ThesendToCache
method no longer checks if the caches are closed (viad.cacheClosed
). If called afterd.closeCache()
has closed the relevantcache.out()
channel, thecache.add(event)
operation will panic. This is a critical issue. - Clarity of Fix for Original Panic: The PR aims to 'fix panic for closed chan', but the removal of existing (though potentially imperfect) safety checks without clear, more robust replacements is concerning. The exact nature of the original panic and how these specific changes address it without introducing new risks needs clarification.
Merge Readiness
The pull request addresses an important issue regarding panics with closed channels. However, the current changes, particularly the removal of safety checks in Add
and sendToCache
, appear to introduce significant risks of new panics during concurrent shutdown operations.
Due to these critical concerns, I recommend that the identified potential panic scenarios be thoroughly addressed before this PR is merged. I am unable to approve the pull request in its current state; further review and approval from other maintainers will be necessary after these issues are resolved.
func (d *ConflictDetector) Add(event *commonEvent.DMLEvent) { | ||
if d.closed.Load() { | ||
return | ||
} | ||
hashes := ConflictKeys(event) | ||
node := d.slots.AllocNode(hashes) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if d.closed.Load()
(which would now correspond to a general detector closed state) has been removed from the Add
method. This is concerning because Add
sets up node.OnNotified
, which eventually sends to d.notifiedNodes.In()
.
In the Close
method, d.notifiedNodes.CloseAndDrain()
is called. If Add
is invoked concurrently with or after Close
(specifically, after d.notifiedNodes
has been closed), the attempt to send to d.notifiedNodes.In()
via the OnNotified
callback will cause a panic.
Could you clarify the reasoning for removing this check? Was the original panic elsewhere, or is there an external synchronization mechanism in the Sink
that prevents Add
from being called once Close
has started on the ConflictDetector
?
If Add
can indeed be called during shutdown, a check against a flag indicating the detector is closing (set at the beginning of ConflictDetector.Close()
) seems necessary to prevent this panic. The cacheClosed
flag is set too late in the Close()
sequence to protect notifiedNodes
.
func (d *ConflictDetector) sendToCache(event *commonEvent.DMLEvent, id int64) bool { | ||
if d.closed.Load() { | ||
return false | ||
} | ||
cache := d.resolvedTxnCaches[id] | ||
ok := cache.add(event) | ||
return ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the Add
method, the check if d.closed.Load()
(which would now relate to d.cacheClosed.Load()
) has been removed from sendToCache
. This method calls cache.add(event)
, which internally pushes to cache.out()
. The cache.out()
channels are closed within d.closeCache()
.
If sendToCache
is called after d.closeCache()
has closed the specific cache.out()
channel (or after d.cacheClosed
becomes true, indicating all such channels are closed), the cache.add(event)
call will panic when it tries to push to the closed channel.
What was the rationale for removing this check? If sendToCache
can be called while closeCache
is executing or after it has completed, this check (or a similar one using d.cacheClosed.Load()
) appears crucial to prevent panics.
Reinstating a check like if d.cacheClosed.Load() { return false }
at the beginning of this method would prevent attempts to add events to caches whose underlying channels are already closed.
@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
@hongyunyan: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #1386
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note