8000 dispatcher: fix panic for closed chan by hongyunyan · Pull Request #1387 · pingcap/ticdc · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dispatcher: fix panic for closed chan #1387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 29, 2025
Merged

Conversation

hongyunyan
Copy link
Collaborator
@hongyunyan hongyunyan commented May 29, 2025

What problem does this PR solve?

Issue Number: close #1386

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2025
Copy link
@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes.

The primary goal of this PR, based on the title, is to fix a panic that occurs due to interacting with a closed channel within the ConflictDetector in the downstream adapter's MySQL sink. The changes involve refactoring the closing logic of the ConflictDetector to prevent this issue.

Highlights

  • Panic Fix: The core issue addressed is a panic related to using a closed channel, likely when the ConflictDetector is shutting down or being closed concurrently with operations trying to add events.
  • Refactor Closing Logic: The mechanism for closing the ConflictDetector has been refactored. A new method closeCache is introduced to specifically handle the closing of the transaction cache channels, separating this from the draining of the notifiedNodes channel.
  • Removed Redundant Checks: Checks for the closed state within the Add and sendToCache methods have been removed. This suggests the new closing strategy handles concurrency differently, or these checks were not effectively preventing the panic.
  • Renamed Field: The closed atomic boolean field has been renamed to cacheClosed for better clarity, indicating what specific part of the detector's state it tracks.

Changelog

  • downstreamadapter/sink/mysql/causality/conflict_detector.go
    • Renamed the closed atomic boolean field to cacheClosed (line 45).
    • Changed the deferred call in the Run method from d.Close() to d.closeCache() (line 66).
    • Removed the if d.closed.Load() check and early return in the Add method (lines 84-86).
    • Removed the if d.closed.Load() check and early return in the sendToCache method (lines 109-111).
    • Introduced a new private method closeCache() to handle closing the resolvedTxnCaches channels and setting cacheClosed (lines 117-125).
    • Modified the Close() method to first drain notifiedNodes and then call the new closeCache() method (lines 128-130).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A channel closed, a panic arose,
A bug to fix, as everyone knows.
Refactor the close,
Where the data flows,
And the program happily goes.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot 8000 added do-not-merge/needs-triage-completed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-linked-issue size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2025
Copy link
@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a panic related to closed channels in the ConflictDetector. The changes involve renaming a flag from closed to cacheClosed for clarity, refactoring the channel closing logic into a new closeCache method, and adjusting the main Close method's sequence of operations. Additionally, checks for this closed state have been removed from the Add and sendToCache methods.

While the renaming and refactoring improve clarity and structure, the removal of the closed-state checks in Add and sendToCache is a significant concern. These removals could potentially lead to panics if these methods are called concurrently with the Close operation, specifically when attempting to send data to channels (notifiedNodes.In() or cache.out()) that are already closed or are in the process of closing. The original checks might have been imperfect, but removing them without a more robust replacement could exacerbate the issue or introduce new ones.

My review focuses on these potential race conditions and the safety of operations during the shutdown sequence.

Summary of Findings

  • Potential Panic in Add Method During Shutdown: The Add method no longer checks if the ConflictDetector is closing. If called concurrently with or after Close(), it can set up callbacks that attempt to send to d.notifiedNodes.In(), which may already be closed, leading to a panic. This is a critical issue.
  • Potential Panic in sendToCache Method During Shutdown: The sendToCache method no longer checks if the caches are closed (via d.cacheClosed). If called after d.closeCache() has closed the relevant cache.out() channel, the cache.add(event) operation will panic. This is a critical issue.
  • Clarity of Fix for Original Panic: The PR aims to 'fix panic for closed chan', but the removal of existing (though potentially imperfect) safety checks without clear, more robust replacements is concerning. The exact nature of the original panic and how these specific changes address it without introducing new risks needs clarification.

Merge Readiness

The pull request addresses an important issue regarding panics with closed channels. However, the current changes, particularly the removal of safety checks in Add and sendToCache, appear to introduce significant risks of new panics during concurrent shutdown operations.

Due to these critical concerns, I recommend that the identified potential panic scenarios be thoroughly addressed before this PR is merged. I am unable to approve the pull request in its current state; further review and approval from other maintainers will be necessary after these issues are resolved.

Comment on lines 83 to 86
func (d *ConflictDetector) Add(event *commonEvent.DMLEvent) {
if d.closed.Load() {
return
}
hashes := ConflictKeys(event)
node := d.slots.AllocNode(hashes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check if d.closed.Load() (which would now correspond to a general detector closed state) has been removed from the Add method. This is concerning because Add sets up node.OnNotified, which eventually sends to d.notifiedNodes.In().

In the Close method, d.notifiedNodes.CloseAndDrain() is called. If Add is invoked concurrently with or after Close (specifically, after d.notifiedNodes has been closed), the attempt to send to d.notifiedNodes.In() via the OnNotified callback will cause a panic.

Could you clarify the reasoning for removing this check? Was the original panic elsewhere, or is there an external synchronization mechanism in the Sink that prevents Add from being called once Close has started on the ConflictDetector?

If Add can indeed be called during shutdown, a check against a flag indicating the detector is closing (set at the beginning of ConflictDetector.Close()) seems necessary to prevent this panic. The cacheClosed flag is set too late in the Close() sequence to protect notifiedNodes.

Comment on lines 105 to 108
func (d *ConflictDetector) sendToCache(event *commonEvent.DMLEvent, id int64) bool {
if d.closed.Load() {
return false
}
cache := d.resolvedTxnCaches[id]
ok := cache.add(event)
return ok

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the Add method, the check if d.closed.Load() (which would now relate to d.cacheClosed.Load()) has been removed from sendToCache. This method calls cache.add(event), which internally pushes to cache.out(). The cache.out() channels are closed within d.closeCache().

If sendToCache is called after d.closeCache() has closed the specific cache.out() channel (or after d.cacheClosed becomes true, indicating all such channels are closed), the cache.add(event) call will panic when it tries to push to the closed channel.

What was the rationale for removing this check? If sendToCache can be called while closeCache is executing or after it has completed, this check (or a similar one using d.cacheClosed.Load()) appears crucial to prevent panics.

Reinstating a check like if d.cacheClosed.Load() { return false } at the beginning of this method would prevent attempts to add events to caches whose underlying channels are already closed.

Copy link
ti-chi-bot bot commented May 29, 2025

@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Code Review

This pull request aims to fix a panic related to closed channels in the ConflictDetector. The changes involve renaming a flag from closed to cacheClosed for clarity, refactoring the channel closing logic into a new closeCache method, and adjusting the main Close method's sequence of operations. Additionally, checks for this closed state have been removed from the Add and sendToCache methods.

While the renaming and refactoring improve clarity and structure, the removal of the closed-state checks in Add and sendToCache is a significant concern. These removals could potentially lead to panics if these methods are called concurrently with the Close operation, specifically when attempting to send data to channels (notifiedNodes.In() or cache.out()) that are already closed or are in the process of closing. The original checks might have been imperfect, but removing them without a more robust replacement could exacerbate the issue or introduce new ones.

My review focuses on these potential race conditions and the safety of operations during the shutdown sequence.

Summary of Findings

  • Potential Panic in Add Method During Shutdown: The Add method no longer checks if the ConflictDetector is closing. If called concurrently with or after Close(), it can set up callbacks that attempt to send to d.notifiedNodes.In(), which may already be closed, leading to a panic. This is a critical issue.
  • Potential Panic in sendToCache Method During Shutdown: The sendToCache method no longer checks if the caches are closed (via d.cacheClosed). If called after d.closeCache() has closed the relevant cache.out() channel, the cache.add(event) operation will panic. This is a critical issue.
  • Clarity of Fix for Original Panic: The PR aims to 'fix panic for closed chan', but the removal of existing (though potentially imperfect) safety checks without clear, more robust replacements is concerning. The exact nature of the original panic and how these specific changes address it without introducing new risks needs clarification.

Merge Readiness

The pull request addresses an important issue regarding panics with closed channels. However, the current changes, particularly the removal of safety checks in Add and sendToCache, appear to introduce significant risks of new panics during concurrent shutdown operations.

Due to these critical concerns, I recommend that the identified potential panic scenarios be thoroughly addressed before this PR is merged. I am unable to approve the pull request in its current state; further review and approval from other maintainers will be necessary after these issues are resolved.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 29, 2025
@ti-chi-bot ti-chi-bot bot added the lgtm label May 29, 2025
Copy link
ti-chi-bot bot commented May 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
ti-chi-bot bot commented May 29, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-05-29 10:49:07.026783028 +0000 UTC m=+102368.088800649: ☑️ agreed by lidezhu.

@ti-chi-bot ti-chi-bot bot added the approved label May 29, 2025
Copy link
ti-chi-bot bot commented May 29, 2025

@hongyunyan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-pulsar-integration-light 2658d70 link false /test pull-cdc-pulsar-integration-light

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot merged commit f7d02d8 into pingcap:master May 29, 2025
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic in sink
2 participants
0