dispatcher: fix panic for closed chan #1387

hongyunyan · 2025-05-29T09:47:19Z

What problem does this PR solve?

Issue Number: close #1386

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

gemini-code-assist

Hello @hongyunyan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes.

The primary goal of this PR, based on the title, is to fix a panic that occurs due to interacting with a closed channel within the ConflictDetector in the downstream adapter's MySQL sink. The changes involve refactoring the closing logic of the ConflictDetector to prevent this issue.

Highlights

Panic Fix: The core issue addressed is a panic related to using a closed channel, likely when the ConflictDetector is shutting down or being closed concurrently with operations trying to add events.
Refactor Closing Logic: The mechanism for closing the ConflictDetector has been refactored. A new method closeCache is introduced to specifically handle the closing of the transaction cache channels, separating this from the draining of the notifiedNodes channel.
Removed Redundant Checks: Checks for the closed state within the Add and sendToCache methods have been removed. This suggests the new closing strategy handles concurrency differently, or these checks were not effectively preventing the panic.
Renamed Field: The closed atomic boolean field has been renamed to cacheClosed for better clarity, indicating what specific part of the detector's state it tracks.

Changelog

downstreamadapter/sink/mysql/causality/conflict_detector.go
- Renamed the closed atomic boolean field to cacheClosed (line 45).
- Changed the deferred call in the Run method from d.Close() to d.closeCache() (line 66).
- Removed the if d.closed.Load() check and early return in the Add method (lines 84-86).
- Removed the if d.closed.Load() check and early return in the sendToCache method (lines 109-111).
- Introduced a new private method closeCache() to handle closing the resolvedTxnCaches channels and setting cacheClosed (lines 117-125).
- Modified the Close() method to first drain notifiedNodes and then call the new closeCache() method (lines 128-130).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A channel closed, a panic arose,
A bug to fix, as everyone knows.
Refactor the close,
Where the data flows,
And the program happily goes.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix a panic related to closed channels in the ConflictDetector. The changes involve renaming a flag from closed to cacheClosed for clarity, refactoring the channel closing logic into a new closeCache method, and adjusting the main Close method's sequence of operations. Additionally, checks for this closed state have been removed from the Add and sendToCache methods.

While the renaming and refactoring improve clarity and structure, the removal of the closed-state checks in Add and sendToCache is a significant concern. These removals could potentially lead to panics if these methods are called concurrently with the Close operation, specifically when attempting to send data to channels (notifiedNodes.In() or cache.out()) that are already closed or are in the process of closing. The original checks might have been imperfect, but removing them without a more robust replacement could exacerbate the issue or introduce new ones.

My review focuses on these potential race conditions and the safety of operations during the shutdown sequence.

Summary of Findings

Potential Panic in Add Method During Shutdown: The Add method no longer checks if the ConflictDetector is closing. If called concurrently with or after Close(), it can set up callbacks that attempt to send to d.notifiedNodes.In(), which may already be closed, leading to a panic. This is a critical issue.
Potential Panic in sendToCache Method During Shutdown: The sendToCache method no longer checks if the caches are closed (via d.cacheClosed). If called after d.closeCache() has closed the relevant cache.out() channel, the cache.add(event) operation will panic. This is a critical issue.
Clarity of Fix for Original Panic: The PR aims to 'fix panic for closed chan', but the removal of existing (though potentially imperfect) safety checks without clear, more robust replacements is concerning. The exact nature of the original panic and how these specific changes address it without introducing new risks needs clarification.

Merge Readiness

The pull request addresses an important issue regarding panics with closed channels. However, the current changes, particularly the removal of safety checks in Add and sendToCache, appear to introduce significant risks of new panics during concurrent shutdown operations.

Due to these critical concerns, I recommend that the identified potential panic scenarios be thoroughly addressed before this PR is merged. I am unable to approve the pull request in its current state; further review and approval from other maintainers will be necessary after these issues are resolved.

gemini-code-assist · 2025-05-29T09:49:16Z

downstreamadapter/sink/mysql/causality/conflict_detector.go

 func (d *ConflictDetector) Add(event *commonEvent.DMLEvent) {
-	if d.closed.Load() {
-		return
-	}
 	hashes := ConflictKeys(event)
 	node := d.slots.AllocNode(hashes)



The check if d.closed.Load() (which would now correspond to a general detector closed state) has been removed from the Add method. This is concerning because Add sets up node.OnNotified, which eventually sends to d.notifiedNodes.In().

In the Close method, d.notifiedNodes.CloseAndDrain() is called. If Add is invoked concurrently with or after Close (specifically, after d.notifiedNodes has been closed), the attempt to send to d.notifiedNodes.In() via the OnNotified callback will cause a panic.

Could you clarify the reasoning for removing this check? Was the original panic elsewhere, or is there an external synchronization mechanism in the Sink that prevents Add from being called once Close has started on the ConflictDetector?

If Add can indeed be called during shutdown, a check against a flag indicating the detector is closing (set at the beginning of ConflictDetector.Close()) seems necessary to prevent this panic. The cacheClosed flag is set too late in the Close() sequence to protect notifiedNodes.

gemini-code-assist · 2025-05-29T09:49:16Z

downstreamadapter/sink/mysql/causality/conflict_detector.go

 func (d *ConflictDetector) sendToCache(event *commonEvent.DMLEvent, id int64) bool {
-	if d.closed.Load() {
-		return false
-	}
 	cache := d.resolvedTxnCaches[id]
 	ok := cache.add(event)
 	return ok


Similar to the Add method, the check if d.closed.Load() (which would now relate to d.cacheClosed.Load()) has been removed from sendToCache. This method calls cache.add(event), which internally pushes to cache.out(). The cache.out() channels are closed within d.closeCache().

If sendToCache is called after d.closeCache() has closed the specific cache.out() channel (or after d.cacheClosed becomes true, indicating all such channels are closed), the cache.add(event) call will panic when it tries to push to the closed channel.

What was the rationale for removing this check? If sendToCache can be called while closeCache is executing or after it has completed, this check (or a similar one using d.cacheClosed.Load()) appears crucial to prevent panics.

Reinstating a check like if d.cacheClosed.Load() { return false } at the beginning of this method would prevent attempts to add events to caches whose underlying channels are already closed.

ti-chi-bot · 2025-05-29T09:49:19Z

@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Code Review

This pull request aims to fix a panic related to closed channels in the ConflictDetector. The changes involve renaming a flag from closed to cacheClosed for clarity, refactoring the channel closing logic into a new closeCache method, and adjusting the main Close method's sequence of operations. Additionally, checks for this closed state have been removed from the Add and sendToCache methods.

While the renaming and refactoring improve clarity and structure, the removal of the closed-state checks in Add and sendToCache is a significant concern. These removals could potentially lead to panics if these methods are called concurrently with the Close operation, specifically when attempting to send data to channels (notifiedNodes.In() or cache.out()) that are already closed or are in the process of closing. The original checks might have been imperfect, but removing them without a more robust replacement could exacerbate the issue or introduce new ones.

My review focuses on these potential race conditions and the safety of operations during the shutdown sequence.

Summary of Findings

Potential Panic in Add Method During Shutdown: The Add method no longer checks if the ConflictDetector is closing. If called concurrently with or after Close(), it can set up callbacks that attempt to send to d.notifiedNodes.In(), which may already be closed, leading to a panic. This is a critical issue.

Potential Panic in sendToCache Method During Shutdown: The sendToCache method no longer checks if the caches are closed (via d.cacheClosed). If called after d.closeCache() has closed the relevant cache.out() channel, the cache.add(event) operation will panic. This is a critical issue.

Clarity of Fix for Original Panic: The PR aims to 'fix panic for closed chan', but the removal of existing (though potentially imperfect) safety checks without clear, more robust replacements is concerning. The exact nature of the original panic and how these specific changes address it without introducing new risks needs clarification.

Merge Readiness

The pull request addresses an important issue regarding panics with closed channels. However, the current changes, particularly the removal of safety checks in Add and sendToCache, appear to introduce significant risks of new panics during concurrent shutdown operations.

Due to these critical concerns, I recommend that the identified potential panic scenarios be thoroughly addressed before this PR is merged. I am unable to approve the pull request in its current state; further review and approval from other maintainers will be necessary after these issues are resolved.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2025-05-29T10:49:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-05-29T10:49:07Z

[LGTM Timeline notifier]

Timeline:

2025-05-29 10:49:07.026783028 +0000 UTC m=+102368.088800649: ☑️ agreed by lidezhu.

ti-chi-bot · 2025-05-29T11:11:05Z

@hongyunyan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cdc-pulsar-integration-light	`2658d70`	link	false	`/test pull-cdc-pulsar-integration-light`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

update

336c966

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2025

gemini-code-assist bot reviewed May 29, 2025

View reviewed changes

update

ad0c5c0

ti-chi-bot bot 8000 added do-not-merge/needs-triage-completed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-linked-issue size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2025

gemini-code-assist bot suggested changes May 29, 2025

View reviewed changes

hongyunyan added 2 commits May 29, 2025 17:59

update

579bfde

update

2658d70

hongyunyan removed the do-not-merge/needs-triage-completed label May 29, 2025

ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 29, 2025

lidezhu approved these changes May 29, 2025

View reviewed changes

ti-chi-bot bot added the lgtm label May 29, 2025

ti-chi-bot bot added the approved label May 29, 2025

ti-chi-bot bot merged commit f7d02d8 into pingcap:master May 29, 2025
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dispatcher: fix panic for closed chan #1387

dispatcher: fix panic for closed chan #1387

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dispatcher: fix panic for closed chan #1387

dispatcher: fix panic for closed chan #1387

Uh oh!

Conversation

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

[LGTM Timeline notifier]

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!