8000 WARN level exception thrown during shut down of ClusteredMediaDriver · Issue #1784 · aeron-io/aeron · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WARN level exception thrown during shut down of ClusteredMediaDriver #1784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zamhassam opened this issue Apr 24, 2025 · 0 comments
Open
8A26

Comments

@zamhassam
Copy link

Hi,

I get the below AeronException from time to time after upgrading form 1.43.0 to 1.47.4. This seems to be in relation to the controlSession set up between the ConsensusModule and Archive.

io.aeron.archive.client.ArchiveEvent: WARN - controlSessionId=536070027 (responseStreamId=120 responseChannel=aeron:ipc?mtu=1408|term-length=65536|session-id=1698090729|alias=cm-archive-ctrl-resp-cluster-0|sparse=true) terminated: request publication image unavailable: image.correlationId=37 sessionId=1698090729 streamId=10 channel=aeron:ipc?term-length=64k

I managed to replicate this in your existing test io.aeron.cluster.ClusterNodeRestartTest#shouldRestartServiceFromSnapshot. This isn't treated as a failure as its category is AeronException.Category.WARN. This only happens after quite a few runs of the test. However it happens more frequently in the system I am working on. I believe this is down to timings.

In the system I am working on, our shutdown procedure is as follows. We will use the ClusterTool to initiate SHUTDOWN which will trigger the ConsensusModule to enqueue a snapshot to all clustered service container nodes before instructing them to shutdown. In order to give time to the clustered service container nodes to do their snapshots and process the shutdown command, we wait two seconds, and then close the ClusteredMediaDriver. Adding that same two seconds into the test causes the Exception to occur more regularly. I have attached the debug logging from the ClusteredMediaDriver when the issue happens in our set up.

I have a suspicion this is because the clean up in io.aeron.cluster.ConsensusModuleAgent#onClose is guarded by !ctx.ownsAeronClient() but using the ClusteredMediaDriver will mean the ConsensusModule always owns its Aeron client.

I've sent a PR with a test and a potential fix. Please let me know your thoughts on the above and if you need any further information from me. The test has a two second wait in it which seems yucky. Maybe you can tell me me if there is a better wait to do this.
#1783

[79850.459638138] ARCHIVE: CMD_IN_START_RECORDING2 [85/85]: controlSessionId=453874119 correlationId=218 streamId=107 sourceLocation=LOCAL autoStop=TRUE channel=aeron:ipc?session-id=968304786|alias=snapshot
[79850.459851243] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=218 relevantId=219 code=OK version=68352 errorMessage=
[79850.460469380] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=218 recordingId=15 subscriptionId=219 position=0 signal=START
[79850.463470211] ARCHIVE: CMD_IN_LIST_RECORDING [32/32]: controlSessionId=453874119 correlationId=222 recordingId=3
[79850.464688549] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=218 recordingId=15 subscriptionId=219 position=1504 signal=STOP
[79850.465015521] ARCHIVE: CMD_IN_STOP_RECORDING_SUBSCRIPTION [32/32]: controlSessionId=453874119 correlationId=224 subscriptionId=130
[79850.465099427] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=224 relevantId=0 code=OK version=68352 errorMessage=
[79850.465164559] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=129 recordingId=3 subscriptionId=130 position=230144 signal=STOP
[79850.475171601] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=-1 relevantId=-1 code=OK version=68352 errorMessage=
[79850.487428401] ARCHIVE: CMD_IN_STOP_RECORDING_BY_IDENTITY [32/32]: controlSessionId=453874119 correlationId=235 recordingId=3
[79850.487461955] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=235 relevantId=0 code=OK version=68352 errorMessage=
[79850.966562052] ARCHIVE: CONTROL_SESSION_STATE_CHANGE [67/67]: controlSessionId=453874119 ACTIVE -> INACTIVE reason="control publication not connected"
[79850.966614191] ARCHIVE: CONTROL_SESSION_STATE_CHANGE [40/40]: controlSessionId=453874119 INACTIVE -> DONE reason="inactive"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0