WARN level exception thrown during shut down of ClusteredMediaDriver #1784

zamhassam · 2025-04-24T13:46:50Z

Hi,

I get the below AeronException from time to time after upgrading form 1.43.0 to 1.47.4. This seems to be in relation to the controlSession set up between the ConsensusModule and Archive.

io.aeron.archive.client.ArchiveEvent: WARN - controlSessionId=536070027 (responseStreamId=120 responseChannel=aeron:ipc?mtu=1408|term-length=65536|session-id=1698090729|alias=cm-archive-ctrl-resp-cluster-0|sparse=true) terminated: request publication image unavailable: image.correlationId=37 sessionId=1698090729 streamId=10 channel=aeron:ipc?term-length=64k

I managed to replicate this in your existing test io.aeron.cluster.ClusterNodeRestartTest#shouldRestartServiceFromSnapshot. This isn't treated as a failure as its category is AeronException.Category.WARN. This only happens after quite a few runs of the test. However it happens more frequently in the system I am working on. I believe this is down to timings.

In the system I am working on, our shutdown procedure is as follows. We will use the ClusterTool to initiate SHUTDOWN which will trigger the ConsensusModule to enqueue a snapshot to all clustered service container nodes before instructing them to shutdown. In order to give time to the clustered service container nodes to do their snapshots and process the shutdown command, we wait two seconds, and then close the ClusteredMediaDriver. Adding that same two seconds into the test causes the Exception to occur more regularly. I have attached the debug logging from the ClusteredMediaDriver when the issue happens in our set up.

I have a suspicion this is because the clean up in io.aeron.cluster.ConsensusModuleAgent#onClose is guarded by !ctx.ownsAeronClient() but using the ClusteredMediaDriver will mean the ConsensusModule always owns its Aeron client.

I've sent a PR with a test and a potential fix. Please let me know your thoughts on the above and if you need any further information from me. The test has a two second wait in it which seems yucky. Maybe you can tell me me if there is a better wait to do this.
#1783

[79850.459638138] ARCHIVE: CMD_IN_START_RECORDING2 [85/85]: controlSessionId=453874119 correlationId=218 streamId=107 sourceLocation=LOCAL autoStop=TRUE channel=aeron:ipc?session-id=968304786|alias=snapshot
[79850.459851243] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=218 relevantId=219 code=OK version=68352 errorMessage=
[79850.460469380] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=218 recordingId=15 subscriptionId=219 position=0 signal=START
[79850.463470211] ARCHIVE: CMD_IN_LIST_RECORDING [32/32]: controlSessionId=453874119 correlationId=222 recordingId=3
[79850.464688549] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=218 recordingId=15 subscriptionId=219 position=1504 signal=STOP
[79850.465015521] ARCHIVE: CMD_IN_STOP_RECORDING_SUBSCRIPTION [32/32]: controlSessionId=453874119 correlationId=224 subscriptionId=130
[79850.465099427] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=224 relevantId=0 code=OK version=68352 errorMessage=
[79850.465164559] ARCHIVE: RECORDING_SIGNAL [52/52]: controlSessionId=453874119 correlationId=129 recordingId=3 subscriptionId=130 position=230144 signal=STOP
[79850.475171601] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=-1 relevantId=-1 code=OK version=68352 errorMessage=
[79850.487428401] ARCHIVE: CMD_IN_STOP_RECORDING_BY_IDENTITY [32/32]: controlSessionId=453874119 correlationId=235 recordingId=3
[79850.487461955] ARCHIVE: CMD_OUT_RESPONSE [44/44]: controlSessionId=453874119 correlationId=235 relevantId=0 code=OK version=68352 errorMessage=
[79850.966562052] ARCHIVE: CONTROL_SESSION_STATE_CHANGE [67/67]: controlSessionId=453874119 ACTIVE -> INACTIVE reason="control publication not connected"
[79850.966614191] ARCHIVE: CONTROL_SESSION_STATE_CHANGE [40/40]: controlSessionId=453874119 INACTIVE -> DONE reason="inactive"

The text was updated successfully, but these errors were encountered:

zamhassam mentioned this issue Apr 24, 2025

WARN level exception thrown during shut down of ClusteredMediaDriver #1783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARN level exception thrown during shut down of ClusteredMediaDriver #1784

WARN level exception thrown during shut down of ClusteredMediaDriver #1784

WARN level exception thrown during shut down of ClusteredMediaDriver #1784

WARN level exception thrown during shut down of ClusteredMediaDriver #1784

Comments