WARN level exception thrown during shut down of ClusteredMediaDriver · Issue #1784 · aeron-io/aeron · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I get the below AeronException from time to time after upgrading form 1.43.0 to 1.47.4. This seems to be in relation to the controlSession set up between the ConsensusModule and Archive.
I managed to replicate this in your existing test io.aeron.cluster.ClusterNodeRestartTest#shouldRestartServiceFromSnapshot. This isn't treated as a failure as its category is AeronException.Category.WARN. This only happens after quite a few runs of the test. However it happens more frequently in the system I am working on. I believe this is down to timings.
In the system I am working on, our shutdown procedure is as follows. We will use the ClusterTool to initiate SHUTDOWN which will trigger the ConsensusModule to enqueue a snapshot to all clustered service container nodes before instructing them to shutdown. In order to give time to the clustered service container nodes to do their snapshots and process the shutdown command, we wait two seconds, and then close the ClusteredMediaDriver. Adding that same two seconds into the test causes the Exception to occur more regularly. I have attached the debug logging from the ClusteredMediaDriver when the issue happens in our set up.
I have a suspicion this is because the clean up in io.aeron.cluster.ConsensusModuleAgent#onClose is guarded by !ctx.ownsAeronClient() but using the ClusteredMediaDriver will mean the ConsensusModule always owns its Aeron client.
I've sent a PR with a test and a potential fix. Please let me know your thoughts on the above and if you need any further information from me. The test has a two second wait in it which seems yucky. Maybe you can tell me me if there is a better wait to do this. #1783
Hi,
I get the below AeronException from time to time after upgrading form 1.43.0 to 1.47.4. This seems to be in relation to the controlSession set up between the ConsensusModule and Archive.
io.aeron.archive.client.ArchiveEvent: WARN - controlSessionId=536070027 (responseStreamId=120 responseChannel=aeron:ipc?mtu=1408|term-length=65536|session-id=1698090729|alias=cm-archive-ctrl-resp-cluster-0|sparse=true) terminated: request publication image unavailable: image.correlationId=37 sessionId=1698090729 streamId=10 channel=aeron:ipc?term-length=64k
I managed to replicate this in your existing test
io.aeron.cluster.ClusterNodeRestartTest#shouldRestartServiceFromSnapshot
. This isn't treated as a failure as its category is AeronException.Category.WARN. This only happens after quite a few runs of the test. However it happens more frequently in the system I am working on. I believe this is down to timings.In the system I am working on, our shutdown procedure is as follows. We will use the ClusterTool to initiate SHUTDOWN which will trigger the ConsensusModule to enqueue a snapshot to all clustered service container nodes before instructing them to shutdown. In order to give time to the clustered service container nodes to do their snapshots and process the shutdown command, we wait two seconds, and then close the ClusteredMediaDriver. Adding that same two seconds into the test causes the Exception to occur more regularly. I have attached the debug logging from the ClusteredMediaDriver when the issue happens in our set up.
I have a suspicion this is because the clean up in
io.aeron.cluster.ConsensusModuleAgent#onClose
is guarded by!ctx.ownsAeronClient()
but using the ClusteredMediaDriver will mean the ConsensusModule always owns its Aeron client.I've sent a PR with a test and a potential fix. Please let me know your thoughts on the above and if you need any further information from me. The test has a two second wait in it which seems yucky. Maybe you can tell me me if there is a better wait to do this.
#1783
The text was updated successfully, but these errors were encountered: