10000 [Java] Make AeronCluster track connection status. by wojciech-adaptive · Pull Request #1791 · aeron-io/aeron · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Java] Make AeronCluster track connection status. #1791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wojciech-adaptive
Copy link
Contributor

Currently, AeronCluster does very little to track its connection status. All applications that want to react to, for example, losing network connectivity to the leader or the cluster shutting down, need to implement custom logic. This change provides such logic out of the box.

If connection is lost for more than newLeaderTimeoutNs, or a message can't be sent to a new leader for newLeaderConnectionTimeoutNs, the client will close itself.

For applications which call pollEgress/controlledPollEgress and check isClosed afterward no changes are needed. For applications which use the ingress publication or the egress subscription directly, two methods were added: trackIngressPublicationResult and pollStateChanges.

For the new leader timeout we currently use twice the default leader heartbeat timeout, in future we would ideally make the cluster announce the actual value.

Changed the semantics of pollEgress/controlledPollEgress return value, hopefully not a breaking change.

The state diagram is:

stateDiagram-v2
    [*] --> CONNECTED
    CONNECTED --> AWAIT_NEW_LEADER : ingress publication NOT_CONNECTED
    CONNECTED --> AWAIT_NEW_LEADER : ingress publication CLOSED
    CONNECTED --> PENDING_CLOSE : ingress publication MAX_POSITION_EXCEEDED
    CONNECTED --> AWAIT_NEW_LEADER : egress image closed
    CONNECTED --> AWAIT_NEW_LEADER_CONNECTION : NewLeaderEvent
    CONNECTED --> PENDING_CLOSE : SessionEvent CLOSED
    CONNECTED --> CLOSED : close
    AWAIT_NEW_LEADER --> AWAIT_NEW_LEADER_CONNECTION : NewLeaderEvent
    AWAIT_NEW_LEADER --> CLOSED : timeout
    AWAIT_NEW_LEADER --> CLOSED : close
    AWAIT_NEW_LEADER_CONNECTION --> CONNECTED : ingress offer successful
    AWAIT_NEW_LEADER_CONNECTION --> CLOSED : timeout
    AWAIT_NEW_LEADER_CONNECTION --> CLOSED : close
    PENDING_CLOSE --> CLOSED
    CLOSED --> [*]

@wojciech-adaptive wojciech-adaptive marked this pull request as ready for review May 2, 2025 15:32
* @return this for a fluent API.
*/
public Context newLeaderTimeoutNs(final long newLeaderTimeoutNs)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much work would it be to embed the leader heartbeat timeout into the detail string of the login response or add it as an extra field (SessionEvent in cluster codec). It would be nice not to require the extra configuration. (Not a stopper for this fix, but would be good if it would fit into the next release).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look. For the detail string do you mean something like leaderHeartbeatTimeoutNs=5000|foo=bar? Having a field would be nicer, but in this case the codec is shared with other events, so for them it would be a waste of space, especially if we decide to add more information in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be for an additional field. The only place this is used is session open/close so the redundant information wouldn't be too much overhead. Having thought about it more I think that a separate field is better than encoding it in a string. SBE has support for adding fields message versioning and compatibility where as arbitrary string encoding does not.

SBE should handle this (due to block length being encoded on the message) so older clients should safely ignore this field (but we can test for that).

@mikeb01
Copy link
Contributor
mikeb01 commented May 4, 2025

Overall looks good, right direction.

Currently, AeronCluster does very little to track its connection status.
All applications that want to react to, for example, losing network
connectivity to the leader or the cluster shutting down, need to implement
custom logic. This change provides such logic out of the box.

If connection is lost for more than newLeaderTimeoutNs, or a message can't
be sent to a new leader for messageTimeoutNs, the client will
close itself.

For applications which call pollEgress/controlledPollEgress and check
isClosed afterward no changes are needed. For applications which use
the ingress publication or the egress subscription directly, two methods
were added: trackIngressPublicationResult and pollStateChanges.

For the new leader timeout we currently use twice the default leader heartbeat
timeout, in future we would ideally make the cluster announce the actual value.

Changed the semantics of pollEgress/controlledPollEgress return value,
hopefully not a breaking change.

The state diagram is:

stateDiagram-v2
    [*] --> CONNECTED
    CONNECTED --> AWAIT_NEW_LEADER : ingress publication NOT_CONNECTED
    CONNECTED --> AWAIT_NEW_LEADER : ingress publication CLOSED
    CONNECTED --> PENDING_CLOSE : ingress publication MAX_POSITION_EXCEEDED
    CONNECTED --> AWAIT_NEW_LEADER : egress image closed
    CONNECTED --> AWAIT_NEW_LEADER_CONNECTION : NewLeaderEvent
    CONNECTED --> PENDING_CLOSE : SessionEvent CLOSED
    CONNECTED --> CLOSED : close
    AWAIT_NEW_LEADER --> AWAIT_NEW_LEADER_CONNECTION : NewLeaderEvent
    AWAIT_NEW_LEADER --> CLOSED : timeout
    AWAIT_NEW_LEADER --> CLOSED : close
    AWAIT_NEW_LEADER_CONNECTION --> CONNECTED : ingress offer successful
    AWAIT_NEW_LEADER_CONNECTION --> CLOSED : timeout
    AWAIT_NEW_LEADER_CONNECTION --> CLOSED : close
    PENDING_CLOSE --> CLOSED
    CLOSED --> [*]
@wojciech-adaptive wojciech-adaptive force-pushed the cluster-client-connection-tracking branch from c6ed271 to a9d85ce Compare May 19, 2025 15:43
@mikeb01
Copy link
Contributor
mikeb01 commented May 20, 2025

Only the newLeaderTimeoutNs() to resolve. I think it would be worth finishing that before release. However, this PR could be merged and handle that change as a separate PR.

@wojciech-adaptive wojciech-adaptive merged commit d11fb36 into aeron-io:master May 20, 2025
45 checks passed
@wojciech-adaptive wojciech-adaptive deleted the cluster-client-connection-tracking branch May 20, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0