Feature request: Add metric to track the number of ungraceful connection resets

We lack a metric to count the number of connection resets, or ungraceful shutdowns (i.e., resets). We log a client_session_end, but we don't log the reason, or have a metric to indicate that a connection was shutdown without the client successfully sending an X Terminate message. In the PostgreSQL ecosystem, basically any backend connection reset or connection close is worth investigating. In a well behaved environment, this rate would be 0 and anything higher than that indicates a specific failure (e.g., node crashed), or a sustained error rate would be something systemic.

One of the key takeaways from a recent and very difficult / time consuming connection reset investigation is that having this kind of information could help facilitate much faster problem determination. Because this detail was missing, there was a general disbelief that what we were seeing in the traces was in fact what we were seeing, so we kept looking for an alternate explanation. Having a metric like this would be a strong indicator to turn over the tcpdump rock sooner rather than later. We were looking in crdb for logs, but didn't see anything obvious there. Adding an annotation to client_session_end that indicated an ungraceful shutdown would have helped, too.

Jira issue: CRDB-30246

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions