Description
We lack a metric to count the number of connection resets, or ungraceful shutdowns (i.e., resets). We log a client_session_end, but we don't log the reason, or have a metric to indicate that a connection was shutdown without the client successfully sending an X Terminate message. In the PostgreSQL ecosystem, basically any backend connection reset or connection close is worth investigating. In a well behaved environment, this rate would be 0 and anything higher than that indicates a specific failure (e.g., node crashed), or a sustained error rate would be something systemic.
One of the key takeaways from a recent and very difficult / time consuming connection reset investigation is that having this kind of information could help facilitate much faster problem determination. Because this detail was missing, there was a general disbelief that what we were seeing in the traces was in fact what we were seeing, so we kept looking for an alternate explanation. Having a metric like this would be a strong indicator to turn over the tcpdump rock sooner rather than later. We were looking in crdb for logs, but didn't see anything obvious there. Adding an annotation to client_session_end that indicated an ungraceful shutdown would have helped, too.
Jira issue: CRDB-30246