Description
Describe the problem
To reduce the overhead of unary Batch RPC calls, BatchRequest's can be sent over BatchStream streaming RPC calls. The client-side readers of these streams are goroutines that are stored in a cache. When a client attempts to send a batch request, we take a stream from the pool, send the request over the stream, return the result, and return the stream to the pool.
During a recent large scale test, we turned off this pool as part of a larger investigation related to goroutine scheduling latency and were surprised by the reduction in the number of goroutines. For instance, on a busy server in this cluster, we over 2000 streams in the pool, with a small percentage of them being actively used.
This raises a number of questions:
- Do the number of goroutines align with the request rate in this cluster or could this be an artifact of some problem with our queue management strategy for a large cluster.
- Would there be a benefit of caping the total size of this pool?
- Would there be a benefit in making
defaultPooledStreamIdleTimeout
configurable for some workloads? - Should we have metrics around the use of this pool?
Jira issue: CRDB-51498
Epic CRDB-50448