server: investigate impact of unbounded stream pool

Describe the problem

To reduce the overhead of unary Batch RPC calls, BatchRequest's can be sent over BatchStream streaming RPC calls. The client-side readers of these streams are goroutines that are stored in a cache. When a client attempts to send a batch request, we take a stream from the pool, send the request over the stream, return the result, and return the stream to the pool.

During a recent large scale test, we turned off this pool as part of a larger investigation related to goroutine scheduling latency and were surprised by the reduction in the number of goroutines. For instance, on a busy server in this cluster, we over 2000 streams in the pool, with a small percentage of them being actively used.

This raises a number of questions:

Do the number of goroutines align with the request rate in this cluster or could this be an artifact of some problem with our queue management strategy for a large cluster.
Would there be a benefit of caping the total size of this pool?
Would there be a benefit in making defaultPooledStreamIdleTimeout configurable for some workloads?
Should we have metrics around the use of this pool?

Jira issue: CRDB-51498

Epic CRDB-50448

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions