tower-balance: Observations from testing

We've recently been running tests on the Linkerd proxy that exercise the load
balancer in larger clusters (of 30+ endpoints). At the same time, I've been
exploring the existing endpoint-weighting scheme.

In doing so, I've realized that the balancer is currently O(n), though it is
intended to be effectively O(1). Furthermore, the existing weighting scheme
is complex to instrument in practice, and is of questionable value in its
current form.

All of this leads to me believe that we should drastically simplify the
balancer:

Do not attempt to unify (Weighted) P2C and Round-Robin under one
implementation. Each strategy benefits from being able to use its own data
structure. For now, I propose that we simply drop the Round-Robin logic. It
can easily be added later if it's desirable.
The balancer cannot be responsible for driving the readiness of all of its
constituent services. The P2C balancer is intended to sample two
endpoints. In the current implementation, we always poll all unready
inner services, which leads to poor behavior as the balancer scales.
Something like Spawn ready: drives a service's readiness on an executor #283 is necessary to relax the balancer's readiness-polling
guarantees.
We initially decided that all endpoint-service errors should be treated as fatal
to the balancer. It now seems more appropriate to let the balancer handle these
failures by dropping the failed service from the balancer.
The Load and Instrument traits are not balancer-specific and should
become more generally useful abstractions in a dedicated crate.
The balancer should expose a Layer that layers over inner layers that
produce Discover-typed results.
The Pool implementation is factored inconveniently, especially in light of how
tower-layer has evolved: it requires direct access to a balancer implementation,
accessing its discover field directly. I think that it should probably
be implemented as a Discover proxy-type that is constructed with a
Watch<Load>. The pool doesn't rely on any specific balancer behavior, so it
shouldn't dictate use of a specific balancer implementation.

I have a series of changes that I would like to pursue to this end:

Spawn ready: drives a service's readiness on an executor #283 Enables endpoint stacks to be driven to readiness without being actively polled, i.e. by the balancer.
Extract tower-load from tower-balance #285 extracts the Load & Instrument traits into a dedicated crates; and removes the current Weight implementation (which is not really what we want).
I have another (followup) branch that removes the choose trait/module, leaving only a P2CBalance implementation.
I could use some help figuring out the path forward for Pool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions