Open
Description
I found two settings useful when conducting experiments on the system's fault tolerance and stability:
- Control the worker failure rate using an argument to the manager. For example, we may want to kill workers periodically, say, every 60 seconds, regardless of whether the failures are unexpected or intentional. We calculate the actual failure rate since the beginning of the run and compare it to the target, and decide how many workers to kill in order to maintain the desired failure rate.
- Control the maximum number of workers we want to maintain. We already have an argument
wait-for-workers
which delays the task scheduling until the desired number of workers are connected. But in some senarios we may want to avoid using more workers than we want.
It would be great if the manager could natively support these two arguments, as that would make experiments more convenient.