[Data] Fix ActorPool autoscaler to properly scale up #53983

alexeykudinkin · 2025-06-20T23:33:12Z

Why are these changes needed?

Currently, ActorPool autoscaling actually isn't scaling up until all of its num_actors * max_tasks_in_flight slots are full. This is a seriously limiting factor:

It requires AP to accumulate backlog of 4x its # of actors before autoscaling even starts
3 out of 4 tasks are sitting in the queue, meaning their execution won't start until previous ones complete

Changes

Revisiting Actor Pool autoscaling protocol to be based on utilization defined as # of submitted tasks / number of running actors * max_concurrency
Added AutoscalingConfig to make all configuration explicitly available inside DataContext
Setting default upscaling threshold to 200% (currently is set to 400%)
Updated tests
Added docs

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Ignore free slots when autoscaling Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen

overall LGTM

raulchen · 2025-06-25T22:05:17Z

python/ray/data/_internal/execution/autoscaler/autoscaling_actor_pool.py

 from typing import Optional

 from ray.data._internal.execution.interfaces.execution_options import ExecutionResources
 from ray.util.annotations import DeveloperAPI


+@dataclass
+class ScalingConfig:


I slightly prefer the old name ScalingAction.
also consider naming it ActorPoolScalingAction to make it more explicit. And we may also add a ClusterScalingAction.

Yeah, i've realized that this is rather an intent/request, renamed to APAutoscalingRequest

raulchen · 2025-06-25T22:07:03Z

python/ray/data/_internal/execution/autoscaler/autoscaling_actor_pool.py

-    def current_in_flight_tasks(self) -> int:
-        """Number of current in-flight tasks."""
+    def num_tasks_in_flight(self) -> int:
+        """Number of current in-flight tasks (running + pending tasks)."""


Suggested change

"""Number of current in-flight tasks (running + pending tasks)."""

"""Number of current in-flight tasks (tasks that have been submitted to the actor pool)."""

raulchen · 2025-06-25T22:13:49Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+                #       running task we'd allow 1 more task to be enqueued
+                compute_strategy.max_tasks_in_flight_per_actor
+                or data_context.max_tasks_in_flight_per_actor
+                or max_actor_concurrency * 2


nit, define a constant for the magic number 2.

Captured it in the comment above, not really sure what value the constant will add

+1 to defining constant. It's easier to understand if you're skimming the code and not reading all of the comments

raulchen · 2025-06-25T22:16:14Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+                config.delta, max(self.max_size() - self.current_size(), 0)
+            )
+
+            logger.info(


this can be too verbose. let's use a debug. same for the scale down.

raulchen · 2025-06-25T22:17:39Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            # Make sure after scaling down actor pool size won't fall below its
+            # min size
+            target_num_actors = min(
+                abs(config.delta), max(self.current_size() - self.min_size(), 0)


We can allow the autoscaling policy to down scale below the min. I.E., when inputs are complete.

Fair point. I was actually thinking to just shutdown the actor pool once processing is done, but now that i'm thinking about it, it's better to incrementally release workers as soon as they could be released.

raulchen · 2025-06-25T22:22:34Z

python/ray/data/context.py

@@ -338,6 +380,11 @@ class DataContext:
            call is made with a S3 URI.
        wait_for_min_actors_s: The default time to wait for minimum requested
            actors to start before raising a timeout, in seconds.
+        max_tasks_in_flight_per_actor: Max number of tasks that could be enqueued
+            into the individual Actor's queue. Note that running tasks are not counted


Suggested change

into the individual Actor's queue. Note that running tasks are not counted

into the individual Actor's local queue. Note that running tasks are not counted

raulchen · 2025-06-25T22:23:58Z

python/ray/data/context.py

@@ -338,6 +380,11 @@ class DataContext:
            call is made with a S3 URI.
        wait_for_min_actors_s: The default time to wait for minimum requested
            actors to start before raising a timeout, in seconds.
+        max_tasks_in_flight_per_actor: Max number of tasks that could be enqueued
+            into the individual Actor's queue. Note that running tasks are not counted
+            towards this limit. This setting allows Actors to start pulling and buffering


Note that running tasks are not counted towards this limit.
This is total number. So running tasks are also counted. right?

Yeah, this is remnant of when this parameter was not overlapping with max_concurrency. Will update

raulchen · 2025-06-25T22:25:32Z

python/ray/data/context.py

+            into the individual Actor's queue. Note that running tasks are not counted
+            towards this limit. This setting allows Actors to start pulling and buffering
+             blocks for the tasks waiting in the queue to make sure tasks could start
+            executing immediately once taken from the queue.


I find it a little bit hard to understand for regular users.
We can just say this setting allows Actors to pipeline task execution with block transfer.

Ok, let me take a stab at rephrasing

- Make validation whether request can be applied private (removed dup) - Log debugging warnings internally Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…aled down to 0 (upon completion) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen · 2025-06-26T17:57:26Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py


-        return num_actors
+        return None


nit, why return None instead of 0?

Just more clearly signal that there's no action taken (rather than attempted and failed for ex in case of downscaling)

raulchen · 2025-06-26T18:00:08Z

python/ray/data/context.py

+DEFAULT_ACTOR_POOL_UTIL_DOWNSCALING_THRESHOLD: float = env_float(
+    "RAY_DATA_DEFAULT_ACTOR_POOL_UTIL_DOWNSCALING_THRESHOLD",
+    0.5,
+)


this might be too low. we can keep an eye and revisit it later.

bveeramani

LGTM.

@alexeykudinkin I'm running the release tests here: https://buildkite.com/ray-project/release/builds/46720#.

Would you mind skimming it and making sure there's no major regression before merging?

python/ray/data/_internal/execution/autoscaler/autoscaling_actor_pool.py

python/ray/data/context.py

bveeramani · 2025-06-26T17:55:45Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py


-        return num_actors
+        return None


Return 0? Not obvious when this method returns 0 vs None

Suggested change

return None

return 0

Also, maybe document the expected return value here or in the base value? I think it's implicit right now

Answered this comment from Hao above (None to signal no action taken)

bveeramani · 2025-06-26T18:02:04Z

python/ray/data/_internal/execution/autoscaler/autoscaling_actor_pool.py

+    def get_pool_util(self) -> float:
+        """Calculate the utilization of the given actor pool."""
        ...


Should this be the responsibility of the autoscaler? You can compute this just using public actor pool methods, and I think the actor pool interface is already really bloated

Yeah, i went back and forth on this. Ultimately decided that the scope of AP should include defining what util actually means (in terms of # of actors used, concurrency slots or what not) as these are intrinsic actor details. Autoscaler on the other hand operates at a higher level (if util > threshold then upscale for ex)

bveeramani · 2025-06-26T18:10:52Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+                #       running task we'd allow 1 more task to be enqueued
+                compute_strategy.max_tasks_in_flight_per_actor
+                or data_context.max_tasks_in_flight_per_actor
+                or max_actor_concurrency * 2


+1 to defining constant. It's easier to understand if you're skimming the code and not reading all of the comments

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

## Why are these changes needed? Currently, ActorPool autoscaling actually isn't scaling up until all of its `num_actors * max_tasks_in_flight` slots are full. This is a seriously limiting factor: - It requires AP to accumulate backlog of 4x its # of actors before autoscaling even starts - 3 out of 4 tasks are sitting in the queue, meaning their execution won't start until previous ones complete Changes --- 1. Revisiting Actor Pool autoscaling protocol to be based on utilization defined as `# of submitted tasks / number of running actors * max_concurrency` 2. Added `AutoscalingConfig` to make all configuration explicitly available inside `DataContext` 3. Setting default upscaling threshold to 200% (cur 850F rently is set to 400%) 4. Updated tests 5. Added docs ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner June 20, 2025 23:33

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jun 24, 2025

alexeykudinkin requested a review from a team as a code owner June 25, 2025 02:31

alexeykudinkin added 27 commits June 25, 2025 13:09

Tidying up

df4afab

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Relocated AP util to AP itself

e0410cd

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Reduce default # of max tasks in-flight;

505846c

Tidying up Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Wire in actor memory reqs properly

b60fa35

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Increase AP autoscaling util threshold to 95%;

dbe2161

Ignore free slots when autoscaling Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Cleaning up tests

acedf5a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Introduced _AutoscalingAction

1b33965

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Replaced scale_up/scale_down API with new apply

258fade

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Make sure AP doesn't scale up/down past limits

ef6cbc5

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added appropriate deltas to scaling actions

1ec2f53

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed tests

05643b4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

6ce7733

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed circular imports

e0b4aad

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Retired _AutoscalingActionKind, replaced with delta sign encoding;

ddf9ac7

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated tests

60f892e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

67467bb

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

10800de

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing usages

551c912

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up more

a3f3695

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing remaining usages

94cceca

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

dc40923

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidiyng up more

3f1adec

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing more tests

ea23b1c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed refs

3fa577f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Abstracted dedicated AutoscalingConfig

e0abae1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added AutoscalingConfig to DataContext

69c811c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed tests

437e13f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 2 commits June 25, 2025 14:45

Don't scale down if not allowed

751df1f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Check whether allowed to scale down

20aaf72

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen reviewed Jun 25, 2025

View reviewed changes

alexeykudinkin added 5 commits June 25, 2025 16:41

Revisited scaling protocol to

d788e41

- Make validation whether request can be applied private (removed dup) - Log debugging warnings internally Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed tests

3e52c72

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing more tests

a634fd9

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

ac88728

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated test

5e9368e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/act-pl-ascl-fix branch from 82d11d2 to 5e9368e Compare June 26, 2025 00:06

alexeykudinkin added 5 commits June 25, 2025 17:59

Tidying up

77a4b4a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up some more

de7f07b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid enforcing scaling min/max inside AP itself to allow it to be sc…

b7972bd

…aled down to 0 (upon completion) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added test to verify we're able to scale down to 0

cfa68cf

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

0211656

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/act-pl-ascl-fix branch from 0945db5 to 0211656 Compare June 26, 2025 01:11

Updated py-doc

ceef8d5

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/act-pl-ascl-fix branch from a83df2e to ceef8d5 Compare June 26, 2025 01:16

alexeykudinkin added 4 commits June 25, 2025 18:57

Fixed py-docs

1c21ae4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

4e7a95c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing more py-docs

6213ecc

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed more tests

60c53c1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen approved these changes Jun 26, 2025

View reviewed changes

bveeramani approved these changes Jun 26, 2025

View reviewed changes

alexeykudinkin added 3 commits June 26, 2025 11:44

Fixed baseline doc

fb02e06

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Restore default of max-tasks in-flight to 4

aedf6f7

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

46cddf3

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

kouroshHakha approved these changes Jun 26, 2025

View reviewed changes

lk-chen approved these changes Jun 26, 2025

View reviewed changes

alexeykudinkin merged commit 5525301 into ray-project:master Jun 26, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Fix ActorPool autoscaler to properly scale up #53983

[Data] Fix ActorPool autoscaler to properly scale up #53983

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	"""Number of current in-flight tasks (running + pending tasks)."""
	"""Number of current in-flight tasks (tasks that have been submitted to the actor pool)."""

	into the individual Actor's queue. Note that running tasks are not counted
	into the individual Actor's local queue. Note that running tasks are not counted

[Data] Fix ActorPool autoscaler to properly scale up #53983

[Data] Fix ActorPool autoscaler to properly scale up #53983

Conversation

Uh oh!

Why are these changes needed?

Changes

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choo D3FE se a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!