[data] fix repartitioning empty datasets #54107

raulchen · 2025-06-25T21:04:23Z

Fix the following error when repartitioning an empty dataset:

    first_block_schema = reduce_metadata_schema[0].schema
IndexError: list index out of range

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Copilot

Pull Request Overview

This PR fixes the repartitioning behavior for empty datasets and refines the handling of block schemas during repartitioning.

Added a new test to ensure empty datasets are repartitioned correctly with both shuffling options.
Updated the block schema handling in the repartition task scheduler, including a default fallback for empty results.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/ray/data/tests/test_repartition_e2e.py	Added test case for empty dataset repartitioning with both shuffled and non-shuffled options.
python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py	Refined block schema detection logic and improved error messaging for unknown block schemas.

Comments suppressed due to low confidence (1)

python/ray/data/tests/test_repartition_e2e.py:210

[nitpick] Consider adding a docstring to this test function to clarify its purpose and expected behavior when repartitioning an empty dataset.

def test_repartition_empty_datasets(ray_start_regular_shared_2_cpus, shuffle):

python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py

bveeramani · 2025-06-26T18:35:28Z

python/ray/data/tests/test_repartition_e2e.py

+    num_partitions = 5
+    ds_empty = ray.data.range(100).filter(lambda row: False)


Nit: Consider using the simplest inputs to make the intent clearer

Suggested change

num_partitions = 5

ds_empty = ray.data.range(100).filter(lambda row: False)

num_partitions = 1

ds_empty = ray.data.range(1).filter(lambda row: False)

Fix the following error when repartitioning an empty dataset: ``` first_block_schema = reduce_metadata_schema[0].schema IndexError: list index out of range ``` Signed-off-by: Hao Chen <chenh1024@gmail.com>

[data] fix repartitioning empty datasets

8d13efc

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Copilot AI review requested due to automatic review settings June 25, 2025 21:04

raulchen requested a review from a team as a code owner June 25, 2025 21:04

Copilot AI reviewed Jun 25, 2025

View reviewed changes

python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py Show resolved Hide resolved

raulchen added the go add ONLY when ready to merge, run all tests label Jun 26, 2025

raulchen enabled auto-merge (squash) June 26, 2025 00:33

bveeramani approved these changes Jun 26, 2025

View reviewed changes

raulchen merged commit 7f8510e into ray-project:master Jun 26, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] fix repartitioning empty datasets #54107

[data] fix repartitioning empty datasets #54107

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		num_partitions = 5
		ds_empty = ray.data.range(100).filter(lambda row: False)

[data] fix repartitioning empty datasets #54107

[data] fix repartitioning empty datasets #54107

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!