8000 [data] fix repartitioning empty datasets by raulchen · Pull Request #54107 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[data] fix repartitioning empty datasets #54107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2025

Conversation

raulchen
Copy link
Contributor
@raulchen raulchen commented Jun 25, 2025

Fix the following error when repartitioning an empty dataset:

    first_block_schema = reduce_metadata_schema[0].schema
IndexError: list index out of range

Signed-off-by: Hao Chen <chenh1024@gmail.com>
@Copilot Copilot AI review requested due to automatic review settings June 25, 2025 21:04
@raulchen raulchen requested a review from a team as a code owner June 25, 2025 21:04
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the repartitioning behavior for empty datasets and refines the handling of block schemas during repartitioning.

  • Added a new test to ensure empty datasets are repartitioned correctly with both shuffling options.
  • Updated the block schema handling in the repartition task scheduler, including a default fallback for empty results.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/ray/data/tests/test_repartition_e2e.py Added test case for empty dataset repartitioning with both shuffled and non-shuffled options.
python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py Refined block schema detection logic and improved error messaging for unknown block schemas.
Comments suppressed due to low confidence (1)

python/ray/data/tests/test_repartition_e2e.py:210

  • [nitpick] Consider adding a docstring to this test function to clarify its purpose and expected behavior when repartitioning an empty dataset.
def test_repartition_empty_datasets(ray_start_regular_shared_2_cpus, shuffle):

@raulchen raulchen added the go add ONLY when ready to merge, run all tests label Jun 26, 2025
@raulchen raulchen enabled auto-merge (squash) June 26, 2025 00:33
Comment on lines +212 to +213
num_partitions = 5
ds_empty = ray.data.range(100).filter(lambda row: False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider using the simplest inputs to make the intent clearer

Suggested change
num_partitions = 5
ds_empty = ray.data.range(100).filter(lambda row: False)
num_partitions = 1
ds_empty = ray.data.range(1).filter(lambda row: False)

@raulchen raulchen merged commit 7f8510e into ray-project:master Jun 26, 2025
6 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
Fix the following error when repartitioning an empty dataset:
```
    first_block_schema = reduce_metadata_schema[0].schema
IndexError: list index out of range
```

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0