ray: fix handling large chunks #53535

ranandfigma · 2025-06-03T23:15:07Z

Ray wasn't correctly handling large chunked pyarrow arrays in the case when the first chunk is already past the max size. This will trigger a pyarrow internal error since our list of slices will contain an empty array within it. Fix this by checking for a non-empty slice before appending.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Ran pyarrow tests
Testing Strategy
- Unit tests

Signed-off-by: Raghav Anand <ranand@figma.com>

Copilot

Pull Request Overview

This PR fixes how large chunked PyArrow arrays are handled when the first chunk exceeds the maximum allowed size by ensuring empty slices aren’t appended, and adds a test to validate this behavior.

Prevent empty slices in _try_combine_chunks_safe when the initial chunk alone exceeds max_chunk_size.
Introduce MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS constant and import it in tests.
Add test_defragment_large_table to cover large-chunk defragmentation logic.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/ray/data/_internal/arrow_ops/transform_pyarrow.py	Add guard to skip appending empty slice when combining chunks.
python/ray/data/tests/test_transform_pyarrow.py	Import new constant and add a test for defragmenting large tables.

Comments suppressed due to low confidence (1)

python/ray/data/tests/test_transform_pyarrow.py:63

The test calls try_combine_chunked_columns but this function is not imported. Add try_combine_chunked_columns to the import list from transform_pyarrow.

data = try_combine_chunked_columns(table)

Copilot · 2025-06-03T23:16:53Z

python/ray/data/tests/test_transform_pyarrow.py

@@ -49,6 +50,20 @@ def test_try_defragment_table():
    assert dt == t


+def test_defragment_large_table():
+
+    big = pa.array(list(range(800_000_000)), type=pa.int32())  # ~2GiB


Constructing a Python list of 800 million ints can OOM or be very slow. Consider using numpy.arange or a PyArrow streaming interface to build the array more efficiently.

Suggested change

big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB

big = pa.array(np.arange(800_000_000), type=pa.int32()) # ~2GiB

That looks like a good suggestion actually

wingkitlee0 · 2025-06-04T01:40:28Z

first chunk is already past the max size

I don't think the original code or your fix actually solved this? For example, chunks = [3gb, 1gb, 1gb]. After the function, it will give slices = [[3gb], [1gb, 1gb]]. Before the return, there is a call pa.concat_arrays([3gb]), which should fail if the data types are not large_*. It seems that some splitting logic is needed in this case?

Signed-off-by: Raghav Anand <ranand@figma.com>

ranandfigma · 2025-06-05T02:30:48Z

first chunk is already past the max size

I don't think the original code or your fix actually solved this? For example, chunks = [3gb, 1gb, 1gb]. After the function, it will give slices = [[3gb], [1gb, 1gb]]. Before the return, there is a call pa.concat_arrays([3gb]), which should fail if the data types are not large_*. It seems that some splitting logic is needed in this case?

@wingkitlee0 are you saying that if the chunk is > 2GiB and is not a large_ type, we will fail anyway? That wasn't my experience when testing out this fix (it seemed to work), and similarly when I wrote the test. Or are you saying that if we see a chunk that's > 2GiB, and of type that is not large_, we should split it up? I'm not immediately understanding why that would be necessary.

wingkitlee0 · 2025-06-05T03:38:45Z

interesting. I may be seeing some unrelated error then. But logically, if the resultant array has a chunk that is larger than the "max chunk size", shouldn't it be split 8000 ? Also there are a few places I read about the 2gb limit, e.g., apache/arrow#44944 (comment)

github-actions · 2025-06-20T00:34:08Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

alexeykudinkin · 2025-06-20T18:15:03Z

@ranandfigma will you be able to take a look at the CI failures? If not i can take over and land this

alexeykudinkin · 2025-06-20T04:06:47Z

python/ray/data/tests/test_transform_pyarrow.py

@@ -49,6 +50,20 @@ def test_try_defragment_table():
    assert dt == t


+def test_defragment_large_table():
+
+    big = pa.array(list(range(800_000_000)), type=pa.int32())  # ~2GiB


That looks like a good suggestion actually

Approved by mistake

alexeykudinkin · 2025-06-26T20:58:20Z

@ranandfigma i'm addressing this and a few other issues in this PR:

#53971

ray: fix handling large chunks

f08c989

Signed-off-by: Raghav Anand <ranand@figma.com>

ranandfigma marked this pull request as ready for review June 3, 2025 23:15

Copilot AI review requested due to automatic review settings June 3, 2025 23:15

ranandfigma requested a review from a team as a code owner June 3, 2025 23:15

Copilot AI reviewed Jun 3, 2025

View reviewed changes

ranandfigma mentioned this pull request Jun 3, 2025

Error Handling Large Pyarrow Chunk #53536

Closed

lint

7f3a56f

Signed-off-by: Raghav Anand <ranand@figma.com>

ranandfigma force-pushed the fix_large_chunk_coalesce branch from a63aeb4 to 7f3a56f Compare June 5, 2025 02:29

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 20, 2025

alexeykudinkin removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 20, 2025

alexeykudinkin self-assigned this Jun 20, 2025

alexeykudinkin previously approved these changes Jun 23, 2025

View reviewed changes

alexeykudinkin self-requested a review June 23, 2025 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ray: fix handling large chunks #53535

ray: fix handling large chunks #53535

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB
	big = pa.array(np.arange(800_000_000), type=pa.int32()) # ~2GiB

ray: fix handling large chunks #53535

Are you sure you want to change the base?

ray: fix handling large chunks #53535

Conversation

Checks

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!