8000 ray: fix handling large chunks by ranandfigma · Pull Request #53535 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ray: fix handling large chunks #53535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ranandfigma
Copy link

Ray wasn't correctly handling large chunked pyarrow arrays in the case when the first chunk is already past the max size. This will trigger a pyarrow internal error since our list of slices will contain an empty array within it. Fix this by checking for a non-empty slice before appending.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Ran pyarrow tests
  • Testing Strategy
    • Unit tests

Signed-off-by: Raghav Anand <ranand@figma.com>
@ranandfigma ranandfigma marked this pull request as ready for review June 3, 2025 23:15
@Copilot Copilot AI review requested due to automatic review settings June 3, 2025 23:15
@ranandfigma ranandfigma requested a review from a team as a code owner June 3, 2025 23:15
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes how large chunked PyArrow arrays are handled when the first chunk exceeds the maximum allowed size by ensuring empty slices aren’t appended, and adds a test to validate this behavior.

  • Prevent empty slices in _try_combine_chunks_safe when the initial chunk alone exceeds max_chunk_size.
  • Introduce MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS constant and import it in tests.
  • Add test_defragment_large_table to cover large-chunk defragmentation logic.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Add guard to skip appending empty slice when combining chunks.
python/ray/data/tests/test_transform_pyarrow.py Import new constant and add a test for defragmenting large tables.
Comments suppressed due to low confidence (1)

python/ray/data/tests/test_transform_pyarrow.py:63

  • The test calls try_combine_chunked_columns but this function is not imported. Add try_combine_chunked_columns to the import list from transform_pyarrow.
data = try_combine_chunked_columns(table)

@@ -49,6 +50,20 @@ def test_try_defragment_table():
assert dt == t


def test_defragment_large_table():

big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB
Copy link
Preview
Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructing a Python list of 800 million ints can OOM or be very slow. Consider using numpy.arange or a PyArrow streaming interface to build the array more efficiently.

Suggested change
big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB
big = pa.array(np.arange(800_000_000), type=pa.int32()) # ~2GiB

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a good suggestion actually

8000
@wingkitlee0
Copy link
Contributor

first chunk is already past the max size

I don't think the original code or your fix actually solved this? For example, chunks = [3gb, 1gb, 1gb]. After the function, it will give slices = [[3gb], [1gb, 1gb]]. Before the return, there is a call pa.concat_arrays([3gb]), which should fail if the data types are not large_*. It seems that some splitting logic is needed in this case?

Signed-off-by: Raghav Anand <ranand@figma.com>
@ranandfigma ranandfigma force-pushed the fix_large_chunk_coalesce branch from a63aeb4 to 7f3a56f Compare June 5, 2025 02:29
@ranandfigma
Copy link
Author

first chunk is already past the max size

I don't think the original code or your fix actually solved this? For example, chunks = [3gb, 1gb, 1gb]. After the function, it will give slices = [[3gb], [1gb, 1gb]]. Before the return, there is a call pa.concat_arrays([3gb]), which should fail if the data types are not large_*. It seems that some splitting logic is needed in this case?

@wingkitlee0 are you saying that if the chunk is > 2GiB and is not a large_ type, we will fail anyway? That wasn't my experience when testing out this fix (it seemed to work), and similarly when I wrote the test. Or are you saying that if we see a chunk that's > 2GiB, and of type that is not large_, we should split it up? I'm not immediately understanding why that would be necessary.

@wingkitlee0
Copy link
Contributor

interesting. I may be seeing some unrelated error then. But logically, if the resultant array has a chunk that is larger than the "max chunk size", shouldn't it be split 8000 ? Also there are a few places I read about the 2gb limit, e.g., apache/arrow#44944 (comment)

Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 20, 2025
@alexeykudinkin alexeykudinkin removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 20, 2025
@alexeykudinkin alexeykudinkin self-assigned this Jun 20, 2025
@alexeykudinkin
Copy link
Contributor

@ranandfigma will you be able to take a look at the CI failures? If not i can take over and land this

alexeykudinkin
alexeykudinkin previously approved these changes Jun 23, 2025
@@ -49,6 +50,20 @@ def test_try_defragment_table():
assert dt == t


def test_defragment_large_table():

big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a good suggestion actually

@alexeykudinkin alexeykudinkin self-requested a review June 23, 2025 23:24
@alexeykudinkin alexeykudinkin dismissed their stale review June 23, 2025 23:25

Approved by mistake

@alexeykudinkin
Copy link
Contributor

@ranandfigma i'm addressing this and a few other issues in this PR:

#53971

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0