-
Notifications
You must be signed in to change notification settings - Fork 6.5k
ray: fix handling large chunks #53535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ray: fix handling large chunks #53535
Conversation
Signed-off-by: Raghav Anand <ranand@figma.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes how large chunked PyArrow arrays are handled when the first chunk exceeds the maximum allowed size by ensuring empty slices aren’t appended, and adds a test to validate this behavior.
- Prevent empty slices in
_try_combine_chunks_safe
when the initial chunk alone exceedsmax_chunk_size
. - Introduce
MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS
constant and import it in tests. - Add
test_defragment_large_table
to cover large-chunk defragmentation logic.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
python/ray/data/_internal/arrow_ops/transform_pyarrow.py | Add guard to skip appending empty slice when combining chunks. |
python/ray/data/tests/test_transform_pyarrow.py | Import new constant and add a test for defragmenting large tables. |
Comments suppressed due to low confidence (1)
python/ray/data/tests/test_transform_pyarrow.py:63
- The test calls
try_combine_chunked_columns
but this function is not imported. Addtry_combine_chunked_columns
to the import list fromtransform_pyarrow
.
data = try_combine_chunked_columns(table)
@@ -49,6 +50,20 @@ def test_try_defragment_table(): | |||
assert dt == t | |||
|
|||
|
|||
def test_defragment_large_table(): | |||
|
|||
big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructing a Python list of 800 million ints can OOM or be very slow. Consider using numpy.arange
or a PyArrow streaming interface to build the array more efficiently.
big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB | |
big = pa.array(np.arange(800_000_000), type=pa.int32()) # ~2GiB |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like a good suggestion actually
I don't think the original code or your fix actually solved this? For example, chunks = [3gb, 1gb, 1gb]. After the function, it will give |
a63aeb4
to
7f3a56f
Compare
@wingkitlee0 are you saying that if the chunk is > 2GiB and is not a |
interesting. I may be seeing some unrelated error then. But logically, if the resultant array has a chunk that is larger than the "max chunk size", shouldn't it be split 8000 ? Also there are a few places I read about the 2gb limit, e.g., apache/arrow#44944 (comment) |
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
@ranandfigma will you be able to take a look at the CI failures? If not i can take over and land this |
@@ -49,6 +50,20 @@ def test_try_defragment_table(): | |||
assert dt == t | |||
|
|||
|
|||
def test_defragment_large_table(): | |||
|
|||
big = pa.array(list(range(800_000_000)), type=pa.int32()) # ~2GiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like a good suggestion actually
@ranandfigma i'm addressing this and a few other issues in this PR: |
Ray wasn't correctly handling large chunked pyarrow arrays in the case when the first chunk is already past the max size. This will trigger a pyarrow internal error since our list of slices will contain an empty array within it. Fix this by checking for a non-empty slice before appending.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.