-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Data] Fix bug where pandas blocks don't use tensor extension #51868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -1430,22 +1430,3 @@ def _is_boolean(self): | |||
TensorArray._add_arithmetic_ops() | |||
TensorArray._add_comparison_ops() | |||
TensorArray._add_logical_ops() | |||
|
|||
|
|||
@PublicAPI(stability="beta") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume no one uses this, but it's technically a public API, so I'm not sure if we should keep it just to be safe.
|
||
if isinstance(column_values, np.ndarray): | ||
# No copy/conversion needed, just keep it verbatim. | ||
return column_values | ||
|
||
elif isinstance(column_values, list): | ||
elif isinstance(column_values, (list, pd.Series)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.Series should be handled by the next branch (L#198)
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
If you return a pandas DataFrame containing Torch tensors from your
map_batches
function, Ray Data won't use the tensor extension type, and this can cause Ray Data to incorrectly estimate the size of resu 8000 lting blocks.This PR fixes that bug by making
PandasBlockBuilder
use the same code path used to construct blocks from NumPy batches.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.