-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Data] Make from_items
lineage serializable
#52026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor comments
class FromBlocks(LogicalOperator): | ||
"""Logical operator for in-heap-memory (not object store) input data. | ||
|
||
If you want to create a Dataset from in-object-store data, use the `InputData` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reconcile naming consistent for operators, FromBlocks
and InputData
are too disjoint so far.
I'd suggest something along the lines FromBlocks
and FromBlockRefs
(or along these lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address in follow-up since the rename might create a large diff
input_metadata: List[BlockMetadata], | ||
): | ||
super().__init__(self.__class__.__name__, [], len(input_blocks)) | ||
assert len(input_blocks) == len(input_metadata), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just pass a list of tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address in follow-up since this PR maintains the current interface
python/ray/data/_internal/logical/operators/from_blocks_operator.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
This PR makes the following changes:
AbstractFrom
and renamesAbstractFrom
toFromBlocks
:AbstractFrom
for some APIs likefrom_pandas
but not for other APIs likefrom_dask
ray.put
of input blocks to execution time rather than when the user-facing API is calledfrom_items
These changes are needed so that you can tune over datasets that look like this:
ray.data.from_items([{"path": ...}, ...]).map(load_path)
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.