Open
Description
Description
- Object Store can transfer specified files between different nodes.
- Object Store can manage the metadata of those specified files. include:
- reference count: delete all the files on the cluster when the reference count is zeros.
- locations: raylet can pull the specified file.
- the local file will fate-sharing with the metadata in the local plasma store. maybe we can insert a callback that will be called when plasma deletes metadata, and delete the corresponding local file.
- When disk space is low (When trying to pull the remote file fails), try to evict file metadata in the Object Store.
API
# In CoreWorker A:
# Ray provides a special class: RayFile, which only contains the metadata of the file.
file = RayFile(
# user to ensure that the file exists, we can support s3, local file, ...
path = "/path/to/a/exsits/file",
# Whether to copy the file to the temporary file created by ray for each job (this directory is fate-sharing with the job)
# default is True.
enable_copy=false,
)
file_ref = ray.put(file)
# In CoreWorker B:
# This is a synchronous operation. When it is completed, the file has arrived locally and is available.
# This RayFile instance holds a plasma point (such as numpy.ndarray), ensuring that as long as
# this instance exists, the metadata of this file will exist in the local plasma store, and the local file will not be deleted
file = ray.get(file_ref)
# this file is read-only
with open(file.path, "r") as f:
...
Use case
Given scenario
- ray cluster description:
- Node Group A: without GPU, responsible for processing and generating training data. These data are large, generally no smaller than 1GB.
- Node Group B: with GPU, responsible for training model. Those training processes need to read data from
Node Group A
.
- The files generated by
A
are consumed byB
. Currently, there are two main solutions:- Transit by storage services (s3, NFS, NAS.....)
- cons: Additional expenses. The overall system will become complex.
- Use RPC to pull remote files
- cons: Users need to maintain metadata by themselves, and the code complexity is high. And we want users on ray to couple ray as much as possible.
- Transit by storage services (s3, NFS, NAS.....)