8000 [Core][Object Store] Object Store to manage files in the cluster · Issue #32694 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core][Object Store] Object Store to manage files in the cluster #32694
Open
@Catch-Bull

Description

@Catch-Bull

Description

  1. Object Store can transfer specified files between different nodes.
  2. Object Store can manage the metadata of those specified files. include:
    • reference count: delete all the files on the cluster when the reference count is zeros.
    • locations: raylet can pull the specified file.
    • the local file will fate-sharing with the metadata in the local plasma store. maybe we can insert a callback that will be called when plasma deletes metadata, and delete the corresponding local file.
  3. When disk space is low (When trying to pull the remote file fails), try to evict file metadata in the Object Store.

API

# In CoreWorker A:

# Ray provides a special class: RayFile, which only contains the metadata of the file.
file = RayFile(
    # user to ensure that the file exists, we can support s3, local file, ...
	path = "/path/to/a/exsits/file",
    # Whether to copy the file to the temporary file created by ray for each job (this directory is fate-sharing with the job)
    # default is True.
    enable_copy=false,
)
file_ref = ray.put(file)

# In CoreWorker B:

# This is a synchronous operation. When it is completed, the file has arrived locally and is available.
# This RayFile instance holds a plasma point (such as numpy.ndarray), ensuring that as long as 
# this instance exists, the metadata of this file will exist in the local plasma store, and the local file will not be deleted
file = ray.get(file_ref)
# this file is read-only
with open(file.path, "r") as f:
    ...

Use case

Given scenario

  • ray cluster description:
    1. Node Group A: without GPU, responsible for processing and generating training data. These data are large, generally no smaller than 1GB.
    2. Node Group B: with GPU, responsible for training model. Those training processes need to read data from Node Group A.
  • The files generated by A are consumed by B. Currently, there are two main solutions:
    1. Transit by storage services (s3, NFS, NAS.....)
      • cons: Additional expenses. The overall system will become complex.
    2. Use RPC to pull remote files
      • cons: Users need to maintain metadata by themselves, and the code complexity is high. And we want users on ray to couple ray as much as possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severitycoreIssues that should be addressed in Ray Corecore-object-storeenhancementRequest for new feature and/or capabilitypending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0