8000 [Train] Reporting metrics/checkpoints from multiple workers · Issue #33360 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Train] Reporting metrics/checkpoints from multiple workers #33360
Open
@matthewdeng

Description

@matthewdeng

Summary

This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.

Context

Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.

However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.

Proposal

One potential option is to only allow reporting from the rank 0 worker.

Related Issues

#31409
#31434

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.ray-team-createdRay Team createdtrainRay Train Related Issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0