Open
Description
Summary
This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.
Context
Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.
However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.
Proposal
One potential option is to only allow reporting from the rank 0 worker.