[Train] Reporting metrics/checkpoints from multiple workers

Summary

This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.

Context

Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.

However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.

Proposal

One potential option is to only allow reporting from the rank 0 worker.

Related Issues

#31409
#31434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Summary

Context

Proposal

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Summary

Context

Proposal

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions