This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
Open
Description
Hi
I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.
Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.
Anyone has any idea how can I do that?
Metadata
Metadata
Assignees
Labels
No labels