-
Notifications
You must be signed in to change notification settings - Fork 74.7k
Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In general, the In the special case that you are using the same parameter server hosts when restoring, you can try passing |
Great, I did miss this point. After trying, I used rsync to work around. Now I will adjust to the recommended way. Get Outlook for iOShttps://aka.ms/o0ukef On Mon, Aug 15, 2016 at 11:35 PM +0800, "Derek Murray" <notifications@github.commailto:notifications@github.com> wrote: In general, the tf.train.Saver code assumes that the worker and parameter server jobs share the same filesystem. For example, you could use a shared NFS mount, or you could use the support for Google Cloud Storage; support for more filesystems - such as HDFS,, #2218#2218 - is in development. In the special case that you are using the same parameter server hosts when restoring, you can try passing sharded=True to the saver constructor, and each parameter server will write its shard of the parameters to its local filesystem. It should then be able to restore from that path as well. You are receiving this because you authored the thread. |
@mrry And I try to change this node device, from Variable's device to chief device. And it can restore the whole cluster well.
Is this a propositional solution for this issue? Or maybe it may call other errors. |
@coderXiangLi I'm not sure exactly what change you're suggesting, but it doesn't look like that is a general fix. We might consider adding an API to place the save and/or restore ops on a different device, but we can't hardcode it to be on |
Operating System:
Linux XNLPEXP2 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Azure VM 8 cores, 56GB memory
If installed from binary pip package, provide:
pip 8.1.2 from /home/xubixiong/.local/lib/python2.7/site-packages (python 2.7)
0.10.0rc0
When I am trying to restore training from latest checkpoint, I found that I have to copy checkpoint files to ps servers, from worker 0.
Is it the right and recommended way? Or I did something wrong? ?
The text was updated successfully, but these errors were encountered: