10000 Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? · Issue #3812 · tensorflow/tensorflow · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bixiongxu opened this issue Aug 15, 2016 · 4 comments

Comments

@bixiongxu
Copy link

Operating System:
Linux XNLPEXP2 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Azure VM 8 cores, 56GB memory
If installed from binary pip package, provide:
pip 8.1.2 from /home/xubixiong/.local/lib/python2.7/site-packages (python 2.7)
0.10.0rc0

When I am trying to restore training from latest checkpoint, I found that I have to copy checkpoint files to ps servers, from worker 0.

Is it the right and recommended way? Or I did something wrong? ?

@mrry
Copy link
Contributor
mrry commented Aug 15, 2016

In general, the tf.train.Saver code assumes that the worker and parameter server jobs share the same filesystem. For example, you could use a shared NFS mount, or you could use the support for Google Cloud Storage; support for more filesystems - such as HDFS,, #2218 - is in development.

In the special case that you are using the same parameter server hosts when restoring, you can try passing sharded=True to the saver constructor, and each parameter server will write its shard of the parameters to its local filesystem. It should then be able to restore from that path as well.

@bixiongxu
Copy link
Author

Great, I did miss this point. After trying, I used rsync to work around. Now I will adjust to the recommended way.
Thank you very much.

Get Outlook for iOShttps://aka.ms/o0ukef

On Mon, Aug 15, 2016 at 11:35 PM +0800, "Derek Murray" <notifications@github.commailto:notifications@github.com> wrote:

In general, the tf.train.Saver code assumes that the worker and parameter server jobs share the same filesystem. For example, you could use a shared NFS mount, or you could use the support for Google Cloud Storage; support for more filesystems - such as HDFS,, #2218#2218 - is in development.

In the special case that you are using the same parameter server hosts when restoring, you can try passing sharded=True to the saver constructor, and each parameter server will write its shard of the parameters to its local filesystem. It should then be able to restore from that path as well.

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3812#issuecomment-239834251, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBCfscXcI1Fn8_2EcvvxosO81o4zYiyks5qgIccgaJpZM4JkIye.

@sherrym sherrym closed this as completed Aug 16, 2016
@coderXiangLi
Copy link
coderXiangLi commented Aug 16, 2016

@mrry
I just find the method _AddRestoreOps in Saver.

And I try to change this node device, from Variable's device to chief device. And it can restore the whole cluster well.

  new_device = '/job:worker/task:0'
  with ops.device(graph_util.set_cpu0(new_device) if v.device else None):
# with ops.device(graph_util.set_cpu0(v.device) if v.device else None)

Is this a propositional solution for this issue? Or maybe it may call other errors.

@mrry
Copy link
Contributor
mrry commented Aug 16, 2016

@coderXiangLi I'm not sure exactly what change you're suggesting, but it doesn't look like that is a general fix. We might consider adding an API to place the save and/or restore ops on a different device, but we can't hardcode it to be on /job:worker/task:0 as that cause far more copies than necessary (from the worker task to the PS tasks, for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0