Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

bixiongxu · 2016-08-15T06:13:26Z

Operating System:
Linux XNLPEXP2 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Azure VM 8 cores, 56GB memory
If installed from binary pip package, provide:
pip 8.1.2 from /home/xubixiong/.local/lib/python2.7/site-packages (python 2.7)
0.10.0rc0

When I am trying to restore training from latest checkpoint, I found that I have to copy checkpoint files to ps servers, from worker 0.

Is it the right and recommended way? Or I did something wrong? ?

mrry · 2016-08-15T15:33:00Z

In general, the tf.train.Saver code assumes that the worker and parameter server jobs share the same filesystem. For example, you could use a shared NFS mount, or you could use the support for Google Cloud Storage; support for more filesystems - such as HDFS,, #2218 - is in development.

In the special case that you are using the same parameter server hosts when restoring, you can try passing sharded=True to the saver constructor, and each parameter server will write its shard of the parameters to its local filesystem. It should then be able to restore from that path as well.

bixiongxu · 2016-08-15T15:40:15Z

Great, I did miss this point. After trying, I used rsync to work around. Now I will adjust to the recommended way.
Thank you very much.

Get Outlook for iOShttps://aka.ms/o0ukef

On Mon, Aug 15, 2016 at 11:35 PM +0800, "Derek Murray" <notifications@github.com mailto:notifications@github.com> wrote:

In general, the tf.train.Saver code assumes that the worker and parameter server jobs share the same filesystem. For example, you could use a shared NFS mount, or you could use the support for Google Cloud Storage; support for more filesystems - such as HDFS,, #2218 #2218 - is in development.

In the special case that you are using the same parameter server hosts when restoring, you can try passing sharded=True to the saver constructor, and each parameter server will write its shard of the parameters to its local filesystem. It should then be able to restore from that path as well.

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3812#issuecomment-239834251, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBCfscXcI1Fn8_2EcvvxosO81o4zYiyks5qgIccgaJpZM4JkIye.

coderXiangLi · 2016-08-16T09:47:15Z

@mrry
I just find the method _AddRestoreOps in Saver.

And I try to change this node device, from Variable's device to chief device. And it can restore the whole cluster well.

  new_device = '/job:worker/task:0'
  with ops.device(graph_util.set_cpu0(new_device) if v.device else None):
# with ops.device(graph_util.set_cpu0(v.device) if v.device else None)

Is this a propositional solution for this issue? Or maybe it may call other errors.

mrry · 2016-08-16T16:51:08Z

@coderXiangLi I'm not sure exactly what change you're suggesting, but it doesn't look like that is a general fix. We might consider adding an API to place the save and/or restore ops on a different device, but we can't hardcode it to be on /job:worker/task:0 as that cause far more copies than necessary (from the worker task to the PS tasks, for example).

sherrym closed this as completed Aug 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

Tensorflow distributed training question: Do we have to manually copy ckpt files from worker 0 to ps servers? #3812

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!