8000 Support exponential moving average in the default trainer. by yizhongw · Pull Request #2406 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Support exponential moving average in the default trainer. #2406

Merged
merged 11 commits into from
Jan 24, 2019

Conversation

yizhongw
Copy link
Contributor

No description provided.

@yizhongw
Copy link
Contributor Author

@matt-gardner Could you give me any hints on why the checks failed? The documentation looks good to me, but it still reported undocumented module: allennlp.training.exponential_moving_average.

Parameters
----------
model: ``Model``, required.
An AllenNLP model to be optimized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced about this API. Why not just pass in the named parameters in the constructor (instead of the model) and then you can get rid of that option in the method calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with only using EMA for part of the parameters. But since this might never be used, I will change to what you suggested.

self._average_values[name] = param.data.clone()
self._backup_values[name] = param.data.clone()

def apply(self, num_updates: int = None, named_parameters: Iterable = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is what Tensorflow calls it, but I can't say that I like the name apply here. If it were me I would call it something like update_averages or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

The optional `num_updates` parameter allows one to tweak the decay rate
dynamically. If passed, the actual decay rate used is:

`min(decay, (1 + num_updates) / (10 + num_updates))`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the second equation here come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this tensorflow implementation.

named_parameters = self._model.named_parameters()
for name, param in named_parameters:
new_average_value = (1.0 - decay) * param.data + decay * self._average_values[name]
self._average_values[name] = new_average_value.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need clone here? I would have written something like

self._average_values[name] = (1.0 - decay) * param.data.detach() + decay * self._average_values[name]

is there a reason why two steps is better ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not necessary. I will change that.

if named_parameters is None:
named_parameters = self._model.named_parameters()
for name, param in named_parameters:
self._backup_values[name] = param.data.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need clone here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clone() might be necessary here, to make sure the update of parameters will not affect the correspondings in _backup_values?

if named_parameters is None:
named_parameters = self._model.named_parameters()
for name, param in named_parameters:
param.data = self._backup_values[name].clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Using clone() is safer?

from allennlp.models.model import Model


class ExponentialMovingAverage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there other kinds of moving averages we might care about?

should there be a MovingAverage base class and then the trainer can be agnostic to what type of moving average it's using, rather than hardcoding this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensorflow as a moving_averages.py file. But, it seems that they only implemented the exponential moving average.

@matt-gardner
Copy link
Contributor

@yizhongw, to answer your question about the docs: you need to add a reference to the module in our API docs, here. Ping me on slack if you need help figuring out how this works.

@joelgrus
Copy link
Contributor

@yizhongw do you want me to take a crack at the abstract MovingAverage version?

@yizhongw
Copy link
Contributor Author

@joelgrus Yes, sure! Then, I could modify based on that.

@joelgrus
Copy link
Contributor

@yizhongw ok, take a look at what I pushed up, I'm not 100% sure I got all the clone()s and datas right, but the tests pass.

@yizhongw
Copy link
Contributor Author

@joelgrus Awesome! Thanks for implementing all those things I need to do. I checked the clone()s and all of them look good. I will run my full model with this trainer, to make sure everything works. But I think this code could already be checked in, after fixing the pylint and doc errors.

@yizhongw yizhongw force-pushed the ema-trainer branch 3 times, most recently from 37f4067 to 54c74e5 Compare January 23, 2019 00:49
@joelgrus
Copy link
9E81
Contributor

since I now wrote a lot of this code, probably someone else should review it

@joelgrus joelgrus requested a review from DeNeutoy January 24, 2019 18:33
Copy link
Contributor
@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo inplace copy

decay = self._decay

for name, parameter in self._parameters:
self._shadows[name] = decay * self._shadows[name] + (1.0 - decay) * parameter.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this inplace:

self._shadows[name].mul_(decay).add_((1 - decay) * parameter.data)

This is actually important I think, because copying large parameters every step is expensive. E.g all the optimizers use inplace operations.

@@ -513,6 +535,11 @@ def _save_checkpoint(self, epoch: Union[int, str]) -> None:
The epoch of training. If the checkpoint is saved in the middle
of an epoch, the parameter is a string with the epoch and timestamp.
"""
# If moving averagew are used for parameters, we save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

averages

@DeNeutoy
Copy link
Contributor

Oh, another idea:

it's annoying to have to keep these EMAs on the same device as the parameters, because you're essentially tripling the memory requirement of the model. What if you added a accumulate_on_cpu option, where you computed the update equation on the GPU, but then did self._shadow[name].copy_(update_value, non_blocking=True) - this way, we wouldn't have to accumulate on the GPU, but I (don't think) we would have to wait for the GPU -> CPU copies which this method would require, because they can be non-blocking. What do you think?

@joelgrus
Copy link
Contributor

sounds like a good idea, let me make an attempt at it

@joelgrus joelgrus merged commit 4465a6e into allenai:master Jan 24, 2019
@yizhongw yizhongw deleted the ema-trainer branch January 24, 2019 22:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0