-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Support exponential moving average in the default trainer. #2406
Conversation
@matt-gardner Could you give me any hints on why the checks failed? The documentation looks good to me, but it still reported |
Parameters | ||
---------- | ||
model: ``Model``, required. | ||
An AllenNLP model to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced about this API. Why not just pass in the named parameters in the constructor (instead of the model) and then you can get rid of that option in the method calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with only using EMA for part of the parameters. But since this might never be used, I will change to what you suggested.
self._average_values[name] = param.data.clone() | ||
self._backup_values[name] = param.data.clone() | ||
|
||
def apply(self, num_updates: int = None, named_parameters: Iterable = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is what Tensorflow calls it, but I can't say that I like the name apply
here. If it were me I would call it something like update_averages
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do.
The optional `num_updates` parameter allows one to tweak the decay rate | ||
dynamically. If passed, the actual decay rate used is: | ||
|
||
`min(decay, (1 + num_updates) / (10 + num_updates))` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the second equation here come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to this tensorflow implementation.
named_parameters = self._model.named_parameters() | ||
for name, param in named_parameters: | ||
new_average_value = (1.0 - decay) * param.data + decay * self._average_values[name] | ||
self._average_values[name] = new_average_value.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need clone
here? I would have written something like
self._average_values[name] = (1.0 - decay) * param.data.detach() + decay * self._average_values[name]
is there a reason why two steps is better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not necessary. I will change that.
if named_parameters is None: | ||
named_parameters = self._model.named_parameters() | ||
for name, param in named_parameters: | ||
self._backup_values[name] = param.data.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clone()
might be necessary here, to make sure the update of parameters will not affect the correspondings in _backup_values
?
if named_parameters is None: | ||
named_parameters = self._model.named_parameters() | ||
for name, param in named_parameters: | ||
param.data = self._backup_values[name].clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Using clone()
is safer?
from allennlp.models.model import Model | ||
|
||
|
||
class ExponentialMovingAverage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other kinds of moving averages we might care about?
should there be a MovingAverage
base class and then the trainer can be agnostic to what type of moving average it's using, rather than hardcoding this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensorflow as a moving_averages.py
file. But, it seems that they only implemented the exponential moving average.
@yizhongw do you want me to take a crack at the abstract |
@joelgrus Yes, sure! Then, I could modify based on that. |
@yizhongw ok, take a look at what I pushed up, I'm not 100% sure I got all the clone()s and datas right, but the tests pass. |
@joelgrus Awesome! Thanks for implementing all those things I need to do. I checked the clone()s and all of them look good. I will run my full model with this trainer, to make sure everything works. But I think this code could already be checked in, after fixing the pylint and doc errors. |
37f4067
to
54c74e5
Compare
since I now wrote a lot of this code, probably someone else should review it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo inplace copy
allennlp/training/moving_average.py
Outdated
decay = self._decay | ||
|
||
for name, parameter in self._parameters: | ||
self._shadows[name] = decay * self._shadows[name] + (1.0 - decay) * parameter.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this inplace:
self._shadows[name].mul_(decay).add_((1 - decay) * parameter.data)
This is actually important I think, because copying large parameters every step is expensive. E.g all the optimizers use inplace operations.
allennlp/training/trainer.py
Outdated
@@ -513,6 +535,11 @@ def _save_checkpoint(self, epoch: Union[int, str]) -> None: | |||
The epoch of training. If the checkpoint is saved in the middle | |||
of an epoch, the parameter is a string with the epoch and timestamp. | |||
""" | |||
# If moving averagew are used for parameters, we save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
averages
Oh, another idea: it's annoying to have to keep these EMAs on the same device as the parameters, because you're essentially tripling the memory requirement of the model. What if you added a |
sounds like a good idea, let me make an attempt at it |
No description provided.