Incorrect Masked Huber Loss calculation #14

bobbiesbob · 2018-02-20T21:10:24Z

in line 57 of masked_huber_loss.lua, it says 1 is for impossible features.

it is actually 0 for impossible features.

So line 65 should actually be (batch_size * feature_size) / self.mask_sum:sum()

Lines 58-60 should also be changed

The text was updated successfully, but these errors were encountered:

lifrordi · 2018-05-11T17:32:36Z

No, the code is correct, there are two variables mask and mask_multiplier that have different semantics of 0 and 1

dmorrill10 · 2018-05-14T15:24:05Z

I got a question about this too, so to clarify:

mask_sum has the number of possible hands, since it's the sum over columns of mask.
mask_multiplier = (feature_size - mask_sum) / feature_size is then the number of impossible hands divided by the number of total hands.
The loss gradients, dloss_doutput, are divided by mask_multiplier, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.

Why is the gradient not divided by the number of possible hands?

JaysenStark · 2018-05-16T06:32:46Z

@dmorrill10 I have excatly the same question as you, did you figure it out?

dmorrill10 · 2018-05-16T14:27:26Z

@JaysenStark no, not yet.

KK666-AI · 2019-05-09T14:37:08Z

@dmorrill10 @lifrordi I think the correct loss should be loss = avg(sum( |pred_i- actual_i| )/mask_i). That is, each sample should compute its own average loss, and then compute the avg batch loss.

The implementation is confusing. Because mask_multiplier actually is used to weight batch loss, when using stochastic gradient descent method, this mask_multiplier will change the scale of derivative, then so that it's hard for optimizer such as adam to compute which learning rate should be used in the next iteration.

KK666-AI · 2019-05-09T15:12:42Z

I got a question about this too, so to clarify:

mask_sum has the number of possible hands, since it's the sum over columns of mask.

mask_multiplier = (feature_size - mask_sum) / feature_size is then the number of impossible hands divided by the number of total hands.

The loss gradients, dloss_doutput, are divided by mask_multiplier, so the adjusted gradient is the original gradient multiplied by the number of total hands, divided by the number of impossible hands.

Why is the gradient not divided by the number of possible hands?

The gradient doesn't need to divided by the number of possible hands because the loss is already regularized by this number. According to the theory of derivative， the gradient is computed by the regularized loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Masked Huber Loss calculation #14

Incorrect Masked Huber Loss calculation #14

Incorrect Masked Huber Loss calculation #14

Incorrect Masked Huber Loss calculation #14

Comments