Open
Description
This is more of a question for sanity check than an issue. I have trained the routing transformer encoder-decoder in the past and was really impressed by the speed. I hot about 4 iter/sec training on 12000 long sequences. Now I am training a language model with a depth equal to the encoder/decoder depth of my old model and keeping all other parameters the same. The training rate for the LM has fallen below 1 iter/sec. I was wondering if this is to be expected or there may be something wrong that I need to look into. Thank you for your help.
Metadata
Metadata
Assignees
Labels
No labels