LM slower than the encoder-decoder with the same depth and max_seq_len, window size

This is more of a question for sanity check than an issue. I have trained the routing transformer encoder-decoder in the past and was really impressed by the speed. I hot about 4 iter/sec training on 12000 long sequences. Now I am training a language model with a depth equal to the encoder/decoder depth of my old model and keeping all other parameters the same. The training rate for the LM has fallen below 1 iter/sec. I was wondering if this is to be expected or there may be something wrong that I need to look into. Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions