Turn BidirectionalLM into a more-general LanguageModel class #2264

nelson-liu · 2019-01-02T01:58:47Z

This PR replaces the BidirectionalLM class with a more-general LanguageModel that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the BidirectionalLanguageModelTokenEmbedder with a LanguageModelTokenEmbedder.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

test the unidirectional case
properly deprecate BidirectionalLM and BidirectionalLanguageModelTokenEmbedder
check docs for accuracy
fix user-facing training configs

…add bidirectional key

…l case

nelson-liu · 2019-01-02T04:15:29Z

I don't think i'll have this finished for the near future's minor release. Thus, it'd be great to merge #2253 before this and the next release.

brendan-ai2 · 2019-01-03T02:41:51Z

I like this idea. My main concern would be minimizing disruption, e.g. keeping trained models working. Please ping me when you're ready for a review!

…m.jsonnet

nelson-liu · 2019-01-03T19:20:09Z

ok @brendan-ai2 this should be ready to look at! No rush, though. I'm also planning on taking this code and actually running it on some data to ensure that the perplexities look more-or-less correct.

one question (for anyone): if we deprecate a class, should we keep its tests around? or is that unnecessary?

nelson-liu · 2019-01-03T19:25:25Z

to summarize why the diff looks so scary / maybe make it easier to review:

moved most of code in models/bidirectional_lm.py to models/shuffled_sentence_lm.py
Kept BidirectionalLM class around in models/bidirectional_lm.py for backwards compatibility, but it just forwards its arguments to ShuffledSentenceLM while emitting a deprecation warning.
moved most of code in bidirectional_language_model_token_embedder.py to shuffled_sentence_language_model_token_embedder.py.
Kept BidirectionalLanguageModelTokenEmbedder class around in bidirectional_language_model_token_embedder.py for backwards compatibility, but it just forwards its arguments to ShuffledSentenceLanguageModelTokenEmbedder while emitting a deprecation warning.

You probably want to diff what's currently in models/shuffled_sentence_lm.py with models/bidirectional_lm.py before this PR to see what actually changed. Similarly, for shuffled_sentence_language_model_token_embedder.py, manually diffing with bidirectional_language_model_token_embedder.py would make it much more reviewable.

matt-gardner · 2019-01-03T19:39:49Z

@nelson-liu, on your deprecation question. @schmmd and I talked about this a bit a while ago, and we think it'd be good to eventually make allennlp-internal warnings into test failures. So, any test that emits a deprecation warning either has to catch/suppress that warning or fail. If you're moving the functionality of a class to a new class, you should also move any applicable tests to the new class. Whether you keep around a test on the deprecated class, that probably only checks that it emits a deprecation warning, is up to you.

matt-peters · 2019-01-03T20:07:38Z

Probably should have looked at this earlier but was out over the holidays so apologies for the late feedback. What is the reasoning for explicitly adding ShuffledSentence to the names? There is nothing inside the BidirectionalLM class that restricts it to shuffled sentences vs longer contexts, and I have used the calypso version to train on both long contexts and the shuffled Billion Word Benchmark.

The assumptions about shuffling require changes in the data preparation / iterator and statefulness vs non-statefulness of the contextual encoder. So it seems to me the abstractions should be LanguageModel (with bidirectional option), and adding a StatefullIterator that will read very long contexts, chunk them into smaller pieces and do the alignment from batch to batch?

nelson-liu · 2019-01-03T20:17:01Z

That's a good point @matt-peters , will edit.

nelson-liu · 2019-01-04T23:06:44Z

@brendan-ai2 this is ready for another look, fyi

allennlp/tests/models/bidirectional_lm_test.py

brendan-ai2

Thanks for all the edits! Really glad to have this PR. :) One final comment, LGTM after that!

training_config/bidirectional_language_model.jsonnet

…ge_model.jsonnet

nelson-liu · 2019-01-08T00:42:36Z

Thanks for your thorough comments, @brendan-ai2 ! Much appreciated.

…#2264) Fixes allenai#2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs

* Fix bug in uniform_unit_scaling #2239 (#2273) * Fix type annotation for .forward(...) in tutorial (#2122) * Add a Contributions section to README.md (#2277) * script for doing archive surgery (#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (#2283) * fix #2285 (#2286) * Update the `find-lr` subcommand help text. (#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (#2298) * Don't deprecate bidirectional-language-model name (#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (#2264) Fixes #2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * more help info * typo fix * add option '--inplace', '--force' * clearer help text

* Fix bug in uniform_unit_scaling #2239 (#2273) * Fix type annotation for .forward(...) in tutorial (#2122) * Add a Contributions section to README.md (#2277) * script for doing archive surgery (#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (#2283) * fix #2285 (#2286) * Update the `find-lr` subcommand help text. (#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (#2298) * Don't deprecate bidirectional-language-model name (#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (#2264) Fixes #2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * move some utilities from allennlp/scripts to allennlp/allennlp/tools * make pylint happy * add modules to API doc

* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273) * Fix type annotation for .forward(...) in tutorial (allenai#2122) * Add a Contributions section to README.md (allenai#2277) * script for doing archive surgery (allenai#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (allenai#2283) * fix allenai#2285 (allenai#2286) * Update the `find-lr` subcommand help text. (allenai#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (allenai#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (allenai#2298) * Don't deprecate bidirectional-language-model name (allenai#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264) Fixes allenai#2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * move some utilities from allennlp/scripts to allennlp/allennlp/tools * make pylint happy * add modules to API doc

nelson-liu added 11 commits January 1, 2019 20:21

Change BidirectionalLM to more-general ShuffledSentenceLM

85e2933

Remove extraneous tests (they come from superclass)

1ce937c

Convert BidirectionalLMTokenEmbedder to ShuffledSentenceLMTokenEmbedder

12070b2

Edit fixture archive model type to shuffled_sentence_language_model, …

2fbddc6

…add bidirectional key

Remove empty temp model.tar.gz

bad8142

Fix lint

66dccd3

Smooth prose in ShuffledSentenceLM docs

e2ee301

Rename test classes in shuffled_sentence_lm_test.py

76580ce

Test bidirectionality of LM and contextualizer match, in bidirectiona…

11116c5

…l case

Test unidirectional shuffled sentence lm

a9f0089

Fix lint

40131bb

nelson-liu and others added 4 commits January 1, 2019 23:59

Fix docs

0f86a7f

Fix odd test failure in simple_tagger_test

833725e

Fix alphabetical order in model docs

2ba2a21

Merge branch 'master' into unidirectional_lm

1ad77b7

nelson-liu mentioned this pull request Jan 3, 2019

Contributing a unidirectional language model / working with non-shuffled data #2255

Closed

brendan-ai2 self-requested a review January 3, 2019 02:40

nelson-liu mentioned this pull request Jan 3, 2019

BidirectionalLanguageModel registered name is inconsistent #2246

Closed

nelson-liu added 5 commits January 3, 2019 13:49

Properly deprecate BidirectionalLM* classes

1a6ff85

Merge branch 'master' into unidirectional_lm

7269065

Add deprecation notices to docstrings

ba75e0d

Fix lint

82d572c

Convert bidirectional_lm.jsonnet to bidirectional_shuffled_sentence_l…

2fd730c

…m.jsonnet

nelson-liu and others added 10 commits January 3, 2019 23:34

Deduplicate jsonnet config for unidirectional unsampled test fixture

6da6985

Deduplicate bidirectional lm test fixtures

036596d

Test unsampled bidirectional lm
8000

57a07f5

Fix lint

08b758b

Refactor and modularize bidirectional_lm / language_model tests

5af15c7

Add test for bidirectional_language_model_token_embedder

dbb9aa7

Fix lint

940d4f8

Don't remove bidirectional-language-model name

3995513

Refactor loop in LM loss calculation

1151f4d

Merge branch 'master' into unidirectional_lm

2772814

nelson-liu changed the title ~~Turn BidirectionalLM into a more-general SentenceShuffledLM~~ Turn BidirectionalLM into a more-general LanguageModel class Jan 5, 2019

nelson-liu and others added 3 commits January 6, 2019 12:21

Fix min count for vocab generation in bilm config

b38513d

Merge branch 'master' into unidirectional_lm

f6750d4

Merge branch 'master' into unidirectional_lm

cdb0d24

brendan-ai2 reviewed Jan 7, 2019

View reviewed changes

allennlp/tests/models/bidirectional_lm_test.py Show resolved Hide resolved

nelson-liu added 2 commits January 7, 2019 15:48

Merge branch 'master' into unidirectional_lm

ba5b149

Merge branch 'master' into unidirectional_lm

7b24424

brendan-ai2 approved these changes Jan 8, 2019

View reviewed changes

training_config/bidirectional_language_model.jsonnet Show resolved Hide resolved

Update references to bidirectional_lm.jsonnet to bidirectional_langua…

3d3c1fe

…ge_model.jsonnet

This was referenced Jan 8, 2019

Improve documentation of _SoftmaxLoss #2302

Closed

Deprecate LanguageModel loss_scale construction parameter #2303

Closed

nelson-liu merged commit 088f0bb into allenai:master Jan 8, 2019

nelson-liu deleted the unidirectional_lm branch January 8, 2019 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turn BidirectionalLM into a more-general LanguageModel class #2264

Turn BidirectionalLM into a more-general LanguageModel class #2264

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Turn BidirectionalLM into a more-general LanguageModel class #2264

Turn BidirectionalLM into a more-general LanguageModel class #2264

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!