8000 Turn BidirectionalLM into a more-general LanguageModel class by nelson-liu · Pull Request #2264 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Turn BidirectionalLM into a more-general LanguageModel class #2264

Merged
merged 49 commits into from
Jan 8, 2019

Conversation

nelson-liu
Copy link
Contributor
@nelson-liu nelson-liu commented Jan 2, 2019

Fixes #2255

This PR replaces the BidirectionalLM class with a more-general LanguageModel that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the BidirectionalLanguageModelTokenEmbedder with a LanguageModelTokenEmbedder.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

  • test the unidirectional case
  • properly deprecate BidirectionalLM and BidirectionalLanguageModelTokenEmbedder
  • check docs for accuracy
  • fix user-facing training configs

@nelson-liu
Copy link
Contributor Author

I don't think i'll have this finished for the near future's minor release. Thus, it'd be great to merge #2253 before this and the next release.

@brendan-ai2
Copy link
Contributor

I like this idea. My main concern would be minimizing disruption, e.g. keeping trained models working. Please ping me when you're ready for a review!

@nelson-liu
Copy link
Contributor Author

ok @brendan-ai2 this should be ready to look at! No rush, though. I'm also planning on taking this code and actually running it on some data to ensure that the perplexities look more-or-less correct.

one question (for anyone): if we deprecate a class, should we keep its tests around? or is that unnecessary?

@nelson-liu
Copy link
Contributor Author

to summarize why the diff looks so scary / maybe make it easier to review:

  • moved most of code in models/bidirectional_lm.py to models/shuffled_sentence_lm.py
  • Kept BidirectionalLM class around in models/bidirectional_lm.py for backwards compatibility, but it just forwards its arguments to ShuffledSentenceLM while emitting a deprecation warning.
  • moved most of code in bidirectional_language_model_token_embedder.py to shuffled_sentence_language_model_token_embedder.py.
  • Kept BidirectionalLanguageModelTokenEmbedder class around in bidirectional_language_model_token_embedder.py for backwards compatibility, but it just forwards its arguments to ShuffledSentenceLanguageModelTokenEmbedder while emitting a deprecation warning.

You probably want to diff what's currently in models/shuffled_sentence_lm.py with models/bidirectional_lm.py before this PR to see what actually changed. Similarly, for shuffled_sentence_language_model_token_embedder.py, manually diffing with bidirectional_language_model_token_embedder.py would make it much more reviewable.

@matt-gardner
Copy link
Contributor

@nelson-liu, on your deprecation question. @schmmd and I talked about this a bit a while ago, and we think it'd be good to eventually make allennlp-internal warnings into test failures. So, any test that emits a deprecation warning either has to catch/suppress that warning or fail. If you're moving the functionality of a class to a new class, you should also move any applicable tests to the new class. Whether you keep around a test on the deprecated class, that probably only checks that it emits a deprecation warning, is up to you.

@matt-peters
Copy link
Contributor

Probably should have looked at this earlier but was out over the holidays so apologies for the late feedback. What is the reasoning for explicitly adding ShuffledSentence to the names? There is nothing inside the BidirectionalLM class that restricts it to shuffled sentences vs longer contexts, and I have used the calypso version to train on both long contexts and the shuffled Billion Word Benchmark.

The assumptions about shuffling require changes in the data preparation / iterator and statefulness vs non-statefulness of the contextual encoder. So it seems to me the abstractions should be LanguageModel (with bidirectional option), and adding a StatefullIterator that will read very long contexts, chunk them into smaller pieces and do the alignment from batch to batch?

@nelson-liu
Copy link
Contributor Author

That's a good point @matt-peters , will edit.

@nelson-liu
Copy link
Contributor Author

@brendan-ai2 this is ready for another look, fyi

@nelson-liu nelson-liu changed the title Turn BidirectionalLM into a more-general SentenceShuffledLM Turn BidirectionalLM into a more-general LanguageModel class Jan 5, 2019
Copy link
Contributor
@brendan-ai2 brendan-ai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the edits! Really glad to have this PR. :) One final comment, LGTM after that!

@nelson-liu nelson-liu merged commit 088f0bb into allenai:master Jan 8, 2019
@nelson-liu nelson-liu deleted the unidirectional_lm branch January 8, 2019 00:42
@nelson-liu
Copy link
Contributor Author

Thanks for your thorough comments, @brendan-ai2 ! Much appreciated.

WrRan pushed a commit to WrRan/allennlp that referenced this pull request Jan 8, 2019
…#2264)

Fixes allenai#2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs
DeNeutoy pushed a commit that referenced this pull request Jan 31, 2019
* Fix bug in uniform_unit_scaling #2239 (#2273)

* Fix type annotation for .forward(...) in tutorial (#2122)

* Add a Contributions section to README.md (#2277)

* script for doing archive surgery (#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (#2283)

* fix #2285 (#2286)

* Update the `find-lr` subcommand help text. (#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (#2298)

* Don't deprecate bidirectional-language-model name (#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (#2264)

Fixes #2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* more help info

* typo fix

* add option '--inplace', '--force'

* clearer help text
matt-gardner pushed a commit that referenced this pull request Mar 18, 2019
* Fix bug in uniform_unit_scaling #2239 (#2273)

* Fix type annotation for .forward(...) in tutorial (#2122)

* Add a Contributions section to README.md (#2277)

* script for doing archive surgery (#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (#2283)

* fix #2285 (#2286)

* Update the `find-lr` subcommand help text. (#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (#2298)

* Don't deprecate bidirectional-language-model name (#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (#2264)

Fixes #2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273)

* Fix type annotation for .forward(...) in tutorial (allenai#2122)

* Add a Contributions section to README.md (allenai#2277)

* script for doing archive surgery (allenai#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (allenai#2283)

* fix allenai#2285 (allenai#2286)

* Update the `find-lr` subcommand help text. (allenai#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (allenai#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (allenai#2298)

* Don't deprecate bidirectional-language-model name (allenai#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264)

Fixes allenai#2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
TalSchuster pushed a commit to TalSchuster/allennlp-MultiLang that referenced this pull request Feb 20, 2020
* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273)

* Fix type annotation for .forward(...) in tutorial (allenai#2122)

* Add a Contributions section to README.md (allenai#2277)

* script for doing archive surgery (allenai#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (allenai#2283)

* fix allenai#2285 (allenai#2286)

* Update the `find-lr` subcommand help text. (allenai#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (allenai#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (allenai#2298)

* Don't deprecate bidirectional-language-model name (allenai#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264)

Fixes allenai#2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Contributing a unidirectional language model / working with non-shuffled data
4 participants
0