8000 script for doing archive surgery by joelgrus · Pull Request #2223 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

script for doing archive surgery #2223

Merged
merged 3 commits into from
Jan 5, 2019
Merged

Conversation

joelgrus
Copy link
Contributor
@joelgrus joelgrus commented Dec 20, 2018

allows you to modify a model.tar.gz by either

  • inspecting its config
  • adding a new key-value pair
  • deleting an existing key-value pair
  • moving a value to a different location (which includes renaming a key)
  • replacing a value

unfortunately, it can only do one of these at a time, I think that covers a lot of use cases, but the usage would get a lot more complicated if I allowed multiple operations in one call.

making the files_to_archive piece work right was sort of intricate

--

the motivating cause was having to update the trained quarel model, which itself depended on a trained parser model, which was using a deprecated key in its config.

the entire workflow was as follows:

  1. download the model
wget https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-parser-zero-2018.10.03.tar.gz
  1. inspect its config
scripts/archive_surgery.py inspect --input-file quarel-parser-zero-2018.10.03.tar.gz

inspection reveals that it depends on a tagger model in two places:

validation_dataset_reader.world_extraction_model: https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz
dataset_reader.world_extraction_model: https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz

so

  1. download that tagger model
wget https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz
  1. inspect it
scripts/archive_surgery.py inspect --input-file quarel-tagger-2018.10.03.tar.gz

it has a deprecated entry: model.constraint_type: BIO, which we need to

  1. rename to model.label_encoding
scripts/archive_surgery.py move --input-file quarel-tagger-2018.10.03.tar.gz --output-file quarel-tagger-2018.12.20.tar.gz --old-key model.constraint_type --new-key model.label_encoding

and then

  1. upload to s3
aws s3 cp quarel-tagger-2018.12.20.tar.gz s3://allennlp/models/quarel-tagger-2018.12.20.tar.gz --acl public-read

now we need to replace the two values that referred to the old tagger model. unfortunately, we need to do that in two steps, because the surgery script doesn't handle multiple changes.

  1. replace one, save to intermediate file:
scripts/archive_surgery.py replace --input-file quarel-parser-zero-2018.10.03.tar.gz --output-file quarel-parser-zero-2018.12.20.tar.gz.intermediate --key validation_dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz
  1. replace the other, save to final file
scripts/archive_surgery.py replace --input-file quarel-parser-zero-2018.12.20.tar.gz.intermediate --output-file quarel-parser-zero-2018.12.20.tar.gz --key dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz
  1. finally, upload the fixed model to s3
aws s3 cp quarel-parser-zero-2018.12.20.tar.gz s3://allennlp/models/quarel-parser-zero-2018.12.20.tar.gz --acl public-read

@joelgrus joelgrus requested a review from schmmd December 20, 2018 22:56
@schmmd
Copy link
Member
schmmd commented Dec 20, 2018

If you want to be able to chain, you could rely more on stdin/stdout, e.g.

cat quarel-parser-zero-2018.10.03.tar.gz | \
    scripts/archive_surgery.py replace --key validation_dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz | \
    scripts/archive_surgery.py replace --key dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz \
    > quarel-parser-zero-2018.12.20.tar.gz 

command = args.command

if command != "inspect" and os.path.exists(args.output_file):
raise ValueError("output file already exists")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you get this for free with argparse.FileType('w')

if command == "inspect":
sys.exit(0)

# Otherwise let's print out the new config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...to stdout"

@matt-gardner
Copy link
Contributor

I'm not sure how feasible this is, but would things be simpler if you had the script pop out to EDITOR?

@joelgrus
Copy link
Contributor Author

I'm really leery of writing .tar.gz files to stdout by default. (And each step unzips and rezips everything, so if people frequently need multiple changes, I'd rather design to do it in one step, it's just hard to get the API right for that and I wanted to get this done.)

@schmmd
Copy link
Member
schmmd commented Dec 20, 2018

@matt-gardner my approach to updating these tar.gzs is to simply use vim on the tarball directly, which works well for me.

@matt-peters
Copy link
Contributor

Since there is a quick, easy understand, and easy to describe method for accomplishing this (untar the archive, edit the config, re-tar) is it really necessary to have the 200+ line script to accomplish the same thing? A line of code not written is worth two lines written? Or something like that.

@joelgrus
Copy link
Contributor Author

@matt-peters my main motivation was that every time we deprecate a config key, we break every trained model.tar.gz that exists in the world, and I wanted to make it as easy as possible for our users to fix their configs.

that is, I wanted to be able to tell them "run this one line command" and it will fix your model, rather than "untar it, find the config, open a text editor, make changes, then re-tar it", which is fine if you're used to doing it but not a great user experience.

but if people think this is not a great idea, I'm not wedded to it

@nelson-liu
Copy link
Contributor

fwiw: i just tried to modify a config in one of the fixtures, and the tests weren't taking the modified config (archive loading code complains about tarfile.ReadError: not a gzip file). Using this script worked beautifully, thanks.

@schmmd
Copy link
Member
schmmd commented Jan 4, 2019

We should merge. I share some of the concerns raised but it's written and we already have some good feedback from users. We discussed this in issue review with @brendan-ai2 and @matt-gardner.

@joelgrus
Copy link
Contributor Author
joelgrus commented Jan 4, 2019

ok, as discussed, I changed it to just launch an editor

@joelgrus joelgrus merged commit e6b0f21 into allenai:master Jan 5, 2019
WrRan pushed a commit to WrRan/allennlp that referenced this pull request Jan 6, 2019
* script for doing archive surgery

* simplify script
@schmmd
Copy link
Member
schmmd commented Jan 7, 2019

Seems like a significant code reduction ;-)

DeNeutoy pushed a commit that referenced this pull request Jan 31, 2019
* Fix bug in uniform_unit_scaling #2239 (#2273)

* Fix type annotation for .forward(...) in tutorial (#2122)

* Add a Contributions section to README.md (#2277)

* script for doing archive surgery (#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (#2283)

* fix #2285 (#2286)

* Update the `find-lr` subcommand help text. (#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (#2298)

* Don't deprecate bidirectional-language-model name (#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (#2264)

Fixes #2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* more help info

* typo fix

* add option '--inplace', '--force'

* clearer help text
matt-gardner pushed a commit that referenced this pull request Mar 18, 2019
* Fix bug in uniform_unit_scaling #2239 (#2273)

* Fix type annotation for .forward(...) in tutorial (#2122)

* Add a Contributions section to README.md (#2277)

* script for doing archive surgery (#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (#2283)

* fix #2285 (#2286)

* Update the `find-lr` subcommand help text. (#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (#2298)

* Don't deprecate bidirectional-language-model name (#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (#2264)

Fixes #2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` a
8000
nd `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273)

* Fix type annotation for .forward(...) in tutorial (allenai#2122)

* Add a Contributions section to README.md (allenai#2277)

* script for doing archive surgery (allenai#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (allenai#2283)

* fix allenai#2285 (allenai#2286)

* Update the `find-lr` subcommand help text. (allenai#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (allenai#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (allenai#2298)

* Don't deprecate bidirectional-language-model name (allenai#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264)

Fixes allenai#2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
TalSchuster pushed a commit to TalSchuster/allennlp-MultiLang that referenced this pull request Feb 20, 2020
* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273)

* Fix type annotation for .forward(...) in tutorial (allenai#2122)

* Add a Contributions section to README.md (allenai#2277)

* script for doing archive surgery (allenai#2223)

* script for doing archive surgery

* simplify script

* Fix spelling in tutorial README (allenai#2283)

* fix allenai#2285 (allenai#2286)

* Update the `find-lr` subcommand help text. (allenai#2289)

* Update the elmo command help text.

* Update the find-lr subcommand help text.

* Add __repr__ to Vocabulary (allenai#2293)

As it currently stands, the following is logged during training:

```
2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model
s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional':
 False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout
': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'}
}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>}
```

Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix.

* Update the base image in the Dockerfiles. (allenai#2298)

* Don't deprecate bidirectional-language-model name (allenai#2297)

* bump version number to v0.8.1

* Bump version numbers to v0.8.2-unreleased

* Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264)

Fixes allenai#2255

This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting.

It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`.

Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled.

TODO:

- [x] test the unidirectional case
- [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` 
- [x] check docs for accuracy
- [x] fix user-facing training configs

* move some utilities from allennlp/scripts to allennlp/allennlp/tools

* make pylint happy

* add modules to API doc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0