script for doing archive surgery #2223

joelgrus · 2018-12-20T22:56:16Z

allows you to modify a model.tar.gz by either

inspecting its config
adding a new key-value pair
deleting an existing key-value pair
moving a value to a different location (which includes renaming a key)
replacing a value

unfortunately, it can only do one of these at a time, I think that covers a lot of use cases, but the usage would get a lot more complicated if I allowed multiple operations in one call.

making the files_to_archive piece work right was sort of intricate

--

the motivating cause was having to update the trained quarel model, which itself depended on a trained parser model, which was using a deprecated key in its config.

the entire workflow was as follows:

download the model

wget https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-parser-zero-2018.10.03.tar.gz

inspect its config

scripts/archive_surgery.py inspect --input-file quarel-parser-zero-2018.10.03.tar.gz

inspection reveals that it depends on a tagger model in two places:

validation_dataset_reader.world_extraction_model: https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz
dataset_reader.world_extraction_model: https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz

so

download that tagger model

wget https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.10.03.tar.gz

inspect it

scripts/archive_surgery.py inspect --input-file quarel-tagger-2018.10.03.tar.gz

it has a deprecated entry: model.constraint_type: BIO, which we need to

rename to model.label_encoding

scripts/archive_surgery.py move --input-file quarel-tagger-2018.10.03.tar.gz --output-file quarel-tagger-2018.12.20.tar.gz --old-key model.constraint_type --new-key model.label_encoding

and then

upload to s3

aws s3 cp quarel-tagger-2018.12.20.tar.gz s3://allennlp/models/quarel-tagger-2018.12.20.tar.gz --acl public-read

now we need to replace the two values that referred to the old tagger model. unfortunately, we need to do that in two steps, because the surgery script doesn't handle multiple changes.

replace one, save to intermediate file:

scripts/archive_surgery.py replace --input-file quarel-parser-zero-2018.10.03.tar.gz --output-file quarel-parser-zero-2018.12.20.tar.gz.intermediate --key validation_dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz

replace the other, save to final file

scripts/archive_surgery.py replace --input-file quarel-parser-zero-2018.12.20.tar.gz.intermediate --output-file quarel-parser-zero-2018.12.20.tar.gz --key dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz

finally, upload the fixed model to s3

aws s3 cp quarel-parser-zero-2018.12.20.tar.gz s3://allennlp/models/quarel-parser-zero-2018.12.20.tar.gz --acl public-read

schmmd · 2018-12-20T23:15:09Z

If you want to be able to chain, you could rely more on stdin/stdout, e.g.

cat quarel-parser-zero-2018.10.03.tar.gz | \
    scripts/archive_surgery.py replace --key validation_dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz | \
    scripts/archive_surgery.py replace --key dataset_reader.world_extraction_model --value https://s3-us-west-2.amazonaws.com/allennlp/models/quarel-tagger-2018.12.20.tar.gz \
    > quarel-parser-zero-2018.12.20.tar.gz

schmmd · 2018-12-20T23:11:04Z

scripts/archive_surgery.py

+    command = args.command
+
+    if command != "inspect" and os.path.exists(args.output_file):
+        raise ValueError("output file already exists")


I wonder if you get this for free with argparse.FileType('w')

schmmd · 2018-12-20T23:12:59Z

scripts/archive_surgery.py

+    if command == "inspect":
+        sys.exit(0)
+
+    # Otherwise let's print out the new config


"...to stdout"

matt-gardner · 2018-12-20T23:17:34Z

I'm not sure how feasible this is, but would things be simpler if you had the script pop out to EDITOR?

joelgrus · 2018-12-20T23:18:47Z

I'm really leery of writing .tar.gz files to stdout by default. (And each step unzips and rezips everything, so if people frequently need multiple changes, I'd rather design to do it in one step, it's just hard to get the API right for that and I wanted to get this done.)

schmmd · 2018-12-20T23:22:24Z

@matt-gardner my approach to updating these tar.gzs is to simply use vim on the tarball directly, which works well for me.

matt-peters · 2018-12-20T23:33:18Z

Since there is a quick, easy understand, and easy to describe method for accomplishing this (untar the archive, edit the config, re-tar) is it really necessary to have the 200+ line script to accomplish the same thing? A line of code not written is worth two lines written? Or something like that.

joelgrus · 2018-12-20T23:44:12Z

@matt-peters my main motivation was that every time we deprecate a config key, we break every trained model.tar.gz that exists in the world, and I wanted to make it as easy as possible for our users to fix their configs.

that is, I wanted to be able to tell them "run this one line command" and it will fix your model, rather than "untar it, find the config, open a text editor, make changes, then re-tar it", which is fine if you're used to doing it but not a great user experience.

but if people think this is not a great idea, I'm not wedded to it

nelson-liu · 2018-12-27T17:33:25Z

fwiw: i just tried to modify a config in one of the fixtures, and the tests weren't taking the modified config (archive loading code complains about tarfile.ReadError: not a gzip file). Using this script worked beautifully, thanks.

schmmd · 2019-01-04T21:27:44Z

We should merge. I share some of the concerns raised but it's written and we already have some good feedback from users. We discussed this in issue review with @brendan-ai2 and @matt-gardner.

joelgrus · 2019-01-04T22:24:22Z

ok, as discussed, I changed it to just launch an editor

* script for doing archive surgery * simplify script

schmmd · 2019-01-07T16:09:56Z

Seems like a significant code reduction ;-)

* Fix bug in uniform_unit_scaling #2239 (#2273) * Fix type annotation for .forward(...) in tutorial (#2122) * Add a Contributions section to README.md (#2277) * script for doing archive surgery (#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (#2283) * fix #2285 (#2286) * Update the `find-lr` subcommand help text. (#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (#2298) * Don't deprecate bidirectional-language-model name (#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (#2264) Fixes #2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * more help info * typo fix * add option '--inplace', '--force' * clearer help text

* Fix bug in uniform_unit_scaling #2239 (#2273) * Fix type annotation for .forward(...) in tutorial (#2122) * Add a Contributions section to README.md (#2277) * script for doing archive surgery (#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (#2283) * fix #2285 (#2286) * Update the `find-lr` subcommand help text. (#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (#2298) * Don't deprecate bidirectional-language-model name (#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (#2264) Fixes #2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` a 8000 nd `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * move some utilities from allennlp/scripts to allennlp/allennlp/tools * make pylint happy * add modules to API doc

* Fix bug in uniform_unit_scaling allenai#2239 (allenai#2273) * Fix type annotation for .forward(...) in tutorial (allenai#2122) * Add a Contributions section to README.md (allenai#2277) * script for doing archive surgery (allenai#2223) * script for doing archive surgery * simplify script * Fix spelling in tutorial README (allenai#2283) * fix allenai#2285 (allenai#2286) * Update the `find-lr` subcommand help text. (allenai#2289) * Update the elmo command help text. * Update the find-lr subcommand help text. * Add __repr__ to Vocabulary (allenai#2293) As it currently stands, the following is logged during training: ``` 2019-01-06 10:46:21,832 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.model s.language_model.LanguageModel'> from params {'bidirectional': False, 'contextualizer': {'bidirectional': False, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 200, 'num_layers': 2, 'type': 'lstm'}, 'dropout ': 0.5, 'text_field_embedder': {'token_embedders': {'tokens': {'embedding_dim': 200, 'type': 'embedding'} }}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7ff7811665f8>} ``` Note that the `Vocabulary` does not provide any useful information, since it doesn't have `__repr__` defined. This provides a fix. * Update the base image in the Dockerfiles. (allenai#2298) * Don't deprecate bidirectional-language-model name (allenai#2297) * bump version number to v0.8.1 * Bump version numbers to v0.8.2-unreleased * Turn BidirectionalLM into a more-general LanguageModel class (allenai#2264) Fixes allenai#2255 This PR replaces the `BidirectionalLM` class with a more-general `LanguageModel` that can be used in either the unidirectional/forward setting or the bidirectional setting. It also accordingly replaces the `BidirectionalLanguageModelTokenEmbedder` with a `LanguageModelTokenEmbedder`. Also fixes bug in the experiment_unsampled.jsonnet config that was preventing a test from actually being unsampled. TODO: - [x] test the unidirectional case - [x] properly deprecate `BidirectionalLM` and `BidirectionalLanguageModelTokenEmbedder` - [x] check docs for accuracy - [x] fix user-facing training configs * move some utilities from allennlp/scripts to allennlp/allennlp/tools * make pylint happy * add modules to API doc

script for doing archive surgery

7591353

joelgrus requested a review from schmmd December 20, 2018 22:56

schmmd approved these changes Dec 20, 2018

View reviewed changes

simplify script

79290a9

matt-gardner approved these changes Jan 4, 2019

View reviewed changes

Merge branch 'master' into archive-surgery

483472d

joelgrus merged commit e6b0f21 into allenai:master Jan 5, 2019

WrRan pushed a commit to WrRan/allennlp that referenced this pull request Jan 6, 2019

script for doing archive surgery (allenai#2223)

9ec4135

* script for doing archive surgery * simplify script

WrRan mentioned this pull request Jan 10, 2019

doing archive surgery from cmdline #2329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

script for doing archive surgery #2223

script for doing archive surgery #2223

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

script for doing archive surgery #2223

script for doing archive surgery #2223

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!