10000 Hf hub by LysandreJik · Pull Request #5052 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Hf hub #5052

Merged
merged 13 commits into from
Apr 20, 2021
Merged

Hf hub #5052

merged 13 commits into from
Apr 20, 2021

Conversation

LysandreJik
Copy link
Contributor
@LysandreJik LysandreJik commented Mar 12, 2021

This PR adds support for the HuggingFace model hub as a way to download files. The hub support is put into cached_file(), and allows the following:

from allennlp.predictors.predictor import Predictor
import allennlp_models.rc

- checkpoint = "https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz"
+ checkpoint = "lysandre/bidaf-elmo-model-2020.03.19"
predictor = Predictor.from_path(checkpoint)
predictions = predictor.predict_json({
  "passage":
      "The Matrix is a 1999 science fiction action "
      "film written and directed by The Wachowskis, "
      "starring Keanu Reeves, Laurence Fishburne, "
      "Carrie-Anne Moss, Hugo Weaving, and Joe P"
      "antoliano.",
  "question":
      "Who stars in The Matrix?"
})
print(predictions["best_span_str"])

This offers model versioning (even across archives, as they're unarchived on the hub as it can be seen with the above example), storage, and the inference API for supported models.

The model above is available here: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19

Feel free to test out the inference API. It's currently only set up for the question answering model, let me know if there are other pipelines you would be interested in and we can look into setting those up.

You can also try it out by uploading your own checkpoints to the model hub, and adding the following tags to the README file:

---
tags:
- allennlp
- question-answering
---

@dirkgr
Copy link
Member
dirkgr commented Mar 12, 2021

I am very excited about having seamless download from the HF Hub in AllenNLP! That could be useful for more things than just downloading models. For example, some dataset readers rely on downloading pre-computed vocabulary files.

Whether we upload all our existing models to the hub and reference them there by default is a separate question. The Google Storage bucket we use now doesn't cause any issues now, but I understand putting these on the hub opens the door to using the HF inference API later?

@LysandreJik
Copy link
Contributor Author

Yes, that's right! Similar to what has been done with pyannote for example: julien-c/voice-activity-detection, the goal would be to ultimately host demos directly on the inference API for the models uploaded to the hub.

For BiDAF-ELMo for example, we could imagine a QA widget like on your cool demo.

@dirkgr
Copy link
Member
dirkgr commented Mar 18, 2021

I am not forgetting about this, I am just swamped at the moment. We're happy to review a PR at any time that puts hub support into cached_file(). Let us know when you have a "final form".

Moving all the models is a slightly bigger project that I'll have to coordinate with the guys working on the demo, so I can't get to it right this second, but it's on my radar.

@LysandreJik
Copy link
Contributor Author

That's good to know, thanks @dirkgr. We'll be focusing on the cached_file() then, I agree that's what would make the most sense. Will let you know once we have a final form.

ELMO example


Cleanup


ok


Version is necessary


Prettier


env
@LysandreJik
Copy link
Contributor Author

@dirkgr, I have updated the PR description with the final proposal.

@dirkgr dirkgr mentioned this pull request Apr 15, 2021
2 tasks
Copy link
Member
@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to see from the diff: Can this be combined with archive support, i.e., if I say lysandre/something.zip!filename.txt, will it work?

if not isinstance(url_or_filename, str):
url_or_filename = str(url_or_filename)

if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried about this. It's not transparent to a user that the behavior of this hinges on the presence of absence of a local file in their current working directory. It also makes the AllenNLP configuration files a lot less self-contained.

Do you think we could treat this like a URL, and make it huggingface://lysandre/pair-classification-roberta-mnli or hf://lysandre/pair-classification-roberta-mnli or even hub://lysandre/pair-classification-roberta-mnli?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could definitely treat it like an URL and have it preceded by either of the three propositions you have made. Is there one you favor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hub is too generic. So I vote for either huggingface://, hf://, or both.

@LysandreJik
Copy link
Contributor Author

Right now the files uploaded to the hub are not in archived form. The bidaf example mentioned above was uploaded to the hub as an untarred version of itself, so that it's easier to browse, inspect, and version each of its files: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19/tree/main

Regarding your question, if I understand correctly you're asking if we could target a single file on a repository (akin to your archives), rather than the entire repository. This is definitely possible; with the current implementation this is acceptable:

from allennlp.modules.elmo import Elmo, batch_to_ids

- options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json"
- weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5"
+ options_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/options.json"
+ weight_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/weights.hdf5"

elmo = Elmo(options_file, weight_file, 2, dropout=0)

sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
print(embeddings)

Happy to look into adding a ! instead of the second / if it makes more sense g 8000 iven your current implementation!

Copy link
Member
@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LysandreJik, thanks for doing this! I thought I'd leave my two cents as well since the caching functionality is near and dear to my heart

- The test for distributed metrics now takes a parameter specifying how often you want to run it.


## [v2.3.0](https://github.com/allenai/allennlp/releases/tag/v2.3.0) - 2021-04-14

### Added

- Ported the following Huggingface `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`.
- Ported the following HuggingFace `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! 😉

if not isinstance(url_or_filename, str):
url_or_filename = str(url_or_filename)

if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hub is too generic. So I vote for either huggingface://, hf://, or both.

Comment on lines 249 to 266
filename = url_or_filename.split("/")[2]
url_or_filename = "/".join(url_or_filename.split("/")[:2])
else:
filename = None

if "@" in url_or_filename:
repo_id = url_or_filename.split("@")[0]
revision = url_or_filename.split("@")[1]
else:
repo_id = url_or_filename
revision = None

try:
if filename is not None:
url = hf_hub_url(
repo_id=repo_id, filename=filename, revision=revision
)
url_or_filename = cached_download(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to me.

We set url_or_filename (along with filename) on line 250, but then override it to something else on line 266.

Comment on lines 274 to 275
except HTTPError as e:
logger.warning(f"Tried to download from Hugging Face Hub but failed with {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why warn here and not just let the exception propagate?

I think it would be better to emit a logger.info() statement above that says "treating url_or_filename as a HuggingFace Hub resource" or something, and then just let this error propagate if it happens.

CHANGELOG.md Outdated
@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Added support for the HuggingFace Hub as an alternative way to handle loading files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with huggingface:// or hf:// type urls, we should mention that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with hf://! I updated the changelog.

@dirkgr
Copy link
Member
dirkgr commented Apr 15, 2021

GitHub is blowing up the commenting threads, so I'll reply like this instead.

I personally think hf:// is quite cool, but I have a thing for terseness. I'm fine with whatever @LysandreJik wants to do here, especially if you want to align it with other projects. It would be annoying if it's hf:// in AllenNLP but huggingface:// in TensorFlow or some other framework you're integrating the hub into.

As a user, I would expect that indicating a single file within a repository would work, and I would have expected that to work with /, not !. I can do that with GitHub URLs after all. But if it's hard to implement, this is already a great addition even without that feature.

Copy link
Contributor Author
@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for your comments! I have updated the code accordingly:

  • Access to the hub is now done via the hf:// scheme, which cleans up the code as there is no ambiguity with local paths anymore
  • Cleaned up the code relative to the url_or_filename variable.

I have kept the behavior with the second / for file access instead of !, but it is simple enough to update it so let me know if this is preferred.

CHANGELOG.md Outdated
@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Added support for the HuggingFace Hub as an alternative way to handle loading files.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with hf://! I updated the changelog.

if not isinstance(url_or_filename, str):
url_or_filename = str(url_or_filename)

if urlparse(url_or_filename).scheme == "hf":
# Remove the hf:// prefix
identifier = url_or_filename[5:]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the names to identifier and model_identifier, please let me know if this looks better to you @epwalsh!

url = hf_hub_url(
repo_id=repo_id, filename=filename, revision=revision
)
url_or_filename = cached_download(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final assignment is done to url_or_filename as this is now a local path. It is handled as a local path in the following statements of the cached_path method.

@dirkgr
Copy link
Member
dirkgr commented Apr 19, 2021

Can you run the formatter? I always run this until it has no more complaints:

black .; and flake8; and mypy allennlp --ignore-missing-imports --no-strict-optional --no-site-packages --cache-dir=/dev/null

@dirkgr
Copy link
Member
dirkgr commented Apr 19, 2021

Oh, that's fish syntax. bash is this:

black . && flake8 && mypy allennlp --ignore-missing-imports --no-strict-optional --no-site-packages --cache-dir=/dev/null

@LysandreJik
Copy link
Contributor Author

I've updated the style and fixed the conflicts. I think the PR looks good!

In a second step, I was wondering how we could amplify visibility for this - contributing to the docs seems like a good first step. Do you have any pointers of where I could help?
As a second step, I'll write a blog post focusing on the integration and the capabilities enabled.

How does that sound?

@LysandreJik
Copy link
Contributor Author

This PR adds support for the inference API as well as you can see on the bidaf model on the hub. Would it make sense for you to upload some of your models on the hub as a starting point, once this PR is merged?

@epwalsh
Copy link
Member
epwalsh commented Apr 20, 2021

In a second step, I was wondering how we could amplify visibility for this - contributing to the docs seems like a good first step. Do you have any pointers of where I could help?

@LysandreJik, we like to post updates like this on GitHub Discussions. If you posted there, we could then tweet about it from the official AllenNLP twitter account.

As a second step, I'll write a blog post focusing on the integration and the capabilities enabled.

That would be great as well, and we'd be happy to tweet about that too!

Copy link
Member
@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@LysandreJik
Copy link
Contributor Author

Great, I'll take a look at the Github discussions. Thanks!

@epwalsh epwalsh enabled auto-merge (squash) April 20, 2021 20:53
@epwalsh epwalsh disabled auto-merge April 20, 2021 20:53
@epwalsh epwalsh enabled auto-merge (squash) April 20, 2021 20:53
@epwalsh epwalsh merged commit a84b9b1 into allenai:main Apr 20, 2021
@epwalsh
Copy link
Member
epwalsh commented Apr 20, 2021

@LysandreJik, I just did a patch release with this included: allennlp==2.3.1

@LysandreJik
Copy link
Contributor Author

Published in a discussion: #5140

dirkgr added a commit that referenced this pull request May 10, 2021
* Load from HF hub



ELMO example


Cleanup


ok


Version is necessary


Prettier


env

* Tests

* Remove erroneous addition

* Add dependency to huggingface_hub

* Update changelog

* Address review comments

* Update changelog

* Update allennlp/common/file_utils.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Style

* Fix style and tests

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0