Hf hub #5052

LysandreJik · 2021-03-12T22:34:48Z

This PR adds support for the HuggingFace model hub as a way to download files. The hub support is put into cached_file(), and allows the following:

from allennlp.predictors.predictor import Predictor
import allennlp_models.rc

- checkpoint = "https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz"
+ checkpoint = "lysandre/bidaf-elmo-model-2020.03.19"
predictor = Predictor.from_path(checkpoint)
predictions = predictor.predict_json({
  "passage":
      "The Matrix is a 1999 science fiction action "
      "film written and directed by The Wachowskis, "
      "starring Keanu Reeves, Laurence Fishburne, "
      "Carrie-Anne Moss, Hugo Weaving, and Joe P"
      "antoliano.",
  "question":
      "Who stars in The Matrix?"
})
print(predictions["best_span_str"])

This offers model versioning (even across archives, as they're unarchived on the hub as it can be seen with the above example), storage, and the inference API for supported models.

The model above is available here: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19

Feel free to test out the inference API. It's currently only set up for the question answering model, let me know if there are other pipelines you would be interested in and we can look into setting those up.

You can also try it out by uploading your own checkpoints to the model hub, and adding the following tags to the README file:

---
tags:
- allennlp
- question-answering
---

dirkgr · 2021-03-12T23:06:09Z

I am very excited about having seamless download from the HF Hub in AllenNLP! That could be useful for more things than just downloading models. For example, some dataset readers rely on downloading pre-computed vocabulary files.

Whether we upload all our existing models to the hub and reference them there by default is a separate question. The Google Storage bucket we use now doesn't cause any issues now, but I understand putting these on the hub opens the door to using the HF inference API later?

LysandreJik · 2021-03-12T23:12:39Z

Yes, that's right! Similar to what has been done with pyannote for example: julien-c/voice-activity-detection, the goal would be to ultimately host demos directly on the inference API for the models uploaded to the hub.

For BiDAF-ELMo for example, we could imagine a QA widget like on your cool demo.

dirkgr · 2021-03-18T16:19:29Z

I am not forgetting about this, I am just swamped at the moment. We're happy to review a PR at any time that puts hub support into cached_file(). Let us know when you have a "final form".

Moving all the models is a slightly bigger project that I'll have to coordinate with the guys working on the demo, so I can't get to it right this second, but it's on my radar.

LysandreJik · 2021-03-18T16:28:47Z

That's good to know, thanks @dirkgr. We'll be focusing on the cached_file() then, I agree that's what would make the most sense. Will let you know once we have a final form.

ELMO example Cleanup ok Version is necessary Prettier env

LysandreJik · 2021-04-14T13:21:13Z

@dirkgr, I have updated the PR description with the final proposal.

dirkgr

It's hard to see from the diff: Can this be combined with archive support, i.e., if I say lysandre/something.zip!filename.txt, will it work?

dirkgr · 2021-04-15T01:53:55Z

allennlp/common/file_utils.py

    if not isinstance(url_or_filename, str):
        url_or_filename = str(url_or_filename)

+    if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"):


I am a little worried about this. It's not transparent to a user that the behavior of this hinges on the presence of absence of a local file in their current working directory. It also makes the AllenNLP configuration files a lot less self-contained.

Do you think we could treat this like a URL, and make it huggingface://lysandre/pair-classification-roberta-mnli or hf://lysandre/pair-classification-roberta-mnli or even hub://lysandre/pair-classification-roberta-mnli?

Yes, we could definitely treat it like an URL and have it preceded by either of the three propositions you have made. Is there one you favor?

I think hub is too generic. So I vote for either huggingface://, hf://, or both.

LysandreJik · 2021-04-15T12:59:09Z

Right now the files uploaded to the hub are not in archived form. The bidaf example mentioned above was uploaded to the hub as an untarred version of itself, so that it's easier to browse, inspect, and version each of its files: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19/tree/main

Regarding your question, if I understand correctly you're asking if we could target a single file on a repository (akin to your archives), rather than the entire repository. This is definitely possible; with the current implementation this is acceptable:

from allennlp.modules.elmo import Elmo, batch_to_ids

- options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json"
- weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5"
+ options_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/options.json"
+ weight_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/weights.hdf5"

elmo = Elmo(options_file, weight_file, 2, dropout=0)

sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
print(embeddings)

Happy to look into adding a ! instead of the second / if it makes more sense g 8000 iven your current implementation!

epwalsh

Hi @LysandreJik, thanks for doing this! I thought I'd leave my two cents as well since the caching functionality is near and dear to my heart

epwalsh · 2021-04-15T17:04:42Z

CHANGELOG.md

 - The test for distributed metrics now takes a parameter specifying how often you want to run it.


 ## [v2.3.0](https://github.com/allenai/allennlp/releases/tag/v2.3.0) - 2021-04-14

 ### Added

- Ported the following Huggingface `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`.
+- Ported the following HuggingFace `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`.


Ah, good catch! 😉

epwalsh · 2021-04-15T17:06:27Z

allennlp/common/file_utils.py

    if not isinstance(url_or_filename, str):
        url_or_filename = str(url_or_filename)

+    if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"):


I think hub is too generic. So I vote for either huggingface://, hf://, or both.

epwalsh · 2021-04-15T17:12:20Z

allennlp/common/file_utils.py

+            filename = url_or_filename.split("/")[2]
+            url_or_filename = "/".join(url_or_filename.split("/")[:2])
+        else:
+            filename = None
+
+        if "@" in url_or_filename:
+            repo_id = url_or_filename.split("@")[0]
+            revision = url_or_filename.split("@")[1]
+        else:
+            repo_id = url_or_filename
+            revision = None
+
+        try:
+            if filename is not None:
+                url = hf_hub_url(
+                    repo_id=repo_id, filename=filename, revision=revision
+                )
+                url_or_filename = cached_download(


This is a bit confusing to me.

We set url_or_filename (along with filename) on line 250, but then override it to something else on line 266.

epwalsh · 2021-04-15T17:17:06Z

allennlp/common/file_utils.py

+        except HTTPError as e:
+            logger.warning(f"Tried to download from Hugging Face Hub but failed with {e}")


Why warn here and not just let the exception propagate?

I think it would be better to emit a logger.info() statement above that says "treating url_or_filename as a HuggingFace Hub resource" or something, and then just let this error propagate if it happens.

epwalsh · 2021-04-15T17:17:38Z

CHANGELOG.md

@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### Added

+- Added support for the HuggingFace Hub as an alternative way to handle loading files.


If we go with huggingface:// or hf:// type urls, we should mention that here.

I went with hf://! I updated the changelog.

dirkgr · 2021-04-15T21:07:49Z

GitHub is blowing up the commenting threads, so I'll reply like this instead.

I personally think hf:// is quite cool, but I have a thing for terseness. I'm fine with whatever @LysandreJik wants to do here, especially if you want to align it with other projects. It would be annoying if it's hf:// in AllenNLP but huggingface:// in TensorFlow or some other framework you're integrating the hub into.

As a user, I would expect that indicating a single file within a repository would work, and I would have expected that to work with /, not !. I can do that with GitHub URLs after all. But if it's hard to implement, this is already a great addition even without that feature.

LysandreJik

Thank you both for your comments! I have updated the code accordingly:

Access to the hub is now done via the hf:// scheme, which cleans up the code as there is no ambiguity with local paths anymore
Cleaned up the code relative to the url_or_filename variable.

I have kept the behavior with the second / for file access instead of !, but it is simple enough to update it so let me know if this is preferred.

LysandreJik · 2021-04-15T21:38:25Z

CHANGELOG.md

@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### Added

+- Added support for the HuggingFace Hub as an alternative way to handle loading files.


I went with hf://! I updated the changelog.

LysandreJik · 2021-04-15T21:39:01Z

allennlp/common/file_utils.py

    if not isinstance(url_or_filename, str):
        url_or_filename = str(url_or_filename)

+    if urlparse(url_or_filename).scheme == "hf":
+        # Remove the hf:// prefix
+        identifier = url_or_filename[5:]


I've updated the names to identifier and model_identifier, please let me know if this looks better to you @epwalsh!

LysandreJik · 2021-04-15T21:40:31Z

allennlp/common/file_utils.py

+            url = hf_hub_url(
+                repo_id=repo_id, filename=filename, revision=revision
+            )
+            url_or_filename = cached_download(


The final assignment is done to url_or_filename as this is now a local path. It is handled as a local path in the following statements of the cached_path method.

allennlp/common/file_utils.py

dirkgr · 2021-04-19T18:07:36Z

Can you run the formatter? I always run this until it has no more complaints:

black .; and flake8; and mypy allennlp --ignore-missing-imports --no-strict-optional --no-site-packages --cache-dir=/dev/null

dirkgr · 2021-04-19T18:08:21Z

Oh, that's fish syntax. bash is this:

black . && flake8 && mypy allennlp --ignore-missing-imports --no-strict-optional --no-site-packages --cache-dir=/dev/null

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

LysandreJik · 2021-04-20T15:19:55Z

I've updated the style and fixed the conflicts. I think the PR looks good!

In a second step, I was wondering how we could amplify visibility for this - contributing to the docs seems like a good first step. Do you have any pointers of where I could help?
As a second step, I'll write a blog post focusing on the integration and the capabilities enabled.

How does that sound?

LysandreJik · 2021-04-20T15:47:38Z

This PR adds support for the inference API as well as you can see on the bidaf model on the hub. Would it make sense for you to upload some of your models on the hub as a starting point, once this PR is merged?

epwalsh · 2021-04-20T20:29:39Z

In a second step, I was wondering how we could amplify visibility for this - contributing to the docs seems like a good first step. Do you have any pointers of where I could help?

@LysandreJik, we like to post updates like this on GitHub Discussions. If you posted there, we could then tweet about it from the official AllenNLP twitter account.

As a second step, I'll write a blog post focusing on the integration and the capabilities enabled.

That would be great as well, and we'd be happy to tweet about that too!

epwalsh

Looks great!

LysandreJik · 2021-04-20T20:50:03Z

Great, I'll take a look at the Github discussions. Thanks!

epwalsh · 2021-04-20T21:47:40Z

@LysandreJik, I just did a patch release with this included: allennlp==2.3.1

LysandreJik · 2021-04-21T22:52:11Z

Published in a discussion: #5140

* Load from HF hub ELMO example Cleanup ok Version is necessary Prettier env * Tests * Remove erroneous addition * Add dependency to huggingface_hub * Update changelog * Address review comments * Update changelog * Update allennlp/common/file_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Style * Fix style and tests Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> Co-authored-by: Julien Chaumond <chaumond@gmail.com>

LysandreJik added 2 commits April 5, 2021 18:29

Load from HF hub

8500c9d

ELMO example Cleanup ok Version is necessary Prettier env

Tests

de772d8

LysandreJik force-pushed the hf_hub branch from 51927ff to de772d8 Compare April 8, 2021 00:12

LysandreJik added 3 commits April 14, 2021 09:23

Remove erroneous addition

d5ed2b2

Add dependency to huggingface_hub

c636215

Update changelog

8783d3c

dirkgr mentioned this pull request Apr 15, 2021

add diff command #5109

Merged

2 tasks

Merge branch 'main' into hf_hub

4d1a30f

dirkgr reviewed Apr 15, 2021

View reviewed changes

epwalsh reviewed Apr 15, 2021

View reviewed changes

LysandreJik added 2 commits April 15, 2021 17:36

Address review comments

fae94c3

Update changelog

f972a3d

LysandreJik commented Apr 15, 2021

View reviewed changes

julien-c reviewed Apr 19, 2021

View reviewed changes

allennlp/common/file_utils.py Outdated Show resolved Hide resolved

dirkgr approved these changes Apr 19, 2021

View reviewed changes

LysandreJik and others added 4 commits April 19, 2021 17:25

Update allennlp/common/file_utils.py

eb7ec2f

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Style

c0bafdd

Fix style and tests

3ee4b85

Merge branch 'main' into hf_hub

a8277a1

epwalsh approved these changes Apr 20, 2021

View reviewed changes

epwalsh enabled auto-merge (squash) April 20, 2021 20:53

epwalsh disabled auto-merge April 20, 2021 20:53

epwalsh enabled auto-merge (squash) April 20, 2021 20:53

Merge branch 'main' into hf_hub
2B68

e0874b5

epwalsh merged commit a84b9b1 into allenai:main Apr 20, 2021

osanseviero mentioned this pull request Aug 19, 2021

Proposal of utility function + command to push models to HF Hub #5370

Merged

		except HTTPError as e:
		logger.warning(f"Tried to download from Hugging Face Hub but failed with {e}")

		@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

		### Added

		- Added support for the HuggingFace Hub as an alternative way to handle loading files.

Hf hub #5052

Hf hub #5052

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!