-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
I am very excited about having seamless download from the HF Hub in AllenNLP! That could be useful for more things than just downloading models. For example, some dataset readers rely on downloading pre-computed vocabulary files. Whether we upload all our existing models to the hub and reference them there by default is a separate question. The Google Storage bucket we use now doesn't cause any issues now, but I understand putting these on the hub opens the door to using the HF inference API later? |
Yes, that's right! Similar to what has been done with pyannote for example: julien-c/voice-activity-detection, the goal would be to ultimately host demos directly on the inference API for the models uploaded to the hub. For BiDAF-ELMo for example, we could imagine a QA widget like on your cool demo. |
I am not forgetting about this, I am just swamped at the moment. We're happy to review a PR at any time that puts hub support into Moving all the models is a slightly bigger project that I'll have to coordinate with the guys working on the demo, so I can't get to it right this second, but it's on my radar. |
That's good to know, thanks @dirkgr. We'll be focusing on the |
ELMO example Cleanup ok Version is necessary Prettier env
@dirkgr, I have updated the PR description with the final proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to see from the diff: Can this be combined with archive support, i.e., if I say lysandre/something.zip!filename.txt
, will it work?
allennlp/common/file_utils.py
Outdated
if not isinstance(url_or_filename, str): | ||
url_or_filename = str(url_or_filename) | ||
|
||
if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about this. It's not transparent to a user that the behavior of this hinges on the presence of absence of a local file in their current working directory. It also makes the AllenNLP configuration files a lot less self-contained.
Do you think we could treat this like a URL, and make it huggingface://lysandre/pair-classification-roberta-mnli
or hf://lysandre/pair-classification-roberta-mnli
or even hub://lysandre/pair-classification-roberta-mnli
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could definitely treat it like an URL and have it preceded by either of the three propositions you have made. Is there one you favor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hub
is too generic. So I vote for either huggingface://
, hf://
, or both.
Right now the files uploaded to the hub are not in archived form. The bidaf example mentioned above was uploaded to the hub as an untarred version of itself, so that it's easier to browse, inspect, and version each of its files: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19/tree/main Regarding your question, if I understand correctly you're asking if we could target a single file on a repository (akin to your archives), rather than the entire repository. This is definitely possible; with the current implementation this is acceptable: from allennlp.modules.elmo import Elmo, batch_to_ids
- options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json"
- weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5"
+ options_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/options.json"
+ weight_file = "lysandre/elmo-2x4096_512_2048cnn_2xhighway/weights.hdf5"
elmo = Elmo(options_file, weight_file, 2, dropout=0)
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
print(embeddings) Happy to look into adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LysandreJik, thanks for doing this! I thought I'd leave my two cents as well since the caching functionality is near and dear to my heart
- The test for distributed metrics now takes a parameter specifying how often you want to run it. | ||
|
||
|
||
## [v2.3.0](https://github.com/allenai/allennlp/releases/tag/v2.3.0) - 2021-04-14 | ||
|
||
### Added | ||
|
||
- Ported the following Huggingface `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`. | ||
- Ported the following HuggingFace `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch! 😉
allennlp/common/file_utils.py
Outdated
if not isinstance(url_or_filename, str): | ||
url_or_filename = str(url_or_filename) | ||
|
||
if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hub
is too generic. So I vote for either huggingface://
, hf://
, or both.
allennlp/common/file_utils.py
Outdated
filename = url_or_filename.split("/")[2] | ||
url_or_filename = "/".join(url_or_filename.split("/")[:2]) | ||
else: | ||
filename = None | ||
|
||
if "@" in url_or_filename: | ||
repo_id = url_or_filename.split("@")[0] | ||
revision = url_or_filename.split("@")[1] | ||
else: | ||
repo_id = url_or_filename | ||
revision = None | ||
|
||
try: | ||
if filename is not None: | ||
url = hf_hub_url( | ||
repo_id=repo_id, filename=filename, revision=revision | ||
) | ||
url_or_filename = cached_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing to me.
We set url_or_filename
(along with filename
) on line 250, but then override it to something else on line 266.
allennlp/common/file_utils.py
Outdated
except HTTPError as e: | ||
logger.warning(f"Tried to download from Hugging Face Hub but failed with {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why warn here and not just let the exception propagate?
I think it would be better to emit a logger.info()
statement above that says "treating url_or_filename as a HuggingFace Hub resource" or something, and then just let this error propagate if it happens.
CHANGELOG.md
Outdated
@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
|
|||
### Added | |||
|
|||
- Added support for the HuggingFace Hub as an alternative way to handle loading files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with huggingface://
or hf://
type urls, we should mention that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with hf://
! I updated the changelog.
GitHub is blowing up the commenting threads, so I'll reply like this instead. I personally think As a user, I would expect that indicating a single file within a repository would work, and I would have expected that to work with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for your comments! I have updated the code accordingly:
- Access to the hub is now done via the
hf://
scheme, which cleans up the code as there is no ambiguity with local paths anymore - Cleaned up the code relative to the
url_or_filename
variable.
I have kept the behavior with the second /
for file access instead of !
, but it is simple enough to update it so let me know if this is preferred.
CHANGELOG.md
Outdated
@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
|
|||
### Added | |||
|
|||
- Added support for the HuggingFace Hub as an alternative way to handle loading files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with hf://
! I updated the changelog.
if not isinstance(url_or_filename, str): | ||
url_or_filename = str(url_or_filename) | ||
|
||
if urlparse(url_or_filename).scheme == "hf": | ||
# Remove the hf:// prefix | ||
identifier = url_or_filename[5:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the names to identifier
and model_identifier
, please let me know if this looks better to you @epwalsh!
allennlp/common/file_utils.py
Outdated
url = hf_hub_url( | ||
repo_id=repo_id, filename=filename, revision=revision | ||
) | ||
url_or_filename = cached_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final assignment is done to url_or_filename
as this is now a local path. It is handled as a local path in the following statements of the cached_path
method.
Can you run the formatter? I always run this until it has no more complaints:
|
Oh, that's
|
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
I've updated the style and fixed the conflicts. I think the PR looks good! In a second step, I was wondering how we could amplify visibility for this - contributing to the docs seems like a good first step. Do you have any pointers of where I could help? How does that sound? |
This PR adds support for the inference API as well as you can see on the bidaf model on the hub. Would it make sense for you to upload some of your models on the hub as a starting point, once this PR is merged? |
@LysandreJik, we like to post updates like this on GitHub Discussions. If you posted there, we could then tweet about it from the official AllenNLP twitter account.
That would be great as well, and we'd be happy to tweet about that too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Great, I'll take a look at the Github discussions. Thanks! |
@LysandreJik, I just did a patch release with this included: |
Published in a discussion: #5140 |
* Load from HF hub ELMO example Cleanup ok Version is necessary Prettier env * Tests * Remove erroneous addition * Add dependency to huggingface_hub * Update changelog * Address review comments * Update changelog * Update allennlp/common/file_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Style * Fix style and tests Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> Co-authored-by: Julien Chaumond <chaumond@gmail.com>
This PR adds support for the HuggingFace model hub as a way to download files. The hub support is put into
cached_file()
, and allows the following:This offers model versioning (even across archives, as they're unarchived on the hub as it can be seen with the above example), storage, and the inference API for supported models.
The model above is available here: https://huggingface.co/lysandre/bidaf-elmo-model-2020.03.19
Feel free to test out the inference API. It's currently only set up for the question answering model, let me know if there are other pipelines you would be interested in and we can look into setting those up.
You can also try it out by uploading your own checkpoints to the model hub, and adding the following tags to the README file: