8000 Bug: `parse_wheel_filename` permits non-normalized versions. · Issue #873 · pypa/packaging · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bug: parse_wheel_filename permits non-normalized versions. #873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
di opened this issue Feb 13, 2025 · 18 comments
Open

Bug: parse_wheel_filename permits non-normalized versions. #873

di opened this issue Feb 13, 2025 · 18 comments

Comments

@di
Copy link
Member
di commented Feb 13, 2025

https://packaging.python.org/en/latest/specifications/binary-distribution-format/ says:

Version numbers should be normalised according to the Version specifier specification.

Currently, parse_wheel_filename will raise InvalidWheelFilename for some invalid filenames, but https://packaging.python.org/en/latest/specifications/version-specifiers/#normalization has a long list of version normalizations that parse_wheel_filename does not enforce:

>>> from packaging.utils import parse_wheel_filename
>>> parse_wheel_filename('foo-01.0.0-py3-none-any.whl')
('foo', <Version('1.0.0')>, (), frozenset({<py3-none-any @ 140365619033920>}))
@notatallshaw
Copy link
Member
notatallshaw commented Feb 13, 2025

Does packaging have any sort of deprecation process for this sort of change? i.e. a public function that throws an error where it didn't used to.

Pip is migrating to use parse_wheel_filename as the primary way to parse wheel filenames in 25.1 and I'm concerned on impact to users.

I'm still learning how pip migrates to new packaging versions, I guess we could patch packaging to throw a warning instead of an exception for an appropriate deprecation period.

@di
Copy link
Member Author
di commented Feb 13, 2025

@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: main...di:packaging:filename-api

From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.

@brettcannon
Copy link
Member

Does packaging have any sort of deprecation process for this sort of change?

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

@notatallshaw
Copy link
Member
notatallshaw commented Feb 13, 2025

From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.

I don't want to hold up packaging if this is a clear win for most use cases. But a bit of background:

Currently pip is using a custom regex to parse wheel filenames: https://github.com/pypa/pip/blob/25.0.1/src/pip/_internal/models/wheel.py#L21

There was a use case pip explicitly supported where _ was allowed in the version part, so we entered a depreciation cycle. So far I've seen two reports of users hitting this (pypa/pip#12938 (comment) and quic/aimet#3743)

These two use cases suggest the two scenarios are users hand rolling their own wheel filenames on private indexes / filesystems and perhaps an edge case in scikit-build-core, again for a project not uploaded to PyPI.

In general when pip is consuming wheel filenames I would rather it be non-strict in cases where the filename can be unambiguously normalized. But I also suppose we could lengthen the depreciation cycle and flag all issues to users (outside the explicitly supported _ in the version part). I would need to do some investigation on what the difference is between this additional strictness here and pip's current regex.

@di
Copy link
Member Author
di commented Feb 13, 2025

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

Taking a look at all filenames uploaded in the last month:

COPY (
    SELECT
        fe.time AS timestamp,
        fe.additional->>'filename' AS filename
    FROM
        file_events fe
    WHERE
        fe.time >= NOW() - INTERVAL '1 month'
        AND fe.tag = 'file:add'
        AND fe.additional->>'filename' LIKE '%.whl'
    ORDER BY fe.time
) TO STDOUT WITH (FORMAT CSV, HEADER);

Analyzing with:

import csv

from packaging.utils import InvalidWheelFilename, parse_wheel_filename

total_filenames = 0
exception_count = 0

with open("out.csv") as file:
    reader = csv.DictReader(file)  # Assumes CSV has a header

    for row in reader:
        total_filenames += 1
        filename = row["filename"]
        try:
            parse_wheel_filename(filename)
        except InvalidWheelFilename:
            exception_count += 1

percentage_exceptions = (exception_count / total_filenames) * 100
print(f"Total filenames processed: {total_filenames}")
print(f"Total exceptions: {exception_count}")
print(f"Percentage of exceptions: {percentage_exceptions:.2f}%")

With this branch, we get:

Total filenames processed: 210589
Total exceptions: 24
Percentage of exceptions: 0.01%

The invalid filenames in question are:

anki-25.01b1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01b1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01b1-cp39-abi3-win_amd64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01rc1-cp39-abi3-win_amd64.whl
anki-25.02-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02-cp39-abi3-win_amd64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02rc1-cp39-abi3-win_amd64.whl
aqt-25.01b1-py3-none-any.whl
aqt-25.01rc1-py3-none-any.whl
aqt-25.02-py3-none-any.whl
aqt-25.02rc1-py3-none-any.whl

So it's really only two projects that are producing wheels with non-normalized versions, not 24 projects.

I can expand the analysis to a larger timeframe if anyone would like?

@di
Copy link
Member Author
di commented Feb 13, 2025

Also, seems reasonable to add a strict flag to parse_wheel_filename that disables this check, which pip could optionally set to handle their own deprecation period.

@notatallshaw
Copy link
Member
notatallshaw commented Feb 13, 2025

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

So it's really only two projects that are producing wheels with non-normalized versions

While helpful information, my experience supporting pip is that it's users who are not uploading to PyPI that are caught by changes to more strictly comply with the spec.

Also, seems reasonable to add a strict flag to parse_wheel_filename that disables this check, which pip could optionally set to handle their own deprecation period.

I think this would make it easier, and I definitely like the look of the filename API you linked earlier.

All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.

@di
Copy link
Member Author
di commented Feb 13, 2025

All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that parse_wheel_filename also doesn't enforce that the project name is normalized (a much more common issue) and that it's partner parse_sdist_filename has a bunch of issues: #527.

@notatallshaw
Copy link
Member
notatallshaw commented Feb 13, 2025

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that parse_wheel_filename also doesn't enforce that the project name is normalized

We started the depreciation in pip 24.3 due to be complete in pip 25.1 (the next release), I'll discuss with the other pip maintainers, thanks.

and that it's partner parse_sdist_filename has a bunch of issues

I've not looked at sdists in general yet, that seemed like a much bigger can of worms. The only place I think we're using parse_sdist_filename right now is to filter filenames when a user points pip to a file directory to find packages, I added that before I understood that sdist and wheel names were tricky, fortunately that was done in pip 24.0 (a year ago) and we've not seen any complaints.

@brettcannon
Copy link
Member

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library.

What API are you thinking of?

@di
Copy link
Member Author
di commented Feb 13, 2025

What API are you thinking of?

The one I mentioned here:

@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: main...di:packaging:filename-api

@di
Copy link
Member Author
di commented Feb 13, 2025

I'm also OK waiting on that (or something like that) to land instead of merging #874, we can add this check to PyPI in the interim instead.

@brettcannon
Copy link
Member

What API are you thinking of?

The one I mentioned here:

FYI that won't work as written as it doesn't handle compressed wheel tags (it's why https://packaging.pypa.io/en/stable/utils.html#packaging.utils.parse_wheel_filename returns a set of tags instead of a single one).

@notatallshaw
Copy link
Member
notatallshaw commented Feb 24, 2025

FYI I changed the approach to migrate to this function in pip: pypa/pip#13229

parse_wheel_filename will be the default way to parse wheel filenames, but if parsing fails it falls back to the old parsing logic during the deprecation period.

Rereading the spec I do see that it says names and versions of binary distributions should be normalized. Because pip now has a fallback we can extend the deprecation period if this function becomes strict about this if this isn't released before the next pip release.

@di
Copy link
Member Author
di commented Feb 24, 2025

What API are you thinking of?

The one I mentioned here:

FYI that won't work as written as it doesn't handle compressed wheel tags (it's why https://packaging.pypa.io/en/stable/utils.html#packaging.utils.parse_wheel_filename returns a set of tags instead of a single one).

Ack, will take that into account, it's still a WIP as I'm hoping to integrate it into PyPI first as a proof-of concept, but I'm currently blocked on other things that haven't landed yet.

@notatallshaw
Copy link
Member
notatallshaw commented Feb 25, 2025

I see @di's PR for setuptools to produce normalized wheel file names just released: pypa/setuptools#4766

I'd assumed that build tooling had largely migrated to using normalized names, this broke a bunch of tests in pip's CI, I'm sure in many private repos they will not be strictly following the spec on name normalization.

It's up to packaging what to do with this function, but from pip's perspective this makes me nervous about being strict about name normalization. If packaging becomes strict here, with no non-strict option, I think pip should hold off on migrating for several years.

@di
Copy link
Member Author
di commented Feb 25, 2025

I agree. I'm closing #874 for now in favor of a new API here that supports both strict and non-strict parsing for a gentler migration.

@notatallshaw
Copy link
Member
notatallshaw commented May 3, 2025

Just a quick update to this, as I had conversations about this change in other settings, and it was pointed out to me the spec does call out tools that consume wheels:

Tools consuming wheels must be prepared to accept . (FULL STOP) and uppercase letters, however, as these were allowed by an earlier version of this specification.

So any update or replacement to parse_wheel_filename should ideally have both the standard behavior and this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0