8000 click.format_filename regression in click 8 for non-unicode paths · Issue #2395 · pallets/click · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
click.format_filename regression in click 8 for non-unicode paths #2395
Closed
@RazerM

Description

@RazerM

Summary

On non-Windows platforms, Python uses the "surrogateescape" error handler by default for filesystem encoding (sys.getfilesystemencodeerrors()).

The docstring for format_filename says this (emphasis mine)

Formats a filename for user display. The main purpose of this function is to ensure that the filename can be displayed at all. This will decode the filename to unicode if necessary in a way that it will not fail. Optionally, it can shorten the filename to not include the full path to the filename.

and under Printing Filenames (emphasis mine):

Because filenames might not be Unicode, formatting them can be a bit tricky.

The way this works with click is through the format_filename() function. It does a best-effort conversion of the filename to Unicode and will never fail. This makes it possible to use these filenames in the context of a full Unicode string.

But on Click 8, format_filename will always return a string with surrogates intact. Strings in this form are destined for functions which have to operate on the file like open, not for printing to the user. When you try to do so, you get a UnicodeEncodeError.

Reproducer

You have to run this example on an OS like Linux which allows non-unicode filenames.

This script

import os
import click

@click.command()
@click.argument('path', type=click.Path(), required=False)
def cli(path):
    if path is not None:
        click.echo(f"printing {path!r} as {click.format_filename(path)}")

    filename_bytes = b'foo\xe9'
    filename_str = os.fsdecode(filename_bytes)

    # We can open filename_bytes or filename_str, the result is the same.
    # When filename_str (which contains surrogates) is given, Python encodes
    # it back using os.fsencode to get the original bytes.
    with open(filename_str, 'w') as fp:
        fp.write('Hello, world!')

    # I'm just using scandir to demonstrate that paths come from the standard
    # library in this surrogateescaped form.
    #
    # This is equivalent to click.echo(click.format_filename(filename_str))
    for entry in os.scandir('.'):
        if entry.name != filename_str:
            continue
        click.echo(click.format_filename(entry.name))

if __name__ == '__main__':
    cli()

Save this file as cli.py and run it:

python cli.py

On Click 8 you get an exception:

Traceback (most recent call last):
  File "cli.py", line 25, in <module>
    cli()
  File "<redacted>/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "<redacted>/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "<redacted>/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "<redacted>/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "cli.py", line 22, in cli
    click.echo(click.format_filename(entry.name))
  File "<redacted>/python3.7/site-packages/click/utils.py", line 299, in echo
    file.write(out)  # type: ignore
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed

On Click 7 you get the expected output:

foo�

You can also run it with a filename argument to see how it passes through click.Path. Using shell completion to complete the filename is the easiest thing to do, in case this escape sequence is unique to my shell (probably not, but I'm not sure).

python cli.py 'foo'$'\351'

Environment:

  • Python version: 3.7 and 3.10 tested
  • Click version: 8.1.3
  • OS: Linux

Cause

Implementation on Click 7.1.2:

click/src/click/_compat.py

Lines 470 to 475 in 1784558

def filename_to_ui(value):
if isinstance(value, bytes):
value = value.decode(get_filesystem_encoding(), "replace")
else:
value = value.encode("utf-8", "surrogateescape").decode("utf-8", "replace")
return value

Implementation on Click 8:

return os.fsdecode(filename)

This means that on Click 8, format_filename is a no-op, since os.fsdecode returns strings unchanged.

Other issues

Under File Path Arguments (emphasis mine):

In the previous example, the files were opened immediately. But what if we just want the filename? The naïve way is to use the default string argument type. However, remember that Click is Unicode-based, so the string will always be a Unicode value. Unfortunately, filenames can be Unicode or bytes depending on which operating system is being used. As such, the type is insufficient.

Instead, you should be using the Path type, which automatically handles this ambiguity. Not only will it return either bytes or Unicode depending on what makes more sense, but it will also be able to do some basic checks for you such as existence checks.

click.Path always returns str as far as I can tell, unless you explicitly give type=bytes so I'm not sure which automatic handling this is referring to. I tried all the way back to Click 1 and didn't see it ever returning bytes. Maybe it's something related to Python 2 and the docs are outdated?


These should use fspath. fsdecode calls fspath for you but the decoding part is a no-op since the input is a str.

click/src/click/_compat.py

Lines 391 to 393 in 9c6f4c8

# Standard streams first. These are simple because they ignore the
# atomic flag. Use fsdecode to handle Path("-").
if os.fsdecode(filename) == "-":

https://github.com/pallets/click/blob/8.1.3/src/click/types.py#L853


This fsdecode is a no-op if the filename: str type annotation is correct. The !r conversion means it gets printed with surrogate escape sequences visible which probably isn't what we want (I think it's probably better to print a replacement character — this is arguable because a technical user could in principle use the repr in Python).

self.ui_filename = os.fsdecode(filename)
self.filename = filename
def format_message(self) -> str:
return _("Could not open file {filename!r}: {message}").format(

click.File doesn't use the repr:

self.fail(f"'{os.fsdecode(value)}': {e.strerror}", param, ctx)

This works because sys.stderr.errors defaults to "backslashreplace". If you caught this exception and tried to print it to stdout, it would cause a UnicodeEncodeError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0