Description
Summary
On non-Windows platforms, Python uses the "surrogateescape"
error handler by default for filesystem encoding (sys.getfilesystemencodeerrors()
).
The docstring for format_filename
says this (emphasis mine)
Formats a filename for user display. The main purpose of this function is to ensure that the filename can be displayed at all. This will decode the filename to unicode if necessary in a way that it will not fail. Optionally, it can shorten the filename to not include the full path to the filename.
and under Printing Filenames (emphasis mine):
Because filenames might not be Unicode, formatting them can be a bit tricky.
The way this works with click is through the format_filename() function. It does a best-effort conversion of the filename to Unicode and will never fail. This makes it possible to use these filenames in the context of a full Unicode string.
But on Click 8, format_filename
will always return a string with surrogates intact. Strings in this form are destined for functions which have to operate on the file like open
, not for printing to the user. When you try to do so, you get a UnicodeEncodeError.
Reproducer
You have to run this example on an OS like Linux which allows non-unicode filenames.
This script
import os
import click
@click.command()
@click.argument('path', type=click.Path(), required=False)
def cli(path):
if path is not None:
click.echo(f"printing {path!r} as {click.format_filename(path)}")
filename_bytes = b'foo\xe9'
filename_str = os.fsdecode(filename_bytes)
# We can open filename_bytes or filename_str, the result is the same.
# When filename_str (which contains surrogates) is given, Python encodes
# it back using os.fsencode to get the original bytes.
with open(filename_str, 'w') as fp:
fp.write('Hello, world!')
# I'm just using scandir to demonstrate that paths come from the standard
# library in this surrogateescaped form.
#
# This is equivalent to click.echo(click.format_filename(filename_str))
for entry in os.scandir('.'):
if entry.name != filename_str:
continue
click.echo(click.format_filename(entry.name))
if __name__ == '__main__':
cli()
Save this file as cli.py
and run it:
python cli.py
On Click 8 you get an exception:
Traceback (most recent call last):
File "cli.py", line 25, in <module>
cli()
File "<redacted>/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "<redacted>/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "<redacted>/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "<redacted>/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "cli.py", line 22, in cli
click.echo(click.format_filename(entry.name))
File "<redacted>/python3.7/site-packages/click/utils.py", line 299, in echo
file.write(out) # type: ignore
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed
On Click 7 you get the expected output:
foo�
You can also run it with a filename argument to see how it passes through click.Path. Using shell completion to complete the filename is the easiest thing to do, in case this escape sequence is unique to my shell (probably not, but I'm not sure).
python cli.py 'foo'$'\351'
Environment:
- Python version: 3.7 and 3.10 tested
- Click version: 8.1.3
- OS: Linux
Cause
Implementation on Click 7.1.2:
Lines 470 to 475 in 1784558
Implementation on Click 8:
Line 400 in 9c6f4c8
This means that on Click 8, format_filename
is a no-op, since os.fsdecode
returns strings unchanged.
Other issues
Under File Path Arguments (emphasis mine):
In the previous example, the files were opened immediately. But what if we just want the filename? The naïve way is to use the default string argument type. However, remember that Click is Unicode-based, so the string will always be a Unicode value. Unfortunately, filenames can be Unicode or bytes depending on which operating system is being used. As such, the type is insufficient.
Instead, you should be using the Path type, which automatically handles this ambiguity. Not only will it return either bytes or Unicode depending on what makes more sense, but it will also be able to do some basic checks for you such as existence checks.
click.Path
always returns str
as far as I can tell, unless you explicitly give type=bytes
so I'm not sure which automatic handling this is referring to. I tried all the way back to Click 1 and didn't see it ever returning bytes. Maybe it's something related to Python 2 and the docs are outdated?
These should use fspath. fsdecode calls fspath for you but the decoding part is a no-op since the input is a str.
Lines 391 to 393 in 9c6f4c8
https://github.com/pallets/click/blob/8.1.3/src/click/types.py#L853
This fsdecode is a no-op if the filename: str
type annotation is correct. The !r
conversion means it gets printed with surrogate escape sequences visible which probably isn't what we want (I think it's probably better to print a replacement character — this is arguable because a technical user could in principle use the repr in Python).
Lines 264 to 268 in 9c6f4c8
click.File doesn't use the repr:
Line 734 in 9c6f4c8
This works because sys.stderr.errors defaults to "backslashreplace"
. If you caught this exception and tried to print it to stdout, it would cause a UnicodeEncodeError