click.format_filename regression in click 8 for non-unicode paths

Summary

On non-Windows platforms, Python uses the "surrogateescape" error handler by default for filesystem encoding (sys.getfilesystemencodeerrors()).

The docstring for format_filename says this (emphasis mine)

Formats a filename for user display. The main purpose of this function is to ensure that the filename can be displayed at all. This will decode the filename to unicode if necessary in a way that it will not fail. Optionally, it can shorten the filename to not include the full path to the filename.

and under Printing Filenames (emphasis mine):

Because filenames might not be Unicode, formatting them can be a bit tricky.

The way this works with click is through the format_filename() function. It does a best-effort conversion of the filename to Unicode and will never fail. This makes it possible to use these filenames in the context of a full Unicode string.

But on Click 8, format_filename will always return a string with surrogates intact. Strings in this form are destined for functions which have to operate on the file like open, not for printing to the user. When you try to do so, you get a UnicodeEncodeError.

Reproducer

You have to run this example on an OS like Linux which allows non-unicode filenames.

This script

import os
import click

@click.command()
@click.argument('path', type=click.Path(), required=False)
def cli(path):
    if path is not None:
        click.echo(f"printing {path!r} as {click.format_filename(path)}")

    filename_bytes = b'foo\xe9'
    filename_str = os.fsdecode(filename_bytes)

    # We can open filename_bytes or filename_str, the result is the same.
    # When filename_str (which contains surrogates) is given, Python encodes
    # it back using os.fsencode to get the original bytes.
    with open(filename_str, 'w') as fp:
        fp.write('Hello, world!')

    # I'm just using scandir to demonstrate that paths come from the standard
    # library in this surrogateescaped form.
    #
    # This is equivalent to click.echo(click.format_filename(filename_str))
    for entry in os.scandir('.'):
        if entry.name != filename_str:
            continue
        click.echo(click.format_filename(entry.name))

if __name__ == '__main__':
    cli()

Save this file as cli.py and run it:

python cli.py

On Click 8 you get an exception:

Traceback (most recent call last):
  File "cli.py", line 25, in <module>
    cli()
  File "<redacted>/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "<redacted>/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "<redacted>/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "<redacted>/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "cli.py", line 22, in cli
    click.echo(click.format_filename(entry.name))
  File "<redacted>/python3.7/site-packages/click/utils.py", line 299, in echo
    file.write(out)  # type: ignore
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed

On Click 7 you get the expected output:

foo�

You can also run it with a filename argument to see how it passes through click.Path. Using shell completion to complete the filename is the easiest thing to do, in case this escape sequence is unique to my shell (probably not, but I'm not sure).

python cli.py 'foo'$'\351'

Environment:

Python version: 3.7 and 3.10 tested
Click version: 8.1.3
OS: Linux

Cause

Implementation on Click 7.1.2:

click/src/click/_compat.py

Lines 470 to 475 in 1784558

    
           def filename_to_ui(value): 
        
               if isinstance(value, bytes): 
        
                   value = value.decode(get_filesystem_encoding(), "replace") 
        
               else: 
        
                   value = value.encode("utf-8", "surrogateescape").decode("utf-8", "replace") 
        
               return value

Implementation on Click 8:

click/src/click/utils.py

Line 400 in 9c6f4c8

return os.fsdecode(filename)

This means that on Click 8, format_filename is a no-op, since os.fsdecode returns strings unchanged.

Other issues

Under File Path Arguments (emphasis mine):

In the previous example, the files were opened immediately. But what if we just want the filename? The naïve way is to use the default string argument type. However, remember that Click is Unicode-based, so the string will always be a Unicode value. Unfortunately, filenames can be Unicode or bytes depending on which operating system is being used. As such, the type is insufficient.

Instead, you should be using the Path type, which automatically handles this ambiguity. Not only will it return either bytes or Unicode depending on what makes more sense, but it will also be able to do some basic checks for you such as existence checks.

click.Path always returns str as far as I can tell, unless you explicitly give type=bytes so I'm not sure which automatic handling this is referring to. I tried all the way back to Click 1 and didn't see it ever returning bytes. Maybe it's something related to Python 2 and the docs are outdated?

These should use fspath. fsdecode calls fspath for you but the decoding part is a no-op since the input is a str.

click/src/click/_compat.py

Lines 391 to 393 in 9c6f4c8

    
           # Standard streams first. These are simple because they ignore the 
        
           # atomic flag. Use fsdecode to handle Path("-"). 
        
           if os.fsdecode(filename) == "-":

https://github.com/pallets/click/blob/8.1.3/src/click/types.py#L853

This fsdecode is a no-op if the filename: str type annotation is correct. The !r conversion means it gets printed with surrogate escape sequences visible which probably isn't what we want (I think it's probably better to print a replacement character — this is arguable because a technical user could in principle use the repr in Python).

click/src/click/exceptions.py

Lines 264 to 268 in 9c6f4c8

    
               self.ui_filename = os.fsdecode(filename) 
        
               self.filename = filename 
        
           def format_message(self) -> str: 
        
               return _("Could not open file {filename!r}: {message}").format(

click.File doesn't use the repr:

click/src/click/types.py

Line 734 in 9c6f4c8

self.fail(f"'{os.fsdecode(value)}': {e.strerror}", param, ctx)

This works because sys.stderr.errors defaults to "backslashreplace". If you caught this exception and tried to print it to stdout, it would cause a UnicodeEncodeError

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Summary

Reproducer

Cause

Other issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def filename_to_ui(value):
	if isinstance(value, bytes):
	value = value.decode(get_filesystem_encoding(), "replace")
	else:
	value = value.encode("utf-8", "surrogateescape").decode("utf-8", "replace")
	return value

	# Standard streams first. These are simple because they ignore the
	# atomic flag. Use fsdecode to handle Path("-").
	if os.fsdecode(filename) == "-":

	self.ui_filename = os.fsdecode(filename)
	self.filename = filename

	def format_message(self) -> str:
	return _("Could not open file {filename!r}: {message}").format(

Uh oh!

Description

Summary

Reproducer

Cause

Other issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions