-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix #4546 (carriage return in HTML) #4548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As I've noted on the linked issue, this will very likely
have unexpected effects in various places, since all of
the parsers have been developed under the assumption that
the documents don't contain CRs. Changing the HTML reader is
only a start to what would be required.
Hence I think it's too risky to merge this.
|
All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not. No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML, and Native, as a serialization format, is better off without any kind of additional filtering. |
Ah, I'd forgotten that we'd moved crFilter inside the individual
readers.
The other things to check would be other consumers of the UTF8
functions... For example, does the template engine assume that
CRs have been stripped from templates? Will there be changes
if you do a header-include?
Alexander <notifications@github.com> writes:
… @jgm
> As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs.
All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not.
No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML and Native, as a serialization format, is better off without any kind of additional filtering.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#4548 (comment)
|
Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes. Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is? |
Alexander <notifications@github.com> writes:
@jgm
> For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?
Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes.
Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?
I can't think of a reason why they should be stripped. I'm just trying
to think of all the places this change might matter, and which we may
not have thought of...
|
Grep for |
See the docs for System.IO.NewlineMode: "Specifies the translation, if any, of newline characters between internal Strings and the external file or stream. Haskell Strings are assumed to represent newlines with the '\n' character; the newline mode specifies how to translate '\n' on output, and what to translate into '\n' on input." So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant. (The newlines then get translated to CRLF on appropriate platforms, and a similar conversion is done on input -- see putStrLnWith in UTF8 module.) Note that System.IO only supports CR and CRLF newlines. So I'm not completely confident about the possible side effects of this change. |
Another option would be to change crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter ('\r':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs One would want to benchmark this to make sure it's not much slower than the current filter. |
Text files are read via
Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows. Proper solution would be to read all files in binary mode (by replacing
Then, we will treat "\r\r\n" properly as a carriage return followed by newline and single CRs will survive conversion from text markup formats. If someone needs to preserve CRs, this change can be applied on top of this PR. It is not related to the problem: single CRs not being handled as a newline in HTML reader. |
No, the idea was that it would be translated to You seem confident that it should be treated as
Your PR contains a change to Text.Pandoc.UTF8, which potentially affects many things. It's not just a change to the HTML reader. So I don't want to go forward until the details have been understood and sorted out. |
I assume it is a typo and read it as "LF and CRLF".
If CRLF is converted to LF (on Windows), "\r\r\n" will be converted to "\r\n" during reading, and
If we agree that only LF (Haskell and POSIX standard) and CRLF (for Windows compatibility) line endings are valid, then CRCRLF can be read either as CR<>newline or CRCR<>newline.
CRCRLF is just a corner case I invented, in which case applying proposed I don't care about CRCRLF case that much, but it worked correctly before my PR and works the same after my PR (unless we decide that we need to support CR newlines in all text formats).
Ok, let me summarize. System.IO has built-in "newline normalization", which translates CRLF in files to LF in process memory on Windows.
I assume that LF and CRLF are the only valid newlines. It is just HTML that treats single CRs differently, according to its standard. Before PR it works like this:
After PR:
The changes are minimal because If we want not a minimal change, but more correct one, I propose that we change |
@jgm It is one more reason to move |
Or, even better: make In any case, I would like this minimal change to be applied as-is if no test case is found where it breaks something, just to fix the bug (both HTML and CommonMark now). Additional correctness fixes can be applied on top of it after that. |
Alexander <notifications@github.com> writes:
Before PR it works like this:
1. Text files are read with UTF8.readFile, which applies newline normalization, removes BOM and applies `crFilter`.
2. All `Text` pandoc readers apply `crFilter` again for themselves, which is no-op because CRs are already removed.
It's not a no-op! If you use pandoc as a library, you may not be
getting your input from UTF8.readFile or UTF8.getContents. That's why
`crFilter` needs to be there in the readers.
UTF8.readFile does filter CRs, it's true, because it uses
`toString`, which uses `toText`, which filters CRs. You might think
that because we use `crFilter` in the readers, we don't need to filter
them here. But that's not so obvious to me. For example, consider
the use of readFile to read a LaTeX include file, which is then added
to the stream of input to be parsed. Here `crFilter` isn't called;
we rely on CRs being filtered by `readFile`. There may be other cases
like this, too. I think a very careful audit of all uses of the UTF8
module's functions is needed before we remove CR stripping there.
If we want not a minimal change, but more correct one, I propose that we change `UTF8.readFile` to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform `pandoc` runs on. Then remove `crFilter` everywhere.
That doesn't work, because of the point above -- you can use the readers
and pass them your own input, without ever using UTF8.readFile. This
is what people typically do when they use pandoc in a web app, for
example.
|
Hmm, I haven't thought of readers as external interface, thanks. Then readers should expect to receive anything, ok.
Ok, postpone this PR until I will add CR filtering to LaTeX.hs (there are two calls to |
The use in the LaTeX reader was just one example. There are other formats that use include files, too. Of course, now that I think of it, |
Applied to |
Readers already use crFilter for themselves anyway.
According to HTML specification, CR, LF and CRLF are newlines. Closes jgm#4546
10a0d76
to
c39fcef
Compare
No description provided.