Fix #4546 (carriage return in HTML) #4548

link2xt · 2018-04-11T09:11:06Z

No description provided.

jgm · 2018-04-11T17:02:02Z

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs. Changing the HTML reader is only a start to what would be required. Hence I think it's too risky to merge this.

link2xt · 2018-04-11T20:42:49Z

@jgm

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs.

All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not.

No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML, and Native, as a serialization format, is better off without any kind of additional filtering.

jgm · 2018-04-11T23:04:58Z

Ah, I'd forgotten that we'd moved crFilter inside the individual readers. The other things to check would be other consumers of the UTF8 functions... For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include? Alexander <notifications@github.com> writes:

…

@jgm > As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs. All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not. No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML and Native, as a serialization format, is better off without any kind of additional filtering. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #4548 (comment)

link2xt · 2018-04-12T13:54:45Z

@jgm

For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?

Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes.

Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?

jgm · 2018-04-12T15:41:09Z

Alexander <notifications@github.com> writes:

@jgm > For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include? Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes. Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?

I can't think of a reason why they should be stripped. I'm just trying to think of all the places this change might matter, and which we may not have thought of...

link2xt · 2018-04-12T16:34:36Z

Grep for toText and toString (which calls toText) does not return that many results, none of them looks like a place to insert crFilter for me.

jgm · 2018-04-14T16:47:27Z

See the docs for System.IO.NewlineMode:

"Specifies the translation, if any, of newline characters between internal Strings and the external file or stream. Haskell Strings are assumed to represent newlines with the '\n' character; the newline mode specifies how to translate '\n' on output, and what to translate into '\n' on input."

So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant. (The newlines then get translated to CRLF on appropriate platforms, and a similar conversion is done on input -- see putStrLnWith in UTF8 module.)

Note that System.IO only supports CR and CRLF newlines.

So I'm not completely confident about the possible side effects of this change.

jgm · 2018-04-14T16:49:14Z

Another option would be to change crFilter to something like

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter ('\r':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

One would want to benchmark this to make sure it's not much slower than the current filter.

link2xt · 2018-04-14T19:04:46Z

@jgm

See the docs for System.IO.NewlineMode
...
So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant.

Text files are read via Text.Pandoc.UTF8.readFile. It calls System.IO.openFile, which opens a file in text mode, in which CRLF is translated to LF on Windows. The reason for stripping CRs manually is to handle the case when a file with CRLFs is read on *nix platform.

Another option would be to change crFilter to something like

Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows.

Proper solution would be to read all files in binary mode (by replacing openFile with openBinaryFile) and replace CRLFs manually with

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

Then, we will treat "\r\r\n" properly as a carriage return followed by newline and single CRs will survive conversion from text markup formats. If someone needs to preserve CRs, this change can be applied on top of this PR. It is not related to the problem: single CRs not being handled as a newline in HTML reader.

jgm · 2018-04-15T00:35:01Z

Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows.

No, the idea was that it would be translated to \n\n in both cases. That's what my revised crFilter function would do (even if the file was opened in text mode and the CRLF converted to LF).

You seem confident that it should be treated as \r\n, but why? I'd need to know more about where you'd actually find such a string in the wild, and why. I'm familiar with older OSs that use CR as line endings, but in what case would you find CRCRLF?

It is not related to the problem: single CRs not being handled as a newline in HTML reader.

Your PR contains a change to Text.Pandoc.UTF8, which potentially affects many things. It's not just a change to the HTML reader. So I don't want to go forward until the details have been understood and sorted out.

link2xt · 2018-04-15T08:49:01Z

@jgm

Note that System.IO only supports CR and CRLF newlines.

I assume it is a typo and read it as "LF and CRLF".

No, the idea was that it would be translated to \n\n in both cases. That's what my revised crFilter function would do (even if the file was opened in text mode and the CRLF converted to LF).

If CRLF is converted to LF (on Windows), "\r\r\n" will be converted to "\r\n" during reading, and crFilter will convert it to "\n". If CRLF not converted to LF, crFilter will convert "\r\r\n" to "\n\n".

You seem confident that it should be treated as \r\n, but why?

If we agree that only LF (Haskell and POSIX standard) and CRLF (for Windows compatibility) line endings are valid, then CRCRLF can be read either as CR<>newline or CRCR<>newline.

I'd need to know more about where you'd actually find such a string in the wild, and why. I'm familiar with older OSs that use CR as line endings, but in what case would you find CRCRLF?

CRCRLF is just a corner case I invented, in which case applying proposed crFilter will result in single CR being treated as a newline.

I don't care about CRCRLF case that much, but it worked correctly before my PR and works the same after my PR (unless we decide that we need to support CR newlines in all text formats).

Your PR contains a change to Text.Pandoc.UTF8, which potentially affects many things. It's not just a change to the HTML reader. So I don't want to go forward until the details have been understood and sorted out.

Ok, let me summarize.

System.IO has built-in "newline normalization", which translates CRLF in files to LF in process memory on Windows.

crFilter just removes CRs, it is its only function.

I assume that LF and CRLF are the only valid newlines. It is just HTML that treats single CRs differently, according to its standard.

Before PR it works like this:

Text files are read with UTF8.readFile, which applies newline normalization, removes BOM and applies crFilter.
All Text pandoc readers apply crFilter again for themselves, which is no-op because CRs are already removed.

After PR:

Text files are read with UTF8.readFile, which applies newline normalization and removes BOM.
All Text pandoc readers, except HTML reader, apply crFilter, so their behavior does not change.
HTML reader does not apply crFilter and does its own processing of CRs according to HTML standard.

The changes are minimal because crFilters are built in every pandoc reader, their behavior does not change because of removing of crFilter from UTF8.readFile.

If we want not a minimal change, but more correct one, I propose that we change UTF8.readFile to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform pandoc runs on. Then remove crFilter everywhere.

link2xt · 2018-04-15T10:32:46Z

@jgm
Just checked CommonMark spec, it treats CR, LF and CRLF as line ending: http://spec.commonmark.org/0.28/#line-ending

It is one more reason to move crFilter out of readFile. Currently file with "foo\rbar" is read by CommonMark reader as "foobar". With my PR, it is read as [Para [Str "foo",SoftBreak,Str "bar"]].

link2xt · 2018-04-15T10:46:26Z

If we want not a minimal change, but more correct one, I propose that we change UTF8.readFile to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform pandoc runs on. Then remove crFilter everywhere.

Or, even better: make UTF8.readFile read files in binary mode and remove BOM, then treat line endings individually in each reader. CommonMark and XML readers can deal with all kinds of line endings by themselves, other readers will apply some filter that replaces all line endings with LFs and can transition to more complicated implementation later on if needed.

In any case, I would like this minimal change to be applied as-is if no test case is found where it breaks something, just to fix the bug (both HTML and CommonMark now). Additional correctness fixes can be applied on top of it after that.

jgm · 2018-04-15T15:22:50Z

Alexander <notifications@github.com> writes:

Before PR it works like this: 1. Text files are read with UTF8.readFile, which applies newline normalization, removes BOM and applies `crFilter`. 2. All `Text` pandoc readers apply `crFilter` again for themselves, which is no-op because CRs are already removed.

It's not a no-op! If you use pandoc as a library, you may not be getting your input from UTF8.readFile or UTF8.getContents. That's why `crFilter` needs to be there in the readers. UTF8.readFile does filter CRs, it's true, because it uses `toString`, which uses `toText`, which filters CRs. You might think that because we use `crFilter` in the readers, we don't need to filter them here. But that's not so obvious to me. For example, consider the use of readFile to read a LaTeX include file, which is then added to the stream of input to be parsed. Here `crFilter` isn't called; we rely on CRs being filtered by `readFile`. There may be other cases like this, too. I think a very careful audit of all uses of the UTF8 module's functions is needed before we remove CR stripping there.

If we want not a minimal change, but more correct one, I propose that we change `UTF8.readFile` to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform `pandoc` runs on. Then remove `crFilter` everywhere.

That doesn't work, because of the point above -- you can use the readers and pass them your own input, without ever using UTF8.readFile. This is what people typically do when they use pandoc in a web app, for example.

link2xt · 2018-04-15T16:35:22Z

@jgm

It's not a no-op! If you use pandoc as a library, you may not be getting your input from UTF8.readFile or UTF8.getContents. That's why crFilter needs to be there in the readers.

Hmm, I haven't thought of readers as external interface, thanks. Then readers should expect to receive anything, ok.

For example, consider the use of readFile to read a LaTeX include file, which is then added to the stream of input to be parsed. Here crFilter isn't called; we rely on CRs being filtered by readFile. There may be other cases like this, too. I think a very careful audit of all uses of the UTF8 module's functions is needed before we remove CR stripping there.

Ok, postpone this PR until I will add CR filtering to LaTeX.hs (there are two calls to readFileFromDirs there) and test it.

jgm · 2018-04-15T18:22:15Z

The use in the LaTeX reader was just one example. There are other formats that use include files, too. Of course, now that I think of it, readFile may already be doing line-ending conversion, if that's the default for System.IO.hGetContents.

link2xt · 2018-04-16T14:27:24Z

Applied to \include but not for \lstinputlisting for now.

Readers already use crFilter for themselves anyway.

According to HTML specification, CR, LF and CRLF are newlines. Closes jgm#4546

quasicomputational mentioned this pull request Apr 11, 2018

Carriage return in HTML should create a space #4546

Open

link2xt force-pushed the html-cr branch from ad33b66 to 10a0d76 Compare April 16, 2018 14:25

Alexander Krotov added 3 commits September 3, 2018 19:24

Do not filter out carriage returns in Text.Pandoc.UTF8

ca62a36

Readers already use crFilter for themselves anyway.

HTML reader: treat CR as newline.

ff2926b

According to HTML specification, CR, LF and CRLF are newlines. Closes jgm#4546

LaTeX reader: apply crFilter to \include'd contents

c39fcef

link2xt force-pushed the 80E8 html-cr branch from 10a0d76 to c39fcef Compare September 3, 2018 16:30

jgm force-pushed the main branch from 3ff001b to 3da4d41 Compare December 18, 2024 23:02

jgm force-pushed the main branch from 60c147d to bfcff3e Compare May 12, 2025 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix #4546 (carriage return in HTML) #4548

Fix #4546 (carriage return in HTML) #4548

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix #4546 (carriage return in HTML) #4548

Are you sure you want to change the base?

Fix #4546 (carriage return in HTML) #4548

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!