8000 Fix #4546 (carriage return in HTML) by link2xt · Pull Request #4548 · jgm/pandoc · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix #4546 (carriage return in HTML) #4548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

link2xt
Copy link
Collaborator
@link2xt link2xt commented Apr 11, 2018

No description provided.

@jgm
Copy link
Owner
jgm commented Apr 11, 2018 via email

@link2xt
Copy link
Collaborator Author
link2xt commented Apr 11, 2018

@jgm

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs.

All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not.

No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML, and Native, as a serialization format, is better off without any kind of additional filtering.

@jgm
Copy link
Owner
jgm commented Apr 11, 2018 via email

@link2xt
Copy link
Collaborator Author
link2xt commented Apr 12, 2018

@jgm

For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?

Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes.

Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?

@jgm
Copy link
Owner
jgm commented Apr 12, 2018 via email

@link2xt
Copy link
Collaborator Author
link2xt commented Apr 12, 2018

Grep for toText and toString (which calls toText) does not return that many results, none of them looks like a place to insert crFilter for me.

@jgm
Copy link
Owner
jgm commented Apr 14, 2018

See the docs for System.IO.NewlineMode:

"Specifies the translation, if any, of newline characters between internal Strings and the external file or stream. Haskell Strings are assumed to represent newlines with the '\n' character; the newline mode specifies how to translate '\n' on output, and what to translate into '\n' on input."

So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant. (The newlines then get translated to CRLF on appropriate platforms, and a similar conversion is done on input -- see putStrLnWith in UTF8 module.)

Note that System.IO only supports CR and CRLF newlines.

So I'm not completely confident about the possible side effects of this change.

@jgm
Copy link
Owner
jgm commented Apr 14, 2018

Another option would be to change crFilter to something like

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter ('\r':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

One would want to benchmark this to make sure it's not much slower than the current filter.

@link2xt
Copy link
Collaborator Author
link2xt commented Apr 14, 2018

@jgm

See the docs for System.IO.NewlineMode
...
So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant.

Text files are read via Text.Pandoc.UTF8.readFile. It calls System.IO.openFile, which opens a file in text mode, in which CRLF is translated to LF on Windows. The reason for stripping CRs manually is to handle the case when a file with CRLFs is read on *nix platform.

Another option would be to change crFilter to something like

Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows.

Proper solution would be to read all files in binary mode (by replacing openFile with openBinaryFile) and replace CRLFs manually with

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

Then, we will treat "\r\r\n" properly as a carriage return followed by newline and single CRs will survive conversion from text markup formats. If someone needs to preserve CRs, this change can be applied on top of this PR. It is not related to the problem: single CRs not being handled as a newline in HTML reader.