-
Notifications
You must be signed in to change notification settings - Fork 1.8k
RTL Language Support #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
RTL Language Support #2107
Conversation
- Text aligned right when direction is set to rtl - Text of RTL languages is reordered while LTR languages should be untouched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice addition. Nicola has done a lot of good work in this area and is very active in maintaining his projects so I think using his libraries is a solid choice.
The basic implementation looks solid so I'm fine moving forward with what you have even if you don't work out some of the kinks.
What were your thoughts regarding block rendering. I was thinking we render from the last child instead of the first child for containers where direction is "rtl". But it would require knowing which children fit on the line first rather than calculating as we go. |
I did not really think about it in detail. The problem is even more complex though when trying to implement it perfectly. This line is rather simple:
With the current implementation we get 21 34 65 due to a bug in the library I guess will be fixed soon. The approach to just reorder the blocks of one line (if we would be able to solve the problem you described) would produce a wrong result here:
We want an unchanged order of 123456. The same belongs not just to english parts in an RTL block but also to the RTL languages as far as I understand where (some) words or glyphs are not reordered. My initial thoughts were to add the changes even earlier in the dompdf process and passing the content of the whole block into the Bidi calculation. and rearranging the inline blocks accordingly. |
Starting with this very basic support (after the aligning fix) would be a good idea in my opinion. There are so many things to do or consider having a better implementation. Reverting order of table columns is just one more coming to my mind. |
I was looking at this and I have some ideas about 712. Let me collect my thoughts and get back to you. |
it isn't work on the persian language. |
Then you most likely did not set a |
…uages like arabic
Activated shaping of the tc-lib-unicode library in 62e7b36 which should fix this and also the issues in #712 |
Hello all. First of all, thank you for your work and support! I just would like to know if you know when the dompdf version will be available with this feature? I can see the PR is approved now. :) |
Thanks for these informations. :) |
@simonberger I went ahead and resolved a conflict in the Style class so this PR should be up-to-date with the changes I made. |
I'll take another quick look, but I think we're probably good to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I've finally had a chance to look at this and toy around with the rendering and thinking about this feature some more. Sorry for the late detail analysis, there's a lot to discuss here so strap yourself in. Or scroll to the bottom for the tl;dr
I think the overall idea works but we need to tweak to the specifics of the implementation. I found three issues that should be addressable without too much work. I haven't made any changes because I wanted to get your feedback.
First, with this implementation you have to disable font subsetting. Otherwise, the rendered PDF shows the unknown-glyph character for shaped text. The reason is that the character subset in the document is determined before layout (ref). This is problematic because text shaping basically replaces the typed character with the shaped version. So, what we have in the font subset is the typed characters when we really need the shaped characters.
The quickest fix would be to modify the font subsetting logic to parse RTL text through tc-lib-unicode prior to registering it in the font subset. This would result in a performance hit, though, because then we'll be parsing the text with tc-lib-unicode twice.
I believe the long term solution to the subsetting issue is to move the logic into CPDF. This should give us better performance and more reliable output by making subsetting part of the CPDF rendering ("out") process.
Second, I'm pretty sure that even if text is not styled RTL that it should still be rendered with shaping. We could go all out and perform subsetting, directionalization, and shaping of the text prior to rendering (i.e. in the location referenced above where we currently do subsetting). It's a pretty significant change, though, because the actual document would have to be manipulated to make this work. As such this probably requires a larger discussion and may be best left to a future release. I think short term I'm fine limiting shaping to RTL text.
Long term, and ignoring for now discussion of the process by which we determine to what text we apply shaping, I think we can also resolve this issue by moving the logic into the PDF library (same as subsetting). It seems like the appropriate separation of concerns that the HTML+CSS rendering engine determines the where of rendering and the PDF rendering engine determines the what. With caveats that we would have to figure out as we analyze the issue more.
Third, the current implementation works well for short amounts of text but appears to break when the text flows to a new line. We're performing the directionality parsing too early in the process and, as a result, the text is reversed prior to reflowing within the document. So words that should appear on the first line appear, instead, on the last line.
For example, the following HTML:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet">
<style type="text/css">
body{
direction: rtl;
font-family: Amiri, DejaVu Sans, monospace;
font-size: 36pt;
}
</style>
</head>
<body>
<p>
این یک متن تست جهت بررسی موضوع
12 34 56
<br>
<span>
این یک متن تست جهت بررسی موضوع
12 34 56
</span>
</p>
</body>
</html>
looks as follows in the browser vs PDF:
(highlighted text shows how particular words are placed incorrectly).
I think to address this issue we just need to move tc-lib-unicode usage later in the process. Maybe make it part of the canvas method and update the text methods with an argument indicating whether or not the text should be directionalized and shaped before being placed in the PDF.
This is probably one more areas where moving the actual text parsing logic into CPDF makes sense.
tl;dr long term my thought is we move a lot of the actual unicode parsing into CPDF. That seems like a lot more work than the current implementation, so I think for the next release we can better accommodate this feature by doing the following:
a. add a first round of text shaping to the font subset routine when the text is styled RTL so that the subsetted font contains the correct characters
b. keep the current logic in the text reflower but pass in "L" for the forcertl argument so that we transform the text early enough for reflow but don't reverse line order
c. add logic to the canvas adapter text methods to specify the directionality of the text and have those methods perform the RTL transformation
src/FrameReflower/Text.php
Outdated
$text = $this->_collapse_white_space($text); | ||
|
||
if ($style->direction === 'rtl') { | ||
$bidi = new Bidi($text, null, null, 'R', true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of passing in 'R' for forcertl we should pass in false. tc-lib-unicode appears to have some logic (ref) to determine if RTL is needed for the passed-in text. I did some basic testing using the following HTML and it seems to work as expected:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet">
<style type="text/css">
body{
direction: rtl;
font-family: Amiri, DejaVu Sans, monospace;
font-size: 36pt;
}
</style>
</head>
<body>
<div class="control-body">
<p>
12 <span style="color: red">34</span> 56
</p>
<p>
این یک متن تست جهت بررسی موضوع
</p>
</div>
</body>
</html>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case a right-to-left direction is request I think it is safe to force it.
Which problems do you want to solve letting tc-lib-unciode decide what to do?
If we are doing the step to pass all text it is another story surely. But I don't know if browsers are having an auto detection. I doubt that as there is just the CSS value ltr or rtl.
EDIT:
In fact the browsers indeed are applying a RTL text transform and a missing direction:rtl style just is not doing a right aligning.
EDIT 2:
Even direction:ltr is not stopping the browsers from transforming the text. This style is moreover changing structure ordering (table columns, floating?, text aligning...) rather than the text itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this initial release focused on the text its 6D40 elf. We can focus a future release on layout modification for rtl content. Conceptually this seems straightforward, just position from the right side of the container instead of the left. I think probably what will make it difficult is locating all the places where we need to add logic for this.
@bsweeney Thank you for the detailed review. That's what I was hoping for. (1) Font subsetting To fix that in the current state we have some options beside the tc-unicode-lib double run:
(2) Always apply shaping (3) Breaking multiline RTL text While writing an answer here I did a change which should work better: I don't know how hard I thought about the place I set the bidi transform. I remember the idea was to be able to still do a split in case the text exceeds the limit just after the transform (I doubt that's really needed) but because the transform changes the order of the words the order gets broken if a line break happens. |
Found the code comments suggesting the move of the font subsetting from rendering to output time. |
Regarding subsetting: So as you can see the whole reason we do subsetting the way we do is because the subset is built at the time the font is selected. I think the change would actually be fairly straightforward, but that's based on preliminary analysis. From what I see we would need to
There may be additional work, but that's what I see right now. We should also look at fixing up font subsetting support in PDFLib as I'm not sure we're doing it at all right now (I think all we need to do is set the global subsetting option). This should probably be done in a separate branch. And we should probably do this for the next release if the totality of the work is as I've outlined above. |
When will it be appropriate for Hebrew? |
@sbc640964 It should work for Hebrew text. |
@simonberger and this the document
|
@sbc640964 Here is a working sample with your text:
EDIT: |
@simonberger I think that text is supposed to be rendered RTL due to strong directionality of the characters. This goes back to our discussion of enabling the bidi parsing at all time. This is a tweak I was looking at making against your branch but haven't gotten very far with (and we need to discuss more anyway). @sbc640964 this branch is still a work-in-progress and we still have much to work out. If you set the direction of the text to RTL you'll see it rendered in the order expected. The directionality logic is only enabled when RTL is specified.
|
I have been testing out these commits on my staging site and am finding a similar issue about the RTL rendering improperly if the text string is too long to fit onto one line. This was mentioned by @bsweeney in their comment above (#2107 (review)). Additionally, I'm finding that the unordered list decoration is not appearing before the list item text (which should be the right side of the page) instead it is staying to the left side of the page. Did anyone else have that issue? |
Solution is here Click Here |
Hello! |
Found the best and simple trick to fix this, simply write :
It fixes the issue with letters AND lines that are reversed |
Here is the full explanation : Add the following in the file located at vendor/dompdf/dompdf/src/Renderer/Text.php Above the line Add:
In your text, add: |
Thank you for the explanation! It would be better if we could fix this without touching the vendor files. |
This feature adds:
This does not use the ar-php library which was unstable in my tests and provided often wrong results. Instead it includes https://github.com/tecnickcom/tc-lib-unicode which looks promising.
I have zero knowledge of all languages like Arabic, Hebrew, Persian this supports. If someone has some more insights and could find flaws of this implementation or on the use of this library instead of ar-php please let us know here.
Known issues: