RTL Language Support #2107

simonberger · 2020-03-07T23:16:09Z

This feature adds:

Text is aligned right when direction is set to RTL
Text of RTL languages is reordered while LTR languages should be untouched
Dir attribute is supported beside the direction style on the most important elements

This does not use the ar-php library which was unstable in my tests and provided often wrong results. Instead it includes https://github.com/tecnickcom/tc-lib-unicode which looks promising.

I have zero knowledge of all languages like Arabic, Hebrew, Persian this supports. If someone has some more insights and could find flaws of this implementation or on the use of this library instead of ar-php please let us know here.

Known issues:

If inline blocks are used to decorate text for example, this block isn't correctly reordered in the line. Trying to move even these blocks with its contents could be a very troublesome job. It is caused by how dompdf internally works. I'll still have a look and maybe someone has an idea.

- Text aligned right when direction is set to rtl - Text of RTL languages is reordered while LTR languages should be untouched

bsweeney

This is a very nice addition. Nicola has done a lot of good work in this area and is very active in maintaining his projects so I think using his libraries is a solid choice.

The basic implementation looks solid so I'm fine moving forward with what you have even if you don't work out some of the kinks.

src/Css/Style.php

bsweeney · 2020-03-08T16:53:39Z

What were your thoughts regarding block rendering. I was thinking we render from the last child instead of the first child for containers where direction is "rtl". But it would require knowing which children fit on the line first rather than calculating as we go.

simonberger · 2020-03-08T17:31:06Z

I did not really think about it in detail. The problem is even more complex though when trying to implement it perfectly.

This line is rather simple:

<p style="direction: rtl">
	12 <span style="color: red">34</span> 56
</p>

With the current implementation we get 21 34 65 due to a bug in the library I guess will be fixed soon.
But we want to get 56 34 12 anyway.

The approach to just reorder the blocks of one line (if we would be able to solve the problem you described) would produce a wrong result here:

<p style="direction: rtl">
	12<span style="color: red">34</span>56
</p>

We want an unchanged order of 123456. The same belongs not just to english parts in an RTL block but also to the RTL languages as far as I understand where (some) words or glyphs are not reordered.

My initial thoughts were to add the changes even earlier in the dompdf process and passing the content of the whole block into the Bidi calculation. and rearranging the inline blocks accordingly.
But as wrote I did not evaluate if this is feasible nor how much effort this would cost.

simonberger · 2020-03-08T17:53:13Z

Starting with this very basic support (after the aligning fix) would be a good idea in my opinion. There are so many things to do or consider having a better implementation. Reverting order of table columns is just one more coming to my mind.
We could try to tackle it in a later version.

… set

simonberger · 2020-03-10T19:37:41Z

Addresses:
#1009
#2103

I do not completely understand the issue in #712 but that shouldn't be fixed with this right?

bsweeney · 2020-03-11T02:41:07Z

I was looking at this and I have some ideas about 712. Let me collect my thoughts and get back to you.

maziyaryosofi · 2020-03-14T19:52:09Z

it isn't work on the persian language.
also have problem with word and align:right also not working.
in this method the text like سلام print as مالس

simonberger · 2020-03-14T20:46:19Z

Then you most likely did not set a direction: rtl css style
The characters aren't joined though which is the problem mentioned in #712.

…uages like arabic

simonberger · 2020-03-14T21:10:26Z

Activated shaping of the tc-lib-unicode library in 62e7b36 which should fix this and also the issues in #712
I deactivated this initially because of some results that seemed not correct to me and now somehow almost forgot about it. On this example it produces a correct result.
We need some more testing.

ThomasCaud · 2020-04-20T08:22:16Z

Hello all. First of all, thank you for your work and support! I just would like to know if you know when the dompdf version will be available with this feature? I can see the PR is approved now. :)
Thomas

simonberger · 2020-04-20T21:21:13Z

We still had one issue which now is fixed. (#2134, #2135)
This will be merged when #2134 is finished.

Releasing is managed by @bsweeney. From my side I'll just try to merge all my other pending pull requests and won't start something else for the next release.

ThomasCaud · 2020-04-21T07:19:13Z

Thanks for these informations. :)

bsweeney · 2020-05-02T23:52:36Z

@simonberger I went ahead and resolved a conflict in the Style class so this PR should be up-to-date with the changes I made.

simonberger · 2020-05-03T00:24:02Z

@bsweeney
Thank you for Merging. Results are similar to how our merge branch looked like which is good.
From my side we could merge this now.
Do you want to have another look at #712 and/or did you test my change activating shaping?

bsweeney · 2020-05-03T19:35:07Z

I'll take another quick look, but I think we're probably good to merge.

bsweeney

OK, so I've finally had a chance to look at this and toy around with the rendering and thinking about this feature some more. Sorry for the late detail analysis, there's a lot to discuss here so strap yourself in. Or scroll to the bottom for the tl;dr

I think the overall idea works but we need to tweak to the specifics of the implementation. I found three issues that should be addressable without too much work. I haven't made any changes because I wanted to get your feedback.

First, with this implementation you have to disable font subsetting. Otherwise, the rendered PDF shows the unknown-glyph character for shaped text. The reason is that the character subset in the document is determined before layout (ref). This is problematic because text shaping basically replaces the typed character with the shaped version. So, what we have in the font subset is the typed characters when we really need the shaped characters.

The quickest fix would be to modify the font subsetting logic to parse RTL text through tc-lib-unicode prior to registering it in the font subset. This would result in a performance hit, though, because then we'll be parsing the text with tc-lib-unicode twice.

I believe the long term solution to the subsetting issue is to move the logic into CPDF. This should give us better performance and more reliable output by making subsetting part of the CPDF rendering ("out") process.

Second, I'm pretty sure that even if text is not styled RTL that it should still be rendered with shaping. We could go all out and perform subsetting, directionalization, and shaping of the text prior to rendering (i.e. in the location referenced above where we currently do subsetting). It's a pretty significant change, though, because the actual document would have to be manipulated to make this work. As such this probably requires a larger discussion and may be best left to a future release. I think short term I'm fine limiting shaping to RTL text.

Long term, and ignoring for now discussion of the process by which we determine to what text we apply shaping, I think we can also resolve this issue by moving the logic into the PDF library (same as subsetting). It seems like the appropriate separation of concerns that the HTML+CSS rendering engine determines the where of rendering and the PDF rendering engine determines the what. With caveats that we would have to figure out as we analyze the issue more.

Third, the current implementation works well for short amounts of text but appears to break when the text flows to a new line. We're performing the directionality parsing too early in the process and, as a result, the text is reversed prior to reflowing within the document. So words that should appear on the first line appear, instead, on the last line.

For example, the following HTML:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=UTF-8">

    <link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet">
    <style type="text/css">
        body{
            direction: rtl;
            font-family: Amiri, DejaVu Sans, monospace;
            font-size: 36pt;
        }
    </style>
</head>
<body>
  <p>
    این یک متن تست جهت بررسی موضوع
    12 34 56
    <br>
    <span>
      این یک متن تست جهت بررسی موضوع
      12 34 56
    </span>
  </p>
</body>
</html>

looks as follows in the browser vs PDF:

(highlighted text shows how particular words are placed incorrectly).

I think to address this issue we just need to move tc-lib-unicode usage later in the process. Maybe make it part of the canvas method and update the text methods with an argument indicating whether or not the text should be directionalized and shaped before being placed in the PDF.

This is probably one more areas where moving the actual text parsing logic into CPDF makes sense.

tl;dr long term my thought is we move a lot of the actual unicode parsing into CPDF. That seems like a lot more work than the current implementation, so I think for the next release we can better accommodate this feature by doing the following:

a. add a first round of text shaping to the font subset routine when the text is styled RTL so that the subsetted font contains the correct characters
b. keep the current logic in the text reflower but pass in "L" for the forcertl argument so that we transform the text early enough for reflow but don't reverse line order
c. add logic to the canvas adapter text methods to specify the directionality of the text and have those methods perform the RTL transformation

bsweeney · 2020-05-08T03:13:22Z

src/FrameReflower/Text.php

+                $text = $this->_collapse_white_space($text);
+
+                if ($style->direction === 'rtl') {
+                    $bidi = new Bidi($text, null, null, 'R', true);


I think instead of passing in 'R' for forcertl we should pass in false. tc-lib-unicode appears to have some logic (ref) to determine if RTL is needed for the passed-in text. I did some basic testing using the following HTML and it seems to work as expected:

<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> <link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet"> <style type="text/css"> body{ direction: rtl; font-family: Amiri, DejaVu Sans, monospace; font-size: 36pt; } </style> </head> <body> <div class="control-body"> <p> 12 <span style="color: red">34</span> 56 </p> <p> این یک متن تست جهت بررسی موضوع </p> </div> </body> </html>

In case a right-to-left direction is request I think it is safe to force it.
Which problems do you want to solve letting tc-lib-unciode decide what to do?

If we are doing the step to pass all text it is another story surely. But I don't know if browsers are having an auto detection. I doubt that as there is just the CSS value ltr or rtl.

EDIT:
In fact the browsers indeed are applying a RTL text transform and a missing direction:rtl style just is not doing a right aligning.

EDIT 2:
Even direction:ltr is not stopping the browsers from transforming the text. This style is moreover changing structure ordering (table columns, floating?, text aligning...) rather than the text itself?

I think we can keep this initial release focused on the text its 6D40 elf. We can focus a future release on layout modification for rtl content. Conceptually this seems straightforward, just position from the right side of the container instead of the left. I think probably what will make it difficult is locating all the places where we need to add logic for this.

simonberger · 2020-05-08T18:08:50Z

@bsweeney Thank you for the detailed review. That's what I was hoping for.

(1) Font subsetting
Good point. First a question. Isn't it generally possible (needed effort aside) to create the embedded font object dynamically or at the end or does this have to be complete at the start?

To fix that in the current state we have some options beside the tc-unicode-lib double run:

Disable font subsetting if we have any RTL style (no preferred solution)
Include all chars that could be added by shaping (https://github.com/tecnickcom/tc-lib-unicode-data/blob/develop/src/Arabic.php). This could be a good compromise but from a quick look it does not seem to work that well as the font subsetting is looking at the text instead of caring at the first place which chars the font provides. But we can include the possible substitutions whenever we hit a "source-glyph".
Running tc-lib-unicode or a sub process of it as you mentioned

(2) Always apply shaping
Thought about this as well. But we maybe should do that if we are sure it works well. Also we might enable processing it just by option.
On the other hand I don't know when and why shaping should have impact on non RTL text. It manipulates just an assessable number of glyphs.

(3) Breaking multiline RTL text
Took me some time to get the problem. [offtopic]It is quiet annoying to work on this with debugging tools as they switch are applying RTL direction to debugging state while it is off in the editor[/offtopic].

While writing an answer here I did a change which should work better:
7680338

I don't know how hard I thought about the place I set the bidi transform. I remember the idea was to be able to still do a split in case the text exceeds the limit just after the transform (I doubt that's really needed) but because the transform changes the order of the words the order gets broken if a line break happens.

simonberger · 2020-05-10T18:08:02Z

Found the code comments suggesting the move of the font subsetting from rendering to output time.
I evaluated the change and for my knowledge of the code almost the whole logic of the selectFontmethod needs to move because of usage of the data probably coming from subsetting. But that should feasible.

bsweeney · 2020-05-10T19:43:32Z

Regarding subsetting: So as you can see the whole reason we do subsetting the way we do is because the subset is built at the time the font is selected. I think the change would actually be fairly straightforward, but that's based on preliminary analysis. From what I see we would need to

Remove the font subsetting logic from Dompdf.cls.php and then within the CPDF class call registerText from the addText method.
In CPDF move the actual subsetting logic from the selectFont method to the "out" action of the o_font method.

There may be additional work, but that's what I see right now. We should also look at fixing up font subsetting support in PDFLib as I'm not sure we're doing it at all right now (I think all we need to do is set the global subsetting option).

This should probably be done in a separate branch. And we should probably do this for the next release if the totality of the work is as I've outlined above.

sbc640964 · 2020-06-28T17:01:35Z

When will it be appropriate for Hebrew?

simonberger · 2020-06-28T18:58:15Z

@sbc640964 It should work for Hebrew text.
If you test it please leave some feedback if it works well or if you have problems.
We have no language knowledge to really test it.

sbc640964 · 2020-06-29T14:08:17Z

@simonberger
I try unsuccessfully.
I am adding the
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

and this the document

<style>
  @font-face {
       font-family: 'Assistant';
       font-style: normal;
       font-weight: normal;
       src: url("the/path/to/ttf/file") format('truetype');
   }
</style>
<h2 class="text-info pb-5 font-weight-normal"  style="font-family: 'Assistant', sans-serif;">זו הכותרת בעברית</h2>

and this is the result

simonberger · 2020-06-29T14:54:15Z

@sbc640964
Something is wrong with your styles or your font file.

Here is a working sample with your text:

<html>
<head>
    <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Assistant&display=swap">
</head>
<body>
    <div style="font-family: 'Assistant', sans-serif;">זו הכותרת בעברית</div>
</body>
</html>

EDIT:
That Hebrew text is displayed correctly is nothing this feature changes. It works already in 0.8.5. This pull request applies right to left rendering and right aligning.

bsweeney · 2020-06-30T03:20:34Z

@simonberger I think that text is supposed to be rendered RTL due to strong directionality of the characters. This goes back to our discussion of enabling the bidi parsing at all time. This is a tweak I was looking at making against your branch but haven't gotten very far with (and we need to discuss more anyway).

@sbc640964 this branch is still a work-in-progress and we still have much to work out. If you set the direction of the text to RTL you'll see it rendered in the order expected. The directionality logic is only enabled when RTL is specified.

<html>
<head>
    <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Assistant&display=swap">
</head>
<body>
    <div style="font-family: 'Assistant', sans-serif; direction: rtl;">זו הכותרת בעברית</div>
</body>
</html>

kory-northrop · 2021-11-24T18:02:39Z

I have been testing out these commits on my staging site and am finding a similar issue about the RTL rendering improperly if the text string is too long to fit onto one line. This was mentioned by @bsweeney in their comment above (#2107 (review)).

Additionally, I'm finding that the unordered list decoration is not appearing before the list item text (which should be the right side of the page) instead it is staying to the left side of the page. Did anyone else have that issue?

learn05 · 2022-03-31T19:36:24Z

Solution is here Click Here

per #2107

Magnatox · 2024-07-30T19:33:20Z

Hello!
Did anyone find a solution for the first words rendered in a new line when the text is too long?

johnslm · 2024-11-10T23:10:18Z

Found the best and simple trick to fix this, simply write :

<p dir="rtl" style="text-align: right; writing-mode: horizontal-tb; direction: rtl; font-family: XXX;">YOUR RTL TEXT</p>

It fixes the issue with letters AND lines that are reversed

Magnatox · 2024-11-10T23:18:48Z

Found the best and simple trick to fix this, simply write :

<p dir="rtl" style="text-align: right; writing-mode: horizontal-tb; direction: rtl; font-family: XXX;">YOUR RTL TEXT</p>

It fixes the issue with letters AND lines that are reversed

Hello @johnslm
Can you confirm if it fixes this issue?

johnslm · 2024-11-10T23:45:33Z

Here is the full explanation :

Add the following in the file located at vendor/dompdf/dompdf/src/Renderer/Text.php

Above the line $this->_canvas->text($x, $y, $text,

Add:

if (strtolower($style->direction) == 'rtl') {
    preg_match_all('/./us', $text, $ar);
    $text = join('', array_reverse($ar[0]));
}

In your text, add:
<p dir="rtl" style="text-align: right; writing-mode: horizontal-tb; direction: rtl; font-family: XXX;">YOUR RTL TEXT</p>

Magnatox · 2024-11-11T00:00:31Z

Here is the full explanation :

Add the following in the file located at vendor/dompdf/dompdf/src/Renderer/Text.php

Above the line $this->_canvas->text($x, $y, $text,

Add:
if (strtolower($style->direction) == 'rtl') {
    preg_match_all('/./us', $text, $ar);
    $text = join('', array_reverse($ar[0]));
}
In your text, add:
<p dir="rtl" style="text-align: right; writing-mode: horizontal-tb; direction: rtl; font-family: XXX;">YOUR RTL TEXT</p>

Thank you for the explanation!

It would be better if we could fix this without touching the vendor files.

dompdf#712 Added support for languages read from right to left

1f8f057

- Text aligned right when direction is set to rtl - Text of RTL languages is reordered while LTR languages should be untouched

simonberger mentioned this pull request Mar 7, 2020

text-align: right doesn't work #2103

Closed

bsweeney approved these changes Mar 8, 2020

View reviewed changes

bsweeney reviewed Mar 8, 2020

View reviewed changes

src/Css/Style.php Outdated Show resolved Hide resolved

simonberger added 2 commits March 8, 2020 21:37

Fixed set align:right through direction:rtl just if no align has been…

2dc98a1

… set

fixed last commit

30abbdb

dompdf#2107 dompdf#712 Activated shaping (joining of glyphs) for lang…

62e7b36

…uages like arabic

simonberger mentioned this pull request Apr 12, 2020

Update Style class to more fully support computed property usage #2134

Merged

Merge branch 'develop' into feature/rtl_language_support

bf053a4

bsweeney added this to the 0.8.6 milestone May 8, 2020

bsweeney requested changes May 8, 2020

View reviewed changes

Move bidi transform past the line splitting

7680338

simonberger mentioned this pull request May 10, 2020

Move font subsetting from render to output #2149

Merged

4 tasks

bsweeney mentioned this pull request Jul 18, 2020

Dompdf Version 0.8.5 Arabic RTL Not Showing Correctly #2197

Closed

bsweeney mentioned this pull request Jul 9, 2021

Support The Arabic language #2484

Closed

bsweeney modified the milestones: dompdf-next, 2.0.0 Aug 24, 2021

bsweeney mentioned this pull request Oct 7, 2021

dompdf not getting urdu text #2586

Closed

bsweeney mentioned this pull request Oct 27, 2021

Possible bug generating certain Unicode #2618

Closed

bsweeney linked an issue Oct 27, 2021 that may be closed by this pull request

Support complex text layout (bidi, shaping, ordering) #2619

Open

19 tasks

bsweeney modified the milestones: 2.0.0, 2.1.0 Nov 8, 2021

bsweeney changed the base branch from develop to master November 8, 2021 17:09

bsweeney mentioned this pull request Dec 10, 2022

Supporting LTR #3078

Closed

bsweeney mentioned this pull request May 2, 2023

dompdf Urdu Words Not Joining Issue #3091

Open

bsweeney mentioned this pull request Aug 20, 2023

Issue generating pdf in hebrew language using dompdf libraray #3260

Closed

bsweeney mentioned this pull request Mar 17, 2024

add Hebrew support #3420

Open

bsweeney added a commit that referenced this pull request Mar 19, 2024

WIP

9b15471

per #2107

bsweeney added a commit that referenced this pull request Mar 19, 2024

WIP

a5ea311

per #2107

bsweeney modified the milestones: 3.1.0, 4.0.0 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTL Language Support #2107

RTL Language Support #2107

RTL Language Support #2107

Are you sure you want to change the base?

RTL Language Support #2107

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment