automatic pinyin and furigana ruby tags in js
follyanna
uses intelligent furigana placement to strip redundant kana from furigana readings. It also supports pinyin rendering.
For instance:
- 頑張る[がんばる] -> 頑張る
- 加油[jia1 you2] -> 加油
This is helpful because a lot of dictionary data, such as Jmdict, contains readings in fully-rendered but not per-character form. These are often a bit clunky to read on their own:
- 頑張る
- 阿吽の呼吸
This is similar to react-furi, but I wanted a version that could work in an Anki deck or in other plain JS environments without including React.
- Include
furigana.bundle.js
into your webpage.
頑張る[がんばる]
頑張る
阿吽の呼吸[あうんのこきゅう]
阿吽の呼吸
冴え冴え[さえざえ]
冴え冴え
権兵衛が種蒔きゃ烏がほじくる[ごんべえがたねまきゃからすがほじくる]
権兵衛が種蒔きゃ烏がほじくる
蒔かぬ種は生えぬ[まかぬたねははえぬ]
蒔かぬ種は生えぬ
秋の野芥子[あきののげし]
秋の野芥子
巻き脚絆[まききゃはん]
巻き脚絆
We use a greedy algorithm to match each kana sequence to a set of candidate positions. If the candidate positions are at the start or tail of the string (as in 頑張る), we know for certainty that we can match them. If, after matching start and tail sequences, there is only one candidate in the middle of the string, we can definitely match it.
The only ambiguous cases are when there are at least two candidates in the middle of the string. Right now we are not able to solve these, and the tokenizer will just use the whole string in this case. This is solvable using levenshtein distance or public datasets and may be added in a future revision.
It's not particularly hard to do pinyin rendering; it's just bundled into the same library as a utility for multi-language environments.
- Set up CDN
- Deal with ambiguities when there are two or more mid-string candidates