; -- Mode: Markdown; -- ; vim: filetype=markdown tw=76 expandtab shiftwidth=4 tabstop=4
License: LGPLv3
Author: Uvarov Michael (freeakk@gmail.com
)
Module for working with strings. A string is a flatten list of Unicode characters.
All actions with Unicode were described in the Unicode Standards.
Use the last tag of this library for hacking, because I'm reading UTS35 and writing large ammount of bad code on the master branch.
- UAX 44 Unicode Character Database
ux_string
uses ux_char
and ux_unidata
.
ux_uca
uses ux_char
and ux_unidata
.
ux_char
uses ux_unidata
.
ux_unidata
is for an internal data access.
This module provides the functions for operations with UNIDATA. UNIDATA contains data about Unicode characters.
- to_nfc/1
- to_nfd/1
- to_nfkd/1
- to_nfkc/1
- is_nfc/1
- is_nfd/1
- is_nfkc/1
- is_nfkd/1
- to_lower/1
- to_upper/1
Grapheme is a letter with its modifiers.
- length/1
- reverse/1
- first/2
- last/2
Code:
(ux@delta)11> ux_string:length("FF g̈").
4
(ux@delta)12> string:len("FF g̈").
5
(ux@delta)13> ux_string:to_graphemes("FF g̈").
["F","F"," ",[103,776]]
- explode/2,3
- html_special_chars/1 (htmlspecialchars in php)
- strip_tags/1,2
Code:
ux_string:explode(["==", "++", "|"], "+++-+=|==|==|=+-+++").
Result:
[[],"+-+=",[],[],[],[],"=+-","+"]
Code:
ux_html:strip_tags("<b>bold text</b>").
Result:
"bold text"
Type is a General Category.
Code:
Str = "Erlang created the field of telephone
networks analysis. His early work in scrutinizing the use of local, exchange
and trunk telephone line usage in a small community, to understand the
theoretical requirements of an efficient network led to the creation of the
Erlang formula, which became a foundational element of present day
telecommunication network studies.",
ux_string:explode_types(['Zs', 'Lu'], Str).
Result:
[[],"rlang","created","the","field","of","telephone",
"networks","analysis.",[],"is","early","work","in",
"scrutinizing","the","use","of","local,","exchange","and",
"trunk","telephone","line","usage","in","a","small",
[...]|...]
Code:
ux_string:types(Str).
Result:
['Lu','Ll','Ll','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
'Ll','Ll','Ll','Zs','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
'Ll','Zs','Ll','Ll','Zs','Ll'|...]
Where atom 'Lu'
is Letter, Uppercase; ll is Letter, Lowercase. Read more
about types from description of ux_char:type/1
.
Code:
ux_string:delete_types(['Ll'], Str).
Result:
"E . H , , E , ."
Code:
ux_char:type($ ).
Result:
'Zs'
- Normative Categories:
- Lu Letter, Uppercase
- Ll Letter, Lowercase
- Lt Letter, Titlecase
- Mn Mark, Non-Spacing
- Mc Mark, Spacing Combining
- Me Mark, Enclosing
- Nd Number, Decimal Digit
- Nl Number, Letter
- No Number, Other
- Zs Separator, Space
- Zl Separator, Line
- Zp Separator, Paragraph
- Cc Other, Control
- Cf Other, Format
- Cs Other, Surrogate
- Co Other, Private Use
- Cn Other, Not Assigned (no characters in the file have this property)
- Informative Categories:
- Lm Letter, Modifier
- Lo Letter, Other
- Pc Punctuation, Connector
- Pd Punctuation, Dash
- Ps Punctuation, Open
- Pe Punctuation, Close
- Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
- Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Po Punctuation, Other
- Sm Symbol, Math
- Sc Symbol, Currency
- Sk Symbol, Modifier
- So Symbol, Other
See Unicode Technical Standard #10.
- compare/2,3
- sort/1,2
- sort_key/1,2
- sort_array/1,2
- search/2,3,4
Code from erlang shell:
1> ux_uca:sort_key("a").
<<21,163,0,0,32,0,0,2,0,0,255,255>>
2> ux_uca:sort_key("abc").
<<21,163,21,185,21,209,0,0,34,0,0,4,0,0,255,255,255,255,
255,255>>
3> ux_uca:sort_key("abcd").
<<21,163,21,185,21,209,21,228,0,0,35,0,0,5,0,0,255,255,
255,255,255,255,255,255>>
Code:
ux_uca:compare("a", "a").
ux_uca:compare("a", "b").
ux_uca:compare("c", "b").
Result:
equal
lower
greater
Code:
Options = ux_uca_options:get_options([
{natural_sort, false},
{strength, 3},
{alternate, shifted}
]),
InStrings = ["erlang", "esl", "nitrogen", "epm", "mochiweb", "rebar", "eunit"],
OutStrings = ux_uca:sort(Options, InStrings),
[io:format("~ts~n", [S]) || S <- OutStrings],
SortKeys = [{Str, ux_uca:sort_key(Options, Str)} || Str <- OutStrings],
[io:format("~ts ~w~n", [S, K]) || {S, K} <- SortKeys],
ok.
Result:
epm
erlang
esl
eunit
mochiweb
nitrogen
rebar
epm [5631,5961,5876,0,32,32,32,0,2,2,2]
erlang [5631,6000,5828,5539,5890,5700,0,32,32,32,32,32,32,0,2,2,2,2,2,2]
esl [5631,6054,5828,0,32,32,32,0,2,2,2]
eunit [5631,6121,5890,5760,6089,0,32,32,32,32,32,0,2,2,2,2,2]
mochiweb [5876,5924,5585,5735,5760,6180,5631,5561,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
nitrogen [5890,5760,6089,6000,5924,5700,5631,5890,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
rebar [6000,5631,5561,5539,6000,0,32,32,32,32,32,0,2,2,2,2,2]
ok
Code:
(ux@delta)30> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"fox").
{"The quick brown ","fox"," jumps over the lazy dog."}
(ux@delta)33> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"cat").
false
Code:
(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>
(ux@delta)32> ux_uca:search(CF(2), "The quick brown fox jumps over the lazy
dog.", "dog", maximal).
{"The quick brown fox jumps over the lazy"," dog.",[]}
(ux@delta)21> ux_uca:search(CF(2), "fF", "F").
{[],"f","F"}
(ux@delta)22> ux_uca:search(CF(3), "fF", "F").
{"f"
672E
,"F",[]}
Code:
(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>
(ux@delta)27> ux_uca:search(CF(3), "! F ?S?", "! F !", 'minimal').
{"! ","F"," ?S?"}
(ux@delta)28> ux_uca:search(CF(3), "! F ?S?", "! F !", 'maximal').
{[],"! F ?","S?"}
(ux@delta)29> ux_uca:search(CF(3), "! F ?S?", "! F !", 'medium').
{[],"! F "," ?S?"}
Stores UNIDATA information. For internal using only.
ux_unidata_filelist:set_source(Level, ParserType, ImportedDataTypes,
FromFile).
For example:
ux_unidata_filelist:set_source(process, blocks, all, code:priv_dir(ux) ++ "/UNIDATA/Blocks.txt"}).
loads data about Unicode blocks from priv/UNIDATA/Blocks.txt
.
So, different processes can use their own unidata dictionaries.
Level is process
, application
or node
.
Parsers are located into ux_unidata_parser_*
modules.
Default unidata files are loaded when the application tries get the access to them.