Unicode eXtension

; -- Mode: Markdown; -- ; vim: filetype=markdown tw=76 expandtab shiftwidth=4 tabstop=4

Unicode eXtension

License: LGPLv3

Author: Uvarov Michael (freeakk@gmail.com)

Module for working with strings. A string is a flatten list of Unicode characters.

All actions with Unicode were described in the Unicode Standards.

Use the last tag of this library for hacking, because I'm reading UTS35 and writing large ammount of bad code on the master branch.

This library realized only these documents:

UAX 15 Unicode Normalization Forms
UTS 10 Unicode Collation Algorithm

and some parts from:

UAX 44 Unicode Character Database

Structure of the library

ux_string uses ux_char and ux_unidata.

ux_uca uses ux_char and ux_unidata.

ux_char uses ux_unidata.

ux_unidata is for an internal data access.

ux_string.erl: String Functions for lists of Unicode characters.

This module provides the functions for operations with UNIDATA. UNIDATA contains data about Unicode characters.

Functions for working with Unicode Normal Forms (UNF)

to_nfc/1
to_nfd/1
to_nfkd/1
to_nfkc/1
is_nfc/1
is_nfd/1
is_nfkc/1
is_nfkd/1

Functions from stdlib for Unicode strings

to_lower/1
to_upper/1

Functions for processing strings as groups of graphemes

Grapheme is a letter with its modifiers.

length/1
reverse/1
first/2
last/2

Examples

Code:

(ux@delta)11> ux_string:length("FF g̈").
4
(ux@delta)12> string:len("FF g̈").       
5
(ux@delta)13> ux_string:to_graphemes("FF g̈").
["F","F"," ",[103,776]]

"PHP-style" string functions

explode/2,3
html_special_chars/1 (htmlspecialchars in php)
strip_tags/1,2

Examples

Code:

ux_string:explode(["==", "++", "|"], "+++-+=|==|==|=+-+++").

Result:

[[],"+-+=",[],[],[],[],"=+-","+"]

Code:

ux_html:strip_tags("<b>bold text</b>").

Result:

"bold text"

Types function

Type is a General Category.

Code:

Str = "Erlang created the field of telephone
networks analysis. His early work in scrutinizing the use of local, exchange
and trunk telephone line usage in a small community, to understand the
theoretical requirements of an efficient network led to the creation of the
Erlang formula, which became a foundational element of present day
telecommunication network studies.",
ux_string:explode_types(['Zs', 'Lu'], Str).

Result:

[[],"rlang","created","the","field","of","telephone",
 "networks","analysis.",[],"is","early","work","in",
 "scrutinizing","the","use","of","local,","exchange","and",
 "trunk","telephone","line","usage","in","a","small",
 [...]|...]

Code:

ux_string:types(Str).

Result:

['Lu','Ll','Ll','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
 'Ll','Ll','Ll','Zs','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
 'Ll','Zs','Ll','Ll','Zs','Ll'|...]

Where atom 'Lu' is Letter, Uppercase; ll is Letter, Lowercase. Read more about types from description of ux_char:type/1.

Code:

ux_string:delete_types(['Ll'], Str).

Result:

"E       . H        ,          ,                E ,           ."

ux_char.erl: Char Functions

Code:

ux_char:type($ ).

Result:

'Zs'

List of types

Normative Categories:
- Lu Letter, Uppercase
- Ll Letter, Lowercase
- Lt Letter, Titlecase
- Mn Mark, Non-Spacing
- Mc Mark, Spacing Combining
- Me Mark, Enclosing
- Nd Number, Decimal Digit
- Nl Number, Letter
- No Number, Other
- Zs Separator, Space
- Zl Separator, Line
- Zp Separator, Paragraph
- Cc Other, Control
- Cf Other, Format
- Cs Other, Surrogate
- Co Other, Private Use
- Cn Other, Not Assigned (no characters in the file have this property)
Informative Categories:
- Lm Letter, Modifier
- Lo Letter, Other
- Pc Punctuation, Connector
- Pd Punctuation, Dash
- Ps Punctuation, Open
- Pe Punctuation, Close
- Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
- Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Po Punctuation, Other
- Sm Symbol, Math
- Sc Symbol, Currency
- Sk Symbol, Modifier
- So Symbol, Other

ux_uca.erl: Unicode Collation Algorithm

See Unicode Technical Standard #10.

Functions

compare/2,3
sort/1,2
sort_key/1,2
sort_array/1,2
search/2,3,4

Examples

Code from erlang shell:

1> ux_uca:sort_key("a").   
<<21,163,0,0,32,0,0,2,0,0,255,255>>

2> ux_uca:sort_key("abc"). 
<<21,163,21,185,21,209,0,0,34,0,0,4,0,0,255,255,255,255,
  255,255>>

3> ux_uca:sort_key("abcd").
<<21,163,21,185,21,209,21,228,0,0,35,0,0,5,0,0,255,255,
  255,255,255,255,255,255>>

Code:

ux_uca:compare("a", "a").
ux_uca:compare("a", "b").
ux_uca:compare("c", "b").

Result:

equal
lower
greater

Code:

Options = ux_uca_options:get_options([ 
        {natural_sort, false}, 
        {strength, 3}, 
        {alternate, shifted} 
    ]),
InStrings = ["erlang", "esl", "nitrogen", "epm", "mochiweb", "rebar", "eunit"],
OutStrings = ux_uca:sort(Options, InStrings),
[io:format("~ts~n", [S]) || S <- OutStrings],

SortKeys = [{Str, ux_uca:sort_key(Options, Str)} || Str <- OutStrings],
[io:format("~ts ~w~n", [S, K]) || {S, K} <- SortKeys],

ok.

Result:

epm
erlang
esl
eunit
mochiweb
nitrogen
rebar
epm [5631,5961,5876,0,32,32,32,0,2,2,2]
erlang [5631,6000,5828,5539,5890,5700,0,32,32,32,32,32,32,0,2,2,2,2,2,2]
esl [5631,6054,5828,0,32,32,32,0,2,2,2]
eunit [5631,6121,5890,5760,6089,0,32,32,32,32,32,0,2,2,2,2,2]
mochiweb [5876,5924,5585,5735,5760,6180,5631,5561,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
nitrogen [5890,5760,6089,6000,5924,5700,5631,5890,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
rebar [6000,5631,5561,5539,6000,0,32,32,32,32,32,0,2,2,2,2,2]
ok

Searching

Code:

(ux@delta)30> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"fox").
{"The quick brown ","fox"," jumps over the lazy dog."}

(ux@delta)33> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"cat").         
false

Searching and Strength

Code:

(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.      
#Fun<erl_eval.6.80247286>

(ux@delta)32> ux_uca:search(CF(2), "The quick brown fox jumps over the lazy
dog.", "dog", maximal).
{"The quick brown fox jumps over the lazy"," dog.",[]}

(ux@delta)21> ux_uca:search(CF(2), "fF", "F").                                    
{[],"f","F"}

(ux@delta)22> ux_uca:search(CF(3), "fF", "F").
{"f"
672E
,"F",[]}

Searching and Match-Style

Code:

(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.      
#Fun<erl_eval.6.80247286>

(ux@delta)27> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'minimal').
{"! ","F","   ?S?"}

(ux@delta)28> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'maximal').
{[],"! F   ?","S?"}

(ux@delta)29> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'medium'). 
{[],"! F ","  ?S?"}

ux_unidata.erl

Stores UNIDATA information. For internal using only.

Data loading

ux_unidata_filelist:set_source(Level, ParserType, ImportedDataTypes,
FromFile).

For example:

ux_unidata_filelist:set_source(process, blocks, all, code:priv_dir(ux) ++ "/UNIDATA/Blocks.txt"}).

loads data about Unicode blocks from priv/UNIDATA/Blocks.txt.

So, different processes can use their own unidata dictionaries.

Level is process, application or node.

Parsers are located into ux_unidata_parser_* modules.

Default unidata files are loaded when the application tries get the access to them.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
doc		doc
ebin		ebin
priv		priv
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
Makefile		Makefile
README.md		README.md
rebar		rebar
rebar.config		rebar.config
root.xml		root.xml
start-dev.sh		start-dev.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unicode eXtension

This library realized only these documents:

and some parts from:

Structure of the library

ux_string.erl: String Functions for lists of Unicode characters.

Functions for working with Unicode Normal Forms (UNF)

Functions from stdlib for Unicode strings

Functions for processing strings as groups of graphemes

Examples

"PHP-style" string functions

Examples

Types function

ux_char.erl: Char Functions

List of types

ux_uca.erl: Unicode Collation Algorithm

Functions

Examples

Searching

Searching and Strength

Searching and Match-Style

ux_unidata.erl

Data loading

About

Uh oh!

Releases

Packages

TypedLambda/ux

Folders and files

Latest commit

History

Repository files navigation

Unicode eXtension

This library realized only these documents:

and some parts from:

Structure of the library

ux_string.erl: String Functions for lists of Unicode characters.

Functions for working with Unicode Normal Forms (UNF)

Functions from stdlib for Unicode strings

Functions for processing strings as groups of graphemes

Examples

"PHP-style" string functions

Examples

Types function

ux_char.erl: Char Functions

List of types

ux_uca.erl: Unicode Collation Algorithm

Functions

Examples

Searching

Searching and Strength

Searching and Match-Style

ux_unidata.erl

Data loading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages