Unicode Codepoints Parser

description

script for creating codepoint-to-name maps from ICU unicode codepoint data. its primary purpose is the construction of, optionally customised, emoji codepoint maps

this project was born out of a script originally found in the emo project, u 8EF3 sed for this very same purpose

usage

briefly, the following command downloads and generates the latest supported emoji set

    $ ./ucp.py -d -g -o   # output at 'codepoints/unicode-12/unicode-12.map
    $ ./ucp.py -d -g > results

verbosely.. the raw datasets can currently be found under the following root url:

https://www.unicode.org/Public/

e.g.

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt https://www.unicode.org/Public/12.0.0/ucd/UnicodeData.txt https://www.unicode.org/Public/emoji/5.0/emoji-data.txt

these can be downloaded using this script's --download and optionally --target options with file(s) being written to codepoints/unicode-version/upstream subdirectories

the --generate option can then be used to collate these files into a single, leaner, unicode-version.raw file prior to final filtering

the --matchsets option can be used to adjust the default builtin filter which is called _emoji2_. --matchsets takes a comma delimited list builtin matchset name(s) and/or arbitrary match strings. '[+]/-' prefixes can be used to designate whether the matchset is a positive (default) / negative (blacklist) filter. the negative filter is applied before the positive filter.

usage: ucp.py [-h] [-lt] [-t [TARGET]] [-d] [-u [URL_ROOT]] [-g] [-s SOURCE]
              [-ms MATCHSETS] [-o] [-v]

Unicode Codepoints Parser

optional arguments:
  -h, --help            show this help message and exit
  -lt, --list-targets   list unicode versions available for targeting
  -t [TARGET], --target [TARGET]
                        unicode version to target
  -d, --download        download ICU codepoints data
  -u [URL_ROOT], --url [URL_ROOT]
                        override default root url for ICU codepoints data
  -g, --generate        generate the codepoints map
  -s SOURCE, --source SOURCE
                        local path to either an ICU codepoints upstream data
                        set ora collated '.raw' file (default: stdin)
  -ms MATCHSETS, --matchsets MATCHSETS
                        comma delimited list of pre-defined matchsets and/or
                        arbitrary match strings. '[+]/-' prefixes can be used
                        todesignate the matchset as a positive/negative filter
  -o, --dump            dump the generated map to the standard target version
                        location and disable the default writing to stdout
  -v, --verbose         increase the level of information output

examples

dowloading

    $ ./ucp.py -d -t unicode-10
    # downloading ICU codepoints for 'unicode-10'

    $ ls -a1R codepoints/unicode-10/upstream
    codepoints/unicode-10/upstream:
    .
    ..
    emoji-data.txt
    emoji-sequences.txt
    emoji-variation-sequences.txt
    emoji-zwj-sequences.txt
    UnicodeData.txt

generating maps

    $ ./ucp.py -o -g -ms "_emoji_" -t unicode-6 -s codepoints/unicode-6/unicode-6.raw
    mapped 1006 codepoints

    $ ./ucp.py -o -g -ms "_emoji_,flag" -t unicode-6 -s codepoints/unicode-6/unicode-6.raw
    mapped 1015 codepoints

    $ ./ucp.py -o -g -ms "_emoji_,-flag" -t unicode-6 -s codepoints/unicode-12/unicode-12.raw
    mapped 997 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_" -s codepoints/unicode/unicode-12.raw
    # generating codepoints map
    mapped 5887 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_,0x1f01" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5888 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_,0x1f0" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5913 codepoints

    ./ucp.py -o -g -ms "_emoji2_,-church" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5886 codepoints

    ./ucp.py -o -g -ms "_emoji_" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 1024 codepoints

    ./ucp.py -o -g -ms "_emoji_,flag" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 1297 codepoints

listing maps

    $ ./ucp.py -g -ms "flag" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 282 codepoints
    0x2690;white-flag
    0x2691;black-flag
    0x26f3;flag-in-hole

    ...

    0x1f38c;crossed-flags
    0x1f3c1;chequered-flag
    0x1f3f3;waving-white-flag
    0x1f3f4;black-flag

    ...

    0x1f3f4 0xe0067 0xe0062 0xe0065 0xe006e 0xe0067 0xe007f;flag:-england
    0x1f3f4 0xe0067 0xe0062 0xe0073 0xe0063 0xe0074 0xe007f;flag:-scotland
    0x1f3f4 0xe0067 0xe0062 0xe0077 0xe006c 0xe0073 0xe007f;flag:-wales
    0x1f3f3 0xfe0f 0x200d 0x1f308;rainbow-flag
    0x1f3f4 0x200d 0x2620 0xfe0f;pirate-flag

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
codepoints		codepoints
.gitignore		.gitignore
README.md		README.md
ucp.py		ucp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unicode Codepoints Parser

description

usage

examples

dowloading

generating maps

listing maps

contributors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

elbeardmorez/ucp

Folders and files

Latest commit

History

Repository files navigation

Unicode Codepoints Parser

description

usage

examples

dowloading

generating maps

listing maps

contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages