10000 GitHub - elbeardmorez/ucp: Unicode Codepoints Parser
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

elbeardmorez/ucp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Unicode Codepoints Parser

description

script for creating codepoint-to-name maps from ICU unicode codepoint data. its primary purpose is the construction of, optionally customised, emoji codepoint maps

this project was born out of a script originally found in the emo project, u 8EF3 sed for this very same purpose

usage

briefly, the following command downloads and generates the latest supported emoji set

    $ ./ucp.py -d -g -o   # output at 'codepoints/unicode-12/unicode-12.map
    $ ./ucp.py -d -g > results

verbosely.. the raw datasets can currently be found under the following root url:

https://www.unicode.org/Public/

e.g.

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt https://www.unicode.org/Public/12.0.0/ucd/UnicodeData.txt https://www.unicode.org/Public/emoji/5.0/emoji-data.txt

these can be downloaded using this script's --download and optionally --target options with file(s) being written to codepoints/unicode-version/upstream subdirectories

the --generate option can then be used to collate these files into a single, leaner, unicode-version.raw file prior to final filtering

the --matchsets option can be used to adjust the default builtin filter which is called _emoji2_. --matchsets takes a comma delimited list builtin matchset name(s) and/or arbitrary match strings. '[+]/-' prefixes can be used to designate whether the matchset is a positive (default) / negative (blacklist) filter. the negative filter is applied before the positive filter.

usage: ucp.py [-h] [-lt] [-t [TARGET]] [-d] [-u [URL_ROOT]] [-g] [-s SOURCE]
              [-ms MATCHSETS] [-o] [-v]

Unicode Codepoints Parser

optional arguments:
  -h, --help            show this help message and exit
  -lt, --list-targets   list unicode versions available for targeting
  -t [TARGET], --target [TARGET]
                        unicode version to target
  -d, --download        download ICU codepoints data
  -u [URL_ROOT], --url [URL_ROOT]
                        override default root url for ICU codepoints data
  -g, --generate        generate the codepoints map
  -s SOURCE, --source SOURCE
                        local path to either an ICU codepoints upstream data
                        set ora collated '.raw' file (default: stdin)
  -ms MATCHSETS, --matchsets MATCHSETS
                        comma delimited list of pre-defined matchsets and/or
                        arbitrary match strings. '[+]/-' prefixes can be used
                        todesignate the matchset as a positive/negative filter
  -o, --dump            dump the generated map to the standard target version
                        location and disable the default writing to stdout
  -v, --verbose         increase the level of information output

examples

dowloading

    $ ./ucp.py -d -t unicode-10
    # downloading ICU codepoints for 'unicode-10'

    $ ls -a1R codepoints/unicode-10/upstream
    codepoints/unicode-10/upstream:
    .
    ..
    emoji-data.txt
    emoji-sequences.txt
    emoji-variation-sequences.txt
    emoji-zwj-sequences.txt
    UnicodeData.txt

generating maps

    $ ./ucp.py -o -g -ms "_emoji_" -t unicode-6 -s codepoints/unicode-6/unicode-6.raw
    mapped 1006 codepoints

    $ ./ucp.py -o -g -ms "_emoji_,flag" -t unicode-6 -s codepoints/unicode-6/unicode-6.raw
    mapped 1015 codepoints

    $ ./ucp.py -o -g -ms "_emoji_,-flag" -t unicode-6 -s codepoints/unicode-12/unicode-12.raw
    mapped 997 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_" -s codepoints/unicode/unicode-12.raw
    # generating codepoints map
    mapped 5887 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_,0x1f01" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5888 codepoints

    $ ./ucp.py -o -g -ms "_emoji2_,0x1f0" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5913 codepoints

    ./ucp.py -o -g -ms "_emoji2_,-church" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 5886 codepoints

    ./ucp.py -o -g -ms "_emoji_" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 1024 codepoints

    ./ucp.py -o -g -ms "_emoji_,flag" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 1297 codepoints

listing maps

    $ ./ucp.py -g -ms "flag" -s codepoints/unicode-12/unicode-12.raw
    # generating codepoints map
    mapped 282 codepoints
    0x2690;white-flag
    0x2691;black-flag
    0x26f3;flag-in-hole

    ...

    0x1f38c;crossed-flags
    0x1f3c1;chequered-flag
    0x1f3f3;waving-white-flag
    0x1f3f4;black-flag

    ...

    0x1f3f4 0xe0067 0xe0062 0xe0065 0xe006e 0xe0067 0xe007f;flag:-england
    0x1f3f4 0xe0067 0xe0062 0xe0073 0xe0063 0xe0074 0xe007f;flag:-scotland
    0x1f3f4 0xe0067 0xe0062 0xe0077 0xe006c 0xe0073 0xe007f;flag:-wales
    0x1f3f3 0xfe0f 0x200d 0x1f308;rainbow-flag
    0x1f3f4 0x200d 0x2620 0xfe0f;pirate-flag

contributors

About

Unicode Codepoints Parser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages

0