- LATTE-PTM-WS (https://github.com/tchayintr/latte-ptm-ws/)
- Character-based word segmentation
- Multi-granularity Lattice (character-word)
- Encoded with Bidirectional-GAT
- BERT-CRF architecture (+LSTM)
- BMES tagging scheme
- B: beginning, M: middle, E: end, and S: single
- CTB6:
- word-f1: 98.1
- oov-recall: 90.6
- BCCWJ:
- word-f1: 99.4
- oov-recall: 92.1
- BEST2010:
- char-bin-f1: 99.1
- word-f1: 97.7
- CTB6 (Chinese)
- BCCWJ (Japanese)
- BEST2010 (Thai)
- Format each dataset in
sl
(word-segmented sentence line).- In this format, each line contains a word-segmented sentence, with words separated by white spaces.
- zh: https://huggingface.co/yacht/latte-mc-bert-base-chinese-ws
- ja: https://huggingface.co/yacht/latte-mc-bert-base-japanese-ws
- th: https://huggingface.co/yacht/latte-mc-bert-base-thai-ws
model/
- PyTorch model files
- pip
- requirements.txt
pip install -r requirements.txt
- conda
- environment.yml
conda env create -f environment.yml
- See
scripts/
for examples - The scripts for the best models
- zh:
scripts/run-ctb6-mc-bert.sh
- ja:
scripts/run-bccwj-mc-bert.sh
- th:
scripts/run-best2010-mc-bert.sh
- zh:
- Published in Journal of Natural Language Processing