Sudachi is Japanese morphological analyzer. Morphological analysis consists mainly of the following tasks.
- Segmentation
- Part-of-speech tagging
- Normalization
Sudachi has the following features.
- Multiple-length segmentation
- You can change the mode of segmentations
- Extract morphemes and named entities at once
- Large lexicon
- Based on UniDic and NEologd
- Plugins
- You can change the behavior of processings
- Work closely with the synonym dictionary
- We will release the sysnonym dictionary at a later date
Sudachi has three types of dictionaries.
- Small: includes only the vocabulary of UniDic
- Core: includes basic vocabulary (default)
- Full: includes miscellaneous proper nouns
For more details, see SudachiDict.
Run the command line tool with the configuration string
$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'
$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]
- -r conf specifies the setting file (overrids -s)
- -s json additional settings (overrids -r)
- -p directory root directory of resources
- -m {A|B|C} specifies the mode of splitting
- -a outputs the dictionary form and the reading form
- -d dump the debug outputs
- -o specifies output file (default: the standard output)
- -f ignore errors
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar
æ±äº¬éƒ½ å詞,固有å詞,地å,一般,*,* æ±äº¬éƒ½
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* ã¸
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã
EOS
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar -a
æ±äº¬éƒ½ å詞,固有å詞,地å,一般,*,* æ±äº¬éƒ½ æ±äº¬éƒ½ トウã‚ョウト
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* 㸠㸠エ
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã 行ã イク
EOS
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar -m A
æ±äº¬ å詞,固有å詞,地å,一般,*,* æ±äº¬
都 å詞,普通å詞,一般,*,*,* 都
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* ã¸
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã
EOS
You can find details in the Javadoc.
To compile an application with Sudachi API, declare a dependency on Sudachi in maven project.
<dependency>
<groupId>com.worksap.nlp</groupId>
<artifactId>sudachi</artifactId>
<version>0.3.2</version>
</dependency>
Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.
The followings are examples in the core dictionary.
Aï¼šé¸æŒ™/管ç†/委員/会
Bï¼šé¸æŒ™/管ç†/委員会
Cï¼šé¸æŒ™ç®¡ç†å§”員会
A:客室/乗務/員
B:客室/乗務員
C:客室乗務員
A:労åƒ/者/å”åŒ/組åˆ
B:労åƒè€…/å”åŒ/組åˆ
C:労åƒè€…å”åŒçµ„åˆ
A:機能/性/食å“
B:機能性/食å“
C:機能性食å“
The followings are examples in the full dictionary.
A:医薬/å“/安全/管ç†/責任/者
B:医薬å“/安全/管ç†/責任者
C:医薬å“安全管ç†è²¬ä»»è€…
A:消費/者/安全/調査/委員/会
B:消費者/安全/調査/委員会
C:消費者安全調査委員会
A:ã•ã£ã½ã‚/テレビ/å¡”
B:ã•ã£ã½ã‚/テレビ塔
C:ã•ã£ã½ã‚テレビ塔
A:カンヌ/国際/æ˜ ç”»/ç¥
B:カンヌ/国際/æ˜ ç”»ç¥
Cï¼šã‚«ãƒ³ãƒŒå›½éš›æ˜ ç”»ç¥
In full-text searching, to use A and B can imrove precision and recall.
You can use or make plugins which modify the behavior of Sudachi.
Type of Plugins | Example |
---|---|
Modify the Inputs | Character nomalization |
Make OOVs | Considering script styles |
Connect Words | Inhibition, Overwrite costs |
Modify the Path | Fix Person names, Equalization of splitting |
We prepared following plugins.
Type of Plugins | Plugin | |
---|---|---|
Modify the Inputs | character nomalization | Full/half-width, Cases, Variants |
normalization of prolong symbols | Normalize "~", "ー"s | |
Make OOVs | Make one character OOVs | Use as the fallback |
MeCab compatible OOVs | ||
Connect Words | Inhibition | Specified by part-of-speech |
Modify the Path | Join Katakata OOVs | |
Join numerics | ||
Equalization of splitting* | Smooth of OOVs and not OOVs | |
Normalize numerics | Normalize Kanji numerics and scales | |
Estimate person names* |
* will be released at a later date.
Sudachi normalize the following variations.
- Okurigana
- e.g. 打込む → 打ã¡è¾¼ã‚€
- Script
- e.g. ã‹ã¤ä¸¼ → カツ丼
- Variant
- e.g. 附属 → 付属
- Misspelling
- e.g. シュミレーション → シミュレーション
- Contracted form
- e.g. ã¡ã‚ƒã‚ → ã¦ã¯
DefaultInputTextPlugin
normalizes an input text in the following order.
- To lower case by
Character.toLowerCase()
- Unicode normalization by NFKC
When rewrite.def
has the following descriptions, DefaultInputTextPlugin
stops the above processing and aplies the followings.
- Ignore
# single code point: this character is skipped in character normalization
é«™
- Replace
# rewrite rule: <target> <replacement>
A' Ā
If the number of characters increases as a result of character normalization, Sudachi may output morphemes whose length is 0 in the original input text.
To create and use your own dictionaries, please refer to docs/user_dict.md.
Sudachi | MeCab | kuromoji | |
---|---|---|---|
Multiple Segmentation | Yes   | No  | Limited ^a |
Normalization | Yes | No | Limited ^b |
Joining, Correction | Yes | No | Limited ^b |
Use multiple user dictionary | Yes | Yes | No |
Saving Memory | Good ^c | Poor | Good |
Accuracy | Good | Good | Good |
Speed | Good | Excellent | Good |
- ^a: approximation with n-best
- ^b: with Lucene filters
- ^c: memory sharing with multiple Java VMs
- Speeding up
- Releasing plugins
- Improving the accuracy
- Adding more split informations
- Adding more normalized forms
- Fix reading forms (pronunciation -> Furigana)
- Coodinating segmentations with the synonym dictionary
Sudachi by Works Applications Co., Ltd. is licensed under the Apache License, Version2.0
Copyright (c) 2017 Works Applications Co., Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
We release a plug-in for Elasticsearch.
An implementation of Sudachi in Python
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
- https://sudachi-dev.slack.com/
- (Please get an invite from here)
We have published a paper about Sudachi and its language resources; "Sudachi: a Japanese Tokenizer for Business" (Takaoka et al., LREC2018).
When citing Sudachi in papers, books, or services, please use the follow BibTex entry;
@InProceedings{TAKAOKA18.8884,
author = {Kazuma Takaoka and Sorami Hisamoto and Noriko Kawahara and Miho Sakamoto and Yoshitaka Uchida and Yuji Matsumoto},
title = {Sudachi: a Japanese Tokenizer for Business},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {may},
date = {7-12},
location = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {979-10-95546-00-9},
language = {english}
}
Sudachi ã¯æ—¥æœ¬èªžå½¢æ…‹ç´ è§£æžå™¨ã§ã™ã€‚å½¢æ…‹ç´ è§£æžã¯ãŠã‚‚ã«ä»¥ä¸‹ã®3ã¤ã®å‡¦ç†ã‚’ ãŠã“ãªã„ã¾ã™ã€‚
- テã‚スト分割
- å“詞付与
- æ£è¦åŒ–処ç†
Sudachi ã¯å¾“æ¥ã®å½¢æ…‹ç´ è§£æžå™¨ã¨ãらã¹ã€ä»¥ä¸‹ã®ã‚ˆã†ãªç‰¹é•·ãŒã‚りã¾ã™ã€‚
- 複数ã®åˆ†å‰²å˜ä½ã®ä½µç”¨
- å¿…è¦ã«å¿œã˜ã¦åˆ‡ã‚Šæ›¿ãˆ
- å½¢æ…‹ç´ è§£æžã¨å›ºæœ‰è¡¨ç¾æŠ½å‡ºã®èžåˆ
- 多数ã®åŽéŒ²èªžå½™
- UniDic 㨠NEologd をベースã«èª¿æ•´
- 機能ã®ãƒ—ラグイン化
- æ–‡å—æ£è¦åŒ–や未知語処ç†ã«æ©Ÿèƒ½è¿½åŠ ãŒå¯èƒ½
- åŒç¾©èªžè¾žæ›¸ã¨ã®é€£æº
- 後日公開予定
Sudachi ã«ã¯3種類ã®è¾žæ›¸ãŒã‚りã¾ã™ã€‚
- Small: UniDic ã®åŽéŒ²èªžã¨ãã®æ£è¦åŒ–表記ã€åˆ†å‰²å˜ä½ã‚’åŽéŒ²
- Core: 基本的ãªèªžå½™ã‚’åŽéŒ² (デフォルト)
- Full: 雑多ãªå›ºæœ‰åè 8000 ©žã¾ã§åŽéŒ²
ãã‚ã—ã㯠SudachiDict ã‚’ã”らんãã ã•ã„。
コマンドラインツールã§è¨å®šæ–‡å—列を指定ã—ã¾ã™
$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'
$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]
- -r conf è¨å®šãƒ•ァイルを指定 (-s ã¨æŽ’ä»–)
- -s json デフォルトè¨å®šã®ä¸Šæ›¸ã (-r ã¨æŽ’ä»–)
- -p directory リソースã®èµ·ç‚¹ã¨ãªã‚‹ãƒ‡ã‚£ãƒ¬ã‚¯ãƒˆãƒªã‚’指定
- -m {A|B|C} 分割モード
- -a èªã¿ã€è¾žæ›¸å½¢ã‚‚出力
- -d デãƒãƒƒã‚°æƒ…å ±ã®å‡ºåŠ›
- -o 出力ファイル (指定ãŒãªã„å ´åˆã¯æ¨™æº–出力)
- -f エラーを無視ã—ã¦å‡¦ç†ã‚’続行ã™ã‚‹
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar
æ±äº¬éƒ½ å詞,固有å詞,地å,一般,*,* æ±äº¬éƒ½
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* ã¸
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã
EOS
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar -a
æ±äº¬éƒ½ å詞,固有å詞,地å,一般,*,* æ±äº¬éƒ½ æ±äº¬éƒ½ トウã‚ョウト
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* 㸠㸠エ
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã 行ã イク
EOS
$ echo æ±äº¬éƒ½ã¸è¡Œã | java -jar target/sudachi.jar -m A
æ±äº¬ å詞,固有å詞,地å,一般,*,* æ±äº¬
都 å詞,普通å詞,一般,*,*,* 都
㸠助詞,æ ¼åŠ©è©ž,*,*,*,* ã¸
行ã 動詞,éžè‡ªç«‹å¯èƒ½,*,*,五段-カ行,終æ¢å½¢-一般 行ã
EOS
ライブラリã¨ã—ã¦ã®åˆ©ç”¨ã¯ Javadoc ã‚’å‚ç…§ã—ã¦ãã ã•ã„。
Maven プãƒã‚¸ã‚§ã‚¯ãƒˆã§åˆ©ç”¨ã™ã‚‹å ´åˆã¯ä»¥ä¸‹ã® dependency ã‚’è¿½åŠ ã—ã¦ãã ã•ã„。
<dependency>
<groupId>com.worksap.nlp</groupId>
<artifactId>sudachi</artifactId>
<version>0.3.2</version>
</dependency>
Sudachi ã§ã¯çŸã„æ–¹ã‹ã‚‰ A, B, C ã®3ã¤ã®åˆ†å‰²ãƒ¢ãƒ¼ãƒ‰ã‚’æä¾›ã—ã¾ã™ã€‚ A 㯠UniDic çŸå˜ä½ç›¸å½“ã€C ã¯å›ºæœ‰è¡¨ç¾ç›¸å½“ã€B 㯠A, C ã®ä¸é–“çš„ãªå˜ä½ã§ã™ã€‚
以下ã«ä¾‹ã‚’示ã—ã¾ã™ã€‚
(コア辞書利用時)
Aï¼šé¸æŒ™/管ç†/委員/会
Bï¼šé¸æŒ™/管ç†/委員会
Cï¼šé¸æŒ™ç®¡ç†å§”員会
A:客室/乗務/員
B:客室/乗務員
C:客室乗務員
A:労åƒ/者/å”åŒ/組åˆ
B:労åƒè€…/å”åŒ/組åˆ
C:労åƒè€…å”åŒçµ„åˆ
A:機能/性/食å“
B:機能性/食å“
C:機能性食å“
(フル辞書利用時)
A:医薬/å“/安全/管ç†/責任/者
B:医薬å“/安全/管ç†/責任者
C:医薬å“安全管ç†è²¬ä»»è€…
A:消費/者/安全/調査/委員/会
B:消費者/安全/調査/委員会
C:消費者安全調査委員会
A:ã•ã£ã½ã‚/テレビ/å¡”
B:ã•ã£ã½ã‚/テレビ塔
C:ã•ã£ã½ã‚テレビ塔
A:カンヌ/国際/æ˜ ç”»/ç¥
B:カンヌ/国際/æ˜ ç”»ç¥
Cï¼šã‚«ãƒ³ãƒŒå›½éš›æ˜ ç”»ç¥
検索用途ã§ã‚れ㰠A 㨠C を併用ã™ã‚‹ã“ã¨ã§ã€å†ç¾çއã¨é©åˆçŽ‡ã‚’å‘上ã•ã›ã‚‹ ã“ã¨ãŒã§ãã¾ã™ã€‚
Sudachi ã§ã¯å½¢æ…‹ç´ è§£æžã®å„ステップをフックã—ã¦å‡¦ç†ã‚’å·®ã—込むプラグイン機構を æä¾›ã—ã¦ã„ã¾ã™ã€‚
プラグイン | 処ç†ä¾‹ |
---|---|
入力テã‚ã‚¹ãƒˆä¿®æ£ | 異体å—統制ã€è¡¨è¨˜è£œæ£ |
æœªçŸ¥èªžå‡¦ç† | æ–‡å—種ã«ã‚ˆã‚‹èª¿æ•´ |
å˜èªžæŽ¥ç¶šå‡¦ç† | å“詞接続ç¦åˆ¶ã€ã‚³ã‚¹ãƒˆå€¤ä¸Šæ›¸ã |
å‡ºåŠ›è§£ä¿®æ£ | 人å処ç†ã€åˆ†å‰²ç²’度調整 |
プラグインを作æˆã™ã‚‹ã“ã¨ã§ãƒ¦ãƒ¼ã‚¶ãƒ¼ãŒç‹¬è‡ªã®å‡¦ç†ã‚’ãŠã“ãªã†ã“ã¨ãŒã§ãã¾ã™ã€‚
システムæä¾›ã®ãƒ—ラグインã¨ã—ã¦ä»¥ä¸‹ã®ã‚‚ã®ã‚’利用ã§ãã¾ã™ã€‚
処ç†éƒ¨åˆ† | プラグイン | |
---|---|---|
入力テã‚ã‚¹ãƒˆä¿®æ£ | æ–‡å—列æ£è¦åŒ– | å…¨åŠè§’ã€å¤§æ–‡å—/å°æ–‡å—ã€ç•°ä½“å— |
カスタマイズå¯èƒ½ | ||
長音æ£è¦åŒ– | 「~ã€ã‚„長音記å·é€£ç¶šã®æ£è¦åŒ– | |
æœªçŸ¥èªžå‡¦ç† | 1æ–‡å—æœªçŸ¥èªž | フォールãƒãƒƒã‚¯ã¨ã—ã¦åˆ©ç”¨ |
MeCabäº’æ› | ||
å˜èªžæŽ¥ç¶šå‡¦ç† | å“詞接続ç¦åˆ¶ | カスタマイズå¯èƒ½ |
å‡ºåŠ›è§£ä¿®æ£ | カタカナ未知語ã¾ã¨ã‚上㒠| |
数詞ã¾ã¨ã‚上㒠| ||
分割粒度調整* | 未知語/既知語ã®åˆ†å‰²ç²’度ã®å¹³æ»‘化 | |
数詞æ£è¦åŒ– | 漢数詞やä½å–ã‚Šã®æ£è¦åŒ– | |
人å補æ£* | 敬称やå‰å¾Œé–¢ä¿‚ã‹ã‚‰äººå部を推定 |
* ã¯å¾Œæ—¥å…¬é–‹äºˆå®š
Sudachi ã®ã‚·ã‚¹ãƒ†ãƒ 辞書ã§ã¯ä»¥ä¸‹ã®ã‚ˆã†ãªè¡¨è¨˜æ£è¦åŒ–ã‚’æä¾›ã—ã¾ã™ã€‚
- é€ã‚Šé•ã„
- 例) 打込む → 打ã¡è¾¼ã‚€
- å—種
- 例) ã‹ã¤ä¸¼ → カツ丼
- 異体å—
- 例) 附属 → 付属
- 誤用
- 例) シュミレーション → シミュレーション
- 縮約
- 例) ã¡ã‚ƒã‚ → ã¦ã¯
デフォルトã§é©ç”¨ã•れるプラグイン DefaultInputTextPlugin
ã§å…¥åŠ›æ–‡ã«å¯¾ã—ã¦ä»¥ä¸‹ã®é †ã§æ£è¦åŒ–ã‚’ãŠã“ãªã„ã¾ã™ã€‚
Character.toLowerCase()
ã‚’ã¤ã‹ã£ãŸå°æ–‡å—化- NFKC ã‚’ã¤ã‹ã£ãŸ Unicode æ£è¦åŒ–
ãŸã ã—ã€rewrite.def
ã«ä»¥ä¸‹ã®è¨˜è¿°ãŒã‚ã£ãŸå ´åˆã¯ä¸Šè¨˜ã®å‡¦ç†ã¯é©ç”¨ã•れãšã€ã“ã¡ã‚‰ã®å‡¦ç†ãŒå„ªå…ˆã•れã¾ã™ã€‚
- æ£è¦åŒ–抑制
# コードãƒã‚¤ãƒ³ãƒˆãŒ1ã¤ã®ã¿è¨˜è¿°ã•れã¦ã„ã‚‹å ´åˆã¯ã€æ–‡å—æ£è¦åŒ–を抑制ã—ã¾ã™
é«™
- ç½®æ›
# ç½®æ›å¯¾è±¡æ–‡å—列 ç½®æ›å…ˆæ–‡å—列
A' Ā
æ–‡å—æ£è¦åŒ–ã®çµæžœã€æ–‡å—æ•°ãŒå¢—ãˆãŸå ´åˆã€åŽŸæ–‡ä¸Šã§ã¯é•·ã•ãŒ0ã«ãªã‚‹å½¢æ…‹ç´ ãŒå‡ºåŠ›ã•れるã“ã¨ãŒã‚りã¾ã™ã€‚
ユーザー辞書ã®ä½œæˆã¨åˆ©ç”¨æ–¹æ³•ã«ã¤ã„ã¦ã¯ã€docs/user_dict.mdã‚’ã”覧ãã ã•ã„。
Sudachi | MeCab | kuromoji | |
---|---|---|---|
分割å˜ä½ã®ä½µç”¨ | â—‹ | × | â–³ ^1 |
æ–‡å—æ£è¦åŒ–ã€è¡¨è¨˜æ£è¦åŒ– | â—‹ | × | â–³ ^2 |
ã¾ã¨ã‚上ã’ã€è£œæ£å‡¦ç† | â—‹ | × | â–³ ^2 |
複数ユーザ辞書ã®åˆ©ç”¨ | â—‹ | â—‹ | × |
çœãƒ¡ãƒ¢ãƒª | â—Ž ^3 | â–³ | â—‹ |
è§£æžç²¾åº¦ | â—‹ | â—‹ | â—‹ |
è§£æžé€Ÿåº¦ | â–³ | â—‹ | â–³ |
- ^1: n-bestè§£ã«ã‚ˆã‚‹è¿‘ä¼¼
- ^2: Lucene フィルター併用
- ^3: メモリマップ利用ã«ã‚ˆã‚‹è¤‡æ•° JavaVM ã§ã®è¾žæ›¸å…±æœ‰
- 高速化
- æœªå®Ÿè£…ãƒ—ãƒ©ã‚°ã‚¤ãƒ³ã®æ•´å‚™
- è§£æžç²¾åº¦å‘上
- åˆ†å‰²æƒ…å ±ã®æ‹¡å……
- æ£è¦åŒ–è¡¨è¨˜ã®æ‹¡å……
- èªã¿æƒ…å ±ã®æ•´å‚™ (発音èªã¿ → ãµã‚ŠãŒãªèªã¿)
- åŒç¾©èªžè¾žæ›¸ã¨ã®é€£æº
Elasticsearch ã§ Sudachi ã‚’ã¤ã‹ã†ãŸã‚ã®ãƒ—ラグインも公開ã—ã¦ã„ã¾ã™ã€‚
Python 版も公開ã—ã¦ã„ã¾ã™ã€‚
é–‹ç™ºè€…ã‚„ãƒ¦ãƒ¼ã‚¶ãƒ¼ã®æ–¹ã€…ãŒè³ªå•ã—ãŸã‚Šè°è«–ã™ã‚‹ãŸã‚ã®Slackワークスペースを用æ„ã—ã¦ã„ã¾ã™ã€‚
- https://sudachi-dev.slack.com/
- (ã“ã¡ã‚‰ã‹ã‚‰æ‹›å¾…ã‚’å—ã‘ã¦ãã ã•ã„)
Sudachiã¨ãã®è¨€èªžè³‡æºã«ã¤ã„ã¦ã€è«–文を発表ã—ã¦ã„ã¾ã™; "Sudachi: a Japanese Tokenizer for Business" (Takaoka et al., LREC2018).
Sudachiを論文や書ç±ã€ã‚µãƒ¼ãƒ“スãªã©ã§å¼•用ã•れる際ã«ã¯ã€ä»¥ä¸‹ã®BibTexã‚’ã”利用ãã ã•ã„。
@InProceedings{TAKAOKA18.8884,
author = {Kazuma Takaoka and Sorami Hisamoto and Noriko Kawahara and Miho Sakamoto and Yoshitaka Uchida and Yuji Matsumoto},
title = {Sudachi: a Japanese Tokenizer for Business},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {may},
date = {7-12},
location = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {979-10-95546-00-9},
language = {english}
}