CTRD is a new Chinese Theme-Rheme Discourse Dataset for Chinese discourse analysis, which contains 525 manually annotated news articles, i.e. totally 45,591 sentences, extracted from OntoNotes 4.0. Different from Penn Discourse TreeBank (PDTB) and the datasets based on Rhetorical Structure Theory (RST), CTRD was annotated according to a novel discourse annotation scheme for Chinese based on Halliday’s Systemic Functional Grammar (SFG) and Thematic Progression Patterns (TPP).
When you use the dataset in your work, would you please cite the following papers:
[1] Yiqi Tong, Jiangbin Zheng, Hongkang Zhu, Yidong Chen, Xiaodong Shi. A Document-Level Neural Machine Translation Model with Dynamic Caching Guided by Theme-Rheme Information. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), Barcelona, Spain (Online), Dec. 8-13, 2020, pp. 4385–4395.
[2] Biao Fu, Yiqi Tong, Dawei Tian, Yidong Chen, Xiaodong Shi, Ming Zhu. CTRD: A Chinese Theme-Rheme Discourse Dataset. In: Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2021), Qingdao, China, Oct. 13-17, 2021.