<-Previous Article Next Article->

[1]ZHANG Hongxi,CAI Zhijie.Construction of a Tibetan verb-ending type dataset for automatic question answering[J].CAAI Transactions on Intelligent Systems,2025,20(5):1207-1216.[doi:10.11992/tis.202410002]

Copy

Construction of a Tibetan verb-ending type dataset for automatic question answering

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 5 Page number: 1207-1216 Column: 学术论文—自然语言处理与理解 Public date: 2025-09-05

Title:: Construction of a Tibetan verb-ending type dataset for automatic question answering

Author(s):: ZHANG Hongxi¹; 2; CAI Zhijie¹; 2; 1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016, China;
2. The State Key Laboratory of Tibetan Intelligence, Xining 810008, China

Keywords:: natural language processing; Tibetan; automatic Q&; A; TiQuAD_36414 dataset; Q&; A template; verb; la case auxiliary word; effectiveness

CLC:: TP391

DOI:: 10.11992/tis.202410002

Abstract:: The Tibetan automatic question answering (Q&A) dataset serves as a crucial data foundation for advancing research in Tibetan automatic Q&A technologies. To solve the problem of the lack of automatic Q&A datasets in Tibetan, this paper first examines the features of the most common verb-ending type sentences in Tibetan based on an analysis of the current status of automatic Q&A dataset construction in English, Chinese, and Tibetan. Then, this study constructs templates for sentences and questions and proposes a template-based method for building a Tibetan automatic Q&A dataset with “verb-ending + La case auxiliary word” sentences. Then, a new Tibetan automatic Q&A dataset (TiQuAD_36414) is generated according to this approach. Finally, the validity of this dataset is verified using the MOS(mean opinion score) method, along with the F1 and EM(exact match) scores of the BiDAF(bidirectional attention flow), RNet(Gated Self-Matching Networks), and QANet(question answering net) models. The experimental results show that the performance of the TiQuAD_36414 dataset is better than that of the baseline Tibetan Q&A dataset.

References:: [1] 文森, 钱力, 胡懋地, 等. 基于大语言模型的问答技术研究进展综述[J]. 数据分析与知识发现, 2024, 8(6): 16-29.
WEN Sen, QIAN Li, HU Maodi, et al. Review of research progress on question-answering techniques based on large language models[J]. Data analysis and knowledge discovery, 2024, 8(6): 16-29.
[2] 王娜, 李杰. 基于AHP-熵权法的FAQ问答系统用户满意度评价研究: 以高校图书馆问答型机器人为例[J]. 情报科学, 2023, 41(9): 164-172.
WANG Na, LI Jie. User satisfaction evaluation of FAQ system based on AHP-entropy weight method: taking the question answering robot of university library as an example[J]. Information science, 2023, 41(9): 164-172.
[3] 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687.
CHE Wanxiang, DOU Zhicheng, FENG Yansong, et al. Towards a comprehensive understanding of the impact of large language models on natural language processing: challenges, opportunities and future directions[J]. Scientia sinica (informationis), 2023, 53(9): 1645-1687.
[4] 才智杰. 面向自然语言处理的藏文句型结构分布统计(13BYY141)研究工作报告[R]. 青海: 国家社科基金项目, 2016.
[5] RAJPURKAR P, ZHANG Jian, LOPYREV K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing. Austin: ACL, 2016: 2383-2392.
[6] BAJAJ P, CAMPOS D, CRASWELL N, et al. MS MARCO: a human generated MAchine reading COmprehension dataset[EB/OL]. (2018-10-31)[2024-10-02]. https://arxiv.org/abs/1611.09268v3.
[7] JOSHI M, CHOI E, WELD D, et al. TriviaQA: a large scale distantly supervised challenge dataset forReading comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: ACL, 2017: 1601-1611.
[8] SAHA A, ARALIKATTE R, KHAPRA M M, et al. DuoRC: towards complex language understanding with paraphrased reading comprehension[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: ACL, 2018: 1683-1693.
[9] KO?ISKY T, SCHWARZ J, BLUNSOM P, et al. The NarrativeQA reading comprehension challenge[J]. Transactions of the association for computational linguistics, 2018, 6: 317-328.
[10] LAI Guokun, XIE Qizhe, LIU Hanxiao, et al. RACE: large-scale ReAding comprehension dataset from examinations[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 785-794.
[11] RICHARDSON M, BURGES C J C, RENSHAW E. MCTest: a challenge dataset for the open-domain machine comprehension of text[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 193-203.
[12] HUANG Lifu, LE BRAS R, BHAGAVATULA C, et al. Cosmos QA: machine reading comprehension with contextual commonsense reasoning[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 2391-2401.
[13] HERMANN K M, KO?ISKY T, GREFENSTETTE E, et al. Teaching machines to read and comprehend[J]. Advances in neural information processing systems, 2015, 28: 1693-1701.
[14] ONISHI T, WANG Hai, BANSAL M, et al. Who did what: a large-scale person-centered cloze dataset[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin: ACL, 2016: 2230-2235.
[15] CUI Yiming, LIU Ting, CHE Wanxiang, et al. A span-extraction dataset for Chinese machine reading comprehension[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 5883-5889.
[16] SHAO C C, LIU T, LAI Yuting, et al. DRCD: a Chinese machine reading comprehension dataset[EB/OL]. (2019-05-29)[2024-10-02]. https://arxiv.org/abs/1806.00920v3.
[17] HE Wei, LIU Kai, LIU Jing, et al. DuReader: a Chinese machine reading comprehension dataset from real-world applications[C]//Proceedings of the Workshop on Machine Reading for Question Answering. Melbourne: ACL, 2018: 37-46.
[18] XU Canwen, PEI Jiaxin, WU Hongtao, et al. MATINF: a jointly labeled large-scale dataset for classification, question answering and summarization[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2020: 3586-3596.
[19] ZHONG Haoxi, XIAO Chaojun, TU Cunchao, et al. JEC-QA: a legal-domain question answering dataset[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(5): 9701-9708.
[20] SUN Kai, YU Dian, YU Dong, et al. Investigating prior knowledge for challenging Chinese machine reading comprehension[J]. Transactions of the association for computational linguistics, 2020, 8: 141-155.
[21] ZHENG Chujie, HUANG Minlie, SUN Aixin. ChID: a large-scale Chinese IDiom dataset for cloze test[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 778-787.
[22] 孙媛, 旦正错, 刘思思, 等. 面向机器阅读理解的藏文数据集TibetanQA[J]. 中国科学数据, 2022, 7(2): 34-42.
SUN Yuan, DAN Zhengcuo, LIU Sisi, et al. TibetanQA: a dataset of Tibetan for Machine reading comprehension[J]. China scientific data, 2022, 7(2): 34-42.
[23] 孙媛, 刘思思, 陈超凡, 等. 面向机器阅读理解的高质量藏语数据集构建[J]. 中文信息学报, 2024, 38(3): 56-64.
SUN Yuan, LIU Sisi, CHEN Chaofan, et al. Construction of high-quality Tibetan dataset for machine reading comprehension[J]. Journal of Chinese information processing, 2024, 38(3): 56-64.
[24] 史晓东, 卢亚军. 央金藏文分词系统[J]. 中文信息学报, 2011, 25(4): 54-56.
SHI Xiaodong, LU Yajun. A Tibetan segmentation system: Yangjin[J]. Journal of Chinese information processing, 2011, 25(4): 54-56.
[25] 格桑居冕, 格桑央京. 实用藏文文法教程[M]. 成都: 四川民族出版社, 2004.
[26] 班玛宝, 才智杰, 拉玛扎西. 基于PCFG的藏文疑问句句法分析[J]. 中文信息学报, 2019, 33(2): 67-74.
BAN Mabao, CAI Zhijie, LA M. Tibetan interrogative sentences parsing based on PCFG[J]. Journal of Chinese information processing, 2019, 33(2): 67-74.
[27] SEO M, KEMBHAVI A, FARHADI A, et al. Bidirectional attention flow for machine comprehension[EB/OL]. (2018-06-21)[2024-10-02]. https://arxiv.org/abs/1611.01603v6.
[28] WANG Wenhui, YANG Nan, WEI Furu, et al. Gated self-matching networks for reading comprehension and question answering[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: ACL, 2017: 189-198.
[29] YU A W, DOHAN D, LUONG M T, et al. QANet: combining local convolution with global self-attention for reading comprehension[EB/OL]. (2018-04-23)[2024-10-02]. https://arxiv.org/abs/1804.09541v1.

Similar References:

Memo

Last Update: 2025-09-05

Construction of a Tibetan verb-ending type dataset for automatic question answering PDF DownloadHTML

Memo

Construction of a Tibetan verb-ending type dataset for automatic question answering

PDF Download HTML