<-上一篇/Previous Article 下一篇/Next Article->

[1]张洪溪,才智杰.面向自动问答的藏文动词结尾型数据集构建[J].智能系统学报,2025,20(5):1207-1216.[doi:10.11992/tis.202410002]
　ZHANG Hongxi,CAI Zhijie.Construction of a Tibetan verb-ending type dataset for automatic question answering[J].CAAI Transactions on Intelligent Systems,2025,20(5):1207-1216.[doi:10.11992/tis.202410002]

点击复制

面向自动问答的藏文动词结尾型数据集构建

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1207-1216 栏目: 学术论文—自然语言处理与理解出版日期: 2025-09-05

Title:: Construction of a Tibetan verb-ending type dataset for automatic question answering

作者:: 张洪溪^1,2, 才智杰^1,2; 1. 青海师范大学计算机学院, 青海西宁 810016;
2. 藏语智能全国重点实验室, 青海西宁 810008

Author(s):: ZHANG Hongxi^1,2, CAI Zhijie^1,2; 1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016, China;
2. The State Key Laboratory of Tibetan Intelligence, Xining 810008, China

关键词:: 自然语言处理; 藏文; 自动问答; TiQuAD_36414数据集; 问答模板; 动词; 位格助词; 有效性

Keywords:: natural language processing; Tibetan; automatic Q&; A; TiQuAD_36414 dataset; Q&; A template; verb; la case auxiliary word; effectiveness

分类号:: TP391

DOI:: 10.11992/tis.202410002

摘要:: 自动问答数据集是研究藏文自动问答技术的重要数据基础。文章针对藏文自动问答数据集匮乏的瓶颈问题，在剖析英文、汉文和藏文自动问答数据集构建现状的基础上，分析了藏文中出现频率最高的动词结尾型句子的问答结构特征，通过构建句子和问句的模板，设计了一种面向自动问答的藏文“动词结尾+位格助词”型数据集构建方案，按照方案构建了面向自动问答的藏文数据集TiQuAD_36414，并采用平均意见得分(mean opinion score, MOS)方法，BiDAF(bidirectional attention flow)、RNet(gated self-matching networks)和QANet(question answering net)模型的F1值和EM(exact match)值验证了数据集的有效性。实验数据表明，本文构建的数据集TiQuAD_36414的质量良好。

Abstract:: The Tibetan automatic question answering (Q&A) dataset serves as a crucial data foundation for advancing research in Tibetan automatic Q&A technologies. To solve the problem of the lack of automatic Q&A datasets in Tibetan, this paper first examines the features of the most common verb-ending type sentences in Tibetan based on an analysis of the current status of automatic Q&A dataset construction in English, Chinese, and Tibetan. Then, this study constructs templates for sentences and questions and proposes a template-based method for building a Tibetan automatic Q&A dataset with “verb-ending + La case auxiliary word” sentences. Then, a new Tibetan automatic Q&A dataset (TiQuAD_36414) is generated according to this approach. Finally, the validity of this dataset is verified using the MOS(mean opinion score) method, along with the F1 and EM(exact match) scores of the BiDAF(bidirectional attention flow), RNet(Gated Self-Matching Networks), and QANet(question answering net) models. The experimental results show that the performance of the TiQuAD_36414 dataset is better than that of the baseline Tibetan Q&A dataset.

参考文献/References:: [1] 文森, 钱力, 胡懋地, 等. 基于大语言模型的问答技术研究进展综述[J]. 数据分析与知识发现, 2024, 8(6): 16-29.
WEN Sen, QIAN Li, HU Maodi, et al. Review of research progress on question-answering techniques based on large language models[J]. Data analysis and knowledge discovery, 2024, 8(6): 16-29.
[2] 王娜, 李杰. 基于AHP-熵权法的FAQ问答系统用户满意度评价研究: 以高校图书馆问答型机器人为例[J]. 情报科学, 2023, 41(9): 164-172.
WANG Na, LI Jie. User satisfaction evaluation of FAQ system based on AHP-entropy weight method: taking the question answering robot of university library as an example[J]. Information science, 2023, 41(9): 164-172.
[3] 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687.
CHE Wanxiang, DOU Zhicheng, FENG Yansong, et al. Towards a comprehensive understanding of the impact of large language models on natural language processing: challenges, opportunities and future directions[J]. Scientia sinica (informationis), 2023, 53(9): 1645-1687.
[4] 才智杰. 面向自然语言处理的藏文句型结构分布统计(13BYY141)研究工作报告[R]. 青海: 国家社科基金项目, 2016.
[5] RAJPURKAR P, ZHANG Jian, LOPYREV K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing. Austin: ACL, 2016: 2383-2392.
[6] BAJAJ P, CAMPOS D, CRASWELL N, et al. MS MARCO: a human generated MAchine reading COmprehension dataset[EB/OL]. (2018-10-31)[2024-10-02]. https://arxiv.org/abs/1611.09268v3.
[7] JOSHI M, CHOI E, WELD D, et al. TriviaQA: a large scale distantly supervised challenge dataset forReading comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: ACL, 2017: 1601-1611.
[8] SAHA A, ARALIKATTE R, KHAPRA M M, et al. DuoRC: towards complex language understanding with paraphrased reading comprehension[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: ACL, 2018: 1683-1693.
[9] KO?ISKY T, SCHWARZ J, BLUNSOM P, et al. The NarrativeQA reading comprehension challenge[J]. Transactions of the association for computational linguistics, 2018, 6: 317-328.
[10] LAI Guokun, XIE Qizhe, LIU Hanxiao, et al. RACE: large-scale ReAding comprehension dataset from examinations[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 785-794.
[11] RICHARDSON M, BURGES C J C, RENSHAW E. MCTest: a challenge dataset for the open-domain machine comprehension of text[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 193-203.
[12] HUANG Lifu, LE BRAS R, BHAGAVATULA C, et al. Cosmos QA: machine reading comprehension with contextual commonsense reasoning[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 2391-2401.
[13] HERMANN K M, KO?ISKY T, GREFENSTETTE E, et al. Teaching machines to read and comprehend[J]. Advances in neural information processing systems, 2015, 28: 1693-1701.
[14] ONISHI T, WANG Hai, BANSAL M, et al. Who did what: a large-scale person-centered cloze dataset[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin: ACL, 2016: 2230-2235.
[15] CUI Yiming, LIU Ting, CHE Wanxiang, et al. A span-extraction dataset for Chinese machine reading comprehension[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 5883-5889.
[16] SHAO C C, LIU T, LAI Yuting, et al. DRCD: a Chinese machine reading comprehension dataset[EB/OL]. (2019-05-29)[2024-10-02]. https://arxiv.org/abs/1806.00920v3.
[17] HE Wei, LIU Kai, LIU Jing, et al. DuReader: a Chinese machine reading comprehension dataset from real-world applications[C]//Proceedings of the Workshop on Machine Reading for Question Answering. Melbourne: ACL, 2018: 37-46.
[18] XU Canwen, PEI Jiaxin, WU Hongtao, et al. MATINF: a jointly labeled large-scale dataset for classification, question answering and summarization[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2020: 3586-3596.
[19] ZHONG Haoxi, XIAO Chaojun, TU Cunchao, et al. JEC-QA: a legal-domain question answering dataset[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(5): 9701-9708.
[20] SUN Kai, YU Dian, YU Dong, et al. Investigating prior knowledge for challenging Chinese machine reading comprehension[J]. Transactions of the association for computational linguistics, 2020, 8: 141-155.
[21] ZHENG Chujie, HUANG Minlie, SUN Aixin. ChID: a large-scale Chinese IDiom dataset for cloze test[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 778-787.
[22] 孙媛, 旦正错, 刘思思, 等. 面向机器阅读理解的藏文数据集TibetanQA[J]. 中国科学数据, 2022, 7(2): 34-42.
SUN Yuan, DAN Zhengcuo, LIU Sisi, et al. TibetanQA: a dataset of Tibetan for Machine reading comprehension[J]. China scientific data, 2022, 7(2): 34-42.
[23] 孙媛, 刘思思, 陈超凡, 等. 面向机器阅读理解的高质量藏语数据集构建[J]. 中文信息学报, 2024, 38(3): 56-64.
SUN Yuan, LIU Sisi, CHEN Chaofan, et al. Construction of high-quality Tibetan dataset for machine reading comprehension[J]. Journal of Chinese information processing, 2024, 38(3): 56-64.
[24] 史晓东, 卢亚军. 央金藏文分词系统[J]. 中文信息学报, 2011, 25(4): 54-56.
SHI Xiaodong, LU Yajun. A Tibetan segmentation system: Yangjin[J]. Journal of Chinese information processing, 2011, 25(4): 54-56.
[25] 格桑居冕, 格桑央京. 实用藏文文法教程[M]. 成都: 四川民族出版社, 2004.
[26] 班玛宝, 才智杰, 拉玛扎西. 基于PCFG的藏文疑问句句法分析[J]. 中文信息学报, 2019, 33(2): 67-74.
BAN Mabao, CAI Zhijie, LA M. Tibetan interrogative sentences parsing based on PCFG[J]. Journal of Chinese information processing, 2019, 33(2): 67-74.
[27] SEO M, KEMBHAVI A, FARHADI A, et al. Bidirectional attention flow for machine comprehension[EB/OL]. (2018-06-21)[2024-10-02]. https://arxiv.org/abs/1611.01603v6.
[28] WANG Wenhui, YANG Nan, WEI Furu, et al. Gated self-matching networks for reading comprehension and question answering[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: ACL, 2017: 189-198.
[29] YU A W, DOHAN D, LUONG M T, et al. QANet: combining local convolution with global self-attention for reading comprehension[EB/OL]. (2018-04-23)[2024-10-02]. https://arxiv.org/abs/1804.09541v1.

相似文献/References:: [1]李蕾,周延泉,钟义信.基于语用的自然语言处理研究与应用初探[J].智能系统学报,2006,1(2):1.
　LI Lei,ZHOU Yan-quan,ZHONG Yi-xin.Pragmatic Information Based NLP Research and Application[J].CAAI Transactions on Intelligent Systems,2006,1():1.
[2]李德毅.AI——人类社会发展的加速器[J].智能系统学报,2017,12(5):583.[doi:10.11992/tis.201710016]
　LI Deyi.Artificial intelligence:an accelerator for the development of human society[J].CAAI Transactions on Intelligent Systems,2017,12():583.[doi:10.11992/tis.201710016]
[3]陈培,景丽萍.融合语义信息的矩阵分解词向量学习模型[J].智能系统学报,2017,12(5):661.[doi:10.11992/tis.201706012]
　CHEN Pei,JING Liping.Word representation learning model using matrix factorization to incorporate semantic information[J].CAAI Transactions on Intelligent Systems,2017,12():661.[doi:10.11992/tis.201706012]
[4]张森,张晨,林培光,等.基于用户查询日志的网络搜索主题分析[J].智能系统学报,2017,12(5):668.[doi:10.11992/tis.201706096]
　ZHANG Sen,ZHANG Chen,LIN Peiguang,et al.Web search topic analysis based on user search query logs[J].CAAI Transactions on Intelligent Systems,2017,12():668.[doi:10.11992/tis.201706096]
[5]王一成,万福成,马宁.融合多层次特征的中文语义角色标注[J].智能系统学报,2020,15(1):107.[doi:10.11992/tis.201910012]
　WANG Yicheng,WAN Fucheng,MA Ning.Chinese semantic role labeling with multi-level linguistic features[J].CAAI Transactions on Intelligent Systems,2020,15():107.[doi:10.11992/tis.201910012]
[6]毛明毅,吴晨,钟义信,等.加入自注意力机制的BERT命名实体识别模型[J].智能系统学报,2020,15(4):772.[doi:10.11992/tis.202003003]
　MAO Mingyi,WU Chen,ZHONG Yixin,et al.BERT named entity recognition model with self-attention mechanism[J].CAAI Transactions on Intelligent Systems,2020,15():772.[doi:10.11992/tis.202003003]
[7]胡康,何思宇,左敏,等.基于CNN-BLSTM的化妆品违法违规行为分类模型[J].智能系统学报,2021,16(6):1151.[doi:10.11992/tis.202104001]
　HU Kang,HE Siyu,ZUO Min,et al.Classification model for judging illegal and irregular behavior for cosmetics based on CNN-BLSTM[J].CAAI Transactions on Intelligent Systems,2021,16():1151.[doi:10.11992/tis.202104001]
[8]喻波,王志海,孙亚东,等.非结构化文档敏感数据识别与异常行为分析[J].智能系统学报,2021,16(5):932.[doi:10.11992/tis.202104028]
　YU Bo,WANG Zhihai,SUN Yadong,et al.Unstructured document sensitive data identification and abnormal behavior analysis[J].CAAI Transactions on Intelligent Systems,2021,16():932.[doi:10.11992/tis.202104028]
[9]于润羽,杜军平,薛哲,等.面向科技学术会议的命名实体识别研究[J].智能系统学报,2022,17(1):50.[doi:10.11992/tis.202107010]
　YU Runyu,DU Junping,XUE Zhe,et al.Research on named entity recognition for scientific and technological conferences[J].CAAI Transactions on Intelligent Systems,2022,17():50.[doi:10.11992/tis.202107010]
[10]黄河燕,刘啸.面向新领域的事件抽取研究综述[J].智能系统学报,2022,17(1):201.[doi:10.11992/tis.202109045]
　HUANG Heyan,LIU Xiao.A survey on event extraction in new domains[J].CAAI Transactions on Intelligent Systems,2022,17():201.[doi:10.11992/tis.202109045]
[11]仁青吉,才智杰.一种基于形容词知识库的藏文文本数据增强方法[J].智能系统学报,2026,21(2):519.[doi:10.11992/tis.202503033]
　REN Qingji,CAI Zhijie.A method for enhancing Tibetan text data based on adjective knowledge base[J].CAAI Transactions on Intelligent Systems,2026,21():519.[doi:10.11992/tis.202503033]

备注/Memo

收稿日期:2024-10-2。
基金项目:国家自然科学基金项目(61966031,61866032)；藏文信息处理教育部重点实验室项目(2013-Z-Y17, 2014-Z-Y32, 2015-Z-Y03).
作者简介:张洪溪，硕士研究生，主要研究方向为藏文信息处理、藏语自然语言处理。E-mail：1036974179@qq.com。;才智杰，教授，博士生导师，博士。主要研究方向为藏文信息处理、藏语自然语言处理。发表学术论文64篇。E-mail：czjqhsd@163.com。
通讯作者:才智杰. E-mail：Czjqhsd@163.com

更新日期/Last Update: 2025-09-05

面向自动问答的藏文动词结尾型数据集构建 PDF下载HTML

备注/Memo

面向自动问答的藏文动词结尾型数据集构建

PDF下载 HTML