[1]张洪溪,才智杰.面向自动问答的藏文动词结尾型数据集构建[J].智能系统学报,2025,20(5):1207-1216.[doi:10.11992/tis.202410002]
ZHANG Hongxi,CAI Zhijie.Construction of a Tibetan verb-ending type dataset for automatic question answering[J].CAAI Transactions on Intelligent Systems,2025,20(5):1207-1216.[doi:10.11992/tis.202410002]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
20
期数:
2025年第5期
页码:
1207-1216
栏目:
学术论文—自然语言处理与理解
出版日期:
2025-09-05
- Title:
-
Construction of a Tibetan verb-ending type dataset for automatic question answering
- 作者:
-
张洪溪1,2, 才智杰1,2
-
1. 青海师范大学 计算机学院, 青海 西宁 810016;
2. 藏语智能全国重点实验室, 青海 西宁 810008
- Author(s):
-
ZHANG Hongxi1,2, CAI Zhijie1,2
-
1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016, China;
2. The State Key Laboratory of Tibetan Intelligence, Xining 810008, China
-
- 关键词:
-
自然语言处理; 藏文; 自动问答; TiQuAD_36414数据集; 问答模板; 动词; 位格助词; 有效性
- Keywords:
-
natural language processing; Tibetan; automatic Q& A; TiQuAD_36414 dataset; Q& A template; verb; la case auxiliary word; effectiveness
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202410002
- 摘要:
-
自动问答数据集是研究藏文自动问答技术的重要数据基础。文章针对藏文自动问答数据集匮乏的瓶颈问题,在剖析英文、汉文和藏文自动问答数据集构建现状的基础上,分析了藏文中出现频率最高的动词结尾型句子的问答结构特征,通过构建句子和问句的模板,设计了一种面向自动问答的藏文“动词结尾+位格助词”型数据集构建方案,按照方案构建了面向自动问答的藏文数据集TiQuAD_36414,并采用平均意见得分(mean opinion score, MOS)方法,BiDAF(bidirectional attention flow)、RNet(gated self-matching networks)和QANet(question answering net)模型的F1值和EM(exact match)值验证了数据集的有效性。实验数据表明,本文构建的数据集TiQuAD_36414的质量良好。
- Abstract:
-
The Tibetan automatic question answering (Q&A) dataset serves as a crucial data foundation for advancing research in Tibetan automatic Q&A technologies. To solve the problem of the lack of automatic Q&A datasets in Tibetan, this paper first examines the features of the most common verb-ending type sentences in Tibetan based on an analysis of the current status of automatic Q&A dataset construction in English, Chinese, and Tibetan. Then, this study constructs templates for sentences and questions and proposes a template-based method for building a Tibetan automatic Q&A dataset with “verb-ending + La case auxiliary word” sentences. Then, a new Tibetan automatic Q&A dataset (TiQuAD_36414) is generated according to this approach. Finally, the validity of this dataset is verified using the MOS(mean opinion score) method, along with the F1 and EM(exact match) scores of the BiDAF(bidirectional attention flow), RNet(Gated Self-Matching Networks), and QANet(question answering net) models. The experimental results show that the performance of the TiQuAD_36414 dataset is better than that of the baseline Tibetan Q&A dataset.
备注/Memo
收稿日期:2024-10-2。
基金项目:国家自然科学基金项目(61966031,61866032);藏文信息处理教育部重点实验室项目(2013-Z-Y17, 2014-Z-Y32, 2015-Z-Y03).
作者简介:张洪溪,硕士研究生,主要研究方向为藏文信息处理、藏语自然语言处理。E-mail:1036974179@qq.com。;才智杰,教授,博士生导师,博士。主要研究方向为藏文信息处理、藏语自然语言处理。发表学术论文64篇。E-mail:czjqhsd@163.com。
通讯作者:才智杰. E-mail:Czjqhsd@163.com
更新日期/Last Update:
2025-09-05