[1]仁青吉,才智杰.一种基于形容词知识库的藏文文本数据增强方法[J].智能系统学报,2026,21(2):519-528.[doi:10.11992/tis.202503033]
REN Qingji,CAI Zhijie.A method for enhancing Tibetan text data based on adjective knowledge base[J].CAAI Transactions on Intelligent Systems,2026,21(2):519-528.[doi:10.11992/tis.202503033]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
21
期数:
2026年第2期
页码:
519-528
栏目:
学术论文—人工智能基础
出版日期:
2026-03-05
- Title:
-
A method for enhancing Tibetan text data based on adjective knowledge base
- 作者:
-
仁青吉1,2, 才智杰1,2
-
1. 青海师范大学 计算机学院, 青海 西宁 810016;
2. 藏语智能信息处理及应用国家重点实验室, 青海 西宁 810008
- Author(s):
-
REN Qingji1,2, CAI Zhijie1,2
-
1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016, China;
2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
-
- 关键词:
-
自然语言处理; 低资源语言; 藏文; 形容词; 知识库; 数据增强; 修饰对象; 句式结构
- Keywords:
-
natural language processing; low-resource languages; tibetan language; adjectives; knowledge base; data augmentation; modified object; sentence structure
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202503033
- 摘要:
-
基于深度学习的自然语言处理领域中,数据集质量和规模直接影响模型的性能。数据增强作为扩展和丰富数据集的有效手段,是自然语言处理中不可或缺的重要技术之一。文章针对藏文数据资源匮乏的问题,结合实际语料分析了藏文形容词的语义、情感以及修饰对象等特征,将藏文形容词按语义特征及修饰对象分为描述性质、状态、数量、感官和感受等5大类46小类,通过提取藏文形容词和形容词修饰对象的特征构建了藏文形容词知识库和形容词修饰对象近义词表,提出了一种基于形容词知识库的藏文文本数据增强方法。该方法通过匹配形容词的类型、音节数等特征替换形容词,同时匹配形容词修饰对象的句式结构,将形容词修饰对象用近义词表中对应的词替换。实验结果表明,该方法能够显著增加藏文文本数据量,在小学一年级至六年级藏文课本句子集上的总增长率达990.22%;在下游任务中也有良好表现,预训练模型为RoBERTa、TiBERT、TBERT和CINO时SimCSE模型的相关系数分别提升了8.78、3.17、0.61和1.33百分点,文本分类任务中准确率、召回率和F1值分别提升了5.97、9.51和9.31百分点。
- Abstract:
-
In the field of natural language processing based on deep learning, the quality and scale of datasets directly impact model performance. Data augmentation is an essential technique in natural language processing, serving as an effective means to expand and enrich datasets. This paper addresses the issue of Tibetan data resource scarcity by analyzing the semantic, emotional, and modifying object features of Tibetan adjectives based on actual corpora. Tibetan adjectives are categorized into five main categories—descriptive properties, states, quantities, sensations, and feelings—which include a total of forty-six subcategories. By extracting the features of Tibetan adjectives and their modifying objects, a knowledge base for Tibetan adjectives and a synonym table for modifying objects were constructed. We propose a data augmentation method based on this knowledge base, which replaces adjectives by matching their types and syllable counts, while also substituting modifying objects with corresponding synonyms based on their syntactic structures. Experimental results indicate that this method can significantly increase the volume of Tibetan text data, achieving a total growth rate of 990.22% on sentence sets derived from Tibetan language textbooks for grades one through six.It also shows strong performance in downstream tasks. When RoBERTa, TiBERT, TBERT, and CINO are used as the pre-trained models, the correlation coefficient of the SimCSE model increases by 8.78, 3.17, 0.61, and 1.33 percentage points, respectively. In the text classification task, accuracy, recall, and F1 score are improved by 5.97, 9.51, and 9.31 percentage points, respectively.
备注/Memo
收稿日期:2025-3-24。
基金项目:国家自然科学基金项目(61866032, 61966031);青海省科技厅项目(2019-SF-129);藏文信息处理教育部重点实验室项目(2020-ZJ-Y05).
作者简介:仁青吉,博士研究生,主要研究方向为藏文信息处理和藏语自然语言处理。E-mail:1054808891@qq.com。;才智杰,教授,博士生导师,博士,主要研究方向为藏文信息处理和藏语自然语言处理。发表学术论文64篇。E-mail:Czjqhsd@163.com。
通讯作者:才智杰. E-mail:Czjqhsd@163.com
更新日期/Last Update:
1900-01-01