[1]REN Qingji,CAI Zhijie.A method for enhancing Tibetan text data based on adjective knowledge base[J].CAAI Transactions on Intelligent Systems,2026,21(2):519-528.[doi:10.11992/tis.202503033]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
21
Number of periods:
2026 2
Page number:
519-528
Column:
学术论文—人工智能基础
Public date:
2026-03-05
- Title:
-
A method for enhancing Tibetan text data based on adjective knowledge base
- Author(s):
-
REN Qingji1; 2; CAI Zhijie1; 2
-
1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016, China;
2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
-
- Keywords:
-
natural language processing; low-resource languages; tibetan language; adjectives; knowledge base; data augmentation; modified object; sentence structure
- CLC:
-
TP391
- DOI:
-
10.11992/tis.202503033
- Abstract:
-
In the field of natural language processing based on deep learning, the quality and scale of datasets directly impact model performance. Data augmentation is an essential technique in natural language processing, serving as an effective means to expand and enrich datasets. This paper addresses the issue of Tibetan data resource scarcity by analyzing the semantic, emotional, and modifying object features of Tibetan adjectives based on actual corpora. Tibetan adjectives are categorized into five main categories—descriptive properties, states, quantities, sensations, and feelings—which include a total of forty-six subcategories. By extracting the features of Tibetan adjectives and their modifying objects, a knowledge base for Tibetan adjectives and a synonym table for modifying objects were constructed. We propose a data augmentation method based on this knowledge base, which replaces adjectives by matching their types and syllable counts, while also substituting modifying objects with corresponding synonyms based on their syntactic structures. Experimental results indicate that this method can significantly increase the volume of Tibetan text data, achieving a total growth rate of 990.22% on sentence sets derived from Tibetan language textbooks for grades one through six.It also shows strong performance in downstream tasks. When RoBERTa, TiBERT, TBERT, and CINO are used as the pre-trained models, the correlation coefficient of the SimCSE model increases by 8.78, 3.17, 0.61, and 1.33 percentage points, respectively. In the text classification task, accuracy, recall, and F1 score are improved by 5.97, 9.51, and 9.31 percentage points, respectively.