[1]ZHANG Xiaochuan,CHEN Panpan,XING Xinlai,et al.A data augmentation method built on GPT-2 model[J].CAAI Transactions on Intelligent Systems,2024,19(1):209-216.[doi:10.11992/tis.202304055]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
19
Number of periods:
2024 1
Page number:
209-216
Column:
人工智能院长论坛
Public date:
2024-01-05
- Title:
-
A data augmentation method built on GPT-2 model
- Author(s):
-
ZHANG Xiaochuan; CHEN Panpan; XING Xinlai; YANG Changmeng; TENG Da
-
Liangjiang Artificial Intelligence College, Chongqing University of Technology, Chongqing 401135, China
-
- Keywords:
-
natural language processing; artificial intelligence; data augmentation; sentence classification; few samples; sequence to sequence; generative pre-trained language model; bidirectional encoder representation from Transformers
- CLC:
-
TP391.1
- DOI:
-
10.11992/tis.202304055
- Abstract:
-
The sentence classification task often faces the problem of insufficient training data. Moreover, text language is discrete, and it is difficult to perform data augmentation under the condition of semantic preservation. Balancing semantic consistency and diversity is also challenging. To address these issues, this paper proposes a punishing generative pre-trained transformer for data augmentation, PunishGPT-DA for short. A penalty term and hyperparameter α are designed. They work together with the negative log-likelihood loss function to fine tune GPT-2 (generative pre-training 2.0) and encourage the model to focus on the outputs with small predicted probabilities but still reasonable. A filter based on BERT (bidirectional encoder representation from transformers) is used to remove generated samples with significant semantic bias. The method has achieved 16-fold expansion of the training set and improved accuracy by 1.1%, 4.9%, and 8.7% in intent recognition, question classification, and sentiment analysis, respectively when compared with GPT-2. Experimental results demonstrate that the proposed method can effectively balance the requirements for semantic consistency and diversity, enhancing the training performance of downstream task models.