[1]张小川,陈盼盼,邢欣来,等.一种建立在GPT-2模型上的数据增强方法[J].智能系统学报,2024,19(1):209-216.[doi:10.11992/tis.202304055]
ZHANG Xiaochuan,CHEN Panpan,XING Xinlai,et al.A data augmentation method built on GPT-2 model[J].CAAI Transactions on Intelligent Systems,2024,19(1):209-216.[doi:10.11992/tis.202304055]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第1期
页码:
209-216
栏目:
人工智能院长论坛
出版日期:
2024-01-05
- Title:
-
A data augmentation method built on GPT-2 model
- 作者:
-
张小川, 陈盼盼, 邢欣来, 杨昌萌, 滕达
-
重庆理工大学 两江人工智能学院, 重庆 401135
- Author(s):
-
ZHANG Xiaochuan, CHEN Panpan, XING Xinlai, YANG Changmeng, TENG Da
-
Liangjiang Artificial Intelligence College, Chongqing University of Technology, Chongqing 401135, China
-
- 关键词:
-
自然语言处理; 人工智能; 数据增强; 句子分类; 少样本; 序列到序列; 生成式预训练语言模型; 双向编码器表征模型
- Keywords:
-
natural language processing; artificial intelligence; data augmentation; sentence classification; few samples; sequence to sequence; generative pre-trained language model; bidirectional encoder representation from Transformers
- 分类号:
-
TP391.1
- DOI:
-
10.11992/tis.202304055
- 文献标志码:
-
2024-01-04
- 摘要:
-
针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-trained transformer for data augmentation, PunishGPT-DA)。设计了惩罚项和超参数α,与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0),鼓励模型关注那些预测概率较小但仍然合理的输出;使用基于双向编码器表征模型 (bidirectional encoder representation from transformers, BERT)的过滤器过滤语义偏差较大的生成样本。本文方法实现了对训练集16倍扩充,与GPT-2相比,在意图识别、问题分类以及情感分析3个任务上的准确率分别提升了1.1%、4.9%和8.7%。实验结果表明,本文提出的方法能够同时有效地控制一致性和多样性需求,提升下游任务模型的训练性能。
- Abstract:
-
The sentence classification task often faces the problem of insufficient training data. Moreover, text language is discrete, and it is difficult to perform data augmentation under the condition of semantic preservation. Balancing semantic consistency and diversity is also challenging. To address these issues, this paper proposes a punishing generative pre-trained transformer for data augmentation, PunishGPT-DA for short. A penalty term and hyperparameter α are designed. They work together with the negative log-likelihood loss function to fine tune GPT-2 (generative pre-training 2.0) and encourage the model to focus on the outputs with small predicted probabilities but still reasonable. A filter based on BERT (bidirectional encoder representation from transformers) is used to remove generated samples with significant semantic bias. The method has achieved 16-fold expansion of the training set and improved accuracy by 1.1%, 4.9%, and 8.7% in intent recognition, question classification, and sentiment analysis, respectively when compared with GPT-2. Experimental results demonstrate that the proposed method can effectively balance the requirements for semantic consistency and diversity, enhancing the training performance of downstream task models.
备注/Memo
收稿日期:2023-04-30。
基金项目:国家自然科学基金项目(61702063);重庆市技术创新与应用发展专项(cstc2021jscx-dxwtBX0019).
作者简介:张小川,教授,重庆理工大学两江人工智能学院副院长、人工智能系统研究所所长、中国人工智能学会常务理事、机器博弈专委会主任委员、重庆市人工智能学会常务理事、副秘书长,主要研究方向为计算机博弈、智能机器人、软件工程。主持和参与纵向科研项目30余项、横向科研项目50余项,获省部级科技奖 2 项、教学类成果奖 2 项。发表学术论文 100余篇,主编专著和教材6部。E-mail:zxc@cqut.edu.cn;陈盼盼,硕士研究生,主要研究方向为自然语言处理、问答服务机器人。E-mail:2972646722@qq.com;邢欣来,讲师 ,博士,主要研究方向为自然语言处理、对话系统。主持和参与科研项目10余项。发表学术论文10余篇。E-mail:xingxinlai@cqut.edu.cn
通讯作者:张小川. E-mail:zxc@cqut.edu.cn
更新日期/Last Update:
1900-01-01