[1]张琳,刘明童,张玉洁,等.探索低资源的迭代式复述生成增强方法[J].智能系统学报,2022,17(4):680-687.[doi:10.11992/tis.202106032]
ZHANG Lin,LIU Mingtong,ZHANG Yujie,et al.Explore the low-resource iterative paraphrase generation enhancement method[J].CAAI Transactions on Intelligent Systems,2022,17(4):680-687.[doi:10.11992/tis.202106032]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
17
期数:
2022年第4期
页码:
680-687
栏目:
学术论文—机器学习
出版日期:
2022-07-05
- Title:
-
Explore the low-resource iterative paraphrase generation enhancement method
- 作者:
-
张琳, 刘明童, 张玉洁, 徐金安, 陈钰枫
-
北京交通大学 计算机与信息技术学院,北京 100044
- Author(s):
-
ZHANG Lin, LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng
-
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
-
- 关键词:
-
低资源; 迭代式; 复述生成; 数据增强; 筛选算法; 神经网络模型; 编码–解码框架; 注意力机制
- Keywords:
-
low-resource; iterative; paraphrase generation; data enhancement; screening algorithm; neural networks model; encoder-decoder; attention mechanism
- 分类号:
-
TP18
- DOI:
-
10.11992/tis.202106032
- 摘要:
-
复述生成旨在同一语言内将给定句子转换成语义一致表达不同的句子。目前,基于深度神经网络的复述生成模型的成功依赖于大规模的复述平行语料,当面向新的语言或新的领域时,模型性能急剧下降。面对这一困境,提出低资源的迭代式复述生成增强方法,最大化利用单语语料和小规模复述平行语料迭代式训练复述生成模型并生成复述伪数据,以此增强模型性能。此外,提出了句子流畅性、语义相近性和表达多样性为基准设计的伪数据筛选算法,选取高质量的复述伪数据参与每轮模型的迭代训练。在公开数据集Quora上的实验结果表明,提出的方法仅利用30%的复述语料在语义和多样性指标上均超过了基线模型,验证了所提方法的有效性。
- Abstract:
-
Paraphrase generation aims to convert a given sentence into semantically consistent different sentences within the same language. At present, the success of deep neural network-based paraphrase generation models depends on large-scale paraphrase parallel corpora. When faced with new languages or new domains, the model’s performance drops sharply. We propose a low-resource iterative paraphrase generation enhancement method faced with this dilemma, which maximizes the use of monolingual and small-scale paraphrase parallel corpora to train the paraphrase generation model iteratively and generate paraphrase pseudo data to enhance the model performance. Furthermore, we propose a pseudo data screening algorithm based on fluency, semantic similarity, and expression diversity to select high-quality paraphrased pseudo data in each round of iterative training of the model. Experimental results on Quora, a public dataset, show that our proposed method exceeds the baseline model in semantic and diversity indicators using only 30% of the paraphrase corpus, which verifies the effectiveness of the proposed method.
备注/Memo
收稿日期:2021-06-23。
基金项目:国家自然科学基金项目(61876198,61976015, 61976016).
作者简介:张琳,硕士研究生,主要研究方向为复述生成和机器翻译;刘明童,博士,主要研究方向为依存句法分析、句子匹配、复述生成、机器翻译和自然语言处理;张玉洁,教授,主要研究方向为机器翻译、多语言信息处理、句法分析和自然语言处理。发表学术论文30余篇
通讯作者:张玉洁. E-mail:yjzhang@bjtu.edu.cn
更新日期/Last Update:
1900-01-01