[1]汪权彬,谭营.基于数据增广和复制的中文语法错误纠正方法[J].智能系统学报,2020,15(1):99-106.[doi:10.11992/tis.202001014]
WANG Quanbin,TAN Ying.Chinese grammatical error correction method based on data augmentation and copy mechanism[J].CAAI Transactions on Intelligent Systems,2020,15(1):99-106.[doi:10.11992/tis.202001014]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
15
期数:
2020年第1期
页码:
99-106
栏目:
学术论文—自然语言处理与理解
出版日期:
2020-01-05
- Title:
-
Chinese grammatical error correction method based on data augmentation and copy mechanism
- 作者:
-
汪权彬, 谭营
-
北京大学 信息科学技术学院, 北京 100871
- Author(s):
-
WANG Quanbin, TAN Ying
-
School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
-
- 关键词:
-
自注意力机制; 复制机制; 序列到序列学习; 中文; 语法错误纠正; 神经网络; 文本生成; 通顺度
- Keywords:
-
self-attention mechanism; copy mechanism; sequence to sequence learning; Chinese; grammatical error correction; neural networks; text generation; fluency
- 分类号:
-
TP389.1
- DOI:
-
10.11992/tis.202001014
- 摘要:
-
中文作为一种使用很广泛的文字,因其同印欧语系文字的天然差别,使得汉语初学者往往会出现各种各样的语法错误。本文针对初学者在汉语书写中可能出现的错别字、语序错误等,提出一种自动化的语法纠正方法。首先,本文在自注意力模型中引入复制机制,构建新的C-Transformer模型。构建从错误文本序列到正确文本序列的文本语法错误纠正模型,其次,在公开数据集的基础上,本文利用序列到序列学习的方式从正确文本学习对应的不同形式的错误文本,并设计基于通顺度、语义和句法度量的错误文本筛选方法;最后,还结合中文象形文字的特点,构造同形、同音词表,按词表映射的方式人工构造错误样本扩充训练数据。实验结果表明,本文的方法能够很好地纠正错别字、语序不当、缺失、冗余等错误,并在中文文本语法错误纠正标准测试集上取得了目前最好的结果。
- Abstract:
-
Chinese is a widely used language. However, due to its natural difference between Indo-European languages, Chinese learners tend to make various grammatical errors. This article proposes an automatic grammar correction method for those who will make errors like typos or improper words order. First, we built the C-Transformer model that adopts copy mechanism in the self-attention model to translate wrong text sequence to the correct one. Second, based on the public data set, a pure sequence to sequence method is utilized to generate wrong text corresponding to the correct one, and an error text filter is designed based on fluency, semantic, and syntactic measurements. Finally, since Chinese words are pictographic, based on the collected homographs and homophones dictionaries, some error samples are artificially constructed to expand training data. The experimental results show that our method can well correct typos, improper word order, missing, redundancy and other errors, and achieved the state-of-the-art performance on the standard test set of Chinese text grammatical error correction.
备注/Memo
收稿日期:2020-01-09。
基金项目:国家重点研发计划资助项目(2018AAA0100300、2018AAA0102301);国家重点基础研究发展计划项目(2015CB352302);国家自然科学基金项目(61673025、61375119);北京市自然科学基金项目(4162029)
作者简介:汪权彬,博士研究生,主要研究方向为机器学习、深度神经网络、自然语言处理;谭营,教授,博士生导师,主要研究方向为智能科学、计算智能与群体智能、机器学习、人工神经网络、群体机器人、大数据挖掘。烟花算法发明人,出版学术专著12部,发表学术论文330余篇
通讯作者:谭营.E-mail:ytan@pku.edu.cn
更新日期/Last Update:
1900-01-01