<-上一篇/Previous Article 下一篇/Next Article->

[1]汪权彬,谭营.基于数据增广和复制的中文语法错误纠正方法[J].智能系统学报,2020,15(1):99-106.[doi:10.11992/tis.202001014]
　WANG Quanbin,TAN Ying.Chinese grammatical error correction method based on data augmentation and copy mechanism[J].CAAI Transactions on Intelligent Systems,2020,15(1):99-106.[doi:10.11992/tis.202001014]

点击复制

基于数据增广和复制的中文语法错误纠正方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 15 期数: 2020年第1期页码: 99-106 栏目: 学术论文—自然语言处理与理解出版日期: 2020-01-05

Title:: Chinese grammatical error correction method based on data augmentation and copy mechanism

作者:: 汪权彬, 谭营; 北京大学信息科学技术学院, 北京 100871

Author(s):: WANG Quanbin, TAN Ying; School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

关键词:: 自注意力机制; 复制机制; 序列到序列学习; 中文; 语法错误纠正; 神经网络; 文本生成; 通顺度

Keywords:: self-attention mechanism; copy mechanism; sequence to sequence learning; Chinese; grammatical error correction; neural networks; text generation; fluency

分类号:: TP389.1

DOI:: 10.11992/tis.202001014

摘要:: 中文作为一种使用很广泛的文字，因其同印欧语系文字的天然差别，使得汉语初学者往往会出现各种各样的语法错误。本文针对初学者在汉语书写中可能出现的错别字、语序错误等，提出一种自动化的语法纠正方法。首先，本文在自注意力模型中引入复制机制，构建新的C-Transformer模型。构建从错误文本序列到正确文本序列的文本语法错误纠正模型，其次，在公开数据集的基础上，本文利用序列到序列学习的方式从正确文本学习对应的不同形式的错误文本，并设计基于通顺度、语义和句法度量的错误文本筛选方法；最后，还结合中文象形文字的特点，构造同形、同音词表，按词表映射的方式人工构造错误样本扩充训练数据。实验结果表明，本文的方法能够很好地纠正错别字、语序不当、缺失、冗余等错误，并在中文文本语法错误纠正标准测试集上取得了目前最好的结果。

Abstract:: Chinese is a widely used language. However, due to its natural difference between Indo-European languages, Chinese learners tend to make various grammatical errors. This article proposes an automatic grammar correction method for those who will make errors like typos or improper words order. First, we built the C-Transformer model that adopts copy mechanism in the self-attention model to translate wrong text sequence to the correct one. Second, based on the public data set, a pure sequence to sequence method is utilized to generate wrong text corresponding to the correct one, and an error text filter is designed based on fluency, semantic, and syntactic measurements. Finally, since Chinese words are pictographic, based on the collected homographs and homophones dictionaries, some error samples are artificially constructed to expand training data. The experimental results show that our method can well correct typos, improper word order, missing, redundancy and other errors, and achieved the state-of-the-art performance on the standard test set of Chinese text grammatical error correction.

参考文献/References:: [1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach, USA, 2017: 5998–6008.
[2] VINYALS O, FORTUNATO M, JAITLY N. Pointer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA, 2015: 2692–2700.
[3] MACDONALD N, FRASE L, GINGRICH P, et al. The writer’s workbench: computer aids for text analysis[J]. IEEE transactions on communications, 1982, 30(1): 105–110.
[4] FRANCIS W N, KUCERA H. A standard corpus of present-day edited American English, for use with digital computers[R]. Providence, RI: Department of Linguistics, Brown University, 1979.
[5] MANGU L, BRILL E. Automatic rule acquisition for spelling correction[C]//Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco, USA, 1997: 187–194.
[6] CAHILL A, MADNANI N, TETREAULT J, et al. Robust systems for preposition error correction using Wikipedia revisions[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, USA, 2013: 507–517.
[7] BROCKETT C, DOLAN W B, GAMON M. Correcting ESL errors using phrasal SMT techniques[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Stroudsburg, USA, 2006: 249–256.
[8] JUNCZYS-DOWMUNT M, GRUNDKIEWICZ R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, USA, 2016: 1546–1556.
[9] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA, 2014: 3104–3112.
[10] XIE Z, AVATI A, ARIVAZHAGAN N, et al. Neural language correction with character-based attention[J]. arXiv preprint arXiv: 1603.09727, 2016.
[11] WANG Quanbin, TAN Ying. Automatic grammatical error correction based on edit operations information[C]//Proceedings of 25th International Conference on Neural Information Processing. Siem Reap, Cambodia, 2018: 494–505.
[12] GE Tao, WEI Furu, ZHOU Ming. Fluency boost learning and inference for neural grammatical error correction[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia, 2018: 1055–1065.
[13] ZHAO Wei, WANG Liang, SHEN Kewei, et al. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data[J]. arXiv preprint arXiv: 1903.00138, 2019.
[14] RAO Gaoqi, ZHANG Baolin, XUN Endong, et al. IJCNLP-2017 Task 1: Chinese grammatical error diagnosis[C]//Proceedings of the IJCNLP 2017. Taipei, China, 2017: 1–8.
[15] WU S H, LIU Chaolin, LEE L H. Chinese spelling check evaluation at SIGHAN Bake-off 2013[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. Nagoya, Japan, 2013: 35–42.
[16] ZHAO Yuanyuan, JIANG Nan, SUN Weiwei, et al. Overview of the NLPCC 2018 shared task: grammatical error correction[C]//Proceedings of 7th CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot, China, 2018: 439–445.
[17] ZHOU Junpei, LI Chen, LIU Hengyou, et al. Chinese grammatical error correction using statistical and neural models[C]//Proceedings of 7th CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot, China, 2018: 117–128.
[18] FU Kai, HUANG Jin, DUAN Yitao. Youdao’s winning solution to the NLPCC-2018 Task 2 challenge: a neural machine translation approach to Chinese grammatical error correction[C]//Proceedings of 7th CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot, China, 2018: 341–350.
[19] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735–1780.
[20] LECUN Y, BENGIO Y. Convolutional networks for images, speech, and time-series[M]//ARBIB M A. The Handbook of Brain Theory and Neural Networks. Cambridge, USA: MIT Press, 1995: 3361.
[21] REN Honghai, YANG Liner, XUN Endong. A sequence to sequence learning for Chinese grammatical error correction[C]//Proceedings of 7th CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot, China, 2018: 401–410.
[22] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv: 1409.0473, 2014.
[23] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv preprint arXiv: 1412.3555, 2014.
[24] SEE A, LIU P J, MANNING C D. Get to the point: summarization with pointer-generator networks[J]. arXiv preprint arXiv: 1704.04368, 2017.
[25] JIA R, LIANG P. Data recombination for neural semantic parsing[J]. arXiv preprint arXiv: 1606.03622, 2016.
[26] LIU CHAOLIN, LAI MINHUA, TIEN K W, et al. Visually and phonologically similar characters in incorrect Chinese words: analyses, identification, and applications[J]. ACM transactions on Asian language information processing, 2011, 10(2): 10.
[27] A project for conversion between traditional and simplified Chinese[EB/OL]. [2019-12-20].https://github.com/BYVoid/OpenCC.

相似文献/References:: [1]毛明毅,吴晨,钟义信,等.加入自注意力机制的BERT命名实体识别模型[J].智能系统学报,2020,15(4):772.[doi:10.11992/tis.202003003]
　MAO Mingyi,WU Chen,ZHONG Yixin,et al.BERT named entity recognition model with self-attention mechanism[J].CAAI Transactions on Intelligent Systems,2020,15():772.[doi:10.11992/tis.202003003]
[2]鲍维克,袁春.面向推荐系统的分期序列自注意力网络[J].智能系统学报,2021,16(2):353.[doi:10.11992/tis.202005028]
　BAO Weike,YUAN Chun.Recommendation system with long-term and short-term sequential self-attention network[J].CAAI Transactions on Intelligent Systems,2021,16():353.[doi:10.11992/tis.202005028]
[3]石拓,张齐,石磊.多尺度视角特征动态融合的盗窃犯罪预测模型[J].智能系统学报,2022,17(6):1104.[doi:10.11992/tis.202203016]
　SHI Tuo,ZHANG Qi,SHI Lei.Prediction model of theft crime based on the dynamic fusion of multiscale perspective characteristics[J].CAAI Transactions on Intelligent Systems,2022,17():1104.[doi:10.11992/tis.202203016]
[4]李祥宇,隋璘,熊伟丽.基于自注意力机制与卷积ONLSTM网络的软测量算法[J].智能系统学报,2023,18(5):957.[doi:10.11992/tis.202211037]
　LI Xiangyu,SUI Lin,XIONG Weili.Soft sensor algorithm based on self-attention mechanism and convolutional ONLSTM network[J].CAAI Transactions on Intelligent Systems,2023,18():957.[doi:10.11992/tis.202211037]
[5]梁艳,温兴,潘家辉.融合全局与局部特征的跨数据集表情识别方法[J].智能系统学报,2023,18(6):1205.[doi:10.11992/tis.202212030]
　LIANG Yan,WEN Xing,PAN Jiahui.Cross-dataset facial expression recognition method fusing global and local features[J].CAAI Transactions on Intelligent Systems,2023,18():1205.[doi:10.11992/tis.202212030]
[6]闫河,刘灵坤,黄俊滨,等.结合多尺度注意力机制和双向门控循环网络的视频摘要模型[J].智能系统学报,2024,19(2):446.[doi:10.11992/tis.202209048]
　YAN He,LIU Lingkun,HUANG Junbin,et al.Video summarization model based on the multiscale attention mechanism and bidirectional gated recurrent network[J].CAAI Transactions on Intelligent Systems,2024,19():446.[doi:10.11992/tis.202209048]
[7]李云洁,王丹阳,刘海涛,等.图推理嵌入动态自注意力网络的文档级关系抽取[J].智能系统学报,2025,20(1):52.[doi:10.11992/tis.202311021]
　LI Yunjie,WANG Danyang,LIU Haitao,et al.Document-level relation extraction of a graph reasoning embedded dynamic self-attention network[J].CAAI Transactions on Intelligent Systems,2025,20():52.[doi:10.11992/tis.202311021]
[8]赵文清,赵振寰,巩佳潇.结合倒残差自注意力机制的遥感图像目标检测[J].智能系统学报,2025,20(1):64.[doi:10.11992/tis.202312001]
　ZHAO Wenqing,ZHAO Zhenhuan,GONG Jiaxiao.Remote sensing image object detection based on inverted residual self-attention mechanism[J].CAAI Transactions on Intelligent Systems,2025,20():64.[doi:10.11992/tis.202312001]
[9]朱超杰,闫昱名,初宝昌,等.采用目标注意力的方面级多模态情感分析研究[J].智能系统学报,2024,19(6):1562.[doi:10.11992/tis.202404009]
　ZHU Chaojie,YAN Yuming,CHU Baochang,et al.Aspect-level multimodal sentiment analysis via object-attention[J].CAAI Transactions on Intelligent Systems,2024,19():1562.[doi:10.11992/tis.202404009]
[10]佟谣,刘波,齐小刚.基于参数优化VMD和改进BiLSTM的低轨卫星网络业务预测方法[J].智能系统学报,2026,21(3):627.[doi:10.11992/tis.202508026]
　TONG Yao,LIU Bo,QI Xiaogang.A traffic prediction method for a low earth orbit satellite network based on parameter-optimized VMD and improved BiLSTM[J].CAAI Transactions on Intelligent Systems,2026,21():627.[doi:10.11992/tis.202508026]

备注/Memo

收稿日期:2020-01-09。
基金项目:国家重点研发计划资助项目(2018AAA0100300、2018AAA0102301)；国家重点基础研究发展计划项目(2015CB352302)；国家自然科学基金项目(61673025、61375119)；北京市自然科学基金项目(4162029)
作者简介:汪权彬，博士研究生，主要研究方向为机器学习、深度神经网络、自然语言处理;谭营，教授，博士生导师，主要研究方向为智能科学、计算智能与群体智能、机器学习、人工神经网络、群体机器人、大数据挖掘。烟花算法发明人，出版学术专著12部，发表学术论文330余篇
通讯作者:谭营.E-mail:ytan@pku.edu.cn

更新日期/Last Update: 1900-01-01

基于数据增广和复制的中文语法错误纠正方法 PDF下载HTML

备注/Memo

基于数据增广和复制的中文语法错误纠正方法

PDF下载 HTML