[1]姚霖,刘轶,李鑫鑫,等.词边界字向量的中文命名实体识别[J].智能系统学报编辑部,2016,11(1):37-42.[doi:10.11992/tis.201507065]
 YAO Lin,LIU Yi,LI Xinxin,et al.Chinese named entity recognition via word boundarybased character embedding[J].CAAI Transactions on Intelligent Systems,2016,11(1):37-42.[doi:10.11992/tis.201507065]
点击复制

词边界字向量的中文命名实体识别(/HTML)
分享到:

《智能系统学报》编辑部[ISSN:1673-4785/CN:23-1538/TP]

卷:
第11卷
期数:
2016年1期
页码:
37-42
栏目:
出版日期:
2016-02-25

文章信息/Info

Title:
Chinese named entity recognition via word boundarybased character embedding
作者:
姚霖123 刘轶1 李鑫鑫4 刘宏2
1. 深港产学研基地, 广东深圳 518057;
2. 北京大学信息科学技术学院, 北京 100871;
3. 哈尔滨工业大学软件学院, 黑龙江哈尔滨 150001;
4. 哈尔滨工业大学深圳研究生院计算机科学与技术学院, 广东深圳 518055
Author(s):
YAO Lin123 LIU Yi1 LI Xinxin4 LIU Hong2
1. Shenzhen High-Tech Industrial Park, Shenzhen 518057, China;
2. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
3. School of Software, Harbin Institute of Technology, Harbin 150001, China;
4. School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China
关键词:
机器学习中文命名体识别深度神经网络特征向量特征提取
Keywords:
machine learningChinese named entity recognitiondeep neutral networksfeature vectorfeature extraction
分类号:
TP391.1
DOI:
10.11992/tis.201507065
摘要:
常见的基于机器学习的中文命名实体识别系统往往使用大量人工提取的特征,但特征提取费时费力,是一件十分繁琐的工作。为了减少中文命名实体识别对特征提取的依赖,构建了基于词边界字向量的中文命名实体识别系统。该方法利用神经元网络从大量未标注数据中,自动抽取出蕴含其中的特征信息,生成字特征向量。同时考虑到汉字不是中文语义的最基本单位,单纯的字向量会由于一字多义造成语义的混淆,因此根据同一个字在词中处于不同位置大多含义不同的特点,将单个字在词语中所处的位置信息加入到字特征向量中,形成词边界字向量,将其用于深度神经网络模型训练之中。在Sighan Bakeoff-3(2006)语料中取得了F1 89.18%的效果,接近当前国际先进水平,说明了该系统不仅摆脱了对特征提取的依赖,也减少了汉字一字多义产生的语义混淆。
Abstract:
Most Chinese named entity recognition systems based on machine learning are realized by applying a large amount of manual extracted features. Feature extraction is time-consuming and laborious. In order to remove the dependence on feature extraction, this paper presents a Chinese named entity recognition system via word boundary based character embedding. The method can automatically extract the feature information from a large number of unlabeled data and generate the word feature vector, which will be used in the training of neural network. Since the Chinese characters are not the most basic unit of the Chinese semantics, the simple word vector will be cause the semantics ambiguity problem. According to the same character on different position of the word might have different meanings, this paper proposes a character vector method with word boundary information, constructs a depth neural network system for the Chinese named entity recognition and achieves F1 89.18% on Sighan Bakeoff-3 2006 MSRA corpus. The result is closed to the state-of-the-art performance and shows that the system can avoid relying on feature extraction and reduce the character ambiguity.

参考文献/References:

[1] BENDER O, OCH F J, NEY H. Maximum entropy models for named entity recognition[C]//Proceedings of 7th Conference on Natural Language Learning at HLT-NAACL. Stroudsburg, USA, 2003, 4:148-151.
[2] WHITELAW C, PATRICK J. Named entity recognition using a character-based probabilistic approach[C]//Proceedings of CoNLL-2003. Edmonton, Canada, 2003:196-199.
[3] CURRAN J R, CLARK S. Language independent NER using a maximum entropy tagger[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Stroudsburg, USA, 2003, 4:164-167.
[4] CHIEU H L, NG H T. Named entity recognition:a maximum entropy approach using global information[C]//Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg, USA, 2002, 1:1-7.
[5] KLEIN D, SMARR J, NGUYEN H, et al. Named entity recognition with character-level models[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL. Stroudsburg, USA, 2003, 4:180-183.
[6] FLORIAN R, ITTYCHERIAH A, JING Hongyan, et al. Named entity recognition through classifier combination[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Stroudsburg, USA, 2003, 4:168-171.
[7] MAYFIELD J, MCNAMEE P, PIATKO C. Named entity recognition using hundreds of thousands of features[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Stroudsburg, USA, 2003, 4:184-187.
[8] KAZAMA J, MAKINO T, OHTA Y, et al. Tuning support vector machines for biomedical named entity recognition[C]//Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain at ACL. Stroudsburg, USA, 2002, 3:1-8.
[9] SETTLES B. Biomedical named entity recognition using conditional random fields and rich feature sets[C]//Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA). Geneva, Switzerland, 2004:104-107.
[10] WONG F, CHAO S, HAO C C, et al. A Maximum Entropy (ME) based translation model for Chinese characters conversion[J]. Journal of advances in computational linguistics, research in computer science, 2009, 41:267-276.
[11] YAO Lin, SUN Chengjie, WANG Xiaolong, et al. Combining self learning and active learning for Chinese named entity recognition[J]. Journal of software, 2010, 5(5):530-537.
[12] COLLOBERT R. Deep learning for efficient discriminative parsing[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Lauderdale, USA, 2011:224-232.
[13] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of machine learning research, 2003, 3(6):1137-1155.
[14] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of machine learning research, 2011, 12(1):2493-2537.
[15] SCHWENK H. Continuous space language models[J]. Computer speech & language, 2007, 21(3):492-518.
[16] MIKOLOV T, KARAFIÁT M, BURGET L, et al. Recurrent neural network based language model[C]//Proceedings of 11th Annual Conference of the International Speech Communication Association (INTERSPEECH). Makuhari, Chiba, Japan, 2010, 4:1045-1048.
[17] MNIH A, TEH Y W. A fast and simple algorithm for training neural probabilistic language models[C]//Proceedings of the 29th International Conference on Machine Learning (ICML-12). Edinburgh, Scotland, UK, 2012:1751-1758.
[18] BOTTOU L. Stochastic gradient learning in neural networks[C]//Proceedings of Neuro-Nîmes 91. Nimes, France, 1991.
[19] TURIAN J, RATINOV L, BENGIO Y. Word representations:a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, 2010:384-394.
[20] MIKOLOV T, YIH W T, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Atlanta, Georgia, 2013:746-751.
[21] MIKOLOV T, SUTSKEVER I, CHEN Kai, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems. California, USA, 2013.
[22] LEVOW G A. The third international Chinese language processing bakeoff:word segmentation and named entity recognition[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Sydney, Australia, 2006:108-117.

相似文献/References:

[1]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报编辑部,2009,4(02):148.
 YE Zhi-fei,WEN Yi-min,LU Bao-liang.A survey of imbalanced pattern classification problems[J].CAAI Transactions on Intelligent Systems,2009,4(1):148.
[2]刘奕群,张 敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报编辑部,2007,2(01):45.
 LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(1):45.
[3]马世龙,眭跃飞,许 可.优先归纳逻辑程序的极限行为[J].智能系统学报编辑部,2007,2(04):9.
 MA Shi-long,SUI Yue-fei,XU Ke.Limit behavior of prioritized inductive logic programs[J].CAAI Transactions on Intelligent Systems,2007,2(1):9.
[4]姚伏天,钱沄涛.高斯过程及其在高光谱图像分类中的应用[J].智能系统学报编辑部,2011,6(05):396.
 YAO Futian,QIAN Yuntao.Gaussian process and its applications in hyperspectral image classification[J].CAAI Transactions on Intelligent Systems,2011,6(1):396.
[5]文益民,强保华,范志刚.概念漂移数据流分类研究综述[J].智能系统学报编辑部,2013,8(02):95.[doi:10.3969/j.issn.1673-4785.201208012]
 WEN Yimin,QIANG Baohua,FAN Zhigang.A survey of the classification of data streams with concept drift[J].CAAI Transactions on Intelligent Systems,2013,8(1):95.[doi:10.3969/j.issn.1673-4785.201208012]
[6]杨成东,邓廷权.综合属性选择和删除的属性约简方法[J].智能系统学报编辑部,2013,8(02):183.[doi:10.3969/j.issn.1673-4785.201209056]
 YANG Chengdong,DENG Tingquan.An approach to attribute reduction combining attribute selection and deletion[J].CAAI Transactions on Intelligent Systems,2013,8(1):183.[doi:10.3969/j.issn.1673-4785.201209056]
[7]胡小生,钟勇.基于加权聚类质心的SVM不平衡分类方法[J].智能系统学报编辑部,2013,8(03):261.
 HU Xiaosheng,ZHONG Yong.Support vector machine imbalanced data classification based on weighted clustering centroid[J].CAAI Transactions on Intelligent Systems,2013,8(1):261.
[8]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报编辑部,2015,10(01):1.[doi:10.3969/j.issn.1673-4785.201403072]
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10(1):1.[doi:10.3969/j.issn.1673-4785.201403072]
[9]孔庆超,毛文吉,张育浩.社交网站中用户评论行为预测[J].智能系统学报编辑部,2015,10(03):349.[doi:10.3969/j.issn.1673-4785.201403019]
 KONG Qingchao,MAO Wenji,ZHANG Yuhao.User comment behavior prediction in social networking sites[J].CAAI Transactions on Intelligent Systems,2015,10(1):349.[doi:10.3969/j.issn.1673-4785.201403019]
[10]钱冬,王蓓,张涛,等.结合Copula理论与贝叶斯决策理论的分类算法[J].智能系统学报编辑部,2016,11(1):78.[doi:10.11992/tis.201509011]
 QIAN Dong,WANG Bei,ZHANG Tao,et al.Classification algorithm based on Copula theory and Bayesian decision theory[J].CAAI Transactions on Intelligent Systems,2016,11(1):78.[doi:10.11992/tis.201509011]

备注/Memo

备注/Memo:
收稿日期:2015-08-13;改回日期:。
基金项目:原创项目研发与非遗产业化资助项目(YC2015057).
作者简介:姚霖,1975年生,高级工程师,主要研究方向为生物信息、自然语言处理。主持和参与多项科研项目。发表学术论文20余篇;刘轶,1972年生,研究员,主要研究方向为语音识别、多媒体信息处理、嵌入式软件及系统,主持和参与国家自然科学基金等项目几十项。发表学术论文50余篇,其中被SCI检索6篇,EI检索22篇;刘宏,1967年生,教授,博士生导师,国家"万人计划"首批入选专家,国家"中青年科技创新领军人才",主要研究方向为软硬件协同设计、计算机视觉与智能机器人、图像处理与模式识别。发表学术论文50余篇。
通讯作者:姚霖.E-mail:1250047487@qq.com.
更新日期/Last Update: 1900-01-01