<-上一篇/Previous Article 下一篇/Next Article->

[1]鉴?? 萍,宗成庆.基于双向标注融合的汉语最长短语识别方法[J].智能系统学报,2009,4(5):406-413.[doi:10.3969/j.issn.1673-4785.2009.05.004]
　JIAN Ping,ZONG Cheng-qing.A new approach to identifying Chinese maximal-length phrases using bidirectional labeling[J].CAAI Transactions on Intelligent Systems,2009,4(5):406-413.[doi:10.3969/j.issn.1673-4785.2009.05.004]

点击复制

基于双向标注融合的汉语最长短语识别方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 4 期数: 2009年第5期页码: 406-413 栏目: 学术论文—自然语言处理与理解出版日期: 2009-10-25

Title:: A new approach to identifying Chinese maximal-length phrases using bidirectional labeling

文章编号:: 1673-4785(2009)05-0406-08

作者:: 鉴?? 萍,宗成庆; 中国科学院自动化研究所模式识别国家重点实验室，北京100190

Author(s):: JIAN Ping， ZONG Cheng-qing; National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

关键词:: 最长名词短语识别; 介词短语识别; 序列标注; 双向标注; 分歧点

Keywords:: maximal-length noun phrase identification; prepositional phrase identification; sequence labeling; bidirectional labeling; fork position

分类号:: TP391

DOI:: 10.3969/j.issn.1673-4785.2009.05.004

文献标志码:: A

摘要:: 汉语最长短语（最长名词短语和介词短语）具有显著的语言学特点.采用基于分类器的确定性标注方法进行双向标注，其结果能够显示最长短语识别在汉语句子正（由左至右）反（由右至左）2个方向上的互补性.基于此，利用确定性的双向标注技术来识别汉语最长短语，并提出了一种基于“分歧点”的概率融合策略以融合该双向标注结果.实验表明，这一融合算法能够有效发掘这2个方向的互补特性，从而获得较好的短语识别效果.

Abstract:: Chinese maximal-length phrases (maximal-length noun phrases and prepositional phrases) possess remarkable linguistic properties. Bidirectional labeling results of Chinese maximal-length phrases obtained using sequential classifiers reveal complementary properties in both directions. In this paper, both left-right and right-left sequential labeling were employed to identify the Chinese maximal-length noun phrases and prepositional phrases. Then a novel “fork position” based probabilistic algorithm was developed to fuse the bidirectional results. Experiments were carried out on the Penn Chinese Treebank, a segmented, part-of-speech tagged, and fully bracketed corpus. The results confirmed that the proposed algorithm is able to effectively exploit the complementary strengths of the two directions.

参考文献/References:: ［1］XUE Nanwen, XIA Fei, CHIOU Fudong, et al. The Penn Chinese Treebank: phrase structure annotation of a large corpus［J］. Natural Language Engineering, 2005, 11(2): 207-238.
［2］李文捷，周??? 明，潘海华，等. 基于语料库的中文最长名词短语的自动抽取［C］//计算语言学进展与应用. 北京：清华大学出版社，1995：119-124.
LI Wenjie, ZHOU Ming, PAN Haihua, et al. Corpusbased maximal-length Chinese noun phrases extraction［C］//Advances and Applications on Computational Linguistics. Beijing: Tsinghua University Press, 1995: 119-124.
［3］周??? 强，孙茂松，黄昌宁. 汉语最长名词短语的自动识别［J］. 软件学报，2000，11（2）：195-201.
?ZHOU Qiang, SUN Maosong, HUANG Changning. Automatic identification of Chinese maximal noun phrases［J］. Journal of Software, 2000, 11(2): 195-201.
［4］王立霞，孙宏林. 现代汉语介词短语边界识别研究［J］. 中文信息学报，2005，19（3）：80-86.
WANG Lixia, SUN Honglin. Automatic recognition of prepositional phrases in Chinese［J］. Journal of Chinese Information Processing, 2005, 19(3): 80-86.
［5］干俊伟，黄德根. 汉语介词短语的自动识别［J］. 中文信息学报，2005，19（4）：17-23.
GAN Junwei, HUANG Degen. Automatic identification of Chinese prepositional phrase［J］. Journal of Chinese Information Processing, 2005, 19(4): 17-23.
［6］ZHOU Guodong, SU Jian, TEY Tongguan. Hybrid text chunking［C］//Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning. Lisbon, Portugal, 2000: 163-165.
［7］KUDO T, MATSUMOTO Y. Chunking with support vector machines［C］//Proceedings of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, USA, 2001: 192-199.
［8］SHA Fei, PEREIRA F. Shallow parsing with conditional random fields［C］//Proceedings of the North American Chapter of the Association for Computational Linguistics. Edmonton, Canada, 2003: 213-220.
［9］BAI Xuemei, LI Jinji, KIM Dongil, et al. Identification of maximal-length noun phrases based on expanded chunks and classified punctuations in Chinese［C］//Proceedings of International Conference on Computer Processing of Oriental Languages. Singapore, 2006: 268-276.
［10］冯??? 冲，陈肇雄，黄河燕，等. 基于条件随机域的复杂最长名词短语识别［J］. 小型微型计算机系统，2006，27（6）：1134-1139.
FENG Chong, CHEN Zhaoxiong, HUANG Heyan, et al. Recognition of complex maximal length noun phrase using conditional random fields［J］. MiniMicro Systems, 2006, 27(6): 1134-1139.
［11］TJONG KIM SANG E F. Noun phrase recognition by system combination［C］//Proceedings of the North American Chapter of the Association for Computational Linguistics. Seattle, USA, 2000: 50-55.
［12］CHEN Wenliang, ZHANG Yujie, ISAHARA H. An empirical study of Chinese chunking［C］//Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics. Sydney, Australia, 2006: 97-104.
［13］LEE Linshan, LIN Longji, CHEN Kehjiann. An efficient natural language processing system specially designed for the Chinese language［J］. Computational Linguistics, 1991, 17(4): 347-374.
［14］WU Yuchieh, YANG Jiechi, LEE Yueshi, et al. Efficient and robust phrase chunking using support vector machines［C］//Proceedings of Asia Information Retrieval Symposium. Singapore, 2006: 350-361.
［15］RATNAPARKHI A. A maximum entropy model for part-of-speech tagging［C］//Proceedings of the Empirical Methods in Natural Language Processing. New Brunswick, USA, 1996: 133-142.
［16］MCCALLUM A, FREITAG D, PEREIRA F. Maximum entropy Markov models for information extraction and segmentation［C］//Proceedings of the International Conference on Machine Learning. Stanford, USA, 2000: 591-598.
［17］TAN Yongmei, YAO Tianshun, CHEN Qing, et al. Applying conditional random fields to Chinese shallow parsing［C］// Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics. Mexico City, Mexico, 2005: 167-176.
［18］宗成庆. 统计自然语言处理［M］. 北京：清华大学出版社，2008：175-177, 179-181.
［19］KITTLER J, HATEF M, DUIN R P W, et al. On combining classifiers［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(3): 226-239.
［20］TJONG KIM SANG E F, VEENSTRA J. Representing text chunks［C］// Proceedings of European Chapter of the Association for Computational Linguistics. Bergen, Norway, 1999: 173-179.
［21］KUDO T. YamCha: Yet another multipurpose chunk annotator［EB/OL］. (2005-09-05)［2009-02-25］. http://www.chasen.org/~tAKu/software/yamcha/.
［22］KUDO T. CRF++: Yet another CRF toolkit［EB/OL］. (2007-03-07)［2009-02-25］. http://crfpp.sourceforge.net/.
［23］QIAN X. Pocket CRF［EB/OL］. (2008-08-05)［2009-02-25］. http://sourceforge.net/projects/pocket-crf-1/files/.

备注/Memo

作者简介：
鉴??? 萍，女，1982年生，博士研究生，主要研究方向为自然语言处理、依存句法分析.
宗成庆，男，1963年生，研究员、博士生导师.中国科学院自动化研究所模式识别国家重点实验室副主任，国际学术期刊 IEEE Intelligent Systems 副主编，清华大学特邀学术顾问和讲座教授，中国科学院研究生院兼职教授，亚洲自然语言处理联合会（AFNLP）执行理事，中国人工智能学会理事及自然语言处理专业委员会副主任，中国中文信息学会理事及机器翻译专业委员会副主任，担任若干国际学术会议的程序委员会主席、委员等职务.主要研究方向为自然语言处理理论与方法、机器翻译、人机对话等技术.作为项目负责人承担国家及国际合作项目10余项，申请国家发明专利多项.发表学术论文70余篇，出版学术专著1部.

更新日期/Last Update: 2009-12-29

基于双向标注融合的汉语最长短语识别方法 PDF下载HTML

备注/Memo

基于双向标注融合的汉语最长短语识别方法

PDF下载 HTML