[1]鉴 萍,宗成庆.基于双向标注融合的汉语最长短语识别方法[J].智能系统学报,2009,4(05):406-413.[doi:10.3969/j.issn.1673-4785.2009.05.004]
 JIAN Ping,ZONG Cheng-qing.A new approach to identifying Chinese maximal-length phrases using bidirectional labeling[J].CAAI Transactions on Intelligent Systems,2009,4(05):406-413.[doi:10.3969/j.issn.1673-4785.2009.05.004]
点击复制

基于双向标注融合的汉语最长短语识别方法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第4卷
期数:
2009年05期
页码:
406-413
栏目:
出版日期:
2009-10-25

文章信息/Info

Title:
A new approach to identifying Chinese maximal-length phrases using bidirectional labeling
文章编号:
1673-4785(2009)05-0406-08
作者:
鉴   萍宗成庆
中国科学院自动化研究所模式识别国家重点实验室,北京100190
Author(s):
JIAN Ping ZONG Cheng-qing
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
关键词:
最长名词短语识别介词短语识别序列标注双向标注分歧点
Keywords:
maximal-length noun phrase identification prepositional phrase identification sequence labeling bidirectional labeling fork position
分类号:
TP391
DOI:
10.3969/j.issn.1673-4785.2009.05.004
文献标志码:
A
摘要:
汉语最长短语(最长名词短语和介词短语)具有显著的语言学特点.采用基于分类器的确定性标注方法进行双向标注,其结果能够显示最长短语识别在汉语句子正(由左至右)反(由右至左)2个方向上的互补性.基于此,利用确定性的双向标注技术来识别汉语最长短语,并提出了一种基于“分歧点”的概率融合策略以融合该双向标注结果.实验表明,这一融合算法能够有效发掘这2个方向的互补特性,从而获得较好的短语识别效果.
Abstract:
Chinese maximal-length phrases (maximal-length noun phrases and prepositional phrases) possess remarkable linguistic properties. Bidirectional labeling results of Chinese maximal-length phrases obtained using sequential classifiers reveal complementary properties in both directions. In this paper, both left-right and right-left sequential labeling were employed to identify the Chinese maximal-length noun phrases and prepositional phrases. Then a novel “fork position” based probabilistic algorithm was developed to fuse the bidirectional results. Experiments were carried out on the Penn Chinese Treebank, a segmented, part-of-speech tagged, and fully bracketed corpus. The results confirmed that the proposed algorithm is able to effectively exploit the complementary strengths of the two directions.

参考文献/References:

[1]XUE Nanwen, XIA Fei, CHIOU Fudong, et al. The Penn Chinese Treebank: phrase structure annotation of a large corpus[J]. Natural Language Engineering, 2005, 11(2): 207-238.
[2]李文捷,周    明,潘海华,等. 基于语料库的中文最长名词短语的自动抽取[C]//计算语言学进展与应用. 北京:清华大学出版社,1995:119-124.
LI Wenjie, ZHOU Ming, PAN Haihua, et al. Corpusbased maximal-length Chinese noun phrases extraction[C]//Advances and Applications on Computational Linguistics. Beijing: Tsinghua University Press, 1995: 119-124.
[3]周    强,孙茂松,黄昌宁. 汉语最长名词短语的自动识别[J]. 软件学报,2000,11(2):195-201.
 ZHOU Qiang, SUN Maosong, HUANG Changning. Automatic identification of Chinese maximal noun phrases[J]. Journal of Software, 2000, 11(2): 195-201.
[4]王立霞,孙宏林. 现代汉语介词短语边界识别研究[J]. 中文信息学报,2005,19(3):80-86.
WANG Lixia, SUN Honglin. Automatic recognition of prepositional phrases in Chinese[J]. Journal of Chinese Information Processing, 2005, 19(3): 80-86.
[5]干俊伟,黄德根. 汉语介词短语的自动识别[J]. 中文信息学报,2005,19(4):17-23.
GAN Junwei, HUANG Degen. Automatic identification of Chinese prepositional phrase[J]. Journal of Chinese Information Processing, 2005, 19(4): 17-23.
[6]ZHOU Guodong, SU Jian, TEY Tongguan. Hybrid text chunking[C]//Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning. Lisbon, Portugal, 2000: 163-165.
[7]KUDO T, MATSUMOTO Y. Chunking with support vector machines[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, USA, 2001: 192-199.
[8]SHA Fei, PEREIRA F. Shallow parsing with conditional random fields[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics. Edmonton, Canada, 2003: 213-220.
[9]BAI Xuemei, LI Jinji, KIM Dongil, et al. Identification of maximal-length noun phrases based on expanded chunks and classified punctuations in Chinese[C]//Proceedings of International Conference on Computer Processing of Oriental Languages. Singapore, 2006: 268-276.
[10]冯    冲,陈肇雄,黄河燕,等. 基于条件随机域的复杂最长名词短语识别[J]. 小型微型计算机系统,2006,27(6):1134-1139.
FENG Chong, CHEN Zhaoxiong, HUANG Heyan, et al. Recognition of complex maximal length noun phrase using conditional random fields[J]. MiniMicro Systems, 2006, 27(6): 1134-1139.
[11]TJONG KIM SANG E F. Noun phrase recognition by system combination[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics. Seattle, USA, 2000: 50-55.
[12]CHEN Wenliang, ZHANG Yujie, ISAHARA H. An empirical study of Chinese chunking[C]//Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics. Sydney, Australia, 2006: 97-104.
[13]LEE Linshan, LIN Longji, CHEN Kehjiann. An efficient natural language processing system specially designed for the Chinese language[J]. Computational Linguistics, 1991, 17(4): 347-374.
[14]WU Yuchieh, YANG Jiechi, LEE Yueshi, et al. Efficient and robust phrase chunking using support vector machines[C]//Proceedings of Asia Information Retrieval Symposium. Singapore, 2006: 350-361.
[15]RATNAPARKHI A. A maximum entropy model for part-of-speech tagging[C]//Proceedings of the Empirical Methods in Natural Language Processing. New Brunswick, USA, 1996: 133-142.
[16]MCCALLUM A, FREITAG D, PEREIRA F. Maximum entropy Markov models for information extraction and segmentation[C]//Proceedings of the International Conference on Machine Learning. Stanford, USA, 2000: 591-598.
[17]TAN Yongmei, YAO Tianshun, CHEN Qing, et al. Applying conditional random fields to Chinese shallow parsing[C]// Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics. Mexico City, Mexico, 2005: 167-176.
[18]宗成庆. 统计自然语言处理[M]. 北京:清华大学出版社,2008:175-177, 179-181.
[19]KITTLER J, HATEF M, DUIN R P W, et al. On combining classifiers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(3): 226-239.
[20]TJONG KIM SANG E F, VEENSTRA J. Representing text chunks[C]// Proceedings of European Chapter of the Association for Computational Linguistics. Bergen, Norway, 1999: 173-179.
[21]KUDO T. YamCha: Yet another multipurpose chunk annotator[EB/OL]. (2005-09-05)[2009-02-25]. http://www.chasen.org/~tAKu/software/yamcha/.
[22]KUDO T. CRF++: Yet another CRF toolkit[EB/OL]. (2007-03-07)[2009-02-25]. http://crfpp.sourceforge.net/.
[23]QIAN X. Pocket CRF[EB/OL]. (2008-08-05)[2009-02-25]. http://sourceforge.net/projects/pocket-crf-1/files/.

备注/Memo

备注/Memo:
作者简介:
鉴    萍,女,1982年生,博士研究生,主要研究方向为自然语言处理、依存句法分析.
宗成庆,男,1963年生,研究员、博士生导师.中国科学院自动化研究所模式识别国家重点实验室副主任,国际学术期刊 IEEE Intelligent Systems 副主编,清华大学特邀学术顾问和讲座教授,中国科学院研究生院兼职教授,亚洲自然语言处理联合会(AFNLP)执行理事,中国人工智能学会理事及自然语言处理专业委员会副主任,中国中文信息学会理事及机器翻译专业委员会副主任,担任若干国际学术会议的程序委员会主席、委员等职务.主要研究方向为自然语言处理理论与方法、机器翻译、人机对话等技术.作为项目负责人承担国家及国际合作项目10余项,申请国家发明专利多项.发表学术论文70余篇,出版学术专著1部.
更新日期/Last Update: 2009-12-29