[1]刘艳超,郭剑毅,余正涛,等.融合实体特性识别越南语复杂命名实体的混合方法[J].智能系统学报,2016,11(4):503-512.[doi:10.11992/tis.201606009]
LIU Yanchao,GUO Jianyi,YU Zhengtao,et al.A hybrid method to recognize complex vietnamese named entity incorporating entity properties[J].CAAI Transactions on Intelligent Systems,2016,11(4):503-512.[doi:10.11992/tis.201606009]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
11
期数:
2016年第4期
页码:
503-512
栏目:
学术论文—知识工程
出版日期:
2016-07-25
- Title:
-
A hybrid method to recognize complex vietnamese named entity incorporating entity properties
- 作者:
-
刘艳超1, 郭剑毅1,2, 余正涛1,2, 周兰江1,2, 严馨1,2, 陈秀琴3
-
1. 昆明理工大学 信息工程与自动化学院, 云南 昆明 650500;
2. 昆明理工大学 智能信息处理重点实验室, 云南 昆明 650500;
3. 昆明理工大学 国际教育学院, 云南 昆明 650093
- Author(s):
-
LIU Yanchao1, GUO Jianyi1,2, YU Zhengtao1,2, ZHOU Lanjiang1,2, YAN Xin1,2, CHEN Xiuqin3
-
1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
2. Key Laboratory of Pattern recognition and Intelligent computing of Yunnan College, Kunming 650500, China;
3. The School of International Educaton, Kunming University of Science and Technology, Kunming, 650093, China
-
- 关键词:
-
越南语; 实体库构建; 实体识别; 最大熵; 规则; 实体特点
- Keywords:
-
vietnamese; entity library construction; entity recognition; maximum entropy; rules set; entity characters
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.201606009
- 摘要:
-
命名实体识别是自然语言处理过程中的基础任务。本文针对越南语的复杂命名实体难识别及F值不够高的问题,提出了一种结合实体库的越南语命名实体识别混合方法。首先,本文根据越南语的语言和实体特点,选取有效的局部特征和全局特征,应用最大熵模型进行越南语命名实体识别;其次,根据本文制定的命名实体的规则进行越南语命名实体识别;然后,结合两者的识别结果,以规则为主,统计为辅原则;最后经过人工校对,把获取到的正确标记的实体加入到实体库,动态扩增实体库,为规则制定和特征选取提供丰富的语料和依据。实验表明,该方法能够有效地结合规则与统计的方法优点,互相弥补不足;明显提高了识别的正确率、召回率和F值。
- Abstract:
-
NER (Named entity recognition) is the basic task in natural language processing. Aimed at the problems of low F values and the difficulty with complex Vietnamese named entity recognition, a hybrid method incorporating entity properties is proposed. Firstly, according to the Vietnamese language and entity characteristics, local and global features were selected and a maximum entropy model built to recognize Vietnamese named entities. Secondly, according to the named entity rules obtained, the Vietnamese entity was recognized. Then, combining the recognition results, this paper uses the rule as the main principle and statistics as the supplementary principle. Finally, the obtained correct entity was added to the entity corpus after manual correction, dynamically expanding the entity corpus, which provided a rich corpus and a basis for determining rules and selecting features. Experimental results show that the method can effectively take advantage of rules and statistics, and that recognition accuracy, recall, and F are all significantly improved.
备注/Memo
收稿日期:2014-04-01。
作者简介:刘艳超,男,1990年生,硕士研究生,主要研究方向为自然语言处理与信息抽取;郭剑毅,女,1964年生,教授,硕士生导师,主要研究方向为自然语言处理、信息抽取、机器学习。主持并参与了国家自然科学基金、云南省信息技术重大专项基金、云南省自然科学基金多项,获得云南省科技进步一等奖1项、云南省自然科学二等奖1项。发表学术论文70余篇,主编教材2部;余正涛,男,1970年生,教授,博士生导师,博士,主要研究方向为自然语言处理、信息检索、机器学习。以排名第一获得云南省科技进步一等奖、云南省自然科学二等奖、云南省科技进步三等奖各1项。发表学术论文150余篇,被SCI、EI检索80余篇。
通讯作者:郭剑毅.E-mail:gjade86@hotmail.com.
更新日期/Last Update:
1900-01-01