[1]汪沛,线岩团,郭剑毅,等.一种结合词向量和图模型的特定领域实体消歧方法[J].智能系统学报编辑部,2016,11(3):366-374.[doi:10.11992/tis.201603044]
WANG Pei,XIAN Yantuan,GUO Jianyi,et al.A novel method using word vector and graphical models for entity disambiguation in specific topic domains[J].CAAI Transactions on Intelligent Systems,2016,11(3):366-374.[doi:10.11992/tis.201603044]
点击复制
《智能系统学报》编辑部[ISSN 1673-4785/CN 23-1538/TP] 卷:
11
期数:
2016年第3期
页码:
366-374
栏目:
学术论文—自然语言处理与理解
出版日期:
2016-06-25
- Title:
-
A novel method using word vector and graphical models for entity disambiguation in specific topic domains
- 作者:
-
汪沛1, 线岩团1,2, 郭剑毅1,2, 文永华1,2, 陈玮1,2, 王红斌1,2
-
1. 昆明理工大学 信息工程与自动化学院, 云南 昆明 650500;
2. 昆明理工大学 智能信息处理重点实验室, 云南 昆明 650500
- Author(s):
-
WANG Pei1, XIAN Yantuan1,2, GUO Jianyi1,2, WEN Yonghua1,2, CHEN Wei1,2, WANG Hongbin1,2
-
1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
2. Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500, China
-
- 关键词:
-
实体消歧; 实体链接; Word2Vec; 图模型; 随机游走; 维基百科
- Keywords:
-
entity disambiguation; entity linking; Word2Vec; Wikipedia; graphical model; random walking
- 分类号:
-
TP393
- DOI:
-
10.11992/tis.201603044
- 摘要:
-
针对特定领域提出了一种结合词向量和图模型的方法来实现实体消歧。以旅游领域为例,首先选取维基百科离线数据库中的旅游分类下的页面内容构建领域知识库,然后用知识库中的文本和从各大旅游网站爬取到的旅游文本,通过词向量计算工具Word2Vec构建词向量模型,结合人工标注的实体关系图谱,采用一种基于图的随机游走算法辅助计算相似度,使其能够较准确地计算旅游领域词与词之间的相似度。最后,提取待消歧实体的背景文本的若干关键词和知识库中候选实体文本的若干关键词,利用训练好的词向量模型结合图模型分别进行交叉相似度计算,把相似度均值最高的候选实体作为最终的目标实体。实验结果表明,这种新的相似度计算方法能够有效获取实体指称项与目标实体之间的相似度,从而能够较为准确地实现特定领域的实体消歧。
- Abstract:
-
In this paper, a novel method based on word vector and graph models is proposed to deal with entity disambiguation in specific topic domains. Take the tourism topic domain as an example. The method firstly chooses the web-pages of the tourism category in a Wikipedia offline database to build a knowledge base; then, the tool Word2Vec is used to build a word vector model with the texts in the knowledge base and texts taken from several tourism websites. Combined with a manual annotation graph, a random walk algorithm based on the graph is used to compute similarity to accurately calculate the similarity between words within the tourism domain. Next, the method extracts several keywords from the background text of the entity to be disambiguated and compares them with the keyword text in the knowledge base that describes the candidate entities. Finally, the method uses the trained Word2Vec model and graphical model to calculate the similarity between the keywords of name mention and the keywords of candidate entities. The method then chooses the candidate entities which have the maximum average similarity to the target entity. Experimental results show that this new method can effectively capture the similarity between name mention and a target entity; thus, it can accurately achieve entity disambiguation of a topic-specific domain.
备注/Memo
收稿日期:2016-3-19;改回日期:。
基金项目:国家自然科学基金项目(61262041,61472168,61462054,61562052);云南省自然科学基金重点项目(2013FA030).
作者简介:汪沛,男,1990年生,硕士研究生,主要研究方向为自然语言处理、信息抽取。线岩团,男,1981年生,博士研究生,主研方向为自然语言处理、信息抽取、机器翻译、机器学习。郭剑毅,女,1964年生,教授,主要研究领域为自然语言处理、信息抽取、机器学习。
通讯作者:郭剑毅.E-mail:gjade86@hotmail.com.
更新日期/Last Update:
1900-01-01