<-上一篇/Previous Article 下一篇/Next Article->

[1]汪沛,线岩团,郭剑毅,等.一种结合词向量和图模型的特定领域实体消歧方法[J].智能系统学报编辑部,2016,11(3):366-374.[doi:10.11992/tis.201603044]
　WANG Pei,XIAN Yantuan,GUO Jianyi,et al.A novel method using word vector and graphical models for entity disambiguation in specific topic domains[J].CAAI Transactions on Intelligent Systems,2016,11(3):366-374.[doi:10.11992/tis.201603044]

点击复制

一种结合词向量和图模型的特定领域实体消歧方法

PDF下载 HTML

《智能系统学报》编辑部[ISSN 1673-4785/CN 23-1538/TP] 卷: 11 期数: 2016年第3期页码: 366-374 栏目: 学术论文—自然语言处理与理解出版日期: 2016-06-25

Title:: A novel method using word vector and graphical models for entity disambiguation in specific topic domains

作者:: 汪沛¹, 线岩团^1,2, 郭剑毅^1,2, 文永华^1,2, 陈玮^1,2, 王红斌^1,2; 1. 昆明理工大学信息工程与自动化学院, 云南昆明 650500;
2. 昆明理工大学智能信息处理重点实验室, 云南昆明 650500

Author(s):: WANG Pei¹, XIAN Yantuan^1,2, GUO Jianyi^1,2, WEN Yonghua^1,2, CHEN Wei^1,2, WANG Hongbin^1,2; 1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
2. Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500, China

关键词:: 实体消歧; 实体链接; Word2Vec; 图模型; 随机游走; 维基百科

Keywords:: entity disambiguation; entity linking; Word2Vec; Wikipedia; graphical model; random walking

分类号:: TP393

DOI:: 10.11992/tis.201603044

摘要:: 针对特定领域提出了一种结合词向量和图模型的方法来实现实体消歧。以旅游领域为例,首先选取维基百科离线数据库中的旅游分类下的页面内容构建领域知识库,然后用知识库中的文本和从各大旅游网站爬取到的旅游文本,通过词向量计算工具Word2Vec构建词向量模型,结合人工标注的实体关系图谱,采用一种基于图的随机游走算法辅助计算相似度,使其能够较准确地计算旅游领域词与词之间的相似度。最后,提取待消歧实体的背景文本的若干关键词和知识库中候选实体文本的若干关键词,利用训练好的词向量模型结合图模型分别进行交叉相似度计算,把相似度均值最高的候选实体作为最终的目标实体。实验结果表明,这种新的相似度计算方法能够有效获取实体指称项与目标实体之间的相似度,从而能够较为准确地实现特定领域的实体消歧。

Abstract:: In this paper, a novel method based on word vector and graph models is proposed to deal with entity disambiguation in specific topic domains. Take the tourism topic domain as an example. The method firstly chooses the web-pages of the tourism category in a Wikipedia offline database to build a knowledge base; then, the tool Word2Vec is used to build a word vector model with the texts in the knowledge base and texts taken from several tourism websites. Combined with a manual annotation graph, a random walk algorithm based on the graph is used to compute similarity to accurately calculate the similarity between words within the tourism domain. Next, the method extracts several keywords from the background text of the entity to be disambiguated and compares them with the keyword text in the knowledge base that describes the candidate entities. Finally, the method uses the trained Word2Vec model and graphical model to calculate the similarity between the keywords of name mention and the keywords of candidate entities. The method then chooses the candidate entities which have the maximum average similarity to the target entity. Experimental results show that this new method can effectively capture the similarity between name mention and a target entity; thus, it can accurately achieve entity disambiguation of a topic-specific domain.

参考文献/References:: [1] 赵军. 命名实体识别、排歧和跨语言关联[J]. 中文信息学报, 2009, 23(2): 3-17.ZHAO Jun. A survey on named entity recognition, disambiguation and cross-lingual coreference resolution[J]. Journal of Chinese information processing, 2009, 23(2): 3-17.
[2] 赵军, 刘康, 周光有, 等. 开放式文本信息抽取[J]. 中文信息学报, 2011, 25(6): 98-110. ZHAO Jun, LIU Kang, ZHOU Guangyou, et al. Open information extraction[J]. Journal of Chinese information processing, 2011, 25(6): 98-110.
[3] BUNESCU R C, PASCA M. Using encyclopedic knowledge for named entity disambiguation[C]//Proceedings of the 11st conference of the european chapter of the association for computational linguistics. Trento, Italy, 2006: 9-16.
[4] BAGGA A, BALDWIN B. Entity-based cross-document coreferencing using the vector space model[C]//Proceedings of the 17th international conference on computational linguistics-volume 1. association for computational linguistics. Montreal, Canada, 1998: 79-85.
[5] MANN G S, YAROWSKY D. Unsupervised personal name disambiguation[C]//Proceedings of the 7th conference on natural language learning at HLT-NAACL 2003-volume 4. Sapporo, Japan, 2003: 33-40.
[6] HAN Xianpei, SUN Le. A generative entity-mention model for linking entities with knowledge base[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Stroudsburg, PA, USA, 2011: 945-954.
[7] FAN Xiaoming, WANG Jianyong, PU Xu, et al. On graph-based name disambiguation[J]. Journal of data and information quality (JDIQ), 2011, 2(2): 10.
[8] 怀宝兴, 宝腾飞, 祝恒书, 等. 一种基于概率主题模型的命名实体链接方法[J]. 软件学报, 2014, 25(9): 2076-2087. HUAI Baoxing, BAO Tengfei, ZHU Hengshu, et al. Topic modeling approach to named entity linking[J]. Journal of software, 2014, 25(9): 2076-2087.
[9] 宁博, 张菲菲. 基于异构知识库的命名实体消歧[J]. 西安邮电大学学报, 2014, 19(4): 70-76. NING Bo, ZHANG Feifei. Named entity disambiguation based on heterogeneous knowledge base[J]. Journal of Xi’an university of posts and telecommunications, 2014, 19(4): 70-76.
[10] 朱敏, 贾真, 左玲, 等. 中文微博实体链接研究[J]. 北京大学学报:自然科学版, 2014, 50(1): 73-78. ZHU Min, JIA Zhen, ZUO Ling, et al. Research on entity linking of chinese micro blog[J]. Acta scientiarum naturalium universitatis pekinensis, 2014, 50(1): 73-78.
[11] HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the 8th annual conference of the cognitive science society. Amherst, USA, 1986: 1-12.
[12] 张剑, 屈丹, 李真. 基于词向量特征的循环神经网络语言模型[J]. 模式识别与人工智能, 2015, 28(4): 299-305.ZHANG Jian, QU Dan, LI Zhen. Recurrent neural network language model based on word vector features[J]. Pattern recognition and artificial intelligence, 2015, 28(4): 299-305.
[13] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations. Scottsdale, Arizona, 2013: 1388-1429.
[14] MIHALCEA R, TARAU P. TextRank: bringing order into texts[C]//Proceedings of EMNLP-04and the 2004 Conference on Empirical Methods in Natural Language Processing. Spain, 2004: 404-411.
[15] PEARSON K. The problem of the random walk[J]. Nature, 1905, 72(1865): 294.
[16] 郑伟, 王朝坤, 刘璋, 等. 一种基于随机游走模型的多标签分类算法[J]. 计算机学报, 2010, 33(8): 1418-1426. ZHENG Wei, WANG Chaokun, LIU Zhang, et al. A multi-label classification algorithm based on random walk model[J]. Chinese journal of computers, 2010, 33(8): 1418-1426.
[17] SZUMMER M, JAAKKOLA T. Partially labeled classification with Markov random walks[C]//Advances in neural information processing systems (NIPS). Cambridge, 2002, 14: 945-952.
[18] ZHOU Dengyong. Learning from labeled and unlabeled data on a directed graph[C]//Proceedings of the 22nd international conference on machine learning. New York, USA, 2005: 1036-1043.
[19] TISHBY N, SLONIM N. Data clustering by Markovian relaxation and the information bottleneck method[C]//Proceedings of Neural Information Processing Systems. Vancouver, Canadian, 2000: 640-646.
[20] HAREL D, KOREN Y. On clustering using random walks[M]//HARIHARAN R, VINAY V, MUKUND M. Foundations of software technology and theoretical computer science. Berlin Heidelberg: Springer, 2001: 18-41.
[21] LUXBURG U V. A tutorial on spectral clustering[J]. Statistics and computing, 2007, 17(4): 395-416.
[22] GRADY L. Random walks for image segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2006, 28(11): 1768-1783.
[23] GORI M, MAGGINI M, SARTI L. Exact and approximate graph matching using random walks.[J]. IEEE transactions on Pattern analysis and machine intelligence, 2005, 27(7): 1100-1111.
[24] KONDOR R I, LAFFERTY J. Diffusion kernels on graphs and other discrete structures[C]//Proceedings of the 19th international conference on machine learning. Sydney, Australia, 2002: 315-322.
[25] BELKIN M, NIYOGI P. Laplacian eigenmaps for dimensionality reduction and data representation[R]. Chicago, USA: University of Chicago, 2002.
[26] LAFFERTY J, LEBANON G. Information diffusion kernels[C]//Advances in neural information processing systems. Cambridge, 2002: 375-382.
[27] SMOLA A J, KONDOR R. Kernels and regularization on graphs[M]//Learning theory and kernel machines. Berlin Heidelberg: Springer, 2003: 144-158.
[28] HU Jian, WANG Gang, LOCHOVSKY F, et al. Understanding user’s query intent with Wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web. Beijing, China, 2009: 471-480.

相似文献/References:: [1]张涛,贾真,李天瑞,等.基于知识库的开放领域问答系统[J].智能系统学报编辑部,2018,13(4):557.[doi:10.11992/tis.201707039]
　ZHANG Tao,JIA Zhen,LI Tianrui,et al.Open-domain question-answering system based on large-scale knowledge base[J].CAAI Transactions on Intelligent Systems,2018,13():557.[doi:10.11992/tis.201707039]

备注/Memo

收稿日期:2016-3-19;改回日期:。
基金项目:国家自然科学基金项目(61262041,61472168,61462054,61562052);云南省自然科学基金重点项目(2013FA030).
作者简介:汪沛,男,1990年生,硕士研究生,主要研究方向为自然语言处理、信息抽取。线岩团,男,1981年生,博士研究生,主研方向为自然语言处理、信息抽取、机器翻译、机器学习。郭剑毅,女,1964年生,教授,主要研究领域为自然语言处理、信息抽取、机器学习。
通讯作者:郭剑毅.E-mail:gjade86@hotmail.com.

更新日期/Last Update: 1900-01-01

一种结合词向量和图模型的特定领域实体消歧方法 PDF下载HTML

备注/Memo

一种结合词向量和图模型的特定领域实体消歧方法

PDF下载 HTML