[1]马甲林,张永军,王志坚.基于概念簇的多主题提取算法[J].智能系统学报,2015,10(2):261-266.[doi:10.3969/j.issn.1673-4785.201405066]
MA Jialin,ZHANG Yongjun,WANG Zhijian.Multi-topic extraction algorithm based on concept clusters[J].CAAI Transactions on Intelligent Systems,2015,10(2):261-266.[doi:10.3969/j.issn.1673-4785.201405066]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
10
期数:
2015年第2期
页码:
261-266
栏目:
学术论文—机器学习
出版日期:
2015-04-25
- Title:
-
Multi-topic extraction algorithm based on concept clusters
- 作者:
-
马甲林1,2, 张永军1,2, 王志坚1
-
1. 河海大学 计算机与信息学院, 江苏 南京 211100;
2. 淮阴工学院 计算机工程学院, 江苏 淮安 223003
- Author(s):
-
MA Jialin1,2, ZHANG Yongjun1,2, WANG Zhijian1
-
1. College of Computer and Information, Hohai University, Nanjing 211100, China;
2. School of Computer Engineering, Huaiyin Institute of Technology, Huaian 223003, China
-
- 关键词:
-
语义; 稀疏; 上下文背景; 知识库; 概念簇; 多主题提取; K-means; MEABCC
- Keywords:
-
semantic; sparsity; context; knowledge base; concept clusters; multi-topic extraction; K-means; MEABCC
- 分类号:
-
TP18
- DOI:
-
10.3969/j.issn.1673-4785.201405066
- 文献标志码:
-
A
- 摘要:
-
现实世界存在着大量的多主题文本,多主题在信息检索、图书情报等领域有着广泛的应用。传统主题提取算法大多是针对文本整体提取一个主题,且存在缺乏语义信息、向量高维和稀疏等缺陷。以《知网》为知识库,构建概念向量表示文本,根据概念的语义及上下文背景对同义词进行归并、对多义词进行排歧,并利用概念间语义关系实现语义相似度计算;在此基础上提出基于概念簇的多主题提取算法MEABCC,该算法通过对概念进行聚类,得到多个主题簇;在使用K-means算法进行概念聚类时,通过“预设种子”方法对其进行改进,以弥补传统K-means算法对初始中心的敏感性所引起的时空开销不稳定、结果波动较大的缺陷。实验结果表明,该算法具有较好的准确率、召回率和F1值。
- Abstract:
-
There are a large number of multi-topic documents existing in the real world, and the extraction of multi-topic is widely used in the fields of information retrieval, library science and intelligence. In the traditional theme extraction algorithm, in most cases a theme is extracted for the whole text, which lacks of semantic information and has high-dimensional vector and sparse defects. Setting concept vectors to represent text based on the repository of cnki.net, merging synonyms and discriminating polysemy according to the semantic of concepts and context, thereby achieving the computation of semantic similarity in light of the semantic relation among concepts. The multi-topic extraction algorithm based on the concept of clusters (MEABCC) is proposed. The MEABCC acquires multiple topics by clustering concepts. The conceptual clustering made by K-means algorithm is improved through the method of presetting "default seed", which makes up the undulating time and space overlay and the unstable results. This happen to be caused by sensitivity to initial centers of traditional K-means algorithm. The experiments showed that MEABCC has good accuracy, recall and F1 values.
备注/Memo
收稿日期:2014-6-1;改回日期:。
基金项目:国家自然科学青年科学基金资助项目(11201168).
作者简介:马甲林,男,1981年生,博士研究生,主要研究方向为自然语言处理。曾获第12届全国多媒体课件大赛三等奖、江苏省高等学校优秀多媒体教学课件二等奖、淮安市科技进步奖三等奖、发明专利1项、参编教材1部,发表学术论文7篇;张永军,男,1978年生,讲师,博士研究生,主要研究方向为中文信息处理、文本数据挖掘、发表学术论文8篇,参编教程1部;王志坚,男,1958年生,教授,博导,主研方向为基于网络的计算机应用技术、软件复用、基于网络的软件系统集成技术,主持国家“863”项目、江苏省基金项目等多项,出版专著多部。
通讯作者:马甲林.E-mail:majialin@126.com.
更新日期/Last Update:
2015-06-15