[1]马甲林,张永军,王志坚.基于概念簇的多主题提取算法[J].智能系统学报,2015,10(02):261-266.[doi:10.3969/j.issn.1673-4785.201405066]
 MA Jialin,ZHANG Yongjun,WANG Zhijian.Multi-topic extraction algorithm based on concept clusters[J].CAAI Transactions on Intelligent Systems,2015,10(02):261-266.[doi:10.3969/j.issn.1673-4785.201405066]
点击复制

基于概念簇的多主题提取算法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第10卷
期数:
2015年02期
页码:
261-266
栏目:
出版日期:
2015-04-25

文章信息/Info

Title:
Multi-topic extraction algorithm based on concept clusters
作者:
马甲林12 张永军12 王志坚1
1. 河海大学 计算机与信息学院, 江苏 南京 211100;
2. 淮阴工学院 计算机工程学院, 江苏 淮安 223003
Author(s):
MA Jialin12 ZHANG Yongjun12 WANG Zhijian1
1. College of Computer and Information, Hohai University, Nanjing 211100, China;
2. School of Computer Engineering, Huaiyin Institute of Technology, Huaian 223003, China
关键词:
语义稀疏上下文背景知识库概念簇多主题提取K-meansMEABCC
Keywords:
semanticsparsitycontextknowledge baseconcept clustersmulti-topic extractionK-meansMEABCC
分类号:
TP18
DOI:
10.3969/j.issn.1673-4785.201405066
文献标志码:
A
摘要:
现实世界存在着大量的多主题文本,多主题在信息检索、图书情报等领域有着广泛的应用。传统主题提取算法大多是针对文本整体提取一个主题,且存在缺乏语义信息、向量高维和稀疏等缺陷。以《知网》为知识库,构建概念向量表示文本,根据概念的语义及上下文背景对同义词进行归并、对多义词进行排歧,并利用概念间语义关系实现语义相似度计算;在此基础上提出基于概念簇的多主题提取算法MEABCC,该算法通过对概念进行聚类,得到多个主题簇;在使用K-means算法进行概念聚类时,通过“预设种子”方法对其进行改进,以弥补传统K-means算法对初始中心的敏感性所引起的时空开销不稳定、结果波动较大的缺陷。实验结果表明,该算法具有较好的准确率、召回率和F1值。
Abstract:
There are a large number of multi-topic documents existing in the real world, and the extraction of multi-topic is widely used in the fields of information retrieval, library science and intelligence. In the traditional theme extraction algorithm, in most cases a theme is extracted for the whole text, which lacks of semantic information and has high-dimensional vector and sparse defects. Setting concept vectors to represent text based on the repository of cnki.net, merging synonyms and discriminating polysemy according to the semantic of concepts and context, thereby achieving the computation of semantic similarity in light of the semantic relation among concepts. The multi-topic extraction algorithm based on the concept of clusters (MEABCC) is proposed. The MEABCC acquires multiple topics by clustering concepts. The conceptual clustering made by K-means algorithm is improved through the method of presetting "default seed", which makes up the undulating time and space overlay and the unstable results. This happen to be caused by sensitivity to initial centers of traditional K-means algorithm. The experiments showed that MEABCC has good accuracy, recall and F1 values.

参考文献/References:

[1] TANG Jie,YAO Limin, CHEN Dewei.Multi-topic based query-oriented summarization[C]//Proceedings of the SIAM International Conference on Data Mining. Sparks, USA, 2009: 1141-1152.
[2] LAMIREL J C. Multi-view data analysis and concept extraction methods for text[J]. Knowledge Organization, 2013, 40(5): 305-319.
[3] NA Fan, LI Huixian,and WANG Chao. Research on sentiment analyzing in multi-topics texts[J]. Advances in Computer Science,Intelligent System and Environment, 2013, 105: 581-586.
[4] FU Xianghua, LIU Guo, GUO Yanyan, et al. Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon[J]. Knowledge-Based Systems, 2013, 37: 186-195.
[5] ZENG Jianping, DUAN Jiangjiao, WANG Wei, et al. Semantic multi-grain mixture topic model for text analysis[J]. Expert Systems with Applications, 2011, 38: 3574-3579.
[6] 刘金岭.基于降维的短信文本语义分类及主题提取[J].计算机工程与应用, 2010, 46(23):159-161.LIU Jinling.Dimensionality reduction of short message text classification and thematic extraction of semantic[J]. Computer Engineering and Applications, 2010, 46(23): 159-161.
[7] 白秋,金春霞,周海岩.概念向量文本聚类算法[J]. 计算机工程与应用, 2011, 47(35): 155-157.BAI Qiuchan, JIN Chunxia, ZHOU Haiyan. Text clustering algorithm based on concept vector[J]. Computer Engineering and Applications, 2011, 47(35): 155-157.
[8] 江敏,肖诗斌. 一种改进的基于《知网》的词语语义相似度计算[J]. 中文信息学报, 2008, 22(5): 84-89.JIANG Min, XIAO Shibin. An improved word similarity computing method based on HowNet[J]. Journal of Chinese Information Processing, 2008, 22(5): 84-89.
[9] 刘金岭.基于语义的高质量中文短信文本聚类算法[J]. 计算机工程, 2009, 35(10): 201-205.LIU Jinling. High quality algorithm for chinese short messages text clustering based on semantic[J]. Computer Engineering, 2009, 35(10): 201-205.
[10] LLORET E. Manuel palomar text summarisation in progress: a literature review[J]. Artificial Intelligence Review, 2012, 37: 1-41.
[11] XU Junling, XU Baowen, et al. Stable initialization scheme for K-means clustering[J]. Wuhan University Journal of Natural Sciences, 2009, 14: 24-28.

相似文献/References:

[1]朱 倩,程显毅,韩 飞.汉语句子语义三维表示模型[J].智能系统学报,2009,4(02):122.
 ZHU Qian,CHENG Xian-yi,HAN Fei.A threedimensional representative model of Chinese sentence semantics[J].CAAI Transactions on Intelligent Systems,2009,4(02):122.
[2]陶星,李卫华,汪中飞.基于知网的可拓领域信息元库的构建方法[J].智能系统学报,2015,10(5):790.[doi:10.11992/tis.201412006]
 TAO Xing,LI Weihua,WANG Zhongfei.Construction of HowNet-based extendable domain information element base[J].CAAI Transactions on Intelligent Systems,2015,10(02):790.[doi:10.11992/tis.201412006]
[3]毛莉娜,李卫华.利用智能引导和KDML增强可拓模型人机建模能力研究[J].智能系统学报,2017,12(03):348.[doi:10.11992/tis.201610017]
 MAO Lina,LI Weihua.Research on enhancing the human-machine modeling ability for an extension model using the intelligent guide and KDML[J].CAAI Transactions on Intelligent Systems,2017,12(02):348.[doi:10.11992/tis.201610017]
[4]张冬慧,程显毅.认知视角下的舆论观点句情感计算[J].智能系统学报,2017,12(04):498.[doi:10.11992/tis.201607023]
 ZHANG Donghui,CHENG Xianyi.Research on computation of affect in public opinion sentences from the cognition viewpoint[J].CAAI Transactions on Intelligent Systems,2017,12(02):498.[doi:10.11992/tis.201607023]
[5]郭少成,陈松灿.稀疏化的因子分解机[J].智能系统学报,2017,12(06):816.[doi:10.11992/tis.201706030]
 GUO Shaocheng,CHEN Songcan.Sparsified factorization machine[J].CAAI Transactions on Intelligent Systems,2017,12(02):816.[doi:10.11992/tis.201706030]
[6]周浩,王莉.融合语义与语法信息的中文评价对象提取[J].智能系统学报,2019,14(01):171.[doi:10.11992/tis.201809029]
 ZHOU Hao,WANG Li.Chinese opinion target extraction based on fusion of semantic and syntactic information[J].CAAI Transactions on Intelligent Systems,2019,14(02):171.[doi:10.11992/tis.201809029]

备注/Memo

备注/Memo:
收稿日期:2014-6-1;改回日期:。
基金项目:国家自然科学青年科学基金资助项目(11201168).
作者简介:马甲林,男,1981年生,博士研究生,主要研究方向为自然语言处理。曾获第12届全国多媒体课件大赛三等奖、江苏省高等学校优秀多媒体教学课件二等奖、淮安市科技进步奖三等奖、发明专利1项、参编教材1部,发表学术论文7篇;张永军,男,1978年生,讲师,博士研究生,主要研究方向为中文信息处理、文本数据挖掘、发表学术论文8篇,参编教程1部;王志坚,男,1958年生,教授,博导,主研方向为基于网络的计算机应用技术、软件复用、基于网络的软件系统集成技术,主持国家“863”项目、江苏省基金项目等多项,出版专著多部。
通讯作者:马甲林.E-mail:majialin@126.com.
更新日期/Last Update: 2015-06-15