[1]刘素芹,柴 松.命名实体的网络话题K-means动态检测方法[J].智能系统学报,2010,5(02):122-126.
 LIU Su-qin,CHAI Song.K-means dynamic web topic detection method based on named entities[J].CAAI Transactions on Intelligent Systems,2010,5(02):122-126.
点击复制

命名实体的网络话题K-means动态检测方法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第5卷
期数:
2010年02期
页码:
122-126
栏目:
出版日期:
2010-04-25

文章信息/Info

Title:
K-means dynamic web topic detection method based on named entities
文章编号:
1673-4785(2010)02-0122-05
作者:
刘素芹1 柴 松12
1.中国石油大学 计算机与通信工程学院,山东 青岛 266555;
2.山东省军区 自动化工作站,山东 济南 250013
Author(s):
LIU Su-qin1 CHAI Song12
1.College of Computer & Communication Engineering,China University of Petroleum, Qingdao 266555, China;
2.Automation Workstation,Military District, Shandong Province, Ji’nan 250013, China
关键词:
命名实体网络话题动态检测K-means聚类自相似度话题向量
Keywords:
named entity web topics dynamic detection Kmeans clustering method selfsimilarity topic vector
分类号:
TP18
文献标志码:
A
摘要:
针对传统的网络话题检测方法在文本特征表示方面的不足及K-means聚类算法面临的问题,提出了一种基于命名实体的网络话题K-means动态检测方法.该方法对传统话题检测的特征表示方法进行了改进,用命名实体和文本特征词相结合表示文本特征,用命名实体对文本表示的贡献大小表示命名实体的权重;另外,利用自适应技术对K-means聚类算法中的K值进行自收敛,对K-means聚类算法进行了优化,利用K值的动态选取来实现网络话题的动态检测.实验结果表明,该方法较好地区分了相似话题,有效提高了话题检测的性能.
Abstract:
Current text representation models are not suitable for web topic detection, and the traditional Kmeans clustering algorithm has some drawbacks. The authors developed a dynamic Kmeans detection algorithm for web topics on the basis of named entities. In the new method, the representation model of the traditional topic detection method was modified. The text was represented by a combination of named entities and text features. The weight of the named entity was described by its contribution to the representation. The number of clusters K in the Kmeans algorithm selfconverged by the use of an adaptive technique. The Kmeans algorithm was optimized, achieving a dynamic detection of web topics by using dynamic selection of K values. Experimental results indicated that the new method detects and distinguishes between similar topics effectively, thus significantly improving the performance of topic detection.

参考文献/References:

[1]ALLAN J, CARBONELL J, DODDINGTON G. Topic detection and tracking pilot study: final report[C]// Proceeding of the DARPA Broadcast News Transcription and Understanding Workshop. San Francisco, 1998:194218.
[2]洪 宇, 张 宇, 刘 挺. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(6):7187.
HONG Yu, ZHANG Yu, LIU Ting. Topic detection and tracking review[J].Journal Chinese Information Processing,2007, 21(6):7187.
[3]YAMRON J P, KNECHT S, Van MULBREGT P. Dragon’s tracking and detection systems for the TDT2000 evaluation[C]//Proceedings of Topic Detection and Tracking Workshop. Washington, USA, 2000:7580.
 [4]KUMARAN G, ALLAN J. Text classification and named entities for new event detection[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Sheffield, 2004:297304.
[5]YIU M C. Kmeans: a new generalized Kmeans clustering algorithm[J]. Pattern Recognition Letters, 2003(24): 28832893.
[6]SUNDHEIM B M. Named entity task definition[C]//Proc of the Sixth Message Understanding Conf.Columbia, Maryland, 1995:319332.
[7]DING C, HE Xiaofeng. Cluster merging and splitting in hierarchical clustering algorithms[C]//Proceedings of the 2002 IEEE International Conference on Data Mining. Maebashi City, Japan, 2002: 139146.
[8]DING C, HE X, ZHA H, et al. A minmax cut algorithm for graph partitioning and data clustering[C]//Proceedings of the IEEE Internationl Conference. San Jose, California, USA, 2001: 107114.
[9]骆卫华, 于满泉. 基于多策略优化的分治多层聚类算法的话题发现研究[J]. 中文信息学报, 2006, 20(1):2936.
LUO Weihua, YU Manquan. The study of topic detection based on algorithm of division and multilevel clustering with multistrategy[J].Journal Chinese Information Processing, 2006, 20(1):2936.

备注/Memo

备注/Memo:
收稿日期:2009-12-04.
通信作者:刘素芹.E-mail:liusq@upc.edu.cn.
作者简介:
刘素芹,女,1968年生,副教授,博士,主要研究方向为计算机网络、高性能计算,近3年发表学术论文20余篇,编写教材2部.
柴 松,男,1981年生,主要研究方向为计算机网络、高性能计算及应用.
更新日期/Last Update: 2010-05-24