[1]刘素芹,柴 松.命名实体的网络话题K-means动态检测方法[J].智能系统学报,2010,5(2):122-126.
LIU Su-qin,CHAI Song.K-means dynamic web topic detection method based on named entities[J].CAAI Transactions on Intelligent Systems,2010,5(2):122-126.
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
5
期数:
2010年第2期
页码:
122-126
栏目:
学术论文—自然语言处理与理解
出版日期:
2010-04-25
- Title:
-
K-means dynamic web topic detection method based on named entities
- 文章编号:
-
1673-4785(2010)02-0122-05
- 作者:
-
刘素芹1 ,柴 松1,2
-
1.中国石油大学 计算机与通信工程学院,山东 青岛 266555;
2.山东省军区 自动化工作站,山东 济南 250013
- Author(s):
-
LIU Su-qin1, CHAI Song1,2
-
1.College of Computer & Communication Engineering,China University of Petroleum, Qingdao 266555, China;
2.Automation Workstation,Military District, Shandong Province, Ji’nan 250013, China
-
- 关键词:
-
命名实体; 网络话题; 动态检测; K-means聚类; 自相似度; 话题向量
- Keywords:
-
named entity; web topics; dynamic detection; Kmeans clustering method; selfsimilarity; topic vector
- 分类号:
-
TP18
- 文献标志码:
-
A
- 摘要:
-
针对传统的网络话题检测方法在文本特征表示方面的不足及K-means聚类算法面临的问题,提出了一种基于命名实体的网络话题K-means动态检测方法.该方法对传统话题检测的特征表示方法进行了改进,用命名实体和文本特征词相结合表示文本特征,用命名实体对文本表示的贡献大小表示命名实体的权重;另外,利用自适应技术对K-means聚类算法中的K值进行自收敛,对K-means聚类算法进行了优化,利用K值的动态选取来实现网络话题的动态检测.实验结果表明,该方法较好地区分了相似话题,有效提高了话题检测的性能.
- Abstract:
-
Current text representation models are not suitable for web topic detection, and the traditional Kmeans clustering algorithm has some drawbacks. The authors developed a dynamic Kmeans detection algorithm for web topics on the basis of named entities. In the new method, the representation model of the traditional topic detection method was modified. The text was represented by a combination of named entities and text features. The weight of the named entity was described by its contribution to the representation. The number of clusters K in the Kmeans algorithm selfconverged by the use of an adaptive technique. The Kmeans algorithm was optimized, achieving a dynamic detection of web topics by using dynamic selection of K values. Experimental results indicated that the new method detects and distinguishes between similar topics effectively, thus significantly improving the performance of topic detection.
更新日期/Last Update:
2010-05-24