<-上一篇/Previous Article 下一篇/Next Article->

[1]申彦,朱玉全.CMP上基于数据集划分的K-means多核优化算法[J].智能系统学报编辑部,2015,10(4):607-614.[doi:10.3969/j.issn.1673-4785.201411036]
　SHEN Yan,ZHU Yuquan.An optimized algorithm of K-means based on data set partition on CMP systems[J].CAAI Transactions on Intelligent Systems,2015,10(4):607-614.[doi:10.3969/j.issn.1673-4785.201411036]

点击复制

CMP上基于数据集划分的K-means多核优化算法

PDF下载 HTML

《智能系统学报》编辑部[ISSN 1673-4785/CN 23-1538/TP] 卷: 10 期数: 2015年第4期页码: 607-614 栏目: 学术论文—机器学习出版日期: 2015-08-25

Title:: An optimized algorithm of K-means based on data set partition on CMP systems

作者:: 申彦^1,2, 朱玉全²; 1. 江苏大学信息管理与信息系统系, 江苏镇江 212013;
2. 江苏大学计算机科学与通信工程学院, 江苏镇江 212013

Author(s):: SHEN Yan^1,2, ZHU Yuquan²; 1. Department of Information Management and Information System, Jiangsu University, Zhenjiang 212013, China;
2. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

关键词:: K均值算法; 聚类算法; 单片多核; 大规模数据集; 数据挖掘; 无监督学习; 大数据

Keywords:: k-means; clustering algorithm; CMP; massive data set; data mining; unsupervised learning; big data

分类号:: TP181

DOI:: 10.3969/j.issn.1673-4785.201411036

文献标志码:: A

摘要:: 虽然现在多核CPU非常普及,但传统K-means聚类算法由于没有专门进行并行化设计,不能充分利用现代CPU的多核计算能力,算法针对大规模数据集的聚类效率有待进一步提高。因此,对K-means算法进行CMP并行化改进,提出了一种Multi-core K-means(MC-K-means)算法。该算法对K-means的聚类任务进行了分解,设计了独立且均衡的聚类子任务并分配给各线程并行执行,以此利用现代CPU的多核计算能力。实验结果表明,MC-K-means相比K-means获得了较高的多核加速比,提高了针对大规模数据集的聚类能力。

Abstract:: The traditional K-means clustering algorithm is not designed to focus on parallelization, which can not make use of the multi-core computing capability of the modern CPU. Therefore, the clustering efficiency of the traditional K-means for massive data set should be further improved. In this paper, a novel algorithm named Multi-core K-means (MC-K-means) after redesigning the original K-means that focuses on parallelization in a chip multi-processor CMP environment is proposed. In order to utilize the multi-core computing capability of the modern CPU, MC-K-means partitions the clustering tasks into some independent and balanced subtasks and distributes these subtasks to the threads to execute parallel. The experimental results showed that the MC-K-means algorithm received the relatively higher speedup rate compared to the K-means algorithm, which improves the handling capacity for massive data set.

参考文献/References:: [1] SUBRAMANIAM V. Programming concurrency on the JVM mastering synchronization, STM, and actors[M]. Beijing: China Machine Press,2013:1-27.
[2] AARON B, TAMIR D E, RISHE N D, et al. Dynamic incremental K-means clustering[C]// Proc of the 2014 International Conference on Computational Science and Computational Intelligence, CSCI 2014. Los Alamitos, CA: IEEE Computer Society, 2014: 308-313.
[3] SARMA T H, VISWANATH P, REDDY B E. Single pass kernel k-means clustering method[J]. Sadhana-Academy Proceedings in Engineering Sciences, 2013, 38(3): 407-419.
[4] BRADLEY P, FAYYAD U, REINA C. Scaling clustering algorithms to large databases[R]. Redmond:Microsoft Research Report,1998:9-15.
[5] 陈光平,王文鹏,黄俊. 一种改进初始聚类中心选择的K-means算法[J]. 小型微型计算机系统,2012,33(6): 1320-1323. CHEN Guangping, WANG Wenpeng, HUANG Jun. Improved initial clustering center selection method for k-means algorith[J]. Journal of Chinese Computer Systems, 2012, 33(6): 1320-1323.
[6] MAHMUD M S, RAHMAN M M, AKHTAR M N. Improvement of k-means clustering algorithm with better initial centroids based on weighted average[C]//Proc of the 7th International Conference on Electrical and Computer Engineering, ICECE 2012. Los Alamitos, CA: IEEE Computer Society, 2012: 647-650.
[7] PATIL R, JONDHALE K C. Edge based technique to estimate number of clusters in k-means color image segmentation[C]//Proc of the 3rd IEEE International Conference on Computer Science and Information Technology, ICCSIT 2010. Piscataway, NJ: IEEE Computer Society, 2010: 117-121.
[8] JING Liping, NG M K, HUANG zhexue. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041.
[9] BISHNU P S, BHATTACHERJEE V. A dimension reduction technique for k-means clustering algorithm[C]//Proc of the 1st International Conference on Recent Advances in Information Technology, RAIT-2012. Piscataway, NJ: IEEE Computer Society,2012: 531-535.
[10] DOBBELIN R, SCHUTT T, REINEFELD A. An analysis of SMP memory allocators: mapreduce on large shared-memory systems[C] //Proc of the 41st International Conference on Parallel Processing Workshops (ICPPW), 2012. Piscataway, NJ: IEEE, 2012: 48-54.
[11] DI F G, BLASA F, CAFIERO S, et al. Fault tolerant decentralised k-means clustering for asynchronous large-scale networks[J]. Journal of Parallel and Distributed Computing, 2013, 73(3): 317-329.
[12] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10): 166-169.ZHAO Weizhong, MA Huifang, FU Yanxiang, et al. Reasearch on parallel k-means algorithm design based on hadoop platform [J]. Computer Science, 2011,38(10): 166-169.
[13] 王晓华. MapReduce 2.0源码分析与编程实战[M].北京:人民邮电出版社, 2014:1-55.
[14] MARTHA, V, ZHAO Weizhong, XV Xiaowei. H-MapReduce: A framework for workload balancing in MapReduce[C]// Proc of the International Conference on Advanced Information Networking and Applications, AINA. Piscataway, NJ: IEEE, 2013: 637-644.
[15] ZHAO Weizhong, MA Huifang, HE Qing. Parallel k-means clustering based on mapreduce[C]//Proc of the 1st International Conference on Cloud Computing, CloudCom 2009. Germany: Springer Verlag, 2009: 674-679.
[16] FAHIM A M. Parallel implementation of k-means on multi-core processors[J]. Computer Science and Telecommunications, 2014, 1(41): 53-61.
[17] ZALIK K R. An efficient k-means clustering algorithm[J]. Pattern Recognition Letters, 2008, 29(9): 1385-1391.
[18] HERBERT S, DALE S. A comprehansive introduction[M]. Beijing: China Machine Press, 2013
[19] JAVIER F G. Java 7 concurrency cookbook[M]. Beijing: Posts & Telecom Press, 2014.
[20] Monitoring and managing java se 6 platform applications[EB/OL]. [2005-12-18].http://java.sun.com/developer/technicalArticles/J2SE/monitoring.

相似文献/References:: [1]朱? 林,王士同,修? 宇.鲁棒的模糊方向相似性聚类算法[J].智能系统学报编辑部,2008,3(1):43.
　ZHU Lin,WANG Shi-tong,XIU Yu.A robust clustering algorithm with fuzzy directional similarity[J].CAAI Transactions on Intelligent Systems,2008,3():43.
[2]郭瑛洁,王士同,许小龙.基于最大间隔理论的组合距离学习算法[J].智能系统学报编辑部,2015,10(6):843.[doi:10.11992/tis.201504027]
　GUO Yingjie,WANG Shitong,XU Xiaolong.Learning a linear combination of distances based on the maximum-margin theory[J].CAAI Transactions on Intelligent Systems,2015,10():843.[doi:10.11992/tis.201504027]
[3]陈爱国,王士同.基于极大熵的知识迁移模糊聚类算法[J].智能系统学报编辑部,2017,12(1):95.[doi:10.11992/tis.201602003]
　CHEN Aiguo,WANG Shitong.A maximum entropy-based knowledge transfer fuzzy clustering algorithm[J].CAAI Transactions on Intelligent Systems,2017,12():95.[doi:10.11992/tis.201602003]
[4]淦文燕,刘冲.一种改进的搜索密度峰值的聚类算法[J].智能系统学报编辑部,2017,12(2):229.[doi:10.11992/tis.201512036]
　GAN Wenyan,LIU Chong.An improved clustering algorithm that searches and finds density peaks[J].CAAI Transactions on Intelligent Systems,2017,12():229.[doi:10.11992/tis.201512036]
[5]杜航原,张晶,王文剑.一种深度自监督聚类集成算法[J].智能系统学报编辑部,2020,15(6):1113.[doi:10.11992/tis.202006050]
　DU Hangyuan,ZHANG Jing,WANG Wenjian.A deep self-supervised clustering ensemble algorithm[J].CAAI Transactions on Intelligent Systems,2020,15():1113.[doi:10.11992/tis.202006050]
[6]王文博,张志飞,王睿智,等.基于聚类重组和预解析的检索增强生成方法[J].智能系统学报编辑部,2026,21(1):236.[doi:10.11992/tis.202506029]
　WANG Wenbo,ZHANG Zhifei,WANG Ruizhi,et al.Retrieval-augmented generation based on cluster reorganization and pre-parsing[J].CAAI Transactions on Intelligent Systems,2026,21():236.[doi:10.11992/tis.202506029]

备注/Memo

收稿日期:2014-11-28;改回日期:。
基金项目:国家自然科学基金资助项目(71271117);国家科技支撑计划基金资助项目(2010BAI88B00);江苏省自然科学基础研究计划基金资助项目(BK2010331);江苏省博士研究生创新计划基金资助项目(CX10B_016X);江苏省博士后科研资助计划项目(1401056C).
作者简介:申彦,男,1982年生,讲师,博士,主要研究方向为数据挖掘、智能信息系统。获2014年度中国商业联合会科学技术奖三等奖。发表学术论文11篇,其中被EI检索5篇;朱玉全,男,1965年生,教授,博士生导师,主要研究方向为数据挖掘、智能信息系统、信息系统集成。获2014年度中国商业联合会科学技术奖三等奖,全国多媒体课件大赛一等奖和江苏省优秀软件产品奖(金慧奖)各1项,省部级科技进步奖4次,申请发明专利10项,其中授权发明专利3项,获批计算机软件著作权7部。发表学术论文70余篇,10多篇被EI检索,出版编著2部。
通讯作者:申彦.E-mail:104186179@qq.com.

更新日期/Last Update: 2015-08-28

CMP上基于数据集划分的K-means多核优化算法 PDF下载HTML

备注/Memo

CMP上基于数据集划分的K-means多核优化算法

PDF下载 HTML