[1]申彦,朱玉全.CMP上基于数据集划分的K-means多核优化算法[J].智能系统学报编辑部,2015,10(4):607-614.[doi:10.3969/j.issn.1673-4785.201411036]
SHEN Yan,ZHU Yuquan.An optimized algorithm of K-means based on data set partition on CMP systems[J].CAAI Transactions on Intelligent Systems,2015,10(4):607-614.[doi:10.3969/j.issn.1673-4785.201411036]
点击复制
《智能系统学报》编辑部[ISSN 1673-4785/CN 23-1538/TP] 卷:
10
期数:
2015年第4期
页码:
607-614
栏目:
学术论文—机器学习
出版日期:
2015-08-25
- Title:
-
An optimized algorithm of K-means based on data set partition on CMP systems
- 作者:
-
申彦1,2, 朱玉全2
-
1. 江苏大学 信息管理与信息系统系, 江苏 镇江 212013;
2. 江苏大学 计算机科学与通信工程学院, 江苏 镇江 212013
- Author(s):
-
SHEN Yan1,2, ZHU Yuquan2
-
1. Department of Information Management and Information System, Jiangsu University, Zhenjiang 212013, China;
2. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
-
- 关键词:
-
K均值算法; 聚类算法; 单片多核; 大规模数据集; 数据挖掘; 无监督学习; 大数据
- Keywords:
-
k-means; clustering algorithm; CMP; massive data set; data mining; unsupervised learning; big data
- 分类号:
-
TP181
- DOI:
-
10.3969/j.issn.1673-4785.201411036
- 文献标志码:
-
A
- 摘要:
-
虽然现在多核CPU非常普及,但传统K-means聚类算法由于没有专门进行并行化设计,不能充分利用现代CPU的多核计算能力,算法针对大规模数据集的聚类效率有待进一步提高。因此,对K-means算法进行CMP并行化改进,提出了一种Multi-core K-means(MC-K-means)算法。该算法对K-means的聚类任务进行了分解,设计了独立且均衡的聚类子任务并分配给各线程并行执行,以此利用现代CPU的多核计算能力。实验结果表明,MC-K-means相比K-means获得了较高的多核加速比,提高了针对大规模数据集的聚类能力。
- Abstract:
-
The traditional K-means clustering algorithm is not designed to focus on parallelization, which can not make use of the multi-core computing capability of the modern CPU. Therefore, the clustering efficiency of the traditional K-means for massive data set should be further improved. In this paper, a novel algorithm named Multi-core K-means (MC-K-means) after redesigning the original K-means that focuses on parallelization in a chip multi-processor CMP environment is proposed. In order to utilize the multi-core computing capability of the modern CPU, MC-K-means partitions the clustering tasks into some independent and balanced subtasks and distributes these subtasks to the threads to execute parallel. The experimental results showed that the MC-K-means algorithm received the relatively higher speedup rate compared to the K-means algorithm, which improves the handling capacity for massive data set.
备注/Memo
收稿日期:2014-11-28;改回日期:。
基金项目:国家自然科学基金资助项目(71271117);国家科技支撑计划基金资助项目(2010BAI88B00);江苏省自然科学基础研究计划基金资助项目(BK2010331);江苏省博士研究生创新计划基金资助项目(CX10B_016X);江苏省博士后科研资助计划项目(1401056C).
作者简介:申彦,男,1982年生,讲师,博士,主要研究方向为数据挖掘、智能信息系统。获2014年度中国商业联合会科学技术奖三等奖。发表学术论文11篇,其中被EI检索5篇;朱玉全,男,1965年生,教授,博士生导师,主要研究方向为数据挖掘、智能信息系统、信息系统集成。获2014年度中国商业联合会科学技术奖三等奖,全国多媒体课件大赛一等奖和江苏省优秀软件产品奖(金慧奖)各1项,省部级科技进步奖4次,申请发明专利10项,其中授权发明专利3项,获批计算机软件著作权7部。发表学术论文70余篇,10多篇被EI检索,出版编著2部。
通讯作者:申彦.E-mail:104186179@qq.com.
更新日期/Last Update:
2015-08-28