[1]齐小刚,胡秋秋,刘立芳.基于MapReduce的并行异常检测算法[J].智能系统学报,2019,14(02):224-230.[doi:10.11992/tis.201809007]
 QI Xiaogang,HU Qiuqiu,LIU Lifang.Parallel anomaly algorithm based on MapReduce[J].CAAI Transactions on Intelligent Systems,2019,14(02):224-230.[doi:10.11992/tis.201809007]
点击复制

基于MapReduce的并行异常检测算法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第14卷
期数:
2019年02期
页码:
224-230
栏目:
出版日期:
2019-03-05

文章信息/Info

Title:
Parallel anomaly algorithm based on MapReduce
作者:
齐小刚1 胡秋秋1 刘立芳2
1. 西安电子科技大学 数学与统计学院, 陕西 西安 710071;
2. 西安电子科技大学 计算机学院, 陕西 西安 710071
Author(s):
QI Xiaogang1 HU Qiuqiu1 LIU Lifang2
1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China;
2. School of Computer Science and Technology, Xidian University, Xi’an 710071, China
关键词:
数据挖掘异常检测局部离群因子HadoopMapReduce分布式文件系统并行计算局部密度
Keywords:
data mininganomaly detectionlocal outlier factorHadoopMapReduceDistributed File Systemparallel computinglocal density
分类号:
TP311
DOI:
10.11992/tis.201809007
摘要:
为了提高数据挖掘中异常检测算法在数据量增大时的准确度、灵敏度和执行效率,本文提出了一种基于MapReduce框架和Local Outlier Factor (LOF)算法的并行异常检测算法(MR-DLOF)。首先,将存放在Hadoop分布式文件系统(HDFS)上的数据集逻辑地切分为多个数据块。然后,利用MapReduce原理将各个数据块中的数据并行处理,使得每个数据点的k-邻近距离和LOF值的计算仅在单个块中执行,从而提高了算法的执行效率;同时重新定义了k-邻近距离的概念,避免了数据集中存在大于或等于k个重复点而导致局部密度为无穷大的情况。最后,将LOF值较大的数据点合并重新计算其LOF值,从而提高算法准确度和灵敏度。通过真实数据集验证了MR-DLOF算法的有效性、高效性和可扩展性。
Abstract:
To improve the accuracy, sensitivity, and efficiency of anomaly detection algorithm in data mining when the amount of data increases, a parallel anomaly detection algorithm (MR-LOF) based on the MapReduce framework and the local outlier factor (LOF) algorithm is proposed in this paper. First, the dataset, stored in the Hadoop distributed file system (HDFS), is logically divided into multiple data blocks. Then, the MapReduce principle is used to process the data in each data block in parallel, so that the k-distance and LOF value of each data point is calculated only in a single block. It greatly improves the efficiency of the algorithm. Simultaneously, the concept of k-distance is redefined. It avoids the situation where the local density is infinite because more than k repeated points exist in the dataset. Finally, the data points whose LOF value is larger than threshold are merged, and the LOF values of combined data are recalculated. This process can effectively improve the accuracy and sensitivity. Experiments with real-world datasets demonstrate the validity, high efficiency, and extendibility of the MR-DLOF algorithm.

参考文献/References:

[1] HAN Jiawei, KAMBER M. Data mining:concepts and techniques[M]. San Francisco:Morgan Kaufmann Publishers Inc., 2006:1-18.
[2] AGGARWAL C C. Outlier analysis[M]. New York:Springer, 2013:75-99.
[3] CHEN Feng, DENG Pan, WAN Jiafu, et al. Data mining for the internet of things:literature review and challenges[J]. International journal of distributed sensor networks, 2015, 2015:431047.
[4] 吴镜锋, 金炜东, 唐鹏. 数据异常的监测技术综述[J]. 计算机科学, 2017, 44(S2):24-28 WU Jingfeng, JIN Weidong, TANG Peng. Survey on monitoring techniques for data abnormalities[J]. Computer science, 2017, 44(S2):24-28
[5] STEINWART I, HUSH D, SCOVEL C. A classification framework for anomaly detection[J]. The journal of machine learning research, 2005, 6(1):211-232.
[6] 邓红莉, 杨韬. 一种基于深度学习的异常检测方法[J]. 信息通信, 2015(3):3-4 DENG Hongli, YANG Tao. An anomaly detection method based on deep learning[J]. Information and communications, 2015(3):3-4
[7] ZHAO Xuanqiang, WANG Guoying, LI Zhixing. Unsupervised network anomaly detection based on abnormality weights and subspace clustering[C]//Proceedings of the 6th International Conference on Information Science and Technology. Dalian, China, 2016:482-486.
[8] 左进, 陈泽茂. 基于改进K均值聚类的异常检测算法[J]. 计算机科学, 2016, 43(8):258-261 ZUO Jin, CHEN Zemao. Anomaly detection algorithm based on improved K-means clustering[J]. Computer science, 2016, 43(8):258-261
[9] 邹云峰, 张昕, 宋世渊, 等. 基于局部密度的快速离群点检测算法[J]. 计算机应用, 2017, 37(10):2932-2937 ZOU Yunfeng, ZHANG Xin, SONG Shiyuan, et al. Fast outlier detection algorithm based on local density[J]. Journal of computer applications, 2017, 37(10):2932-2937
[10] BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF:identifying density-based local outliers[C]//Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. Dallas, Texas, USA, 2000:93-104.
[11] BHATT V, SHARMA K G, RAM A. An enhanced approach for LOF in data mining[C]//Proceedings of 2013 International Conference on Green High Performance Computing. Nagercoil, India, 2013:1-3.
[12] MIAO Dandan, QIN Xiaowei, WANG Weidong. Anomalous cell detection with kernel density-based local outlier factor[J]. China communications, 2015, 12(9):64-75.
[13] TANG Bo, HE Haibo. A local density-based approach for outlier detection[J]. Neurocomputing, 2017, 241:171-180.
[14] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113.
[15] GHEMAWAT S, GOBIOFF H, LEUNG S T. The google file system[C]//Proceedings of the 19th ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, 2003:29-43.
[16] CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable:a distributed storage system for structured data[C]//Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. Seattle, WA, 2006:15.
[17] MU Yashuang, LIU Xiaodong, YANG Zhihao, et al. A parallel C4.5 decision tree algorithm based on MapReduce[J]. Concurrency and computation practice and experience, 2017, 29(8):e4015.
[18] 侯泳旭, 段磊, 秦江龙, 等. 基于Isolation Forest的并行化异常探测设计[J]. 计算机工程与科学, 2017, 39(2):236-244 HOU Yongxu, DUAN Lei, QIN Jianglong, et al. Parallel anomaly detection based on isolation forest[J]. Computer engineering & science, 2017, 39(2):236-244
[19] 叶海琴, 孟彩霞, 王意锋, 等. 一种基于MapReduce的频繁模式挖掘算法[J]. 南京理工大学学报, 2018, 42(1):62-67 YE Haiqin, MENG Caixia, WANG Yifeng, et al. Frequent pattern mining algorithm based on MapReduce[J]. Journal of Nanjing University of Science and Technology, 2018, 42(1):62-67
[20] 刘义, 景宁, 陈荦, 等. MapReduce框架下基于R-树的k-近邻连接算法[J]. 软件学报, 2013, 24(8):1836-1851 LIU Yi, JING Ning, CHEN Luo, et al. Algorithm for processing k-nearest join based on R-tree in MapReduce[J]. Journal of software, 2013, 24(8):1836-1851
[21] WHITE T. Hadoop:the definitive guide[M]. 4th ed. Gravenstein Highway North:O’reilly Media Inc., 2015:1-4.
[22] http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/.

相似文献/References:

[1]张继福,张素兰,胡立华.约束概念格及其构造方法[J].智能系统学报,2006,1(02):31.
 ZHANG Ji-fu,ZHANG Su-lan,HU Li-hua.Constrained concept lattice and its construction method[J].CAAI Transactions on Intelligent Systems,2006,1(02):31.
[2]王国胤,张清华,胡 军.粒计算研究综述[J].智能系统学报,2007,2(06):8.
 WANG Guo-yin,ZHANG Qing-hua,HU Jun.An overview of granular computing[J].CAAI Transactions on Intelligent Systems,2007,2(02):8.
[3]田国会,吉艳青,李晓磊.家庭智能空间下基于场景的人的行为理解[J].智能系统学报,2010,5(01):57.
 TIAN Guo-hui,JI Yan-qing,LI Xiao-lei.Human behaviors understanding based on scene knowledge in home intelligent space[J].CAAI Transactions on Intelligent Systems,2010,5(02):57.
[4]何清.物联网与数据挖掘云服务[J].智能系统学报,2012,7(03):189.
 HE Qing.The Internet of things and the data mining cloud service[J].CAAI Transactions on Intelligent Systems,2012,7(02):189.
[5]李海林,郭韧,万校基.基于特征矩阵的多元时间序列最小距离度量方法[J].智能系统学报,2015,10(03):442.[doi:10.3969/j.issn.1673-4785.201405047]
 LI Hailin,GUO Ren,WAN Xiaoji.A minimum distance measurement method for amultivariate time series based on the feature matrix[J].CAAI Transactions on Intelligent Systems,2015,10(02):442.[doi:10.3969/j.issn.1673-4785.201405047]
[6]申彦,朱玉全.CMP上基于数据集划分的K-means多核优化算法[J].智能系统学报,2015,10(04):607.[doi:10.3969/j.issn.1673-4785.201411036]
 SHEN Yan,ZHU Yuquan.An optimized algorithm of K-means based on data set partition on CMP systems[J].CAAI Transactions on Intelligent Systems,2015,10(02):607.[doi:10.3969/j.issn.1673-4785.201411036]
[7]汤建国,汪江桦,韩莉英,等.基于覆盖粗糙集的语言动力系统[J].智能系统学报,2014,9(02):229.[doi:10.3969/j.issn.1673-4785.201307018]
 TANG Jianguo,WANG Jianghua,HAN Liying,et al.Linguistic dynamic systems based on covering-based rough sets[J].CAAI Transactions on Intelligent Systems,2014,9(02):229.[doi:10.3969/j.issn.1673-4785.201307018]
[8]石磊,杜军平,周亦鹏,等.在线社交网络挖掘与搜索技术研究[J].智能系统学报,2016,11(6):777.[doi:10.11992/tis.201612007]
 SHI Lei,DU Junping,ZHOU Yipeng,et al.A survey on online social network mining and search[J].CAAI Transactions on Intelligent Systems,2016,11(02):777.[doi:10.11992/tis.201612007]
[9]淦文燕,刘冲.一种改进的搜索密度峰值的聚类算法[J].智能系统学报,2017,12(02):229.[doi:10.11992/tis.201512036]
 GAN Wenyan,LIU Chong.An improved clustering algorithm that searches and finds density peaks[J].CAAI Transactions on Intelligent Systems,2017,12(02):229.[doi:10.11992/tis.201512036]
[10]翟俊海,刘博,张素芳.基于粗糙集相对分类信息熵和粒子群优化的特征选择方法[J].智能系统学报,2017,12(03):397.[doi:10.11992/tis.201705004]
 ZHAI Junhai,LIU Bo,ZHANG Sufang.A feature selection approach based on rough set relative classification information entropy and particle swarm optimization[J].CAAI Transactions on Intelligent Systems,2017,12(02):397.[doi:10.11992/tis.201705004]
[11]张玉玲,尹传环.依特征频率的安卓恶意软件异常检测的研究[J].智能系统学报,2018,13(02):168.[doi:10.11992/tis.201609016]
 ZHANG Yuling,YIN Chuanhuan.Android malware outlier detection based on feature frequency[J].CAAI Transactions on Intelligent Systems,2018,13(02):168.[doi:10.11992/tis.201609016]

备注/Memo

备注/Memo:
收稿日期:2018-09-04。
基金项目:国家自然科学基金项目(61572435,61472305,61473222);教育部-中国移动联合基金项目(MCM20170103);复杂电子系统仿真重点实验室基础研究基金项目(DXZT-JC-ZZ-2015-015);宁波市自然科学基金项目(2016A610035,2017A610119).
作者简介:齐小刚,男,1973年生,教授,博士生导师,主要研究方向为系统建模与故障诊断、网络优化与算法设计。发表学术论文50余篇,被SCI检索10余篇、EI检索50余篇。申请专利18项(授权9项)、登记软件著作权3项。;胡秋秋,女,1995年生,硕士研究生,主要研究方向为分布式系统、数据处理与分析。;刘立芳,女,1972年生,教授,博士,主要研究方向为数据处理与智能计算。发表学术论文40余篇,其中被SCI检索9篇、EI检索30余篇。
通讯作者:胡秋秋.E-mail:huqiuq@yeah.net
更新日期/Last Update: 2019-04-25