QI Xiaogang,HU Qiuqiu,LIU Lifang.Parallel anomaly algorithm based on MapReduce[J].CAAI Transactions on Intelligent Systems,2019,14(02):224-230.[doi:10.11992/tis.201809007]





Parallel anomaly algorithm based on MapReduce
齐小刚1 胡秋秋1 刘立芳2
1. 西安电子科技大学 数学与统计学院, 陕西 西安 710071;
2. 西安电子科技大学 计算机学院, 陕西 西安 710071
QI Xiaogang1 HU Qiuqiu1 LIU Lifang2
1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China;
2. School of Computer Science and Technology, Xidian University, Xi’an 710071, China
data mininganomaly detectionlocal outlier factorHadoopMapReduceDistributed File Systemparallel computinglocal density
为了提高数据挖掘中异常检测算法在数据量增大时的准确度、灵敏度和执行效率,本文提出了一种基于MapReduce框架和Local Outlier Factor (LOF)算法的并行异常检测算法(MR-DLOF)。首先,将存放在Hadoop分布式文件系统(HDFS)上的数据集逻辑地切分为多个数据块。然后,利用MapReduce原理将各个数据块中的数据并行处理,使得每个数据点的k-邻近距离和LOF值的计算仅在单个块中执行,从而提高了算法的执行效率;同时重新定义了k-邻近距离的概念,避免了数据集中存在大于或等于k个重复点而导致局部密度为无穷大的情况。最后,将LOF值较大的数据点合并重新计算其LOF值,从而提高算法准确度和灵敏度。通过真实数据集验证了MR-DLOF算法的有效性、高效性和可扩展性。
To improve the accuracy, sensitivity, and efficiency of anomaly detection algorithm in data mining when the amount of data increases, a parallel anomaly detection algorithm (MR-LOF) based on the MapReduce framework and the local outlier factor (LOF) algorithm is proposed in this paper. First, the dataset, stored in the Hadoop distributed file system (HDFS), is logically divided into multiple data blocks. Then, the MapReduce principle is used to process the data in each data block in parallel, so that the k-distance and LOF value of each data point is calculated only in a single block. It greatly improves the efficiency of the algorithm. Simultaneously, the concept of k-distance is redefined. It avoids the situation where the local density is infinite because more than k repeated points exist in the dataset. Finally, the data points whose LOF value is larger than threshold are merged, and the LOF values of combined data are recalculated. This process can effectively improve the accuracy and sensitivity. Experiments with real-world datasets demonstrate the validity, high efficiency, and extendibility of the MR-DLOF algorithm.


[1] HAN Jiawei, KAMBER M. Data mining:concepts and techniques[M]. San Francisco:Morgan Kaufmann Publishers Inc., 2006:1-18.
[2] AGGARWAL C C. Outlier analysis[M]. New York:Springer, 2013:75-99.
[3] CHEN Feng, DENG Pan, WAN Jiafu, et al. Data mining for the internet of things:literature review and challenges[J]. International journal of distributed sensor networks, 2015, 2015:431047.
[4] 吴镜锋, 金炜东, 唐鹏. 数据异常的监测技术综述[J]. 计算机科学, 2017, 44(S2):24-28 WU Jingfeng, JIN Weidong, TANG Peng. Survey on monitoring techniques for data abnormalities[J]. Computer science, 2017, 44(S2):24-28
[5] STEINWART I, HUSH D, SCOVEL C. A classification framework for anomaly detection[J]. The journal of machine learning research, 2005, 6(1):211-232.
[6] 邓红莉, 杨韬. 一种基于深度学习的异常检测方法[J]. 信息通信, 2015(3):3-4 DENG Hongli, YANG Tao. An anomaly detection method based on deep learning[J]. Information and communications, 2015(3):3-4
[7] ZHAO Xuanqiang, WANG Guoying, LI Zhixing. Unsupervised network anomaly detection based on abnormality weights and subspace clustering[C]//Proceedings of the 6th International Conference on Information Science and Technology. Dalian, China, 2016:482-486.
[8] 左进, 陈泽茂. 基于改进K均值聚类的异常检测算法[J]. 计算机科学, 2016, 43(8):258-261 ZUO Jin, CHEN Zemao. Anomaly detection algorithm based on improved K-means clustering[J]. Computer science, 2016, 43(8):258-261
[9] 邹云峰, 张昕, 宋世渊, 等. 基于局部密度的快速离群点检测算法[J]. 计算机应用, 2017, 37(10):2932-2937 ZOU Yunfeng, ZHANG Xin, SONG Shiyuan, et al. Fast outlier detection algorithm based on local density[J]. Journal of computer applications, 2017, 37(10):2932-2937
[10] BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF:identifying density-based local outliers[C]//Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. Dallas, Texas, USA, 2000:93-104.
[11] BHATT V, SHARMA K G, RAM A. An enhanced approach for LOF in data mining[C]//Proceedings of 2013 International Conference on Green High Performance Computing. Nagercoil, India, 2013:1-3.
[12] MIAO Dandan, QIN Xiaowei, WANG Weidong. Anomalous cell detection with kernel density-based local outlier factor[J]. China communications, 2015, 12(9):64-75.
[13] TANG Bo, HE Haibo. A local density-based approach for outlier detection[J]. Neurocomputing, 2017, 241:171-180.
[14] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113.
[15] GHEMAWAT S, GOBIOFF H, LEUNG S T. The google file system[C]//Proceedings of the 19th ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, 2003:29-43.
[16] CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable:a distributed storage system for structured data[C]//Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. Seattle, WA, 2006:15.
[17] MU Yashuang, LIU Xiaodong, YANG Zhihao, et al. A parallel C4.5 decision tree algorithm based on MapReduce[J]. Concurrency and computation practice and experience, 2017, 29(8):e4015.
[18] 侯泳旭, 段磊, 秦江龙, 等. 基于Isolation Forest的并行化异常探测设计[J]. 计算机工程与科学, 2017, 39(2):236-244 HOU Yongxu, DUAN Lei, QIN Jianglong, et al. Parallel anomaly detection based on isolation forest[J]. Computer engineering & science, 2017, 39(2):236-244
[19] 叶海琴, 孟彩霞, 王意锋, 等. 一种基于MapReduce的频繁模式挖掘算法[J]. 南京理工大学学报, 2018, 42(1):62-67 YE Haiqin, MENG Caixia, WANG Yifeng, et al. Frequent pattern mining algorithm based on MapReduce[J]. Journal of Nanjing University of Science and Technology, 2018, 42(1):62-67
[20] 刘义, 景宁, 陈荦, 等. MapReduce框架下基于R-树的k-近邻连接算法[J]. 软件学报, 2013, 24(8):1836-1851 LIU Yi, JING Ning, CHEN Luo, et al. Algorithm for processing k-nearest join based on R-tree in MapReduce[J]. Journal of software, 2013, 24(8):1836-1851
[21] WHITE T. Hadoop:the definitive guide[M]. 4th ed. Gravenstein Highway North:O’reilly Media Inc., 2015:1-4.
[22] http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/.


 ZHANG Ji-fu,ZHANG Su-lan,HU Li-hua.Constrained concept lattice and its construction method[J].CAAI Transactions on Intelligent Systems,2006,1(02):31.
[2]王国胤,张清华,胡 军.粒计算研究综述[J].智能系统学报,2007,2(06):8.
 WANG Guo-yin,ZHANG Qing-hua,HU Jun.An overview of granular computing[J].CAAI Transactions on Intelligent Systems,2007,2(02):8.
 TIAN Guo-hui,JI Yan-qing,LI Xiao-lei.Human behaviors understanding based on scene knowledge in home intelligent space[J].CAAI Transactions on Intelligent Systems,2010,5(02):57.
 HE Qing.The Internet of things and the data mining cloud service[J].CAAI Transactions on Intelligent Systems,2012,7(02):189.
 LI Hailin,GUO Ren,WAN Xiaoji.A minimum distance measurement method for amultivariate time series based on the feature matrix[J].CAAI Transactions on Intelligent Systems,2015,10(02):442.[doi:10.3969/j.issn.1673-4785.201405047]
 SHEN Yan,ZHU Yuquan.An optimized algorithm of K-means based on data set partition on CMP systems[J].CAAI Transactions on Intelligent Systems,2015,10(02):607.[doi:10.3969/j.issn.1673-4785.201411036]
 TANG Jianguo,WANG Jianghua,HAN Liying,et al.Linguistic dynamic systems based on covering-based rough sets[J].CAAI Transactions on Intelligent Systems,2014,9(02):229.[doi:10.3969/j.issn.1673-4785.201307018]
 SHI Lei,DU Junping,ZHOU Yipeng,et al.A survey on online social network mining and search[J].CAAI Transactions on Intelligent Systems,2016,11(02):777.[doi:10.11992/tis.201612007]
 GAN Wenyan,LIU Chong.An improved clustering algorithm that searches and finds density peaks[J].CAAI Transactions on Intelligent Systems,2017,12(02):229.[doi:10.11992/tis.201512036]
 ZHAI Junhai,LIU Bo,ZHANG Sufang.A feature selection approach based on rough set relative classification information entropy and particle swarm optimization[J].CAAI Transactions on Intelligent Systems,2017,12(02):397.[doi:10.11992/tis.201705004]
 ZHANG Yuling,YIN Chuanhuan.Android malware outlier detection based on feature frequency[J].CAAI Transactions on Intelligent Systems,2018,13(02):168.[doi:10.11992/tis.201609016]


更新日期/Last Update: 2019-04-25