[1]齐小刚,胡秋秋,刘立芳.基于MapReduce的并行异常检测算法[J].智能系统学报,2019,14(2):224-230.[doi:10.11992/tis.201809007]
QI Xiaogang,HU Qiuqiu,LIU Lifang.Parallel anomaly algorithm based on MapReduce[J].CAAI Transactions on Intelligent Systems,2019,14(2):224-230.[doi:10.11992/tis.201809007]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
14
期数:
2019年第2期
页码:
224-230
栏目:
学术论文—机器学习
出版日期:
2019-03-05
- Title:
-
Parallel anomaly algorithm based on MapReduce
- 作者:
-
齐小刚1, 胡秋秋1, 刘立芳2
-
1. 西安电子科技大学 数学与统计学院, 陕西 西安 710071;
2. 西安电子科技大学 计算机学院, 陕西 西安 710071
- Author(s):
-
QI Xiaogang1, HU Qiuqiu1, LIU Lifang2
-
1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China;
2. School of Computer Science and Technology, Xidian University, Xi’an 710071, China
-
- 关键词:
-
数据挖掘; 异常检测; 局部离群因子; Hadoop; MapReduce; 分布式文件系统; 并行计算; 局部密度
- Keywords:
-
data mining; anomaly detection; local outlier factor; Hadoop; MapReduce; Distributed File System; parallel computing; local density
- 分类号:
-
TP311
- DOI:
-
10.11992/tis.201809007
- 摘要:
-
为了提高数据挖掘中异常检测算法在数据量增大时的准确度、灵敏度和执行效率,本文提出了一种基于MapReduce框架和Local Outlier Factor (LOF)算法的并行异常检测算法(MR-DLOF)。首先,将存放在Hadoop分布式文件系统(HDFS)上的数据集逻辑地切分为多个数据块。然后,利用MapReduce原理将各个数据块中的数据并行处理,使得每个数据点的k-邻近距离和LOF值的计算仅在单个块中执行,从而提高了算法的执行效率;同时重新定义了k-邻近距离的概念,避免了数据集中存在大于或等于k个重复点而导致局部密度为无穷大的情况。最后,将LOF值较大的数据点合并重新计算其LOF值,从而提高算法准确度和灵敏度。通过真实数据集验证了MR-DLOF算法的有效性、高效性和可扩展性。
- Abstract:
-
To improve the accuracy, sensitivity, and efficiency of anomaly detection algorithm in data mining when the amount of data increases, a parallel anomaly detection algorithm (MR-LOF) based on the MapReduce framework and the local outlier factor (LOF) algorithm is proposed in this paper. First, the dataset, stored in the Hadoop distributed file system (HDFS), is logically divided into multiple data blocks. Then, the MapReduce principle is used to process the data in each data block in parallel, so that the k-distance and LOF value of each data point is calculated only in a single block. It greatly improves the efficiency of the algorithm. Simultaneously, the concept of k-distance is redefined. It avoids the situation where the local density is infinite because more than k repeated points exist in the dataset. Finally, the data points whose LOF value is larger than threshold are merged, and the LOF values of combined data are recalculated. This process can effectively improve the accuracy and sensitivity. Experiments with real-world datasets demonstrate the validity, high efficiency, and extendibility of the MR-DLOF algorithm.
备注/Memo
收稿日期:2018-09-04。
基金项目:国家自然科学基金项目(61572435,61472305,61473222);教育部-中国移动联合基金项目(MCM20170103);复杂电子系统仿真重点实验室基础研究基金项目(DXZT-JC-ZZ-2015-015);宁波市自然科学基金项目(2016A610035,2017A610119).
作者简介:齐小刚,男,1973年生,教授,博士生导师,主要研究方向为系统建模与故障诊断、网络优化与算法设计。发表学术论文50余篇,被SCI检索10余篇、EI检索50余篇。申请专利18项(授权9项)、登记软件著作权3项。;胡秋秋,女,1995年生,硕士研究生,主要研究方向为分布式系统、数据处理与分析。;刘立芳,女,1972年生,教授,博士,主要研究方向为数据处理与智能计算。发表学术论文40余篇,其中被SCI检索9篇、EI检索30余篇。
通讯作者:胡秋秋.E-mail:huqiuq@yeah.net
更新日期/Last Update:
2019-04-25