[1]张 贺,蔡江辉,张继福,等.信息熵度量的离群数据挖掘算法[J].智能系统学报,2010,5(02):150-155.
 ZHANG He,CAI Jiang-hui,ZHANG Ji-fu,et al.An outlier mining algorithm based on information entropy[J].CAAI Transactions on Intelligent Systems,2010,5(02):150-155.
点击复制

信息熵度量的离群数据挖掘算法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第5卷
期数:
2010年02期
页码:
150-155
栏目:
出版日期:
2010-04-25

文章信息/Info

Title:
An outlier mining algorithm based on information entropy
文章编号:
1673-4785(2010)02-0150-06
作者:
张 贺1蔡江辉1张继福1乔 衎2
1.太原科技大学 计算机科学与技术学院,山西 太原 030024;
2.北京航空航天大学 自动化科学与电气工程学院,北京 100191
Author(s):
ZHANG He1 CAI Jiang-hui1 ZHANG Ji-fu1 QIAO Kan2
1.School of Computer Science and Technology, Taiyuan University of Science & Technology, Taiyuan 030024, China;
2. Automation Science and Electrical Engineering College, Beijing University of Aeronautics and Astronautics, Beijing 100191, China
关键词:
离群数据信息熵离群度量因子数据挖掘
Keywords:
outlierinformation entropyoutlier measure factordata mining
分类号:
TP311
文献标志码:
A
摘要:
离群数据挖掘是为了找出隐含在海量数据中相对稀疏而孤立的异常数据模式,但传统的离群数据挖掘方法受人为因素影响较大.通过引入基于信息熵的离群度量因子,给出一种离群数据挖掘新算法.该算法先利用信息熵计算每个数据对象的离群度量因子,然后通过离群度量因子来衡量每个对象的离群程度,进而检测离群数据,有效地消除了人为主观因素对离群检测的影响,并能很好地解释离群点的含义.最后,采用UCI和恒星光谱数据作为实验数据,通过对实验的分析,验证了该算法的可行性和有效性.
Abstract:
The task of outlier mining is to discover patterns that are exceptional, interesting, and sparse or isolated even though they are concealed within tremendous volumes of data. Traditional outlier detection methods are easily influenced by manmade factors. A novel outlier mining algorithm based on information entropy has been formulated. It used an outlier measurement factor based on information entropy. In the algorithm, the outlier measurement factor of each record was calculated using information entropy. Outliers were then detected by analyzing the values of the outlier measurement factor. In this way the impact of manmade factors was eliminated in outlier mining. The definition of an outlier was based on an outlier measurement factor which could explain the meaning of the outliers. Experimental results proved the feasibility and effectiveness of the algorithm when it was used to analyze the UC Irvine (UCI) data set as well as highdimensional star spectrum data.

参考文献/References:

[1]HAN Jiawei, KAMBER M. Data mining:concepts and techniques[M].Bejing:China Machine Press, 2006:254255
[2]HAWKINS D. Identification of outliers [M].London:Chapman and Hall, 1980:228.
[3]BARNETT V, LEWIS T. Outliers in statistical data[M].New York: John Wiley & Sons,1994:7,49.
[4]RUTS I, ROUSSEEUW P. Computing depth contours of bivariate point clouds[J]. Computational Statistics and Data Analysis,1996,23(1):153168.
[5]ARNING A, AGRAWAL R, RAGHAVAN P.A linear method for deviation in large database[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portlan,Oregon,USA,1996:164169.
 [6]KNORR E M, NG R T. Algorithms of mining distancebased outliers in large datasets[C]//Proc of Int Conf on Very Large Database (VLDB’98).New York,USA, 1998:392402
.[7]BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF: identifying densitybased local outliers[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. Dallas: ACM Press, 2000:93104.
[8]熊家军,李庆华.信息熵理论与入侵检测聚类问题研究[J].小型微型计算机系统,2005, 26(7):11631166.
XIONG Jiajun, LI Qinghua. Study on clustering problem for intrusion detection with information entropy[J]. Minimicro Systems,2005,26(7):11631166.
[9]薛 萍,金鸿章,王 双.应用最大熵原理分析通信系统脆性风险[J]. 电机与控制学报,2007,11(2): 7478.
XUE Ping, JIN Hongzhang,WANG Shuang. Application of the maximum entropy principle to brittleness risk analysis on communication system[J]. Electric Machines and Control, 2007, 11(2): 7478.
[10]HE Zengyou, XU Xiaofei, DENG Shengchun. A fast greedy algorithm for outlier mining[C]//Proceedings of PAKDD’2006 (LNAI3918). Berlin: SpringerVerlag, 2006:567576.
[11]倪巍伟,陈 耿,陆介平,等.基于局部信息熵的加权子空间离群点检测算法[J].计算机研究与发展,2008,45(7):11891192. NI Weiwei, CHEN Geng, LU Jieping. Local entropy based weighted subspace outlier mining algorithm [J]. Journal of Computer Research and Development, 2008, 45(7):11891192.
[12]于绍越,商 琳. ENBROD:基于信息熵的相对离群点的检测方法[J].南京大学学报:自然科学版,2008,44 (2):11891194. YU Shaoyue, SHANG Lin. An entropybased algorithm to detect relative outliers: ENBROD[J].Journal of Nanjing University:Natural Sciences, 2008,44 (2):11891194. 
[13]DUDA R O, HART P E, STOCK D G. Pattern classification[M].2nd ed. Beijing: China Machine Press, 2003:317356.
[14]NEWMAN D J, HETTICH S, BLAKE C L, et al. UCI repository of machine learning databases[DB/OL]. Irvine, CA: University of California, Department of Information and Computer Science,1998.[20080925]http://www.ics.uci.edu/~mlearn/MLRepository.html.
 [15]张继福,蒋义勇,胡立华,等.基于概念格的天体光谱离群数据识别方法[J].自动化学报,2007,34(9):10601066. ZHANG Jifu, JIANG Yiyong, HU Lihua, et al. A concept lattice based recognition method of celestial spectra outliers[J]. Acta Automatica Sinica, 2007,34(9):10601066.

相似文献/References:

[1]王科俊,刘靖宇,马慧,等.手指静脉图像质量评价[J].智能系统学报,2011,6(04):324.
 WANG Kejun,LIU Jingyu,MA Hui,et al.A finger vein image quality assessment method[J].CAAI Transactions on Intelligent Systems,2011,6(02):324.
[2]龚冬颖,黄敏,张洪博,等.RGBD人体行为识别中的自适应特征选择方法[J].智能系统学报,2017,12(01):1.[doi:10.11992/tis.201611008]
 GONG Dongying,HUANG Min,ZHANG Hongbo,et al.Adaptive feature selection method for action recognition of human body in RGBD data[J].CAAI Transactions on Intelligent Systems,2017,12(02):1.[doi:10.11992/tis.201611008]
[3]翟俊海,刘博,张素芳.基于粗糙集相对分类信息熵和粒子群优化的特征选择方法[J].智能系统学报,2017,12(03):397.[doi:10.11992/tis.201705004]
 ZHAI Junhai,LIU Bo,ZHANG Sufang.A feature selection approach based on rough set relative classification information entropy and particle swarm optimization[J].CAAI Transactions on Intelligent Systems,2017,12(02):397.[doi:10.11992/tis.201705004]
[4]黄琴,钱文彬,王映龙,等.代价敏感数据的多标记特征选择算法[J].智能系统学报,2019,14(05):929.[doi:10.11992/tis.201807027]
 HUANG Qin,QIAN Wenbin,WANG Yinglong,et al.Multi-label feature selection algorithm for cost-sensitive data[J].CAAI Transactions on Intelligent Systems,2019,14(02):929.[doi:10.11992/tis.201807027]

备注/Memo

备注/Memo:
收稿日期:2008-12-30.
基金项目:山西省青年科学基金资助项目(2008021028).
通信作者:张 贺.E-mail:zhanghe_helen@126.com.
作者简介:
张贺,女,1981年生,硕士研究生. 主要研究方向为数据挖掘.
 蔡江辉,男,1978年生.讲师,主要研究方向为离群数据挖掘.
张继福,男,1963年生,教授,博士. 主要研究方向为数据挖掘、模式识别与智能信息系统. 已主持完成国家自然科学基金、国家“863”计划子课题等省部级以上科研项目10余项,发表学术论文100余篇,其中被SCI、EI30余篇.
更新日期/Last Update: 2010-05-24