WANG Hong-ding,TONG Yun-hai,TAN Shao-hua,et al.Research progress on outlier mining[J].CAAI Transactions on Intelligent Systems,2006,1(01):67-73.





Research progress on outlier mining
北京大学视觉与听觉处理国家重点实验室,北京 100871
WANG Hong-ding TONG Yun-hai TAN Shao-hua TANG Shi-wei Y ANG Dong-qing
National Laboratory on Machine Perception, Peking University, Beijing 100871, China
outlier outlier mining local outlier data stream high-dimensional data
异常点是数据集中与其他数据显著不同的数据.一个人的噪声对另一个人而言可能是有用的数据,因此,随着人们对数据质量、欺诈检测、网络入侵、故障诊断、自动军事侦察等问题的关注, 异常点挖掘在信息科学研究领域日益受到重视.在充分调研国内外异常点挖掘研究文献基础上,系统地综述了数据库研究领域中异常点挖掘的研究现状,对已有各种异常点挖掘方法进行了总结和比较,并结合当前研究热点,展望了异常点挖掘未来的研究方向及其面临的挑战.
An outlier is a data point that is significantly diff erent from the others in a data set. One person’s noise could be another person ’s signal, and therefore the problem of outlier mining attracts more and more interests in research of information science when the research fields of data quality, fraud detection, i ntrusion detection, fault diagnosis, military scout and so on receive wide atten tions. In this paper, a survey was presented for the problem of outlier mining from the basic concepts to the principal research problems and the underlying te chniques, including origination of outlier, definition of outlier and the compar ison of popular outlier mining methods. A summary of th e current state of the art of these techniques, a discussion on future rese arch topics, and the challenges of the outlier mining were also presented.


[1]BARNETT V, LEWIS T. Outliers in statistical data:2nd[M]. NewYork : John Wiley & Sons, 1994.
[2]HAWKINS D. Identification of outliers[M]. London: Chapman and Hall, 1980.
[3]HAN Jiawei, KAMBER M. Data mining: concepts and techniques[M]. NewYo rk: Morgan Kaufmann Publishers, 2001.
[4]QI Hongwei,WANG Jue. A model for mining outliers from complex datasets[A]. In Proc of ACM SAC’04[C].Cyprus,2004.
[5]ARNING A, AGRAWAL R,RAGHAVAN P. A linear method for deviation dete ction in large databases[A]. In Proc of KDD’96[C]. Oregon:Portland, 1996.
[6]KIFER D, BENDAVID S,GEHRKE J.Detecting change in data streams[A]. In Pr oc of VLDB’04[C].Toronto, 2004.
[7]CAI Y D,CLUTTER D,PAPE G,et al. MAIDS: mining alarming incide nts from data streams[A]. In Proceedings of SIGMOD’04[C]. Paris,2004.
[8]BREUNING M M,KRIEGEL H P,NG R T, et al. LOF: Identifying dens itybased local outliers[A]. In Proc of SIGMOD’00[C]. Texas, 2000.
[9]HINNEBURG A, KEIM D A. An Efficient approach to clustering in lar ge multimedia databases with noise[A]. In Proc of KDD’98[C]. NY, 1998.
[10]李翠平,李盛恩,王    珊,等.一种基于约束的多维数据异常点挖掘方法[J]. 软件学报, 2003, 14(9):1571-1577.
 LI Cuiping, LI Shengen, WANG Shan, et al. A constraintbased multidimension al data exception mining approach[J]. Journal of Software, 2003, 14(9):1571-1577 .
[11]陆介平,倪巍伟,孙志辉.基于关联分析的高维空间异常点发现[J]. 应用科学学报, 2006, 24(1):60-63.
LU Jieping, NI Weiwei, SUN Zhihui. Discovery of high dimensional outliers b ased on association analysis[J]. Journal of Applied Science, 2006, 24(01):60-63.
[12]赵泽茂,何坤金,陈   鹏,等.Web日志文件的异常数据挖掘算法及其应用[J] .计算机工程, 2003, 29(17):195-197.
 ZHAO Zemao, HE Kunjin, CHEN Peng, et al. Algorithms for mining outlier data on web log and its application[J]. Computer Engineering, 2003, 29(17):195-197.
[13]AGGARWAL C C, YU P S. Outlier detection for high dimensional data[A]. In Proceedings of the SIGMOD’01[C].Santa Barbara:CA,2001.
[14]KNORR E M,NG R T, TUCAKOV V. Distancebased outliers: algorithms and appl ications[J]. The VLDB Journal, 2000, 8(3-4):237-253.
[15]RAMASWAMY S,RASTOGI R,SHIM K. Efficient algorithms for mining outliers f rom large data sets[A]. In Proc of SIGMOD’00[C]. Texas,2000.
[16]ARNING A, AGRAWAL R,RAGHAVAN P.A linear method for deviation detection in large databases[A]. In Proc of KDD’95[C].Montreals, 1995.
[17]KNORR E, NG R. Finding intensional knowledge of distance-based outl iers[A]. In Proc of VLDB’99[C].Edinburgh,1999.
[18]AGYEMANG M, BARKER K, ALHAJJ R. Framework for mining web content outliers [A]. In Proc of ACM SAC’04[C]. Cyprus, 2004.
[19]AGYEMANG M,BARKER K, ALHAJJ R. Mining web content outliers using structure oriented weighting techniques and N grams[A]. In Proc of ACM SAC’ 05[C]. NM, 2005.
[20]KNORR E, NG R. Algorithms for mining distancebased outliers in large datasets[A]. In Proc of VLDB’98[C]. NY,1998.
[21]AGRAWAL R,IMIELINSKI T, SWAMI A. Mining association rules between sets of items in large databases[A]. In Proc of SIGMOD’93[C]. Was hington DC, 1993.
[22]BREIMAN L, FRIEDMAN J H, OLSHEN R A, et al. Classification and regress ion trees[M]. New York: Chapman & Hall, 1984.
[23]ESTER M, KRIEGEL H P, SANDER J, et al. A densitybased algorithm for d iscovering clusters in large spatial databases[A]. In Proc of KDD’96[C]. Oregon,Portland, 1996.
[24]NG R. HAN J. Efficient and effective clustering method for spatial d ata mining[A]. In Proc of VLDB’94[C]. Santiago,1994.
[25]KAYA A. Outlier effects on databases[A]. In Proc of ADVIS 2004[C]. Izmir:Turkey, 2004.
[26]JOHNSON T, KWOK I, Ng R. Fast computation of 2dimensional depth co ntours[A] In Proc KDD’98[C]. NY, 1998.
[27]ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: an efficient data clustering method for very large databases[A]. In Proc. of SIGMOD’96[C]. Montreal, 1996.
[28]KNORR E, NG R. A unified motion of outliers: properties and compu tation[A]. In Proc of KDD’97[C]. California, 1997.
[29]JIN Wen,TUNG A K H,HAN Jiawei. Mining topn local outliers in large d atabases[A]. In Proc of SIGKDD’01[C]. California,2001.
[30]〖JP4〗AGRAWAL R, GEHRKE J, GUNOPULOS D,et al. Automatic subspace clustering of high dimensional data for data mining applications[A]. In Proc of SIGMOD’98 [C]. WA, 1998.
[31]WANG W, YANG J, MUNTZ R. STING: A statistical information grid approach to spatial data mining[A]. In Proc of VLDB’97[C]. Athens, 1997.
[32]SARAWAGI S, AGRAWAL R,MEGIDDO N. Discoverydriven exploration of OLAP da ta cubes[A]. In Proc.of EDBT’98[C]. Valencia, 1998.
[33]CHEN Zhiyuan, LI Chen, PEI Jian, et al. Recent progress on selected t opics in database research: a report from nine young chinese researchers working in united states[J]. JSCT, 2003, 18(5):538-552.
[34]GUHA S, MISHRA N, MOTWANI R,O’CALLAGHAN L. Clustering data streams[A]. In Proc of FOCS’00[C]. Redondo Beach,2000.
[35]O’CALLAGHAM L, MISHRA N, MEYESON A, et al. Streamingdata algorithms for high-quality clustering[A]. In Proc of FOCS’01[C]. Las Vegas, 200 1.
[36]DOMINGOS P, HULTEN G. Mining high-speed data streams[A]. In Proc of SIG KDD’00[C]. MA,2000.
[37]DOMINGOS P,HULTEN G, SPENCER L. Mining timechanging data st reams[A]. In Proc of SIGKDD’01[C]. California, 2001.
[38]MANKU G S, MOTWANI R. Approximate frequency counts over data streams[A]. In Proc of VLDB’02[C]. Hongkong,2002.
[39]CHARIKAR M,CHEN K, COLTON M F. Finding frequent items in data streams[ A]. In Proc of ICALP 2002[C]. Malaga, 2002.
[40]AGGARWAL C,HAN J, WANG J, et al. A framework for clustering evol ving data streams[A]. In Proc of VLDB’03[C]. Berlin, 2003.


唐世渭,男,1939年生,教授,博士生导师,中国计算机学会数据库专业委员会副主任. 主要研究方向为数据库与信息系统.先后主持多项国家重大科技攻关课题和“973”课题,曾获国家科技进步二等奖等多项奖励,在国内外重要期刊和学术会议发表论文多篇.
更新日期/Last Update: 2009-04-07