[1]张 贺,蔡江辉,张继福,等.信息熵度量的离群数据挖掘算法[J].智能系统学报,2010,5(2):150-155.
ZHANG He,CAI Jiang-hui,ZHANG Ji-fu,et al.An outlier mining algorithm based on information entropy[J].CAAI Transactions on Intelligent Systems,2010,5(2):150-155.
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
5
期数:
2010年第2期
页码:
150-155
栏目:
学术论文—人工智能基础
出版日期:
2010-04-25
- Title:
-
An outlier mining algorithm based on information entropy
- 文章编号:
-
1673-4785(2010)02-0150-06
- 作者:
-
张 贺1,蔡江辉1,张继福1,乔 衎2
-
1.太原科技大学 计算机科学与技术学院,山西 太原 030024;
2.北京航空航天大学 自动化科学与电气工程学院,北京 100191
- Author(s):
-
ZHANG He1, CAI Jiang-hui1, ZHANG Ji-fu1, QIAO Kan2
-
1.School of Computer Science and Technology, Taiyuan University of Science & Technology, Taiyuan 030024, China;
2. Automation Science and Electrical Engineering College, Beijing University of Aeronautics and Astronautics, Beijing 100191, China
-
- 关键词:
-
离群数据; 信息熵; 离群度量因子; 数据挖掘
- Keywords:
-
outlier; information entropy; outlier measure factor; data mining
- 分类号:
-
TP311
- 文献标志码:
-
A
- 摘要:
-
离群数据挖掘是为了找出隐含在海量数据中相对稀疏而孤立的异常数据模式,但传统的离群数据挖掘方法受人为因素影响较大.通过引入基于信息熵的离群度量因子,给出一种离群数据挖掘新算法.该算法先利用信息熵计算每个数据对象的离群度量因子,然后通过离群度量因子来衡量每个对象的离群程度,进而检测离群数据,有效地消除了人为主观因素对离群检测的影响,并能很好地解释离群点的含义.最后,采用UCI和恒星光谱数据作为实验数据,通过对实验的分析,验证了该算法的可行性和有效性.
- Abstract:
-
The task of outlier mining is to discover patterns that are exceptional, interesting, and sparse or isolated even though they are concealed within tremendous volumes of data. Traditional outlier detection methods are easily influenced by manmade factors. A novel outlier mining algorithm based on information entropy has been formulated. It used an outlier measurement factor based on information entropy. In the algorithm, the outlier measurement factor of each record was calculated using information entropy. Outliers were then detected by analyzing the values of the outlier measurement factor. In this way the impact of manmade factors was eliminated in outlier mining. The definition of an outlier was based on an outlier measurement factor which could explain the meaning of the outliers. Experimental results proved the feasibility and effectiveness of the algorithm when it was used to analyze the UC Irvine (UCI) data set as well as highdimensional star spectrum data.
更新日期/Last Update:
2010-05-24