[1]张燕,杜红乐.基于异构距离的集成分类算法研究[J].智能系统学报,2019,14(04):733-742.[doi:10.11992/tis.201807023]
 ZHANG Yan,DU Hongle.Imbalanced heterogeneous data ensemble classification based on HVDM-KNN[J].CAAI Transactions on Intelligent Systems,2019,14(04):733-742.[doi:10.11992/tis.201807023]
点击复制

基于异构距离的集成分类算法研究(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第14卷
期数:
2019年04期
页码:
733-742
栏目:
出版日期:
2019-07-02

文章信息/Info

Title:
Imbalanced heterogeneous data ensemble classification based on HVDM-KNN
作者:
张燕 杜红乐
商洛学院 数学与计算机应用学院, 陕西 商洛 726000
Author(s):
ZHANG Yan DU Hongle
School of Math and Computer Application, Shangluo University, Shangluo 726000, China
关键词:
异构数据不均衡数据异构距离集成学习过取样欠取样
Keywords:
heterogeneous dataimbalanced dataheterogeneous value difference metricensemble learningover samplingundersampling
分类号:
TP391.4
DOI:
10.11992/tis.201807023
摘要:
针对异构数据集下的不均衡分类问题,从数据集重采样、集成学习算法和构建弱分类器3个角度出发,提出一种针对异构不均衡数据集的分类方法——HVDM-Adaboost-KNN算法(heterogeneous value difference metric-Adaboost-KNN),该算法首先通过聚类算法对数据集进行均衡处理,获得多个均衡的数据子集,并构建多个子分类器,采用异构距离计算异构数据集中2个样本之间的距离,提高KNN算法的分类准性能,然后用Adaboost算法进行迭代获得最终分类器。用8组UCI数据集来评估算法在不均衡数据集下的分类性能,Adaboost实验结果表明,相比Adaboost等算法,F1值、AUC、G-mean等指标在异构不均衡数据集上的分类性能都有相应的提高。
Abstract:
A novel classification method, the heterogeneous value difference metric-Adaboost-KNN (HVDM-Adaboost-KNN), is proposed to achieve data resampling, to obtain an ensemble learning algorithm, and to construct a weak classifier for addressing the imbalanced classification of a heterogeneous dataset. This algorithm initially equalizes the dataset using a clustering algorithm to obtain several equalized data subsets and constructs several sub-classifiers. Further, the heterogeneous distance is used to calculate the distance between two samples in the heterogeneous dataset to improve the classification accuracy of the KNN algorithm. Subsequently, the Adaboost algorithm is used to iteratively obtain the final classifier. Eight groups of UCI datasets are used to evaluate the classification performance of the algorithm in imbalanced datasets. The Adaboost experimental results denote that the classification performance of indices, such as the F1 value, AUC, and G-means, using the heterogeneous imbalanced datasets was better when compared with that exhibited by other algorithms.

参考文献/References:

[1] 胡峰, 王蕾, 周耀. 基于三支决策的不平衡数据过采样方法[J]. 电子学报, 2018, 46(1):136-144 HU Feng, WANG Lei, ZHOU Yao. An oversampling method for imbalance data based on three-way decision model[J]. Acta electronica sinica, 2018, 46(1):136-144
[2] SÁEZ J A, LUENGO J, STEFANOWSKI J, et al. SMOTE-IPF:addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information sciences, 2015, 291:184-203.
[3] PRUSTY M R, JAYANTHI T, VELUSAMY K. Weighted-SMOTE:a modification to SMOTE for event classification in sodium cooled fast reactors[J]. Progress in nuclear energy, 2017, 100:355-364.
[4] MATHEW J, LUO Ming, PANG C K, et al. Kernel-based SMOTE for SVM classification of imbalanced datasets[C]//Proceedings of the 41st Annual Conference of the IEEE Industrial Electronics Society. Yokohama, Japan, 2015:1127-1132.
[5] 武森, 刘露, 卢丹. 基于聚类欠采样的集成不均衡数据分类算法[J]. 工程科学学报, 2017, 39(8):1244-1253 WU Sen, LIU Lu, LU Dan. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm[J]. Chinese journal of engineering, 2017, 39(8):1244-1253
[6] 陈旭, 刘鹏鹤, 孙毓忠, 等. 面向不均衡医学数据集的疾病预测模型研究[J]. 计算机学报, 2019, 42(3):596-609 CHEN Xu, LIU Penghe, SUN Yuzhong, et al. Research on disease prediction models based on imbalanced medical data sets[J]. Chinese journal of computers, 2019, 42(3):596-609
[7] JIAN Chuanxia, GAO Jian, AO Yinhui. A new sampling method for classifying imbalanced data based on support vector machine ensemble[J]. Neurocomputing, 2016, 193:115-122.
[8] DU Hongle, TENG Shaohua, ZHANG Lin, et al. Support vector machine based on dynamic density equalization[C]//Proceedings of the Second International Conference on Human Centered Computing. Colombo, Sri Lanka, 2016:58-69.
[9] ZHOU Yuhang, ZHOU Zhihua. Large margin distribution learning with cost interval and unlabeled data[J]. IEEE transactions on knowledge and data engineering, 2016, 28(7):1749-1763.
[10] WANG Shuo, MINKU L L, YAO Xin. Resampling-based ensemble methods for online class imbalance learning[J]. IEEE transactions on knowledge and data engineering, 2015, 27(5):1356-1368.
[11] SUN Zhongbin, SONG Qinbao, ZHU Xiaoyan, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern recognition, 2015, 48(5):1623-1637.
[12] GUO Haixiang, LI Yijing, LI Yanan, et al. BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification[J]. Engineering applications of artificial intelligence, 2016, 49:176-193.
[13] WANG Qi, LUO Zhihao, HUANG Jincai, et al. A novel ensemble method for imbalanced data learning:bagging of extrapolation-SMOTE SVM[J]. Computational intelligence and neuroscience, 2017, 2017:1827016.
[14] POTHARAJU S P, SREEDEVI M. Ensembled rule based classification algorithms for predicting imbalanced kidney disease data[J]. Journal of engineering science and technology review, 2016, 9(5):201-207.
[15] ZHAI Junhai, ZHANG Sufang, WANG Chenxi. The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers[J]. International journal of machine learning and cybernetics, 2017, 8(3):1009-1017.
[16] YU Hualong, NI Jun. An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data[J]. IEEE/ACM transactions on computational biology and bioinformatics, 2014, 11(4):657-666.
[17] HAQUE M N, NOMAN N, BERRETTA R, et al. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification[J]. PLoS one, 2016, 11(1):e0146116.
[18] WILSON D R, MARTINEZ T R. Improved heterogeneous distance functions[J]. Journal of artificial intelligence research, 1997, 6(1):1-34.
[19] ZHANG Yishi, YANG Anrong, XIONG Chan, et al. Feature selection using data envelopment analysis[J]. Knowledge-based systems, 2014, 64:70-80.

备注/Memo

备注/Memo:
收稿日期:2018-07-22。
基金项目:陕西省自然科学基础研究计划项目(2015JM6347);陕西省教育厅科技计划项目(15JK1218);商洛学院科学与技术项目(18sky014);商洛学院科技创新团队建设项目(18SCX002);商洛学院重点学科建设项目,学科名:数学”.
作者简介:张燕,女,1977年生,讲师,主要研究方向为模式识别、机器学习。主持和参加省部级及企业合作项目6项。发表学术论文10余篇。;杜红乐,男,1979年生,副教授,主要研究方向为数据挖掘、机器学习。主持或承担校级以上项目12项。发表学术论文30余篇,被EI检索10余篇。
通讯作者:杜红乐.E-mail:dhl5597@163.com
更新日期/Last Update: 2019-08-25