[1]石洪波,陈雨文,陈鑫.SMOTE过采样及其改进算法研究综述[J].智能系统学报,2019,14(06):1073-1083.[doi:10.11992/tis.201906052]
 SHI Hongbo,CHEN Yuwen,CHEN Xin.Summary of research on SMOTE oversampling and its improved algorithms[J].CAAI Transactions on Intelligent Systems,2019,14(06):1073-1083.[doi:10.11992/tis.201906052]
点击复制

SMOTE过采样及其改进算法研究综述(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第14卷
期数:
2019年06期
页码:
1073-1083
栏目:
出版日期:
2019-11-05

文章信息/Info

Title:
Summary of research on SMOTE oversampling and its improved algorithms
作者:
石洪波 陈雨文 陈鑫
山西财经大学 信息学院, 山西 太原 030031
Author(s):
SHI Hongbo CHEN Yuwen CHEN Xin
School of Information, Shanxi University of Finance and Economics, Taiyuan, Shanxi, 030031
关键词:
不平衡数据分类SMOTE算法k-NN过采样欠采样高维数据分类型数据
Keywords:
imbalanced data classificationSMOTEalgorithmk-NNoversamplingundersamplinghigh dimensional datacategorical data
分类号:
TP391
DOI:
10.11992/tis.201906052
摘要:
近年来不平衡分类问题受到广泛关注。SMOTE过采样通过添加生成的少数类样本改变不平衡数据集的数据分布,是改善不平衡数据分类模型性能的流行方法之一。本文首先阐述了SMOTE的原理、算法以及存在的问题,针对SMOTE存在的问题,分别介绍了其4种扩展方法和3种应用的相关研究,最后分析了SMOTE应用于大数据、流数据、少量标签数据以及其他类型数据的现有研究和面临的问题,旨在为SMOTE的研究和应用提供有价值的借鉴和参考。
Abstract:
In recent years, the problem of imbalanced classification has received considerable attention. The synthetic minority oversampling technique (SMOTE), a popular method for improving the classification performance of imbalanced data, adds generated minority samples to change the distribution of imbalanced data sets. In this paper, we first describe the fundamentals, algorithms, and existing problems of SMOTE. Then, with respect to the existing problems of SMOTE, we introduce related research on four types of extension methods and three types of applications. Finally, to provide valuable reference information for the research and application of SMOTE, we analyze the existing difficulties of applying SMOTE to big data, streaming data, a small amount of label data, and other types of data.

参考文献/References:

[1] VASIGHIZAKER A, JALILI S. C-PUGP:a cluster-based positive unlabeled learning method for disease gene prediction and prioritization[J]. Computational biology and chemistry, 2018, 76:23-31.
[2] JURGOVSKY J, GRANITZER M, ZIEGLER K, et al. Sequence classification for credit-card fraud detection[J]. Expert systems with applications, 2018, 100:234-245.
[3] KIM J H. Time frequency image and artificial neural network based classification of impact noise for machine fault diagnosis[J]. International journal of precision engineering and manufacturing, 2018, 19(6):821-827.
[4] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16(1):321-357.
[5] FERNáNDEZ A, GARCIA S, HERRERA F, et al. SMOTE for learning from imbalanced data:Progress and challenges, marking the 15-year anniversary[J]. Journal of artificial intelligence research, 2018, 61:863-905.
[6] HAN Hui, WANG Wenyuan, MAO Binghuan. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Proceedings of International Conference on Intelligent Computing. Hefei, China, 2005:878-887.
[7] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-level-SMOTE:safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem[C]//Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Bangkok, Thailand, 2009:475-482.
[8] HE Haibo, BAI Yang, GARCIA E A, et al. ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of 2008 IEEE International Joint Conference on Neural Networks. Hong Kong, China, 2008:1322-1328.
[9] ZHU Tuanfai, LIN Yaping, LIU Yonghe. Synthetic minority oversampling technique for multiclass imbalance problems[J]. Pattern recognition, 2017, 72:327-340.
[10] DOUZAS G, BACAO F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J]. Information sciences, 2019, 501:118-135.
[11] SEIFFERT C, KHOSHGOFTAAR T M, VAN HULSE J. Hybrid sampling for imbalanced data[J]. Integrated computer-aided engineering, 2009, 16(3):193-210.
[12] GAZZAH S, HECHKEL A, AMARA N E B. A hybrid sampling method for imbalanced data[C]//Proceedings of 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices. Mahdia, Tunisia, 2015:1-6.
[13] 古平, 欧阳源遊. 基于混合采样的非平衡数据集分类研究[J]. 计算机应用研究, 2015, 32(2):379-381, 418 GU Ping, OUYANG Yuanyou. Classification research for unbalanced data based on mixed-sampling[J]. Application research of computers, 2015, 32(2):379-381, 418
[14] SONG Jia, HUANG Xianglin, QIN Sijun, et al. A bi-directional sampling based on k-means method for imbalance text classification[C]//Proceedings of 2016 IEEE/ACIS International Conference on Computer and Information Science. Okayama, Japan, 2016:1-5.
[15] 冯宏伟, 姚博, 高原, 等. 基于边界混合采样的非均衡数据处理算法[J]. 控制与决策, 2017, 32(10):1831-1836 FENG Hongwei, YAO Bo, GAO Yuan, et al. Imbalanced data processing algorithm based on boundary mixed sampling[J]. Control and decision, 2017, 32(10):1831-1836
[16] 赵自翔, 王广亮, 李晓东. 基于支持向量机的不平衡数据分类的改进欠采样方法[J]. 中山大学学报(自然科学版), 2012, 51(6):10-16 ZHAO Zixiang, WANG Guangliang, LI Xiaodong. An improved SVM based under-sampling method for classifying imbalanced data[J]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2012, 51(6):10-16
[17] JIA Cangzhi, ZUO Yun. S-SulfPred:a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique[J]. Journal of theoretical biology, 2017, 422:84-49.
[18] HANSKUNATAI A. A new hybrid sampling approach for classification of imbalanced datasets[C]//Proceedings of 2018 International Conference on Computer and Communication Systems. Nagoya, Japan, 2018:67-71.
[19] SHI Hongbo, GAO Qigang, JI Suqin, et al. A hybrid sampling method based on safe screening for imbalanced datasets with sparse structure[C]//Proceedings of 2018 International Joint Conference on Neural Networks. Rio de Janeiro, Brazil, 2018:1-8.
[20] 吴艺凡, 梁吉业, 王俊红. 基于混合采样的非平衡数据分类算法[J]. 计算机科学与探索, 2019, 13(2):342-349 WU Yifan, LIANG Jiye, WANG Junhong. Classification algorithm based on hybrid sampling for unbalanced data[J]. Journal of frontiers of computer science and technology, 2019, 13(2):342-349
[21] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*:a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge and information systems, 2012, 33(2):245-265.
[22] SáEZ J A, LUENGO J, STEFANOWSKI J, et al. SMOTE-IPF:addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information sciences, 2015, 291:184-203.
[23] RADWAN A M. Enhancing prediction on imbalance data by thresholding technique with noise filtering[C]//Proceedings of 2017 International Conference on Information Technology. Amman, Jordan, 2017:399-404.
[24] ZHANG Jianjun, NG W. Stochastic sensitivity measure-based noise filtering and oversampling method for imbalanced classification problems[C]//Proceedings of 2018 IEEE International Conference on Systems, Man, and Cybernetics. Miyazaki, Japan, 2018:403-408.
[25] BISPO A, PRUDENCIO R, VéRAS D. Instance selection and class balancing techniques for cross project defect prediction[C]//Proceedings of 2018 Brazilian Conference on Intelligent Systems. Sao Paulo, Brazil, 2018:552-557.
[26] BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD explorations newsletter, 2004, 6(1):20-29.
[27] BARUA S, ISLAM M M, YAO Xin, et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE transactions on knowledge and data engineering, 2014, 26(2):405-425.
[28] PRUENGKARN R, WONG K W, FUNG C C. Multiclass imbalanced classification using fuzzy C-mean and SMOTE with fuzzy support vector machine[C]//Proceedings of the 24th International Conference on Neural Information Processing. Guangzhou, China, 2017:67-75.
[29] DOUZAS G, BACAO F, LAST F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE[J]. Information sciences, 2018, 465:1-20.
[30] 楼晓俊, 孙雨轩, 刘海涛. 聚类边界过采样不平衡数据分类方法[J]. 浙江大学学报(工学版), 2013, 47(6):944-950 LOU Xiaojun, SUN Yuxuan, LIU Haitao. Clustering boundary over-sampling classification method for imbalanced data sets[J]. Journal of Zhejiang University (Engineering Science), 2013, 47(6):944-950
[31] MA Li, FAN Suohai. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J]. BMC bioinformatics, 2017, 18(1):169.
[32] IJAZ M F, ALFIAN G, SYAFRUDIN M, et al. Hybrid prediction model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, synthetic minority over sampling technique (SMOTE), and random forest[J]. Applied sciences, 2018, 8(8):1325.
[33] 盛凯, 刘忠, 周德超, 等. 面向不平衡分类的IDP-SMOTE重采样算法[J]. 计算机应用研究, 2019, 36(01):115-118 SHENG Kai, LIU Zhong, ZHOU Dechao, et al. IDP-SMOTE resampling algorithm for imbalanced classification[J]. Application research of computers, 2019, 36(01):115-118
[34] BLAGUS R, LUSA L. SMOTE for high-dimensional class-imbalanced data[J]. BMC bioinformatics, 2013, 14:106.
[35] ABDI L, HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques[J]. IEEE transactions on knowledge and data engineering, 2016, 28(1):238-251.
[36] WANG Jin, YUN Bo, HUANG Pingli, et al. Applying threshold SMOTE algorithm with attribute bagging to imbalanced datasets[C]//Proceedings of the 8th International Conference on Rough Sets and Knowledge Technology. Halifax, NS, Canada, 2013:221-228.
[37] MATHEW J, LUO Ming, PANG C K, et al. Kernel-based SMOTE for SVM classification of imbalanced datasets[C]//Proceedings of IECON 2015-41st Annual Conference of the IEEE Industrial Electronics Society. Yokohama, Japan, 2015:1127-1132.
[38] BELLINGER C, DRUMMOND C, JAPKOWICZ N. Beyond the boundaries of SMOTE-A framework for manifold-based synthetically oversampling[C]//Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Riva del Garda, Italy, 2016:248-263.
[39] BELLINGER C, JAPKOWICZ N, DRUMMOND C. Synthetic oversampling for advanced radioactive threat detection[C]//Proceedings of 2015 IEEE International Conference on Machine Learning and Applications. Miami, FL, USA, 2015:948-953.
[40] LI Xiao, ZOU Beiji, WANG Lei, et al. A novel LASSO-based feature weighting selection method for microarray data classification[C]//Proceedings of 2015 IET International Conference on Biomedical Image and Signal Processing. Beijing, China, 2015:1-5.
[41] ZHANG Chunkai, GUO Jianwei, LU Junru. Research on classification method of high-dimensional class-imbalanced data sets based on SVM[C]//Proceedings of the 2nd IEEE International Conference on Data Science in Cyberspace. Shenzhen, China, 2017:60-67.
[42] GUYON I, WESTON J, BARNHILL S, et al. Gene selection for cancer classification using support vector machines[J]. Machine learning, 2002, 46(1/2/3):389-422.
[43] 许召召, 李京华, 陈同林, 等. 融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用[J]. 计算机科学, 2018, 45(9):65-69, 74 XU Zhaozhao, LI Jinghua, CHEN Tonglin, et al. Naive Bayesian decision tree algorithm combining SMOTE and Filter-Wrapper and it’s application[J]. Computer science, 2018, 45(9):65-69, 74
[44] GUO Lei, WANG Shunfang F. Membrane protein type prediction for high-dimensional imbalanced datasets[C]//Proceedings of 2018 International Conference on Information Technology in Medicine and Education. Hangzhou, China, 2018:847-851.
[45] TORGO L, BRANCO P, RIBEIRO R P, et al. Resampling strategies for regression[J]. Expert systems, 2015, 32(3):465-476.
[46] MONIZ N, BRANCO P, TORGO L. Resampling strategies for imbalanced time series[C]//Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics. Montreal, QC, Canada, 2016:282-291.
[47] BRANCO P, TORGO L, RIBEIRO R P. REBAGG:REsampled BAGGing for imbalanced regression[C]//Proceedings of International Workshop on Learning with Imbalanced Domains:Theory and Applications. Dublin, Ireland, 2018:67-81.
[48] PéREZ-ORTIZ M, GUTIéRREZ P A, HERVáS-MARTíNEZ C, et al. Graph-based approaches for over-sampling in the context of ordinal regression[J]. IEEE transactions on knowledge and data engineering, 2015, 27(5):1233-1245.
[49] ZHU Tuanfei, LIN Yaping, LIU Yonghe, et al. Minority oversampling for imbalanced ordinal regression[J]. Knowledge-based systems, 2019, 166:140-155.
[50] COST S, SALZBERG S. A weighted nearest neighbor algorithm for learning with symbolic features[J]. Machine learning, 1993, 10(1):57-78.
[51] KURNIAWATI Y E, PERMANASARI A E, FAUZIATI S. Adaptive synthetic-nominal (ADASYN-N) and adaptive synthetic-KNN (ADASYN-KNN) for multiclass imbalance learning on laboratory test data[C]//Proceedings of 2018 International Conference on Science and Technology. Yogyakarta, Indonesia, 2018:1-6.
[52] WILSON D R, MARTINEZ T R. Improved heterogeneous distance functions[J]. Journal of artificial intelligence research, 1997, 6:1-34.
[53] AHMAD A, DEY L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set[J]. Pattern recognition letters, 2007, 28(1):110-118.
[54] KULLBACK S, LEIBLER R A. On information and sufficiency[J]. The annals of mathematical statistics, 1951, 22(1):79-86.
[55] IENCO D, PENSA R G, MEO R. Context-based distance learning for categorical data clustering[C]//Proceedings of the 8th International Symposium on Intelligent Data Analysis. Lyon, France, 2009:83-94.
[56] DEL RíO S, LóPEZ V, BENíTEZ J M, et al. On the use of MapReduce for imbalanced big data using Random Forest[J]. Information sciences, 2014, 285:112-137.
[57] GUO Haixiang, LI Yijing, SHANG J, et al. Learning from class-imbalanced data:review of methods and applications[J]. Expert systems with applications, 2017, 73:220-239.
[58] GHAZIKHANI A, MONSEFI R, YAZDI H S. Online neural network model for non-stationary and imbalanced data stream classification[J]. International journal of machine learning and cybernetics, 2014, 5(1):51-62.
[59] WANG Shuo, MINKU L L, YAO Xin. A multi-objective ensemble method for online class imbalance learning[C]//Proceedings of 2014 International Joint Conference on Neural Networks. Beijing, China, 2014:3311-3318.
[60] WANG Shuo, MINKU L L, YAO Xin. Resampling-based ensemble methods for online class imbalance learning[J]. IEEE transactions on knowledge and data engineering, 2015, 27(5):1356-1368.
[61] MIRZA B, LIN Zhiping, LIU Nan. Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift[J]. Neurocomputing, 2015, 149:316-329.
[62] GHAZIKHANI A, MONSEFI R, YAZDI H S. Ensemble of online neural networks for non-stationary and imbalanced data streams[J]. Neurocomputing, 2013, 122:535-544.
[63] DITZLER G, POLIKAR R. Incremental learning of concept drift from streaming imbalanced data[J]. IEEE transactions on knowledge and data engineering, 2013, 25(10):2283-2301.
[64] ERTEKIN ?. Adaptive oversampling for imbalanced data classification[C]//Proceedings of the 28th International Symposium on Computer and Information Sciences. Paris, France, 2013:261-269.
[65] MOUTAFIS P, KAKADIARIS I A. GS4:generating synthetic samples for semi-supervised nearest neighbor classification[C]//Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Tainan, China, 2014:393-403.
[66] TRIGUERO I, GARCIA S, HERRERA F. SEG-SSC:a framework based on synthetic examples generation for self-labeled semi-supervised classification[J]. IEEE transactions on cybernetics, 2015, 45(4):622-634.
[67] DONG Aimei, CHUNG F L, WANG Shitong. Semi-supervised classification method through oversampling and common hidden space[J]. Information sciences, 2016, 349-350:216-228.

相似文献/References:

[1]胡小生,钟勇.基于加权聚类质心的SVM不平衡分类方法[J].智能系统学报,2013,8(03):261.
 HU Xiaosheng,ZHONG Yong.Support vector machine imbalanced data classification based on weighted clustering centroid[J].CAAI Transactions on Intelligent Systems,2013,8(06):261.
[2]黄庆康,宋恺涛,陆建峰.应用于不平衡多分类问题的损失平衡函数[J].智能系统学报,2019,14(05):953.[doi:10.11992/tis.201808004]
 HUANG Qingkang,SONG Kaitao,LU Jianfeng.Application of the loss balance function to the imbalanced multi-classification problems[J].CAAI Transactions on Intelligent Systems,2019,14(06):953.[doi:10.11992/tis.201808004]

备注/Memo

备注/Memo:
收稿日期:2019-06-27。
基金项目:国家自然科学基金资助项目(61801279);山西省自然科学基金项目(201801D121115,2014011022-2)
作者简介:石洪波,女,1965年生,教授,博士生导师,主要研究方向为机器学习、人工智能。主持和参与国家自然科学基金项目、山西省自然科学基金项目等20余项。发表学术论文50余篇;陈雨文,女,1995年生,硕士研究生,主要研究方向为数据挖掘、商务智能;陈鑫,男,1995年生,硕士研究生,主要研究方向为机器学习、数据挖掘、商务智能
通讯作者:石洪波.E-mail:shihb@sxufe.edu.cn
更新日期/Last Update: 2019-12-25