[1]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报,2009,(02):148-156.
 YE Zhi-fei,WEN Yi-min,LU Bao-liang.A survey of imbalanced pattern classification problems[J].CAAI Transactions on Intelligent Systems,2009,(02):148-156.
点击复制

不平衡分类问题研究综述(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
期数:
2009年02期
页码:
148-156
栏目:
出版日期:
2009-04-25

文章信息/Info

Title:
A survey of imbalanced pattern classification problems
文章编号:
1673-4785(2009)02-0148-09
作者:
叶志飞1 文益民2吕宝粮13
1.上海交通大学计算机科学与工程系,上海200240;
 2.湖南工业职业技术学院信息工程系,湖南长沙410208;
 3. 上海交通大学智能计算与智能系统教育部微软重点实验室,上海200240
Author(s):
YE Zhi-fei1WEN Yi-min 2LU Bao-liang 13
1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;
2.Department of Information Engineering,Hunan Industry Polytechnic, Changsha 410208, China;
3.MOEMicrosoft Key Lab. for Intelligent Computing and Intelligent Systems, Shanghai Jiao Tong University, Shanghai 200240, China
关键词:
机器学习不平衡模式分类重采样代价敏感学习训练集划分分类器集成分类器性能评测
Keywords:
machine learning imbalanced pattern classification resampling cost sensitive learning task decomposition classifier ensemble evaluation matrices
分类号:
TP181
文献标志码:
A
摘要:
实际的分类问题往往都是不平衡分类问题,采用传统的分类方法,难以得到满意的分类效果.为此,十多年来,人们相继提出了各种解决方案.对国内外不平衡分类问题的研究做了比较详细地综述,讨论了数据不平衡性引发的问题,介绍了目前几种主要的解决方案.通过仿真实验,比较了具有代表性的重采样法、代价敏感学习、训练集划分以及分类器集成在3个实际的不平衡数据集上的分类性能,发现训练集划分和分类器集成方法能较好地处理不平衡数据集,给出了针对不平衡分类问题的分类器评测指标和将来的工作.
Abstract:
Imbalanced data sets have always been regarded as presenting significant difficulties when applying machine learning methods to realworld pattern classification problems. Although various approaches have been proposed during the past decade, limitations are imposed by many realworld imbalanced data sets, and as a result, a lot of further research is currently being done. In this paper, we provide an uptodate survey of research on imbalanced pattern classification problems. We first took a deep look into the problems that imbalanced data sets bring, and then we introduced different kinds of solutions in detail, with their representative approaches. Finally, using three real imbalanced data sets, we compared the performance of some typical methods including resampling, cost sensitive learning, training set partitions, and the performance of classifier ensembles. In addition, topics such as evaluation indexes and future areas of research were also discussed. 

参考文献/References:

[1]KUBAT M, HOLTE B C,MATWIN S. Machine learning for the detection of oil spills in satellite radar images[J]. Machine Learning, 1998, 30(2): 195215.
[2]CHAN P K,STOLFO S J. Toward scalable learning with nonuniform class and cost distributions: a case study in credit card fraud detection[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998:164168.
[3]CHOE W, ERSOY O K,BINA M. Neural network schemes for detecting rare events in human genomic DNA[J]. Bioinformatics, 2000, 16(12): 10621072.
[4]PLANT C, B〖AKO¨5〗HM C, BERNHARD T, et al. Enhancing instancebased classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981988.
[5]WEISS G M. Learning with rare cases and small disjuncts[C]// Proceedings of the 12th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1995:558565.
[6]WEISS G M, HIRSH H. A quantitative study of small disjuncts[C]//Proceedings of the 17th National Conference on Artificial Intelligence. Texas: AAAI Press, 2000: 665670.
[7]WEISS G M. Mining with rarity: a unifying framework[J]. Sigkdd Explorations, 2004, 6(1): 719. 
[8]JAPKOWICZ N, STEPHEN S. The class imbalance problem: a systematic study[J]. Intelligent Data Analysis Journal, 2002, 6(5): 429450.
[9]ARUNASALAM B, CHAWLA S. CCCS: a top down associative classifier for imbalanced class distribution[C]//International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2006:517522.
[10]DRUMMOND C, HOLTE R. Explicitly representing expected cost: an alternative to ROC representation[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2000: 187207.
[11]PROVOST F, FAWCETT T. Robust classification for imprecise environments[J]. Machine Learning,2001, 42(3): 203231.
[12]DRUMMOND C, HOLTE R C. C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling[C]//International Conference on Machine Learning.Washington DC, 2003:152154.
[13]LING C,LI C. Data mining for direct marketing problems and solutions[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Ming. New York: AAAI Press, 1998:7379.
[14]CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority oversampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16: 321357.
[15]LEE S S. Noisy replication in skewed binary classification [J]. Computational Statistics and Data Analysis, 2000, 34(2):165191.
[16]KUBAT M, HOTLE R,MATWIN S. Learning when negative examples abound[C]//Proceedings of the 9th European Conference on Machine Learning. London: SpringerVerlag, 1997:146153.
[17]KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997:179186.
[18]CHEN X W, GERLACH B, CASASENT D. Pruning support vectors for imbalanced data classification[C]//Proceedings of 18th International Joint Conference on Neural Networks. Montreal,Quebec,Canada,2005:18831887.
[19]RASKUTTI B, KOWALCZYK A. Extreme rebalancing for SVM’s: a case study[C]//International Conference on Machine Learning. Washington DC, 2003:6571.
[20]ESTABROOKS A, JAPKOWICZ N. A mixtureofexperts framework for learning from unbalanced data sets[C]//Proceedings of the 4th Intelligent Data Analysis Conference.Lisbon,Portugal,2001:3443.
[21]AN R, LIU Y, JIN R, et al. On predicting rare classes with SVM ensembles in scene classification[C]//IEEE International Conference on Acoustics, Speech and Signal Processing.Hong Kong, 2003:2124.
[22]LU B L, ITO M. Task decomposition and module combination based on class relations: a modular neural network for pattern classification[J]. IEEE Transaction on Neural Networks, 1999, 10(5):12441256.
[23]LU B L, WANG K A, UTIYAMA M, et al. A partversuspart method for massively parallel training of support vector machines[C]//Proceedings of 17th International Joint Conference on Neural Networks. Budapest,Hungary,2004: 735740.
[24]YE Z F , LU B L. Learning imbalanced data sets with a minmax modular support vector machine[C]//Proceedings of the 20th International Joint Conference on Neural Networks.Orlando, USA,2007: 16731678.
[25]KOTSIANTIS S B,PINTELAS P E. Mixture of expert agents for handling imbalanced data sets[J]. Annals of Mathematics, Computing & Teleinformatics, 2003, 1(1):4655.
[26]ESTABROOK A, TAEHO J,JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence, 2004, 20(1): 1836.
[27]CHEN C, LIAW A,BREIMAN L. Using random forest to learn imbalanced data[R]. No.666, Statistics Department, University of California at Berkeley, 2004.
[28]CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[C]//Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. CavtatDubrovnik, Croatia, 2003:107119.
[29]LIU X Y, WU J X, ZHOU Z H. A cascadebased classification method for classimbalanced data[J]. Journal of NanJing University:Natural Science, 2006 ,42(2):148155
[30]ZHOU Z H, LIU X Y. Training costsensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transaction on Knowledge and Data Engineering, 2006, 18(1): 637
[31]PAZZANI M, MERZ C, MURPHY P, et al. Reducing misclassification costs[C]//Proceedings of the 11th International Conference on Machine Learning. San Francisco, CA, USA,1994:217225.
[32]DOMINGOS P. METACOST: a general method for making classifiers cost sensitive[C]//Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego, CA:ACM Press, 1999:155164.
[33]CHE H G, BONGER R E, LIM C C. Dualnusupport vector machine with error rate and training size biasing[C]//Proceedings of the 25th IEEE International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001:12691272.
[34]FAN W, STOLFO J S, ZHANG J X,et al. AdaCost: misclassification costsensitive boosting[C]//Proceedings of the 16th International Conference on Machine Learning. San Mateo, USA, 1999:97105.
[35]JOSHI M V, AGARWAL R C, KUMAR V. Predicting rare classes: can boosting make any weak learner strong[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada:ACM Press, 2002: 297306.
[36]CHAWLA N V. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure[C]//International Conference on Machine Learning. Washington DC, 2003:125130.
[37]ELKAN C. The foundation of costsensitive learning[C]//Proceedings of the 17th International Joint Conference on Artificial Intelligence. Seattle, Washington, 2001:239246.
[38]CARDIE C, HOWE N. Improving minority class predicting using casespecific feature weights[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 5765.
[39]ZHENG Z H, SRIHARI R. Optimally combining positive and negative features for text categorization[C]//International Conference on Machine Learning.Washington DC, 2003:241245
[40]WU G,CHANG E Y. KBA: kernel boundary alignment considering imbalanced data distribution[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(6):786795.
[41]HONG X, CHEN S, HARRIS C J. A kernelbased twoclass classifier for imbalanced data sets[J]. IEEE Transaction on Neural Networks, 2007, 18(1): 2841.
[42]SCH〖AKO¨〗LKOPF B, PLATT J C, TAYLOR J S, et al. Estimating the support of a highdimensional distribution[J]. Neural Computation, 2001, 13(7):14431472.
[43]BRADLEY A. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern Recognition, 1997, 30(7):11451159.
[44]JOSHI M V. On evaluating performance of classifiers for rare classes[C]//Proceedings of the 2nd IEEE International Conference on Data Mining. Japan, 2002:641644.
[45]〖ZK(〗PARK K J, KANEHISA M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs[J]. Bioinformatics, 2003, 19(13):16561663.
[46]MALOOF M A. Learning when data sets are imbalanced and when costs are unequal and unknown[C]//International Conference on Machine Learning.Washington DC, 2003:154160.

相似文献/References:

[1]刘奕群,张 敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报,2007,(01):45.
 LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,(02):45.
[2]马世龙,眭跃飞,许 可.优先归纳逻辑程序的极限行为[J].智能系统学报,2007,(04):9.
 MA Shi-long,SUI Yue-fei,XU Ke.Limit behavior of prioritized inductive logic programs[J].CAAI Transactions on Intelligent Systems,2007,(02):9.
[3]姚伏天,钱沄涛.高斯过程及其在高光谱图像分类中的应用[J].智能系统学报,2011,(05):396.
 YAO Futian,QIAN Yuntao.Gaussian process and its applications in hyperspectral image classification[J].CAAI Transactions on Intelligent Systems,2011,(02):396.
[4]文益民,强保华,范志刚.概念漂移数据流分类研究综述[J].智能系统学报,2013,(02):95.[doi:10.3969/j.issn.1673-4785.201208012]
 WEN Yimin,QIANG Baohua,FAN Zhigang.A survey of the classification of data streams with concept drift[J].CAAI Transactions on Intelligent Systems,2013,(02):95.[doi:10.3969/j.issn.1673-4785.201208012]
[5]杨成东,邓廷权.综合属性选择和删除的属性约简方法[J].智能系统学报,2013,(02):183.[doi:10.3969/j.issn.1673-4785.201209056]
 YANG Chengdong,DENG Tingquan.An approach to attribute reduction combining attribute selection and deletion[J].CAAI Transactions on Intelligent Systems,2013,(02):183.[doi:10.3969/j.issn.1673-4785.201209056]
[6]胡小生,钟勇.基于加权聚类质心的SVM不平衡分类方法[J].智能系统学报,2013,(03):261.
 HU Xiaosheng,ZHONG Yong.Support vector machine imbalanced data classification based on weighted clustering centroid[J].CAAI Transactions on Intelligent Systems,2013,(02):261.
[7]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报,2015,(01):1.[doi:10.3969/j.issn.1673-4785.201403072]
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,(02):1.[doi:10.3969/j.issn.1673-4785.201403072]
[8]孔庆超,毛文吉,张育浩.社交网站中用户评论行为预测[J].智能系统学报,2015,(03):349.[doi:10.3969/j.issn.1673-4785.201403019]
 KONG Qingchao,MAO Wenji,ZHANG Yuhao.User comment behavior prediction in social networking sites[J].CAAI Transactions on Intelligent Systems,2015,(02):349.[doi:10.3969/j.issn.1673-4785.201403019]
[9]姚霖,刘轶,李鑫鑫,等.词边界字向量的中文命名实体识别[J].智能系统学报,2016,(1):37.[doi:10.11992/tis.201507065]
 YAO Lin,LIU Yi,LI Xinxin,et al.Chinese named entity recognition via word boundarybased character embedding[J].CAAI Transactions on Intelligent Systems,2016,(02):37.[doi:10.11992/tis.201507065]
[10]钱冬,王蓓,张涛,等.结合Copula理论与贝叶斯决策理论的分类算法[J].智能系统学报,2016,(1):78.[doi:10.11992/tis.201509011]
 QIAN Dong,WANG Bei,ZHANG Tao,et al.Classification algorithm based on Copula theory and Bayesian decision theory[J].CAAI Transactions on Intelligent Systems,2016,(02):78.[doi:10.11992/tis.201509011]

备注/Memo

备注/Memo:
收稿日期:2008-04-23.
基金项目:国家自然科学基金资助项目(60375022,60473040).
 作者简介:
叶志飞,男,1983年生,硕士,主要研究方向为统计机器学习和模式分类.
文益民,男,1969年生,博士后,副教授,CCF高级会员,主要研究方向为统计学习理论、生物信息学和图像处理.发表学术论文20余篇.
吕宝粮,男,1960年生,教授、博士生导师、博士、IEEE高级会员,主要研究方向为仿脑计算理论与模型、神经网络理论与应用、机器学习、模式识别、脑—计算机接口、生物信息学与计算生物学.已在IEEE Trans. Neural Networks, IEEE Trans. Bimedical Engineering,Neural Networks和ICCV等国际期刊和会议上发表学术论文80余篇.
更新日期/Last Update: 2009-05-04