字符串 ') and Issue_No=(select Issue_No from OA where Script_ID=@Script_ID) order by ID ' 后的引号不完整。 ') and Issue_No=(select Issue_No from OA where Script_ID=@Script_ID) order by ID ' 附近有语法错误。 不平衡分类问题研究综述-《智能系统学报》

 YE Zhi-fei,WEN Yi-min,LU Bao-liang.A survey of imbalanced pattern classification problems[J].CAAI Transactions on Intelligent Systems,2009,4(02):148-156.





A survey of imbalanced pattern classification problems
叶志飞1 文益民2吕宝粮13
 3. 上海交通大学智能计算与智能系统教育部微软重点实验室,上海200240
YE Zhi-fei1WEN Yi-min 2LU Bao-liang 13
1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;
2.Department of Information Engineering,Hunan Industry Polytechnic, Changsha 410208, China;
3.MOEMicrosoft Key Lab. for Intelligent Computing and Intelligent Systems, Shanghai Jiao Tong University, Shanghai 200240, China
machine learning imbalanced pattern classification resampling cost sensitive learning task decomposition classifier ensemble evaluation matrices
Imbalanced data sets have always been regarded as presenting significant difficulties when applying machine learning methods to realworld pattern classification problems. Although various approaches have been proposed during the past decade, limitations are imposed by many realworld imbalanced data sets, and as a result, a lot of further research is currently being done. In this paper, we provide an uptodate survey of research on imbalanced pattern classification problems. We first took a deep look into the problems that imbalanced data sets bring, and then we introduced different kinds of solutions in detail, with their representative approaches. Finally, using three real imbalanced data sets, we compared the performance of some typical methods including resampling, cost sensitive learning, training set partitions, and the performance of classifier ensembles. In addition, topics such as evaluation indexes and future areas of research were also discussed. 


[1]KUBAT M, HOLTE B C,MATWIN S. Machine learning for the detection of oil spills in satellite radar images[J]. Machine Learning, 1998, 30(2): 195215.
[2]CHAN P K,STOLFO S J. Toward scalable learning with nonuniform class and cost distributions: a case study in credit card fraud detection[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998:164168.
[3]CHOE W, ERSOY O K,BINA M. Neural network schemes for detecting rare events in human genomic DNA[J]. Bioinformatics, 2000, 16(12): 10621072.
[4]PLANT C, B〖AKO¨5〗HM C, BERNHARD T, et al. Enhancing instancebased classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981988.
[5]WEISS G M. Learning with rare cases and small disjuncts[C]// Proceedings of the 12th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1995:558565.
[6]WEISS G M, HIRSH H. A quantitative study of small disjuncts[C]//Proceedings of the 17th National Conference on Artificial Intelligence. Texas: AAAI Press, 2000: 665670.
[7]WEISS G M. Mining with rarity: a unifying framework[J]. Sigkdd Explorations, 2004, 6(1): 719. 
[8]JAPKOWICZ N, STEPHEN S. The class imbalance problem: a systematic study[J]. Intelligent Data Analysis Journal, 2002, 6(5): 429450.
[9]ARUNASALAM B, CHAWLA S. CCCS: a top down associative classifier for imbalanced class distribution[C]//International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2006:517522.
[10]DRUMMOND C, HOLTE R. Explicitly representing expected cost: an alternative to ROC representation[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2000: 187207.
[11]PROVOST F, FAWCETT T. Robust classification for imprecise environments[J]. Machine Learning,2001, 42(3): 203231.
[12]DRUMMOND C, HOLTE R C. C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling[C]//International Conference on Machine Learning.Washington DC, 2003:152154.
[13]LING C,LI C. Data mining for direct marketing problems and solutions[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Ming. New York: AAAI Press, 1998:7379.
[14]CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority oversampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16: 321357.
[15]LEE S S. Noisy replication in skewed binary classification [J]. Computational Statistics and Data Analysis, 2000, 34(2):165191.
[16]KUBAT M, HOTLE R,MATWIN S. Learning when negative examples abound[C]//Proceedings of the 9th European Conference on Machine Learning. London: SpringerVerlag, 1997:146153.
[17]KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997:179186.
[18]CHEN X W, GERLACH B, CASASENT D. Pruning support vectors for imbalanced data classification[C]//Proceedings of 18th International Joint Conference on Neural Networks. Montreal,Quebec,Canada,2005:18831887.
[19]RASKUTTI B, KOWALCZYK A. Extreme rebalancing for SVM’s: a case study[C]//International Conference on Machine Learning. Washington DC, 2003:6571.
[20]ESTABROOKS A, JAPKOWICZ N. A mixtureofexperts framework for learning from unbalanced data sets[C]//Proceedings of the 4th Intelligent Data Analysis Conference.Lisbon,Portugal,2001:3443.
[21]AN R, LIU Y, JIN R, et al. On predicting rare classes with SVM ensembles in scene classification[C]//IEEE International Conference on Acoustics, Speech and Signal Processing.Hong Kong, 2003:2124.
[22]LU B L, ITO M. Task decomposition and module combination based on class relations: a modular neural network for pattern classification[J]. IEEE Transaction on Neural Networks, 1999, 10(5):12441256.
[23]LU B L, WANG K A, UTIYAMA M, et al. A partversuspart method for massively parallel training of support vector machines[C]//Proceedings of 17th International Joint Conference on Neural Networks. Budapest,Hungary,2004: 735740.
[24]YE Z F , LU B L. Learning imbalanced data sets with a minmax modular support vector machine[C]//Proceedings of the 20th International Joint Conference on Neural Networks.Orlando, USA,2007: 16731678.
[25]KOTSIANTIS S B,PINTELAS P E. Mixture of expert agents for handling imbalanced data sets[J]. Annals of Mathematics, Computing & Teleinformatics, 2003, 1(1):4655.
[26]ESTABROOK A, TAEHO J,JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence, 2004, 20(1): 1836.
[27]CHEN C, LIAW A,BREIMAN L. Using random forest to learn imbalanced data[R]. No.666, Statistics Department, University of California at Berkeley, 2004.
[28]CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[C]//Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. CavtatDubrovnik, Croatia, 2003:107119.
[29]LIU X Y, WU J X, ZHOU Z H. A cascadebased classification method for classimbalanced data[J]. Journal of NanJing University:Natural Science, 2006 ,42(2):148155
[30]ZHOU Z H, LIU X Y. Training costsensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transaction on Knowledge and Data Engineering, 2006, 18(1): 637
[31]PAZZANI M, MERZ C, MURPHY P, et al. Reducing misclassification costs[C]//Proceedings of the 11th International Conference on Machine Learning. San Francisco, CA, USA,1994:217225.
[32]DOMINGOS P. METACOST: a general method for making classifiers cost sensitive[C]//Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego, CA:ACM Press, 1999:155164.
[33]CHE H G, BONGER R E, LIM C C. Dualnusupport vector machine with error rate and training size biasing[C]//Proceedings of the 25th IEEE International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001:12691272.
[34]FAN W, STOLFO J S, ZHANG J X,et al. AdaCost: misclassification costsensitive boosting[C]//Proceedings of the 16th International Conference on Machine Learning. San Mateo, USA, 1999:97105.
[35]JOSHI M V, AGARWAL R C, KUMAR V. Predicting rare classes: can boosting make any weak learner strong[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada:ACM Press, 2002: 297306.
[36]CHAWLA N V. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure[C]//International Conference on Machine Learning. Washington DC, 2003:125130.
[37]ELKAN C. The foundation of costsensitive learning[C]//Proceedings of the 17th International Joint Conference on Artificial Intelligence. Seattle, Washington, 2001:239246.
[38]CARDIE C, HOWE N. Improving minority class predicting using casespecific feature weights[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 5765.
[39]ZHENG Z H, SRIHARI R. Optimally combining positive and negative features for text categorization[C]//International Conference on Machine Learning.Washington DC, 2003:241245
[40]WU G,CHANG E Y. KBA: kernel boundary alignment considering imbalanced data distribution[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(6):786795.
[41]HONG X, CHEN S, HARRIS C J. A kernelbased twoclass classifier for imbalanced data sets[J]. IEEE Transaction on Neural Networks, 2007, 18(1): 2841.
[42]SCH〖AKO¨〗LKOPF B, PLATT J C, TAYLOR J S, et al. Estimating the support of a highdimensional distribution[J]. Neural Computation, 2001, 13(7):14431472.
[43]BRADLEY A. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern Recognition, 1997, 30(7):11451159.
[44]JOSHI M V. On evaluating performance of classifiers for rare classes[C]//Proceedings of the 2nd IEEE International Conference on Data Mining. Japan, 2002:641644.
[45]〖ZK(〗PARK K J, KANEHISA M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs[J]. Bioinformatics, 2003, 19(13):16561663.
[46]MALOOF M A. Learning when data sets are imbalanced and when costs are unequal and unknown[C]//International Conference on Machine Learning.Washington DC, 2003:154160.


[1]刘奕群,张 敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报,2007,2(01):45.
 LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(02):45.
[2]马世龙,眭跃飞,许 可.优先归纳逻辑程序的极限行为[J].智能系统学报,2007,2(04):9.
 MA Shi-long,SUI Yue-fei,XU Ke.Limit behavior of prioritized inductive logic programs[J].CAAI Transactions on Intelligent Systems,2007,2(02):9.
 YAO Futian,QIAN Yuntao.Gaussian process and its applications in hyperspectral image classification[J].CAAI Transactions on Intelligent Systems,2011,6(02):396.
 WEN Yimin,QIANG Baohua,FAN Zhigang.A survey of the classification of data streams with concept drift[J].CAAI Transactions on Intelligent Systems,2013,8(02):95.[doi:10.3969/j.issn.1673-4785.201208012]
 YANG Chengdong,DENG Tingquan.An approach to attribute reduction combining attribute selection and deletion[J].CAAI Transactions on Intelligent Systems,2013,8(02):183.[doi:10.3969/j.issn.1673-4785.201209056]
 HU Xiaosheng,ZHONG Yong.Support vector machine imbalanced data classification based on weighted clustering centroid[J].CAAI Transactions on Intelligent Systems,2013,8(02):261.
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10(02):1.[doi:10.3969/j.issn.1673-4785.201403072]
 KONG Qingchao,MAO Wenji,ZHANG Yuhao.User comment behavior prediction in social networking sites[J].CAAI Transactions on Intelligent Systems,2015,10(02):349.[doi:10.3969/j.issn.1673-4785.201403019]
 YAO Lin,LIU Yi,LI Xinxin,et al.Chinese named entity recognition via word boundarybased character embedding[J].CAAI Transactions on Intelligent Systems,2016,11(02):37.[doi:10.11992/tis.201507065]
 QIAN Dong,WANG Bei,ZHANG Tao,et al.Classification algorithm based on Copula theory and Bayesian decision theory[J].CAAI Transactions on Intelligent Systems,2016,11(02):78.[doi:10.11992/tis.201509011]


吕宝粮,男,1960年生,教授、博士生导师、博士、IEEE高级会员,主要研究方向为仿脑计算理论与模型、神经网络理论与应用、机器学习、模式识别、脑—计算机接口、生物信息学与计算生物学.已在IEEE Trans. Neural Networks, IEEE Trans. Bimedical Engineering,Neural Networks和ICCV等国际期刊和会议上发表学术论文80余篇.
更新日期/Last Update: 2009-05-04