[1]胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报编辑部,2016,11(2):257-263.[doi:10.11992/tis.201507015]
 HU Xiaosheng,WEN Juping,ZHONG Yong.Imbalanced data ensemble classification using dynamic balance sampling[J].CAAI Transactions on Intelligent Systems,2016,11(2):257-263.[doi:10.11992/tis.201507015]
点击复制

动态平衡采样的不平衡数据集成分类方法(/HTML)
分享到:

《智能系统学报》编辑部[ISSN:1673-4785/CN:23-1538/TP]

卷:
第11卷
期数:
2016年2期
页码:
257-263
栏目:
出版日期:
2016-04-25

文章信息/Info

Title:
Imbalanced data ensemble classification using dynamic balance sampling
作者:
胡小生 温菊屏 钟勇
佛山科学技术学院 电子与信息工程学院, 广东 佛山 528000
Author(s):
HU Xiaosheng WEN Juping ZHONG Yong
College of Electronic and Information Engineering, Foshan University, Foshan 528000, China
关键词:
分类不平衡数据重采样集成学习随机森林
Keywords:
data miningimbalanced datare-samplingensemblerandom forest
分类号:
TP181
DOI:
10.11992/tis.201507015
摘要:
传统分类算法假定平衡的类分布或相同的误分类代价,处理不平衡数据集时,少数类识别精度过低。提出一种动态平衡数据采样与Boosting技术相结合的不平衡数据集成分类算法。在每次迭代初始,综合使用随机欠采样和SMOTE过采样获得平衡规模的训练数据,各类别样本数据比例保持随机性以体现训练数据的差异性,为子分类器提供更好的训练平台;子分类器形成后,利用加权投票得到最终强分类器。实验结果表明,该方法具有处理类别不平衡数据分类问题的优势。
Abstract:
Traditional classification algorithms assume balanced class distribution or equal misclassification costs, which result in poor predictive accuracy of minority classes when handling imbalanced data. A novel imbalanced data classification method that combines dynamic balance sampling with ensemble boosting classifiers is proposed. At the beginning of each iteration, each member of the dynamic balance ensemble is trained with under-sampled data from the original training set and is augmented by artificial instances obtained using SMOTE . The distribution proportion of each class sample is randomly chosen to reflect the diversity of the training data and to provide a better training platform for the ensemble sub-classifier. Once the sub-classifiers are trained, a strong classifier is obtained using a weighting vote. Experimental results show that the proposed method provides better classification performance than other approaches.

参考文献/References:

[1] CATENI S, COLLA V, VANNUCCI M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems[J]. Neurocomputing, 2014, 135: 32-41.
[2] ZHANG Huaxiang, LI Mingfang. RWO-Sampling: a random walk over-sampling approach to imbalanced data classification[J]. Information fusion, 2014, 20: 99-116.
[3] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16(1): 321-357.
[4] 郭丽娟, 倪子伟, 江弋, 等. 集成降采样不平衡数据分类方法研究[J]. 计算机科学与探索, 2013, 7(7): 630-638. GUO Lijuan, NI Ziwei, JIANG Yi, et al. Research on imbalanced data classification based on ensemble and under-sampling[J]. Journal of frontiers of computer and technology, 2013, 7(7): 630-638.
[5] 李雄飞, 李军, 董元方, 等. 一种新的不平衡数据学习算法PCBoost[J]. 计算机学报, 2012, 35(2): 202-209. LI Xiongfei, LI Jun, DONG Yuanfang, et al. A new learning algorithm for imbalanced data-PCBoost[J]. Chinese journal of computers, 2012, 35(2): 202-209.
[6] CHEN Xiaolin, SONG Enming, MA Guangzhi. An adaptive cost-sensitive classifier[C]//Proceedings of the 2nd International Conference on Computer and Automation Engineering. Singapore: IEEE, 2010, 1: 699-701.
[7] 李倩倩, 刘胥影. 多类类别不平衡学习算法: EasyEnsemble. M[J]. 模式识别与人工智能, 2014, 27(2): 187-192. LI Qianqian, LIU Xuying. EasyEnsemble. M for multiclass imbalance problem[J]. Pattern recognition and artificial intelligence, 2014, 27(2): 187-192.
[8] 韩敏, 朱新荣. 不平衡数据分类的混合算法[J]. 控制理论与应用, 2011, 28(10): 1485-1489. HAN Min, ZHU Xinrong. Hybrid algorithm for classification of unbalanced datasets[J]. Control theory & applications, 2012, 28(10): 1485-1489.
[9] WANG Shijin, XI Lifeng. Condition monitoring system design with one-class and imbalanced-data classifier[C]//Proceedings of the 16th International Conference on Industrial Engineering and Engineering Management. Beijing, China: IEEE, 2009: 779-783.
[10] 叶志飞, 文益民, 吕宝粮. 不平衡分类问题研究综述[J]. 智能系统学报, 2009, 4(2): 148-156. YE Zhifei, WEN Yimin, LV Baoliang. A survey of imbalanced pattern classification problems[J]. CAAI transactions on intelligent systems, 2009, 4(2): 148-156.
[11] 翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[J]. 计算机科学, 2010, 37(10): 27-32. ZHAI Yun, YANG Bingyu, QU Wu. Survey of mining imbalanced datasets[J]. Computer science, 2010, 37(10): 27-32.
[12] HAN Hui, WANG Wenyuan, MAO Binghuan. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878-887.
[13] HE Haibo, BAI Yang, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of IEEE International Joint Conference on Neural Networks. Hong Kong, China: IEEE, 2008: 1322-1328.
[14] BATISTA G, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD explorations newsletter, 2004, 6(1): 20-29.
[15] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: one-sided selection[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1997: 179-186.
[16] 蒋盛益, 苗邦, 余雯. 基于一趟聚类的不平衡数据下抽样算法[J]. 小型微型计算机系统, 2012, 33(2): 232-236. JIANG Shengyi, MIAO Bang, YU Wen. Under-sampling method based on one-pass clustering for imbalanced data distribution[J]. Journal of Chinese computer systems, 2012, 32(2): 232-236.
[17] 胡小生, 钟勇. 基于加权聚类质心的SVM不平衡分类方法[J]. 智能系统学报, 2013, 8(3): 261-265. HU Xiaosheng, ZHONG Yong. Support vector machine imbalanced data classification based on weighted clustering centroid[J]. CAAI transactions on intelligent systems, 2013, 8(3): 261-265.
[18] 胡小生, 张润晶, 钟勇. 两层聚类的类别不平衡数据挖掘算法[J]. 计算机科学, 2013, 40(11): 271-275. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. Two-tier clustering for mining imbalanced datasets[J]. Computer science, 2013, 40(11): 271-275.
[19] 陈思, 郭躬德, 陈黎飞. 基于聚类融合的不平衡数据分类方法[J]. 模式识别与人工智能, 2010, 23(6): 772-780. CHEN Si, GUO Gongde, CHEN Lifei. Clustering ensembles based classification method for imbalanced data sets[J]. Pattern recognition and artificial intelligence, 2010, 23(6): 772-780.
[20] UCI machine learning repository[EB/OL]. (2009-10-16)[2015-3-20]. http://archive.ics.uci.edu/ml.
[21] 李建更, 高志坤. 随机森林针对小样本数据类权重设置[J]. 计算机工程与应用, 2009, 45(26): 131-134. LI Jiangeng, GAO Zhikun. Setting of class weights in random forest for small-sample data[J]. Computer engineering and applications, 2009, 45(26): 131-134.
[22] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTBoost: improving prediction of the minority class in boosting[C]//Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Berlin Heidelberg: Springer, 2003, 2838: 107-119.
[23] SEIFFERT C, KHOSHGOFTAAR T M, VAN HULSE J, et al. RUSBoost: a hybrid approach to alleviating class imbalance[J]. IEEE transactions on system, man and cybernetics-part a: systems and humans, 2010, 40(1): 185-197.

相似文献/References:

[1]刘三阳 杜喆.一种改进的模糊支持向量机算法[J].智能系统学报编辑部,2007,2(03):30.
 LIU San-yang,DU Zhe.An improved fuzzy support vector machine method[J].CAAI Transactions on Intelligent Systems,2007,2(2):30.
[2]富春岩,葛茂松.一种能够适应概念漂移变化的数据流分类方法[J].智能系统学报编辑部,2007,2(04):86.
 FU Chun-yan,GE Mao-song.A data stream classification methods adaptive to concept drift[J].CAAI Transactions on Intelligent Systems,2007,2(2):86.
[3]王定桥,李卫华,杨春燕.从用户需求语句建立问题可拓模型的研究[J].智能系统学报编辑部,2015,10(6):865.[doi:10.11992/tis.201507038]
 WANG Dingqiao,LI Weihua,YANG Chunyan.Research on building an extension model from user requirements[J].CAAI Transactions on Intelligent Systems,2015,10(2):865.[doi:10.11992/tis.201507038]
[4]王晓初,包芳,王士同,等.基于最小最大概率机的迁移学习分类算法[J].智能系统学报编辑部,2016,11(1):84.[doi:10.11992/tis.201505024]
 WANG Xiaochu,BAO Fang,WANG Shitong,et al.Transfer learning classification algorithms based on minimax probability machine[J].CAAI Transactions on Intelligent Systems,2016,11(2):84.[doi:10.11992/tis.201505024]
[5]刘威,刘尚,周璇.BP神经网络子批量学习方法研究[J].智能系统学报编辑部,2016,11(2):226.[doi:10.11992/tis.201509015]
 LIU Wei,LIU Shang,ZHOU Xuan.Subbatch learning method for BP neural networks[J].CAAI Transactions on Intelligent Systems,2016,11(2):226.[doi:10.11992/tis.201509015]
[6]李海林,梁叶.分段聚合近似和数值导数的动态时间弯曲方法[J].智能系统学报编辑部,2016,11(2):249.[doi:10.11992/tis.201507064]
 LI Hailin,LIANG Ye.Dynamic time warping based on piecewise aggregate approximation and data derivatives[J].CAAI Transactions on Intelligent Systems,2016,11(2):249.[doi:10.11992/tis.201507064]
[7]花小朋,孙一颗,丁世飞.一种改进的投影孪生支持向量机[J].智能系统学报编辑部,2016,11(3):384.[doi:10.11992/tis.201603049]
 HUA Xiaopeng,SUN Yike,DING Shifei.An improved projection twin support vector machine[J].CAAI Transactions on Intelligent Systems,2016,11(2):384.[doi:10.11992/tis.201603049]
[8]李晨曦,孙正兴,宋沫飞,等.一种三维模型最优视图的分类选择方法[J].智能系统学报编辑部,2014,9(01):12.[doi:10.3969/j.issn.1673-4785.201305004]
 LI Chenxi,SUN Zhengxing,SONG Mofei,et al.A classification-based approach for best view selection of 3D models[J].CAAI Transactions on Intelligent Systems,2014,9(2):12.[doi:10.3969/j.issn.1673-4785.201305004]
[9]张龙,陈宸,韩宁,等.压缩感知理论中的建筑电气系统故障诊断[J].智能系统学报编辑部,2014,9(02):204.[doi:10.3969/j.issn.1673-4785.201310026]
 ZHANG Long,CHEN Chen,HAN Ning,et al.Fault diagnosis of electrical systems in buildingsbased on compressed sensing[J].CAAI Transactions on Intelligent Systems,2014,9(2):204.[doi:10.3969/j.issn.1673-4785.201310026]
[10]陈玉明,吴克寿,李向军.基因表达数据在邻域关系中的特征选择[J].智能系统学报编辑部,2014,9(02):210.[doi:10.3969/j.issn.1673-4785.201307014]
 CHEN Yuming,WU Keshou,LI Xiangjun.Gene expression data feature selection with neighborhood relation[J].CAAI Transactions on Intelligent Systems,2014,9(2):210.[doi:10.3969/j.issn.1673-4785.201307014]

备注/Memo

备注/Memo:
收稿日期:2015-7-9;改回日期:。
基金项目:国家星火计划项目(2014GA780031);广东省自然科学基金项目(2015A030313638);广东高校优秀青年创新人才培养计划资助项目(2013LYM_0097,2014KQNCX184,2015KQNCX180);佛山科学技术学院校级科研项目.
作者简介:胡小生,男,1978年生,讲师/高级工程师,主要研究方向为机器学习、数据挖掘、人工智能。主持广东省教育厅育苗工程项目1项,参与省级、市厅级科研项目6项,发表学术论文12篇,其中被EI、ISTP检索4篇;温菊屏,女,1979年生,讲师,主要研究方向为虚拟现实、数据挖掘。主持广东省教育厅科研项目1项,参与省级、厅级科研和教改项目4项,发表学术论文9篇;钟勇,男,1970年生,教授,博士,主要研究方向为访问控制、隐私保护、信息检索、云计算。主持和参与国家自然科学基金、国家星火科技计划、省自然科学基金等国家级、省级科研项目10余项,发表学术论文30多篇,其中被SCI、EI检索10篇。
通讯作者:胡小生.E-mail:feihu@fosu.edu.cn.
更新日期/Last Update: 1900-01-01