[1]王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017,(06):865-872.[doi:10.11992/tis.201706049]
 WANG Junhong,DUAN Bingqian.Research on the SMOTE method based on density[J].CAAI Transactions on Intelligent Systems,2017,(06):865-872.[doi:10.11992/tis.201706049]
点击复制

一种基于密度的SMOTE方法研究(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
期数:
2017年06期
页码:
865-872
栏目:
出版日期:
2017-12-25

文章信息/Info

Title:
Research on the SMOTE method based on density
作者:
王俊红 段冰倩
山西大学 计算机与信息技术学院, 山西 太原 030006
Author(s):
WANG Junhong DUAN Bingqian
School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
关键词:
非平衡分类采样准确率密度
Keywords:
imbalanceclassificationsamplingprecisiondensity
分类号:
TP311
DOI:
10.11992/tis.201706049
摘要:
重采样技术在解决非平衡类分类问题上得到了广泛的应用。其中,Chawla提出的SMOTE(Synthetic Minority Oversampling Technique)算法在一定程度上缓解了数据的不平衡程度,但这种方法对少数类数据不加区分地进行过抽样,容易造成过拟合。针对此问题,本文提出了一种新的过采样方法:DS-SMOTE方法。DS-SMOTE算法基于样本的密度来识别稀疏样本,并将其作为采样过程中的种子样本;然后在采样过程中采用SMOTE算法的思想,在种子样本与其k近邻之间产生合成样本。实验结果显示,DS-SMOTE算法与其他同类方法相比,准确率以及G值有较大的提高,说明DS-SMOTE算法在处理非平衡数据分类问题上具有一定优势。
Abstract:
In recent years, over-sampling has been widely used in the field of classification of imbalanced classes. The SMOTE(Synthetic Minority Oversampling Technique) algorithm, presented by Chawla, alleviates the degree of data imbalance to a certain extent, but can lead to over-fitting. To solve this problem, this paper presents a new sampling method, DS-SMOTE, which identifies sparse samples based on their density and uses them as seed samples in the process of sampling. The SMOTE algorithm is then adopted, and a synthetic sample is generated between the seed sample and its k neighbor. The proposed algorithm showed great improvement in precision and G-mean compared with similar algorithms, and it has advantage of treating imbalanced data classification.

参考文献/References:

[1] CHARTE F, RIVERA A J, JESUS M J D, et al. Addressing imbalance in multilabel classification: Measures and random resampling algorithms[J]. Neurocomputing, 2015, 163: 3-16
[2] RADIVOJAC P, CHAWLA N V, DUNKER A K, et al. Classification and knowledge discovery in protein databases[J]. Journal of Biomedical Informatics, 2004, 37(4): 224-239
[3] LIU Y, CHAWLA N V, HARPER M P, et al. A study in machine learning from imbalanced data for sentence boundary detection in speech[J]. Computer speech and language, 2006, 20(4): 468-494
[4] KUBAT M, HOLTE R C, MATWIN S. Machine learning for the detection of oil spills in satellite radar images[J]. Machine learning, 1998, 30(2): 195-215
[5] QIAN H, HE G. A survey of class-imbalanced data classification[J]. Computer engineering and science, 2010, 5: 025
[6] 翟云, 王树鹏, 马楠,等. 基于单边选择链和样本分布密度融合机制的非平衡数据挖掘方法[J]. 电子学报, 2014, 42(7): 1311-1319
ZHAI Yun, WANG Shupeng, MA Nan, et al. A data mining method for imbalanced datasets based on one-side link and distribution density of instances[J].Chinise journal of electronics, 2014, 42(7): 1311-1319
[7] CHARTE F, RIVERA A J, JESUS M J D, et al. Addressing imbalance in multilabel classification: Measures and random resampling algorithms[J]. Neurocomputing, 2015, 163: 3-16
[8] GONG C, GU L. A novel smote-based classification approach to online data imbalance problem[J]. Mathematical problems in engineering, 2016, 35: 1-14
[9] BIAN J, PENG X G, WANG Y, et al. An efficient cost-sensitive feature selection using chaos genetic algorithm for class imbalance problem[J]. Mathematical problems in engineering, 2016, 6: 1-9
[10] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16(1): 321-357
[11] 杨智明, 乔立岩, 彭喜元. 基于改进SMOTE的不平衡数据挖掘方法研究[J]. 电子学报, 2007, 35(B12): 22-26
YANG Zhimin, QIAO Liyan, PENG Xiyuan. Research on datamining method for imbalanced dataset based on improved SMOTE[J]. Chinise journal of electronics, 2007, 35(B12): 22-26
[12] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing. Springer Berlin Heidelberg, 2005, 3644(5): 878-887.
[13] HE H, BAI Y, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks. IEEE Xplore, 2008: 1322-1328.
[14] GRZYMALA-BUSSE J W, STEFANOWSKI J, WILK S. A comparison of two approaches to data mining from imbalanced data[J]. Journal of intelligent manufacturing, 2005, 16(6): 565-573
[15] EZ J, KRAWCZYK B, NIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern recognition, 2016, 57(C): 164-178
[16] NANNI L, FANTOZZI C, LAZZARINI N. Coupling different methods for overcoming the class imbalance problem[J]. Neurocomputing, 2015, 158(C): 48-61
[17] NAGANJANEYULU S, KUPPA M R. A novel framework for class imbalance learning using intelligent under-sampling[J]. Progress in artificial intelligence, 2013, 2(1): 73-84
[18] ZHANG X, SONG Q, WANG G, et al. A dissimilarity-based imbalance data classification algorithm[J]. Applied intelligence, 2015, 42(3): 544-565
[19] JIANG K, LU J, XIA K. A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian journal for science and engineering, 2016, 41(8): 3255-3266.
[20] XU Y, YANG Z, ZHANG Y, et al. A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification[J]. Knowledge-based systems, 2016, 95: 75-85
[21] ANWAR N, JONES G, GANESH S. Measurement of data complexity for classification problems with unbalanced data[J]. Statistical analysis and data mining the asa data science journal, 2014, 7(3): 194-211.

相似文献/References:

[1]刘三阳 杜喆.一种改进的模糊支持向量机算法[J].智能系统学报,2007,(03):30.
 LIU San-yang,DU Zhe.An improved fuzzy support vector machine method[J].CAAI Transactions on Intelligent Systems,2007,(06):30.
[2]富春岩,葛茂松.一种能够适应概念漂移变化的数据流分类方法[J].智能系统学报,2007,(04):86.
 FU Chun-yan,GE Mao-song.A data stream classification methods adaptive to concept drift[J].CAAI Transactions on Intelligent Systems,2007,(06):86.
[3]王定桥,李卫华,杨春燕.从用户需求语句建立问题可拓模型的研究[J].智能系统学报,2015,(6):865.[doi:10.11992/tis.201507038]
 WANG Dingqiao,LI Weihua,YANG Chunyan.Research on building an extension model from user requirements[J].CAAI Transactions on Intelligent Systems,2015,(06):865.[doi:10.11992/tis.201507038]
[4]王晓初,包芳,王士同,等.基于最小最大概率机的迁移学习分类算法[J].智能系统学报,2016,(1):84.[doi:10.11992/tis.201505024]
 WANG Xiaochu,BAO Fang,WANG Shitong,et al.Transfer learning classification algorithms based on minimax probability machine[J].CAAI Transactions on Intelligent Systems,2016,(06):84.[doi:10.11992/tis.201505024]
[5]刘威,刘尚,周璇.BP神经网络子批量学习方法研究[J].智能系统学报,2016,(2):226.[doi:10.11992/tis.201509015]
 LIU Wei,LIU Shang,ZHOU Xuan.Subbatch learning method for BP neural networks[J].CAAI Transactions on Intelligent Systems,2016,(06):226.[doi:10.11992/tis.201509015]
[6]李海林,梁叶.分段聚合近似和数值导数的动态时间弯曲方法[J].智能系统学报,2016,(2):249.[doi:10.11992/tis.201507064]
 LI Hailin,LIANG Ye.Dynamic time warping based on piecewise aggregate approximation and data derivatives[J].CAAI Transactions on Intelligent Systems,2016,(06):249.[doi:10.11992/tis.201507064]
[7]胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,(2):257.[doi:10.11992/tis.201507015]
 HU Xiaosheng,WEN Juping,ZHONG Yong.Imbalanced data ensemble classification using dynamic balance sampling[J].CAAI Transactions on Intelligent Systems,2016,(06):257.[doi:10.11992/tis.201507015]
[8]花小朋,孙一颗,丁世飞.一种改进的投影孪生支持向量机[J].智能系统学报,2016,(3):384.[doi:10.11992/tis.201603049]
 HUA Xiaopeng,SUN Yike,DING Shifei.An improved projection twin support vector machine[J].CAAI Transactions on Intelligent Systems,2016,(06):384.[doi:10.11992/tis.201603049]
[9]李晨曦,孙正兴,宋沫飞,等.一种三维模型最优视图的分类选择方法[J].智能系统学报,2014,(01):12.[doi:10.3969/j.issn.1673-4785.201305004]
 LI Chenxi,SUN Zhengxing,SONG Mofei,et al.A classification-based approach for best view selection of 3D models[J].CAAI Transactions on Intelligent Systems,2014,(06):12.[doi:10.3969/j.issn.1673-4785.201305004]
[10]张龙,陈宸,韩宁,等.压缩感知理论中的建筑电气系统故障诊断[J].智能系统学报,2014,(02):204.[doi:10.3969/j.issn.1673-4785.201310026]
 ZHANG Long,CHEN Chen,HAN Ning,et al.Fault diagnosis of electrical systems in buildingsbased on compressed sensing[J].CAAI Transactions on Intelligent Systems,2014,(06):204.[doi:10.3969/j.issn.1673-4785.201310026]

备注/Memo

备注/Memo:
收稿日期:2017-06-12;改回日期:。
基金项目:国家自然科学基金项目(61772323,61402272);山西省自然科学基金项目(201701D121051).
作者简介:王俊红女,1979年生,副教授,博士,主要研究方向为形式概念分析、粗糙集与粒计算以及数据挖掘;段冰倩,女,1991年生,硕士研究生,主要研究方向为数据挖掘。
通讯作者:王俊红.E-mail:wjhwjh@sxu.edu.cn.
更新日期/Last Update: 2018-01-03