[1]刘金平,周嘉铭,贺俊宾,等.面向不均衡数据的融合谱聚类的自适应过采样法[J].智能系统学报,2020,15(4):732-739.[doi:10.11992/tis.201909062]
 LIU Jinping,ZHOU Jiaming,HE Junbin,et al.Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing[J].CAAI Transactions on Intelligent Systems,2020,15(4):732-739.[doi:10.11992/tis.201909062]
点击复制

面向不均衡数据的融合谱聚类的自适应过采样法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第15卷
期数:
2020年4期
页码:
732-739
栏目:
学术论文—机器学习
出版日期:
2020-07-05

文章信息/Info

Title:
Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing
作者:
刘金平1 周嘉铭1 贺俊宾12 唐朝晖3 徐鹏飞1 张国勇3
1. 湖南师范大学 智能计算与语言信息处理湖南省重点实验室,湖南 长沙 410081;
2. 湖南省计量检测研究院,湖南 长沙 410014;
3. 中南大学 自动化学院,湖南 长沙 410082
Author(s):
LIU Jinping1 ZHOU Jiaming1 HE Junbin12 TANG Zhaohui3 XU Pengfei1 ZHANG Guoyong3
1. Hu’nan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hu’nan Normal University, Changsha 410081, China;
2. Hu’nan Institute of Metrology and Test, Changsha 410014, China;
3. School of Automation, Central South University, Changsha 410082, China
关键词:
不自适应综合采样法不均衡数据集谱聚类过采样模式分类数据分布有偏分类器数据预处理
Keywords:
adaptive synthetic sampling approach (ADASYN)imbalanced data se-tspectral clusteringoversamplingpattern classificationdata distributionbiased classifierdata pre-processing
分类号:
TP391
DOI:
10.11992/tis.201909062
摘要:
分类是模式识别领域中的研究热点,大多数经典的分类器往往默认数据集是分布均衡的,而现实中的数据集往往存在类别不均衡问题,即属于正常/多数类别的数据的数量与属于异常/少数类数据的数量之间的差异很大。若不对数据进行处理往往会导致分类器忽略少数类、偏向多数类,使得分类结果恶化。针对数据的不均衡分布问题,本文提出一种融合谱聚类的综合采样算法。首先采用谱聚类方法对不均衡数据集的少数类样本的分布信息进行分析,再基于分布信息对少数类样本进行过采样,获得相对均衡的样本,用于分类模型训练。在多个不均衡数据集上进行了大量实验,结果表明,所提方法能有效解决数据的不均衡问题,使得分类器对于少数类样本的分类精度得到提升。
Abstract:
Classification is a research hotspot in the field of machine learning. Most classic classifiers assume that the distribution of dataset is generally balanced, while the data se-t in reality often has a problem of class imbalance. Namely, the number of data belonging to the normal/majority category and the amount of anomaly/minority data vary greatly. If the data is not processed, the classifier will ignore the minority and be biased towards the majority, which deteriorates the classification results. Focusing on the problem of data imbalance, this paper proposes a spectral clustering-fused comprehensive sampling algorithm (SCF-ADASYN). First, the spectral clustering method is employed to analyze the distribution information of the minority-type samples in the imbalanced dataset, and the samples of minority class are oversampled to obtain a relatively balanced dataset, used for the classification model training. A large number of experiments have been carried out on multiple unbalanced datasets. The results show that the SCF-ADASYN can effectively improve the imbalance on the data se-t, and the classification accuracies of the testing classifiers on the unbalanced data se-t can be significantly improved.

参考文献/References:

[1] LESSMANN S, BAESENS B, SEOW H V, et al. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research[J]. European journal of operational research, 2015, 247(1): 124-36.
[2] LIU J, HE J, ZHANG W, et al. TCvBsISM: Texture classification via B-splines-based image statistical modeling[J]. IEEE access, 2018, 6(1): 76-93.
[3] 翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[J]. 计算机科学, 2010, 37(10): 27-32
ZHAI Yun, YANG Bingru, QU Wu. Survey of mining imbalanced datasets[J]. Computer science, 2010, 37(10): 27-32
[4] LIN W C, TSAI C F, HU Y H, et al. Clustering-based undersampling in class-imbalanced data[J]. Information sciences, 2017, 17(2): 409-410.
[5] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE transactions on knowledge & data engineering, 2009, 21(9): 1263-84.
[6] LIU J, TANG Z, ZHANG J, et al. Visual perception-based statistical modeling of complex grain image for product quality monitoring and supervision on assembly production line[J]. Plos one, 2016, 11(3): 1-25.
[7] 刘天羽, 李国正, 尤鸣宇. 不均衡故障诊断数据上的特征选择[J]. 小型微型计算机系统, 2009, 30(5): 924-927
LIU Tianyu, LI Guozheng, YOU Mingyu. Feature selection on unbalanced fault diagnosis data[J]. Journal of Chinese computer systems, 2009, 30(5): 924-927
[8] YUAN X, XIE L, ABOUELENIEN M. A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data[J]. Pattern recognition, 2018, 77(1): 160-72.
[9] LIU J, HE J, ZHANG W, et al. ANID-SEoKELM: adaptive network intrusion detection based on selective ensemble of kernel ELMs with random features[J]. Knowledge-based systems, 2019, 177(1): 104-16.
[10] 刘金平, 张五霞, 唐朝晖, 等. 基于模糊粗糙集属性约简与GMM-LDA最优聚类簇特征学习的自适应网络入侵检测[J]. 控制与决策, 2019, 34(2): 243-251
LIU Jinping, ZHANG Wuxia, TANG Zhaohui, et al. Adaptive network intrusion detection based on fuzzy rough set-based attribute reduction and GMM-LDA-based optimal cluster feature learning[J]. Control and decision, 2019, 34(2): 243-251
[11] FOTOUHI S, ASADI S, KATTAN M W. A comprehensive data level analysis for cancer diagnosis on imbalanced data[J]. Journal of biomedical informatics, 2018, 90(1): 1-29.
[12] ZHOU P, HU X, LI P, et al. Online feature selecton for high dimensional class-imbalanced data[J]. Knowledge-based systems, 2017, 136(15): 187-199.
[13] QIAN Y, LIANG Y, LI M, et al. A resampling ensemble algorithm for classification of imbalance problems[J]. Neurocomputing, 2014, 143(2): 57-67.
[14] LIU M, XU C, LUO Y, et al. Cost-sensitive feature selection by optimizing F-Measures[J]. IEEE transactions on image processing, 2018, 27(3): 1323-35.
[15] 吴雨茜, 王俊丽, 杨丽, 等. 代价敏感深度学习方法研究综述[J]. 计算机科学, 2019, 46(5): 8-19
WU Yuqian, WANG Junli, YANG Li, et al. Survey on cost-sensitive deep learning methods[J]. Computer science, 2019, 46(5): 8-19
[16] HE H, BAI Y, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]// Neural Networks. Hong Kong, China, 2008, 3641-46
[17] AHMAD J, JAVED F, HAYAT M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods[J]. Artificial intelligence in medicine, 2017, 78(1): 14-16.
[18] LIN W C, TSAI C F, HU Y H, et al. Clustering-based undersampling in class-imbalanced data[J]. Information sciences, 2017, 17(2): 409-410.
[19] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2011, 16(1): 321-357.
[20] 蔡晓妍, 戴冠中, 杨黎斌. 谱聚类算法综述[J]. 计算机科学, 2008(7): 14-18
CAI Xiaoyan, DAI Guanzhong, YANG Libin. Survey on spectral clustering algorithms[J]. Computer science, 2008(7): 14-18
[21] NG A Y, JORDAN M I, WEISS Y. On spectral clustering: Analysis and an algorithm[C]//Proceedings of the Advances in Neural Information Processing Systems. Berkeley, USA, 2002: 26-34.
[22] 刘金平, 周嘉铭, 刘先锋, 等. 基于聚类簇结构特性的自适应综合采样法在入侵检测中的应用[J/OL]. 控制与决策: https://doi.org/10.1n3195/j.kzyjc.2019.1672.
LIU Jinping, ZHOU Jiaming, LIU Xianfeng, et al.Toward intrusion detection via cluster-structure characteristics-based adaptive synthetic sampling approach[J/OL]. Control and decision: https://doi.org/10.13195/j.kzyjc.2019.1672.
[23] CHAUHAN V K, DAHIYA K, SHARMA A. Problemformulations and solvers in linear SVM: a review[J]. Artificial intelligence review, 2018, 6(1): 1-53.
[24] PAUL A, MUKHERJEE D P, DAS P, et al. Improved random forest for classification[J]. IEEE transactionson image processing, 2018, 27(8): 4012-24.
[25] ZHANG S, DENG Z, CHENG D, et al. Efficient KNN classification algorithm for big data[J]. Neurocomputing, 2016, 195(26): 143-8.
[26] 林智勇, 郝志峰, 杨晓伟. 若干评价准则对不平衡数据学习的影响[J]. 华南理工大学学报(自然科学版), 2010, 38(4): 147-155
LIN Zhiyong, HAO Zhifeng, YANG Xiaowei. The influence of several evaluation criteria on unbalanced data learning[J]. Journal of South China University of Technology (natural science edition), 2010, 38(4): 147-155
[27] THARWAT A. Classification assessment methods[J]. Applied computing and informatics, 2018, 12(1): 1-13.

备注/Memo

备注/Memo:
收稿日期:2019-09-27。
基金项目:国家自然科学基金项目(61971188,61771492);国家自然科学基金-广东联合基金重点项目(U1701261);湖南省自然科学基金项目(2018JJ3349);湖南省研究生科研创新项目(CX20190415)
作者简介:刘金平,副教授,博士,主要研究方向为智能信息处理;周嘉铭,硕士研究生,主要研究方向为数据挖掘、模式识别;贺俊宾,硕士研究生,主要研究方向为模式识别、计算机视觉
通讯作者:刘金平.E-mail:ljp202518@163.com
更新日期/Last Update: 2020-07-25