[1]刘金平,周嘉铭,贺俊宾,等.面向不均衡数据的融合谱聚类的自适应过采样法[J].智能系统学报,2020,15(4):732-739.[doi:10.11992/tis.201909062]
LIU Jinping,ZHOU Jiaming,HE Junbin,et al.Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing[J].CAAI Transactions on Intelligent Systems,2020,15(4):732-739.[doi:10.11992/tis.201909062]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
15
期数:
2020年第4期
页码:
732-739
栏目:
学术论文—机器学习
出版日期:
2020-07-05
- Title:
-
Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing
- 作者:
-
刘金平1, 周嘉铭1, 贺俊宾1,2, 唐朝晖3, 徐鹏飞1, 张国勇3
-
1. 湖南师范大学 智能计算与语言信息处理湖南省重点实验室,湖南 长沙 410081;
2. 湖南省计量检测研究院,湖南 长沙 410014;
3. 中南大学 自动化学院,湖南 长沙 410082
- Author(s):
-
LIU Jinping1, ZHOU Jiaming1, HE Junbin1,2, TANG Zhaohui3, XU Pengfei1, ZHANG Guoyong3
-
1. Hu’nan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hu’nan Normal University, Changsha 410081, China;
2. Hu’nan Institute of Metrology and Test, Changsha 410014, China;
3. School of Automation, Central South University, Changsha 410082, China
-
- 关键词:
-
不自适应综合采样法; 不均衡数据集; 谱聚类; 过采样; 模式分类; 数据分布; 有偏分类器; 数据预处理
- Keywords:
-
adaptive synthetic sampling approach (ADASYN); imbalanced data se-t; spectral clustering; oversampling; pattern classification; data distribution; biased classifier; data pre-processing
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.201909062
- 摘要:
-
分类是模式识别领域中的研究热点,大多数经典的分类器往往默认数据集是分布均衡的,而现实中的数据集往往存在类别不均衡问题,即属于正常/多数类别的数据的数量与属于异常/少数类数据的数量之间的差异很大。若不对数据进行处理往往会导致分类器忽略少数类、偏向多数类,使得分类结果恶化。针对数据的不均衡分布问题,本文提出一种融合谱聚类的综合采样算法。首先采用谱聚类方法对不均衡数据集的少数类样本的分布信息进行分析,再基于分布信息对少数类样本进行过采样,获得相对均衡的样本,用于分类模型训练。在多个不均衡数据集上进行了大量实验,结果表明,所提方法能有效解决数据的不均衡问题,使得分类器对于少数类样本的分类精度得到提升。
- Abstract:
-
Classification is a research hotspot in the field of machine learning. Most classic classifiers assume that the distribution of dataset is generally balanced, while the data se-t in reality often has a problem of class imbalance. Namely, the number of data belonging to the normal/majority category and the amount of anomaly/minority data vary greatly. If the data is not processed, the classifier will ignore the minority and be biased towards the majority, which deteriorates the classification results. Focusing on the problem of data imbalance, this paper proposes a spectral clustering-fused comprehensive sampling algorithm (SCF-ADASYN). First, the spectral clustering method is employed to analyze the distribution information of the minority-type samples in the imbalanced dataset, and the samples of minority class are oversampled to obtain a relatively balanced dataset, used for the classification model training. A large number of experiments have been carried out on multiple unbalanced datasets. The results show that the SCF-ADASYN can effectively improve the imbalance on the data se-t, and the classification accuracies of the testing classifiers on the unbalanced data se-t can be significantly improved.
更新日期/Last Update:
2020-07-25