[1]LIU Jinping,ZHOU Jiaming,HE Junbin,et al.Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing[J].CAAI Transactions on Intelligent Systems,2020,15(4):732-739.[doi:10.11992/tis.201909062]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
15
Number of periods:
2020 4
Page number:
732-739
Column:
学术论文—机器学习
Public date:
2020-07-05
- Title:
-
Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing
- Author(s):
-
LIU Jinping1; ZHOU Jiaming1; HE Junbin1; 2; TANG Zhaohui3; XU Pengfei1; ZHANG Guoyong3
-
1. Hu’nan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hu’nan Normal University, Changsha 410081, China;
2. Hu’nan Institute of Metrology and Test, Changsha 410014, China;
3. School of Automation, Central South University, Changsha 410082, China
-
- Keywords:
-
adaptive synthetic sampling approach (ADASYN); imbalanced data se-t; spectral clustering; oversampling; pattern classification; data distribution; biased classifier; data pre-processing
- CLC:
-
TP391
- DOI:
-
10.11992/tis.201909062
- Abstract:
-
Classification is a research hotspot in the field of machine learning. Most classic classifiers assume that the distribution of dataset is generally balanced, while the data se-t in reality often has a problem of class imbalance. Namely, the number of data belonging to the normal/majority category and the amount of anomaly/minority data vary greatly. If the data is not processed, the classifier will ignore the minority and be biased towards the majority, which deteriorates the classification results. Focusing on the problem of data imbalance, this paper proposes a spectral clustering-fused comprehensive sampling algorithm (SCF-ADASYN). First, the spectral clustering method is employed to analyze the distribution information of the minority-type samples in the imbalanced dataset, and the samples of minority class are oversampled to obtain a relatively balanced dataset, used for the classification model training. A large number of experiments have been carried out on multiple unbalanced datasets. The results show that the SCF-ADASYN can effectively improve the imbalance on the data se-t, and the classification accuracies of the testing classifiers on the unbalanced data se-t can be significantly improved.