[1]LENG Qiangkui,SUN Xuezi,MENG Xiangfu.A borderline sample synthesis oversampling method based on KNN and random affine transformation[J].CAAI Transactions on Intelligent Systems,2025,20(2):329-343.[doi:10.11992/tis.202311038]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
20
Number of periods:
2025 2
Page number:
329-343
Column:
学术论文—机器学习
Public date:
2025-03-05
- Title:
-
A borderline sample synthesis oversampling method based on KNN and random affine transformation
- Author(s):
-
LENG Qiangkui; SUN Xuezi; MENG Xiangfu
-
School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China
-
- Keywords:
-
K-nearest neighbor; linear interpolation; borderline sample; natural distribution; oversampling; three nearest neighbor theory; random affine transformation; imbalanced classification
- CLC:
-
TP391
- DOI:
-
10.11992/tis.202311038
- Abstract:
-
Oversampling is a proven strategy for addressing imbalanced data classification challenges. This paper introduces a borderline sample synthesis oversampling method based on K-nearest neighbor (KNN) and random affine transformation to improve both the seed sample selection stage and synthetic sample generation stages of existing oversampling methods. Initially, the three nearest neighbor theory is applied to establish an effective intrinsic neighborhood relationship between samples and remove noise from the dataset. This step helps reduce the risk of overfitting by subsequent classifiers. Next, the minority-class borderline samples that are difficult to learn but contain rich information are accurately identified and treated as sampling seeds. Finally, the method replaces traditional linear interpolation with local random affine transformation, uniformly generating synthetic samples within the approximate manifold of the original data. Compared with traditional oversampling methods, the proposed method more effectively leverages important borderline information within datasets, thereby enhancing classifier performance. Extensive comparative experiments were conducted on 18 benchmark datasets, comparing the proposed method against 8 classic sampling methods, each combined with 4 different classifiers. The results show that this method achieves higher F1 scores and geometric means (G-mean), addressing the imbalanced data classification problem more effectively. Furthermore, statistical analysis confirms that the method has a higher Friedman ranking.