[1]冷强奎,孙薛梓,孟祥福.一种基于KNN和随机仿射的边界样本合成过采样方法[J].智能系统学报,2025,20(2):329-343.[doi:10.11992/tis.202311038]
LENG Qiangkui,SUN Xuezi,MENG Xiangfu.A borderline sample synthesis oversampling method based on KNN and random affine transformation[J].CAAI Transactions on Intelligent Systems,2025,20(2):329-343.[doi:10.11992/tis.202311038]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
20
期数:
2025年第2期
页码:
329-343
栏目:
学术论文—机器学习
出版日期:
2025-03-05
- Title:
-
A borderline sample synthesis oversampling method based on KNN and random affine transformation
- 作者:
-
冷强奎, 孙薛梓, 孟祥福
-
辽宁工程技术大学 电子与信息工程学院, 辽宁 葫芦岛 125105
- Author(s):
-
LENG Qiangkui, SUN Xuezi, MENG Xiangfu
-
School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China
-
- 关键词:
-
K近邻; 线性插值; 边界样本; 自然分布; 过采样; 三近邻理论; 随机仿射变换; 不平衡分类
- Keywords:
-
K-nearest neighbor; linear interpolation; borderline sample; natural distribution; oversampling; three nearest neighbor theory; random affine transformation; imbalanced classification
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202311038
- 摘要:
-
过采样是处理不平衡数据分类问题的有效策略。本文提出了一种基于K近邻(K-nearest neighbor, KNN)和随机仿射的边界样本合成过采样方法,用于改进现有过采样方法的种子样本选择阶段和合成样本生成阶段。首先,引入三近邻理论,建立样本间有效的内在近邻关系,并去除数据集中的噪声,以降低后续分类器的过拟合风险。其次,准确识别那些难以学习且包含丰富信息的少数类边界样本,并将其用作采样种子。最后,利用局部随机仿射代替线性插值机制,在原始数据的近似流形中均匀地生成合成样本。相比于传统过采样方法,本文方法能更充分挖掘数据集中的重要边界信息,从而为分类器提供更多辅助以改善其分类性能。在18个基准数据集上,与8种经典采样方法(结合4种不同分类器)进行了大量对比实验。结果表明,本文所提方法获得了更高的F1分数和几何均值(G-mean),可以更为有效地解决不平衡数据分类问题。此外,统计分析也证实该方法具有更高的弗里德曼排名(Friedman ranking)。
- Abstract:
-
Oversampling is a proven strategy for addressing imbalanced data classification challenges. This paper introduces a borderline sample synthesis oversampling method based on K-nearest neighbor (KNN) and random affine transformation to improve both the seed sample selection stage and synthetic sample generation stages of existing oversampling methods. Initially, the three nearest neighbor theory is applied to establish an effective intrinsic neighborhood relationship between samples and remove noise from the dataset. This step helps reduce the risk of overfitting by subsequent classifiers. Next, the minority-class borderline samples that are difficult to learn but contain rich information are accurately identified and treated as sampling seeds. Finally, the method replaces traditional linear interpolation with local random affine transformation, uniformly generating synthetic samples within the approximate manifold of the original data. Compared with traditional oversampling methods, the proposed method more effectively leverages important borderline information within datasets, thereby enhancing classifier performance. Extensive comparative experiments were conducted on 18 benchmark datasets, comparing the proposed method against 8 classic sampling methods, each combined with 4 different classifiers. The results show that this method achieves higher F1 scores and geometric means (G-mean), addressing the imbalanced data classification problem more effectively. Furthermore, statistical analysis confirms that the method has a higher Friedman ranking.
备注/Memo
收稿日期:2023-11-24。
基金项目:国家自然科学基金青年项目(61602056);国家自然科学基金面上项目(61772249);辽宁省教育厅项目(JYTMS20230819);辽宁工程技术大学博士科研启动基金项目(21-1043).
作者简介:冷强奎,教授,博士生导师,博士,中国计算机学会高级会员。主要研究方向为人工智能与机器学习。主持国家自然科学基金青年项目1项、辽宁省博士科研启动基金项目1项、辽宁省自然科学基金项目1项、辽宁省教育厅科研项目2项。发表学术论文30余篇。E-mail:qkleng@126.com;孙薛梓,硕士研究生,主要研究方向为人工智能与机器学习。E-mail:980048119@qq.com;孟祥福,教授,博士生导师,博士,中国计算机学会高级会员。主要研究方向为时空大数据分析、医学影像分析、人工智能。主持国家自然科学基金项目2项、辽宁省高校优秀学校杰出青年学者成长计划项目1项、辽宁省教育厅一般项目2项。获发明专利授权5项、软件著作权10项,发表学术论文80余篇,出版专著2部。E-mail:marxi@126.com。
通讯作者:冷强奎. E-mail:qkleng@126.com
更新日期/Last Update:
2025-03-05