[1]冀常鹏,尚佳奇,代巍.不平衡数据集的DC-SMOTE过采样方法[J].智能系统学报,2024,19(3):525-533.[doi:10.11992/tis.202204013]
JI Changpeng,SHANG Jiaqi,DAI Wei.DC-SMOTE oversampling method for an imbalanced dataset[J].CAAI Transactions on Intelligent Systems,2024,19(3):525-533.[doi:10.11992/tis.202204013]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第3期
页码:
525-533
栏目:
学术论文—机器学习
出版日期:
2024-05-05
- Title:
-
DC-SMOTE oversampling method for an imbalanced dataset
- 作者:
-
冀常鹏1, 尚佳奇2, 代巍1
-
1. 辽宁工程技术大学 电子与信息工程学院, 辽宁 葫芦岛125105;
2. 辽宁工程技术大学 研究生院, 辽宁 葫芦岛125105
- Author(s):
-
JI Changpeng1, SHANG Jiaqi2, DAI Wei1
-
1. School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;
2. Graduate School, Liaoning Technical University, Huludao 125105, China
-
- 关键词:
-
不平衡数据集; 过采样; 高斯核函数; 局部引力; 高不平衡数据; 合成少数类过采样; 不平衡度; 分类
- Keywords:
-
imbalanced dataset; oversampling; Gaussian kernel; local gravity; high-imbalanced data; SMOTE; imbalance ratio; classification
- 分类号:
-
TP181
- DOI:
-
10.11992/tis.202204013
- 文献标志码:
-
2023-09-27
- 摘要:
-
针对不平衡数据集在分类任务中表现不佳的问题,提出基于局部密度与集中度的过采样算法。针对数据集中所有的少数类样本点,分别利用高斯核函数与局部引力来计算局部密度与集中度;对于局部密度较小的部分有针对性地合成第一类新样本,解决类内不平衡问题。根据集中度的不同,区分出少数类样本的边界,有针对性地合成第二类新样本,达到强化边界的作用;同时,通过自适应生成新样本,有效解决大部分过采样算法没有明确过采样量或者盲目追求样本平衡度相等的问题。最后,在公开的12个不平衡数据集上进行了实验,实验结果表明,本算法在低不平衡数据集与高不平衡数据集上的应用均拥有良好的表现。
- Abstract:
-
Inspired by the poor performance of imbalanced datasets in classification tasks, an oversampling algorithm based on local density and centrality is proposed. First, for all the minority sample points in the dataset, the Gaussian kernel function and local gravity are used to calculate the local density and centrality, respectively. Furthermore, the first type of new samples is synthesized for the portion with small local density to solve the imbalance problem within this kind. According to the difference of centrality, the boundaries of minority samples are distinguished, and the second kind of samples are specifically synthesized to strengthen the boundaries. Meanwhile, new samples are generated adaptively, which solves the problem that most oversampling algorithms fail to clearly define the oversampling quantity or blindly pursue the balance of the number of samples of two categories. Finally, experiments are conducted on 12 public imbalanced datasets and results reveal that the algorithm has good performance in low- and high-imbalanced datasets.
备注/Memo
收稿日期:2022-04-10。
作者简介:冀常鹏,教授,主要研究方向为信号检测与估计、智能控制、工程机械电液一体化、无线传感网络和计算机仿真。主持或参与完成科研项目40余项,获得辽宁省科技进步一等奖1项,阜新市科技进步一等奖3项,二等奖2项,获得国家发明专利6项,实用新型专利16项。发表学术论文100余篇。 E-mail:ccp@lntu.edu.cn;尚佳奇,硕士研究生,主要研究方向为机器学习、数据挖掘。 E-mail:409516478@qq.com;代巍,讲师,博士,主要研究方向为微弱信号检测、信息处理,获得国家发明专利1项,软件著作权4项,发表学术论文10余篇。 E-mail:daiwei0084@126.com
通讯作者:冀常鹏. E-mail:ccp@lntu.edu.cn
更新日期/Last Update:
1900-01-01