[1]胡峰,李路正,代劲,等.结合聚类边界采样的主动学习[J].智能系统学报,2024,19(2):482-492.[doi:10.11992/tis.202205020]
HU Feng,LI Luzheng,DAI Jin,et al.Active learning combined with clustering boundary sampling[J].CAAI Transactions on Intelligent Systems,2024,19(2):482-492.[doi:10.11992/tis.202205020]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第2期
页码:
482-492
栏目:
人工智能院长论坛
出版日期:
2024-03-05
- Title:
-
Active learning combined with clustering boundary sampling
- 作者:
-
胡峰, 李路正, 代劲, 刘群
-
重庆邮电大学 计算机科学与技术学院, 重庆 400065
- Author(s):
-
HU Feng, LI Luzheng, DAI Jin, LIU Qun
-
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
-
- 关键词:
-
主动学习; 机器学习; 聚类边界; 密度峰值聚类; 几何采样; 信息熵; 版本空间; 主动聚类
- Keywords:
-
active learning; machine learning; cluster boundary; density peak clustering; geometric sampling; entropy; version space; active clustering
- 分类号:
-
TP301
- DOI:
-
10.11992/tis.202205020
- 文献标志码:
-
2023-11-20
- 摘要:
-
主动学习是一种机器学习方法,需要选择最有价值的样本进行标注。目前,主动学习在应用时面临着一些挑战,其依赖分类器的先验假设,这容易导致分类器性能意外下降,同时需要一定规模的样本作为启动条件。聚类可以降低问题规模,是主动学习的一种有效手段。为此,结合密度聚类边界采样,开展主动学习方法的研究。针对容易产生分类错误的聚类边界区域,通过计算样本密度,提出一种密度峰值聚类边界点采样方法;在此基础上,给出密度熵的定义,并利用密度熵对聚类边界区域进行启发式搜索,提出一种基于聚类边界采样的主动学习方法。试验结果表明,与文献中的5种主动学习算法相比,该算法能够以更少标记量获得同等甚至更高的分类性能,是一种有效的主动学习算法;在标记不足,无标签样本总量20%的情况下,算法在Accuracy、F-score等指标上取得较好的结果。
- Abstract:
-
Active learning is a machine learning method that requires the selection of the most valuable samples for labeling. Currently, active learning encounters certain challenges in its practical application. It relies on prior assumptions of the classifier, which can lead to unexpected declines in classifier performance and requires a specific number of samples as an initial condition. Clustering, which can reduce the complexity of a problem, serves as an effective tool in active learning. Based on density clustering boundary sampling, this study focuses on active learning methods. First, a method of sampling boundary points in density peak clustering is introduced. This method calculates the sample density for a clustering boundary region that is prone to classification errors. Subsequently, with a specified definition of density entropy, an active learning method based on cluster boundary sampling is proposed. This method employs density entropy for the heuristic search of cluster boundary regions. The experimental results show that the proposed algorithm, compared with the five active learning algorithms referenced in the literature, can achieve equal or even higher classification performance with fewer markers. This proves that it is an effective active learning algorithm. When the number of labeled samples is less than 20% of the total number of unlabeled samples, the algorithm achieves better results in the accuracy and F-score metrics.
备注/Memo
收稿日期:2022-05-17。
基金项目:国家重点研发计划项目(2018YFC0832102);重庆市教委重点合作项目(HZ2021008);重庆市自然科学基金项目(cstc2021jcyj-msxmX0849).
作者简介:胡峰,教授,博士,主要研究方向为粗糙集、粒计算、数据挖掘。主持和参与国家自然科学基金项目4项,参与科技部重点研发计划项目3项,作为参与者获吴文俊人工智能科学技术奖、重庆市自然科学奖各1项,发表学术论文40余篇。E-mail:hufeng@ cqupt.edu.cn;李路正,硕士研究生,主要研究方向为数据挖掘、主动学习。E-mail:isluzheng.li@foxmail.com;代劲,教授,博士,重庆邮电大学软件学院副院长。主要研究方向为大数据知识工程、智能信息处理。先后承担和完成省部级科研项目4项,出版专著1部,发表学术论文20余篇。E-mail:331545392@qq.com
通讯作者:胡峰. E-mail:hufeng@cqupt.edu.cn
更新日期/Last Update:
1900-01-01