[1]李航,王进,赵蕊.基于Spark的多标签超网络集成学习[J].智能系统学报,2017,12(5):624-639.[doi:10.11992/tis.201706033]
LI Hang,WANG Jin,ZHAO Rui.Multi-label hypernetwork ensemble learning based on Spark[J].CAAI Transactions on Intelligent Systems,2017,12(5):624-639.[doi:10.11992/tis.201706033]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
12
期数:
2017年第5期
页码:
624-639
栏目:
学术论文—机器学习
出版日期:
2017-10-25
- Title:
-
Multi-label hypernetwork ensemble learning based on Spark
- 作者:
-
李航1, 王进2, 赵蕊2
-
1. 重庆邮电大学 软件工程学院, 重庆 400065;
2. 重庆邮电大学 计算智能重庆市重点实验室, 重庆 400065
- Author(s):
-
LI Hang1, WANG Jin2, ZHAO Rui2
-
1. College of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
2. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
-
- 关键词:
-
多标签学习; 超网络; 标签相关性; Apache Spark; 选择性集成学习
- Keywords:
-
multi-label learning; hypernetwork; label correlations; Apache Spark; selective ensemble learning
- 分类号:
-
TP181
- DOI:
-
10.11992/tis.201706033
- 摘要:
-
近年来,多标签学习在图像识别和文本分类等多个领域得到了广泛关注,具有越来越重要的潜在应用价值。尽管多标签学习的发展日新月异,但仍然存在两个主要挑战,即如何利用标签间的相关性以及如何处理大规模的多标签数据。针对上述问题,基于MLHN算法,提出一种能有效利用标签相关性且能处理大数据集的基于Spark的多标签超网络集成算法SEI-MLHN。该算法首先引入代价敏感,使其适应不平衡数据集。其次,改良了超网络演化学习过程,并优化了损失函数,降低了算法时间复杂度。最后,进行了选择性集成,使其适应大规模数据集。在11个不同规模的数据集上进行实验,结果表明,该算法具有较好的分类性能,较低的时间复杂度且具备良好的处理大规模数据集的能力。
- Abstract:
-
Multi-label learning has attracted a great deal of attention in recent years and has a wide range of potential real-world applications, including image identification and text categorization. Although great effort has been expended in the development of multi-label learning, two main challenges remain, i.e., how to utilize the correlation between labels and how to tackle large-scale multi-label data. To solve these challenges, based on the multi-label hypernetwork (MLHN) algorithm, in this paper, we propose a Spark-based multi-label hypernetwork ensemble algorithm (SEI-MLHN) that effectively utilizes label correlation and can deal with large-scale multi-label datasets. First, the algorithm introduces cost sensitivity to enable it to adapt to unbalanced datasets. Secondly, it improves the hypernetwork evolution learning process, optimizes the loss function, and reduces the inherent time complexity. Lastly, it uses selective ensemble learning to enable it to adapt to large-scale datasets. We conducted experiments on 11 datasets wit different scales. The results show that the proposed algorithm demonstrates excellent categorization performance, low time complexity, and the capability to handle large-scale datasets.
备注/Memo
收稿日期:2017-06-09。
基金项目:重庆市基础与前沿研究计划项目(cstc2014jcyjA40001,cstc2014jcyjA40022);重庆教委科学技术研究项目(自然科学类)(KJ1400436).
作者简介:李航,女,1995年生,硕士研究生,主要研究方向为机器学习与数据挖掘;王进,男,1979年生,教授,博士,主要研究方向为大数据并行处理与分布式计算、大规模数据挖掘与机器学习。曾主持多项国家和重庆市科研课题,发表学术论文50多篇,其中被SCI检索10篇,授权专利13项;赵蕊,男,1990年生,硕士研究生,主要研究方向为机器学习与数据挖掘。发表学术论文2篇,均被EI检索。
通讯作者:李航.E-mail:1326202954@qq.com
更新日期/Last Update:
2017-10-25