[1]LI Hang,WANG Jin,ZHAO Rui.Multi-label hypernetwork ensemble learning based on Spark[J].CAAI Transactions on Intelligent Systems,2017,12(5):624-639.[doi:10.11992/tis.201706033]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
12
Number of periods:
2017 5
Page number:
624-639
Column:
学术论文—机器学习
Public date:
2017-10-25
- Title:
-
Multi-label hypernetwork ensemble learning based on Spark
- Author(s):
-
LI Hang1; WANG Jin2; ZHAO Rui2
-
1. College of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
2. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
-
- Keywords:
-
multi-label learning; hypernetwork; label correlations; Apache Spark; selective ensemble learning
- CLC:
-
TP181
- DOI:
-
10.11992/tis.201706033
- Abstract:
-
Multi-label learning has attracted a great deal of attention in recent years and has a wide range of potential real-world applications, including image identification and text categorization. Although great effort has been expended in the development of multi-label learning, two main challenges remain, i.e., how to utilize the correlation between labels and how to tackle large-scale multi-label data. To solve these challenges, based on the multi-label hypernetwork (MLHN) algorithm, in this paper, we propose a Spark-based multi-label hypernetwork ensemble algorithm (SEI-MLHN) that effectively utilizes label correlation and can deal with large-scale multi-label datasets. First, the algorithm introduces cost sensitivity to enable it to adapt to unbalanced datasets. Secondly, it improves the hypernetwork evolution learning process, optimizes the loss function, and reduces the inherent time complexity. Lastly, it uses selective ensemble learning to enable it to adapt to large-scale datasets. We conducted experiments on 11 datasets wit different scales. The results show that the proposed algorithm demonstrates excellent categorization performance, low time complexity, and the capability to handle large-scale datasets.