[1]李航,王进,赵蕊.基于Spark的多标签超网络集成学习[J].智能系统学报,2017,12(05):624-639.[doi:10.11992/tis.201706033]
 LI Hang,WANG Jin,ZHAO Rui.Multi-label hypernetwork ensemble learning based on Spark[J].CAAI Transactions on Intelligent Systems,2017,12(05):624-639.[doi:10.11992/tis.201706033]
点击复制

基于Spark的多标签超网络集成学习(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第12卷
期数:
2017年05期
页码:
624-639
栏目:
出版日期:
2017-10-25

文章信息/Info

Title:
Multi-label hypernetwork ensemble learning based on Spark
作者:
李航1 王进2 赵蕊2
1. 重庆邮电大学 软件工程学院, 重庆 400065;
2. 重庆邮电大学 计算智能重庆市重点实验室, 重庆 400065
Author(s):
LI Hang1 WANG Jin2 ZHAO Rui2
1. College of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
2. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
关键词:
多标签学习超网络标签相关性Apache Spark选择性集成学习
Keywords:
multi-label learninghypernetworklabel correlationsApache Sparkselective ensemble learning
分类号:
TP181
DOI:
10.11992/tis.201706033
摘要:
近年来,多标签学习在图像识别和文本分类等多个领域得到了广泛关注,具有越来越重要的潜在应用价值。尽管多标签学习的发展日新月异,但仍然存在两个主要挑战,即如何利用标签间的相关性以及如何处理大规模的多标签数据。针对上述问题,基于MLHN算法,提出一种能有效利用标签相关性且能处理大数据集的基于Spark的多标签超网络集成算法SEI-MLHN。该算法首先引入代价敏感,使其适应不平衡数据集。其次,改良了超网络演化学习过程,并优化了损失函数,降低了算法时间复杂度。最后,进行了选择性集成,使其适应大规模数据集。在11个不同规模的数据集上进行实验,结果表明,该算法具有较好的分类性能,较低的时间复杂度且具备良好的处理大规模数据集的能力。
Abstract:
Multi-label learning has attracted a great deal of attention in recent years and has a wide range of potential real-world applications, including image identification and text categorization. Although great effort has been expended in the development of multi-label learning, two main challenges remain, i.e., how to utilize the correlation between labels and how to tackle large-scale multi-label data. To solve these challenges, based on the multi-label hypernetwork (MLHN) algorithm, in this paper, we propose a Spark-based multi-label hypernetwork ensemble algorithm (SEI-MLHN) that effectively utilizes label correlation and can deal with large-scale multi-label datasets. First, the algorithm introduces cost sensitivity to enable it to adapt to unbalanced datasets. Secondly, it improves the hypernetwork evolution learning process, optimizes the loss function, and reduces the inherent time complexity. Lastly, it uses selective ensemble learning to enable it to adapt to large-scale datasets. We conducted experiments on 11 datasets wit different scales. The results show that the proposed algorithm demonstrates excellent categorization performance, low time complexity, and the capability to handle large-scale datasets.

参考文献/References:

[1] GAO S, WU W, LEE C H, et al. A MFoM learning approach to robust multiclass multi-label text categorization[C]//Proceedings of the 21st International Conference on Machine Learning. Canada:ACM Press, 2004:42.
[2] JIANG J Y, TSAI S C, LEE S J. FSKNN:multi-label text categorization based on fuzzy similarity and k nearest neighbors[J]. Expert systems with applications, 2012, 39(3):2813-2821.
[3] BOUTELL M R, LUO J, SHEN X, et al. Learning multi-label scene classification ☆[J]. Pattern recognition, 2004, 37(9):1757-1771.
[4] QI G J, HUA X S, RUI Y, et al. Correlative multi-label video annotation[C]//In Proceedings of the 15th ACM International Conference on Multimedia. Germany:ACM Press, 2007:17-26.
[5] CESA-BIANCHI N, RE M, VALENTINI G. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference[J]. Machine learning, 2012, 88(1):209-241.
[6] ZHANG M L, ZHANG K. Multi-label learning by exploiting label dependency[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. USA:ACM Press, 2010:999-1008.
[7] TSOUMAKAS G, KATAKIS I, VLAHAVAS I. Mining multi-label data[M]. New York:Springer US, 2009:667-685.
[8] FüRNKRANZ J, HüLLERMEIER E, MENCíA E L, et al. Multilabel classification via calibrated label ranking[J]. Machine Learning, 2008, 73(2):133-153.
[9] TSOUMAKAS G, KATAKIS I, VLAHAVAS I. Random k-Labelsets for Multilabel Classification[J].IEEE transactions on knowledge & data engineering, 2010, 23(7):1079-1089.
[10] LO H Y, LIN S D, WANG H M. Generalized k-labelsets ensemble for multi-label and cost-sensitive classification[J]. Knowledge & data engineering IEEE transactions on, 2014, 26(7):1679-1691.
[11] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE transactions on knowledge and data engineering, 2009, 21(9):1263-1284.
[12] XIOUFIS E S, SPILIOPOULOU M, TSOUMAKAS G, et al. Dealing with concept drift and class imbalance in multi-label stream classification[C]//IJCAI 2011 Proceedings of the International Joint Conference on Artificial Intelligence. Barcelona, Spain, 2011:1583-1588.
[13] CHARTE F, RIVERA A, del JESUS M J, et al. A first approach to deal with imbalance in multi-label datasets[C]//In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems. Springer Berlin Heidelberg, USA, 2013:150-160.
[14] ZHANG M L, LI Y K, LIU X Y. Towards class-imbalance aware multi-label learning[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence. Argentina:AAAI Press, 2015:4041-4047.
[15] LIU H, LI X, ZHANG S. Learning instance correlation functions for multilabel classification[J]. IEEE transactions on cybernetics, 2017, 47(2):499-510.
[16] ZHANG M L, WU L. Lift:multi-label learning with label[J]. Pattern analysis & machine intelligence IEEE transactions on, 2015, 37(1):107-20.
[17] ALALI A, KUBAT M. PruDent:A pruned and confident stacking approach for multi-label classification[J]. IEEE transactions on knowledge & data engineering, 2015, 27(9):1-1.
[18] WU Q, TAN M, Song H, et al. ML-forest:a multi-label tree ensemble method for multi-label classification[J]. IEEE transactions on knowledge and data engineering, 2016, 28(10):1-1.
[19] HUANG J, LI G, HUANG Q, et al. Learning label-specific features and class-dependent labels for multi-label classification[J].IEEE transactions on knowledge and data engineering, 2016, 28(12):3309-3323.
[20] WU Q, YE Y, ZHANG H, et al. ML-Tree:a tree-structure-based approach to multilabel learning.[J]. IEEE trans neural netw learn syst, 2014, 26(3):430-443.
[21] CHARTE F. LI-MLC:A Label Inference Methodology for Addressing High Dimensionality in the Label Space for Multilabel Classification[J]. IEEE trans. neural networks and learning systems, 2014, 25(10):1842-1854.
[22] MONTAñES E, SENGE R, BARRANQUERO J, et al. Dependent binary relevance models for multi-label classification[J]. Pattern recognition, 2014, 47(3):1494-1508.
[23] LO H Y, LIN S D, WANG H M. Generalized k-labelsets ensemble for multi-label and cost-sensitive classification[J]. Knowledge and data engineering IEEE transactions on, 2014, 26(7):1679-1691.
[24] SUN K W, LEE C H, WANG J. Multilabel classification via co-evolutionary multilabel hypernetwork[J]. IEEE transactions on knowledge and data engineering, 2016, 28(9):1-1.
[25] LO H Y, WANG J C, WANG H M, et al. Cost-sensitive multi-label learning for audio tag annotation and retrieval[J]. IEEE transactions on multimedia, 2011, 13(3):518-529.
[26] OZONAT K, YOUNG D. Towards a universal marketplace over the web:statistical multi-label classification of service provider forms with simulated annealing[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009:1295-1304.
[27] ZHANG M L, ZHOU Z H. A review on multi-label learning algorithms[J]. IEEE transactions on knowledge and dada engineering, 2014, 26(8):1819-1837.
[28] ZHANG M L, ZHANG K. Multi-label learning by exploiting label dependency[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. USA:ACM Press, 2010:999-1008.
[29] ZHANG M L, ZHOU Z H. ML-KNN:A lazy learning approach to multi-label learning[J]. Pattern recognition, 2007, 40(7):2038-2048.
[30] BOUTELL M R, LUO J, SHEN X, et al. Learning multi-label scene classification[J]. Pattern recognition, 2004, 37(9):1757-1771.
[31] ELISSEEFF A, WESTON J. A kernel method for multi-labelled classification[C]//In NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems:Natural and Synthetic. Vancouver, British Columbia, Canada:MIT Press, 2001:681-687.
[32] ZHANG M L, ZHOU Z H. Multilabel neural networks with applications to functional genomics and text categorization[J]. IEEE transactions on knowledge and data engineering, 2006, 18(10):1338-1351.
[33] READ J, PFAHRINGER B, HOLMES G, et al. Classifier chains for multi-label classification[J]. Machine learning, 2011, 85(3):254-269.
[34] YI L, RONG J, LIU Y. Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization.[C]//In AAAI’06 Proceedings of the 21st national conference on Artificial intelligence. Boston:AAAI Press,2006:421-426.
[35] LIU X Y, LI Q Q, ZHOU Z H. Learning imbalanced multi-class data with optimal dichotomy weights[C]//Proceedings of the 2013 IEEE 13th International Conference on Data Mining. USA:IEEE Press, 2013:478-487.
[36] TAHIR M A, KITTLER J, MIKOLAJCZYK K, et al. Improving multilabel classification performance by using ensemble of multi-label classifiers[C]//Proceedings of the International Workshop on Multiple Classifier Systems. Egypt:Springer Berlin Heidelberg, 2010:11-21.
[37] DEAN J, GHEMAWAT S. MapReduce:Simplified data processing on large clusters[C]//In OSDI’04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. Berkeley, USA, 2004:10-10.
[38] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:cluster computing with working sets[C]//In HotCloud’10 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, USA, 2010:10-10.
[39] MIKA P. Flink:semantic web technology for the extraction and analysis of social networks[J]. Web semantics science services and agents on the world Wide Web, 2005, 3(2/3):211-223.
[40] BU Y, HOWE B, BALAZINSKA M, et al. HaLoop:efficient iterative data processing on large clusters[J]. Proceedings of the Vldb endowment, 2010, 3(1/2):285-296.
[41] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. San Jose:USENIX Association, 2012:2.
[42] VESANTO J, ALHONIEMI E. Clustering of the self-organizing map[J]. IEEE transactions on neural networks, 2000, 11(3):586-600.
[43] LEE S J, JIANG J Y. Multilabel text categorization based on fuzzy relevance clustering[J]. IEEE transactions on fuzzy systems, 2014, 22(6):1457-1471.
[44] CHENG W, HüLLERMEIER E. Combining instance-based learning and logistic regression for multilabel classification[J]. Machine learning, 2009, 76(2):211-225.

相似文献/References:

[1]刘胜久,李天瑞,洪西进,等.基于矩阵运算的超网络构建方法研究及特性分析[J].智能系统学报,2018,13(03):359.[doi:10.11992/tis.201706055]
 LIU Shengjiu,LI Tianrui,HORNG Xijin,et al.Supernetwork building based on matrix operation and property analysis[J].CAAI Transactions on Intelligent Systems,2018,13(05):359.[doi:10.11992/tis.201706055]
[2]程麟焰,胡峰.基于模糊超网络的知识获取方法研究[J].智能系统学报,2019,14(03):479.[doi:10.11992/tis.201804055]
 CHENG Linyan,HU Feng.Fuzzy hypernetwork-based knowledge acquisition method[J].CAAI Transactions on Intelligent Systems,2019,14(05):479.[doi:10.11992/tis.201804055]
[3]严菲,王晓栋.鲁棒的半监督多标签特征选择方法[J].智能系统学报,2019,14(04):812.[doi:10.11992/tis.201809017]
 YAN Fei,WANG Xiaodong.A robust, semi-supervised, and multi-label feature selection method[J].CAAI Transactions on Intelligent Systems,2019,14(05):812.[doi:10.11992/tis.201809017]

备注/Memo

备注/Memo:
收稿日期:2017-06-09。
基金项目:重庆市基础与前沿研究计划项目(cstc2014jcyjA40001,cstc2014jcyjA40022);重庆教委科学技术研究项目(自然科学类)(KJ1400436).
作者简介:李航,女,1995年生,硕士研究生,主要研究方向为机器学习与数据挖掘;王进,男,1979年生,教授,博士,主要研究方向为大数据并行处理与分布式计算、大规模数据挖掘与机器学习。曾主持多项国家和重庆市科研课题,发表学术论文50多篇,其中被SCI检索10篇,授权专利13项;赵蕊,男,1990年生,硕士研究生,主要研究方向为机器学习与数据挖掘。发表学术论文2篇,均被EI检索。
通讯作者:李航.E-mail:1326202954@qq.com
更新日期/Last Update: 2017-10-25