[1]张钢,谢晓珊,黄英,等.面向大数据流的半监督在线多核学习算法[J].智能系统学报,2014,9(03):355-363.[doi:10.3969/j.issn.1673-4785.201403067]
 ZHANG Gang,XIE Xiaoshan,HUANG Ying,et al.An online multi-kernel learning algorithm for big data[J].CAAI Transactions on Intelligent Systems,2014,9(03):355-363.[doi:10.3969/j.issn.1673-4785.201403067]
点击复制

面向大数据流的半监督在线多核学习算法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第9卷
期数:
2014年03期
页码:
355-363
栏目:
出版日期:
2014-06-25

文章信息/Info

Title:
An online multi-kernel learning algorithm for big data
作者:
张钢 谢晓珊 黄英 王春茹
广东工业大学 自动化学院, 广东 广州 510006
Author(s):
ZHANG Gang XIE Xiaoshan HUANG Ying WANG Chunru
School of Automation, Guangdong University of Technology, Guangzhou 510006, China
关键词:
大数据流在线多核学习流形学习数据依赖核半监督学习
Keywords:
big data streamonline multi-kernel learningmanifold learningdata-dependent kernelsemi-supervised learning
分类号:
TP18
DOI:
10.3969/j.issn.1673-4785.201403067
摘要:
在机器学习中, 核函数的选择对核学习器性能有很大的影响, 而通过核学习的方法可以得到有效的核函数。提出一种面向大数据流的半监督在线核学习算法, 通过当前读取的大数据流片段以在线方式更新当前的核函数。算法通过大数据流的标签对核函数参数进行有监督的调整, 同时以无监督的方式通过流形学习对核函数参数进行修改, 以使得核函数所体现的等距面尽可能沿着数据的某种低维流形分布。算法的创新性在于能同时进行有监督和无监督的核学习, 且不需要对历史数据进行再次扫描, 有效降低了算法的时间复杂度, 适用于在大数据和高速数据流环境下的核函数学习问题, 其对无监督学习的支持有效解决了大数据流中部分标记缺失的问题。在MOA生成的人工数据集以及UCI大数据分析的基准数据集上进行算法有效性的评估, 其结果表明该算法是有效的。
Abstract:
In machine learning, a proper kernel function affects much on the performance of target learners. Commonly an effective kernel function can be obtained through kernel learning. We present a semi-supervised online multiple kernel algorithm for big data stream analysis. The algorithm learns a kernel function through an online update procedure by reading current segments of a big data stream. The algorithm adjusts the parameters of currently learned kernel function in a supervised manner and modifies the kernel through unsupervised manifold learning, so as to make the contour surfaces of the kernel along with some low dimensionality manifold in the data space as far as possible. The novelty is that it performs supervised and unsupervised learning at the same time, and scans the training data only once, which reduces the computational complexity and is suitable for the kernel learning tasks in big datasets and high speed data streams. This algorithm’s support to the unsupervised learning effectively solves the problem of label missing in big data streams. The evaluation results from the synthetic datasets generated by MOA and the benchmark datasets of the big data analysis from the UCI data repository show the effectiveness of the proposed algorithm.

参考文献/References:

[1] GOPALKRISHNAN V, STEIER D, LEWIS H, et al. Big data, big business:bridging the gap[C]//Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining:Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012:7-11.
[2] YANG H, FONG S. Incrementally optimized decision tree for noisy big data[C]//Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining:Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012:36-44.
[3] JORDAN M I. Divide-and-conquer and statistical inference for big data[C]//Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. Beijing, China, 2012:4-4.
[4] ACAR U A, CHEN Y. Streaming big data with self-adjusting computation[C]//Proceedings of the 2013 Proceedings of the 2013 Workshop on Data driven Functional Programming. Rome, Italy, 2013:15-18.
[5] ARI I, CELEBI O F, OLMEZOGULLARI E. Data stream analytics and mining in the cloud[C]//Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science. Washington, DC, USA, 2012:857-862.
[6] AGMON S. The relaxation method for linear inequalities[J]. Canadian Journal of Mathematics, 1954, 6(3):393-404.
[7] GONEN M, ALPAYD E. Multiple kernel learning algorithms[J]. Journal of Machine Learning Research, 2011(12):2211-2268.
[8] ORABONA F, JIE L, CAPUTO B. Multi kernel learning with online-batch optimization[J]. Journal of Machine Learning Research, 2012(13):227-253.
[9] JIN R, HOI S C H, YANG T, et al. Online multiple kernel learning:algorithms and mistake bounds[J]. Algorithmic Learning Theory, 2010(6331):390-404.
[10] QIN C, RUSU F. Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE[C]//Proceedings of the Second Workshop on Data Analytics in the Cloud. New York, USA, 2013:16-20.
[11] SINDHWANI V, NIYOGI P, BELKIN M. Beyond the point cloud:from transductive to semi-supervised learning[C]//Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany, 2005:824-831.
[12] 李宏伟, 刘扬, 卢汉清, 等. 结合半监督核的高斯过程分[J]. 自动化学报, 2009, 35(7):888-895.LI Hongwei, LIU Yang, LU Hanqing, et al. Gaussian processes classification combined with semi-supervised kernels[J]. Acta Automatica Sinica, 2009, 35(7):888-895.
[13] 邹恒明. 计算机的心智:操作系统之哲学原理[M]. 北京:机械工业出版社, 2012:100-102.
[14] BIFET A, HOLMES G, KIRKBY R, et al. MOA:massive online analysis[J]. Journal of Machine Learning Research, 2010(11):1601-1604.
[15] KREMER H, KRANEN P, JANSEN T, et al. An effective evaluation measure for clustering on evolving data streams[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, California, USA, 2011:868-876.
[16] BIFET A, HOLMES G, PFAHRINGER B, et al. Mining frequent closed graphs on evolving data streams[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, USA, 2011:591-599.
[17] FRANCESCO O, LUO Jie, BARBARA C. Multi kernel learning with online-batch optimization[J]. Journal of Machine Learning Research, 2012(13):227-253.
[18] STEVEN C H, RONG Jin, ZHAO Peilin, et al. Online multiple kernel classification[J]. Machine Learning, 2013, 90(2):289-316.
[19] UCI数据集:http://archive.ics.uci.edu/ml/.[2014-03-18].
[20] YANG Haiqin, MICHAEL R L, IRWIN K. Efficient online learning for multitask feature selection[J]. ACM Transactions on Knowledge Discovery from Data, 2013, 7(2):6-27.
[21] CHEN Jianhui, LIU Ji, YE Jieping. Learning incoherent sparse and low-rank patterns from multiple tasks[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 5(4):22-31.
[22] HONG Chaoqun, ZHU Jianke. Hypergraph-based multi-example ranking with sparse representation for transductive learning image retrieval[J]. Neurocomputing, 2013(101):94-103.
[23] YU Jun, BIAN Wei, SONG Mingli, et al. Graph based transductive learning for cartoon correspondence construction[J]. Neurocomputing, 2012(79):105-114.

备注/Memo

备注/Memo:
收稿日期:2014-03-25。
基金项目:国家自然科学基金资助项目(81373883)
作者简介:谢晓珊,女,1990年生,硕士研究生,发表学术论文3篇,主要研究方向为机器学习、数据挖掘、模式识别和生物医学图像处理。
通讯作者:张钢,男,1979年生,讲师,博士研究生,CCF会员。主要研究方向为机器学习、数据挖掘和生物信息学,参与国家自然科学基金项目1项 ,广东省自然科学基金团队项目1项,获得软件著作权2项,专利4项。发表学术论文40余篇,其中被SCI检索3篇,EI检索20余篇,E-mail:ipx@gdut.edu.cn。
更新日期/Last Update: 1900-01-01