<-上一篇/Previous Article 下一篇/Next Article->

[1]温有福,贾彩燕,陈智能.一种多模态融合的网络视频相关性度量方法[J].智能系统学报编辑部,2016,11(3):359-365.[doi:10.11992/tis.201603040]
　WEN Youfu,JIA Caiyan,CHEN Zhineng.A multi-modal fusion approach for measuring web video relatedness[J].CAAI Transactions on Intelligent Systems,2016,11(3):359-365.[doi:10.11992/tis.201603040]

点击复制

一种多模态融合的网络视频相关性度量方法

PDF下载 HTML

《智能系统学报》编辑部[ISSN 1673-4785/CN 23-1538/TP] 卷: 11 期数: 2016年第3期页码: 359-365 栏目: 学术论文—机器感知与模式识别出版日期: 2016-06-25

Title:: A multi-modal fusion approach for measuring web video relatedness

作者:: 温有福^1,2, 贾彩燕¹, 陈智能²; 1. 北京交通大学交通数据分析与数据挖掘北京市重点实验室, 北京 100044;
2. 中国科学院自动化研究所数字内容技术与服务研究中心, 北京 100190

Author(s):: WEN Youfu^1,2, JIA Caiyan¹, CHEN Zhineng²; 1. Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China;
2. Interactive Media Research and Services Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

关键词:: 网络视频; 海量视频; 社会特征; 交互; 多源异构信息; 多模态信息融合; 相关性度量; 视频检索

Keywords:: web video; large-scale video; social feature; human-video interactions; multi-source heterogeneous information; social features; multi-modal fusion; relatedness measurement; video retrieval

分类号:: TP393

DOI:: 10.11992/tis.201603040

摘要:: 随着网络和多媒体技术的发展,视频分享网站中的网络视频数量呈爆炸式增长。海量视频库中的高精度视频检索、分类、标注等任务成为亟待解决的研究问题。视频间的相关性度量是这些问题所面临的一个共性基础技术。本文从视频视觉内容,视频标题和标签文本,以及视频上传时间、类别、作者3种人与视频交互产生的社会特征等多源异构信息出发,提出一种新颖的多模态融合的网络视频相关性度量方法,并将所获相关性应用到大规模视频检索任务中。YouTube数据上的实验结果显示:相对于传统单一文本特征、单一视觉特征的检索方案,以及文本和视觉特征相融合的检索方案,文本视觉和用户社会特征多模态融合方法表现出更好的性能。

Abstract:: With the advances in internet and multimedia technologies, the number of web videos on social video platforms rapidly grows. Therefore, tasks such as large-scale video retrieval, classification, and annotation become issues that need to be urgently addressed. Web video relatedness serves as a basic and common infrastructure for these issues. This paper investigates the measurement of web video relatedness from a multi-modal fusion perspective. It proposes to measure web video relatedness based on multi-source heterogeneous information. The multi-modal fusion simultaneously leverages videos’ visual content, title, and tag text as well as social features contributed by human-video interactions (i.e., the upload time, channel, and author of a video). Consequently, a novel multi-modal fusion approach is proposed for computing web video relatedness, which serves to give a ranking criterion and is applied to the task of large-scale video retrieval. Experimental results using YouTube videos show that the proposed text, visual, and users’ social feature multi-modal fusion approach performs best in comparison tests with three alternate approaches; i.e., those approaches that compute web video relatedness based just on text features, just on visual features, or jointly on text and visual features.

参考文献/References:: [1] ZHU Weiyu, TOKLU C, LIOU S P. Automatic news video segmentation and categorization based on closed-captioned text[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Tokyo, Japan, 2001: 829-832.
[2] BREZEALE D, COOK D J. Using closed captions and visual features to classify movies by genre[C]//Poster Session of the Seventh International Workshop on Multimedia Data Mining. Philadelphia, Pennsylvania, USA, 2006.
[3] SCHMIEDEKE S, KELM P, SIKORA T. TUB @ MediaEval 2011 genre tagging task: prediction using bag-of-(visual)-words approaches[C]//Working Notes Proceedings of the MediaEval 2011 Workshop. Pisa, Italy, 2011: 1-2.
[4] LAW-TO J, CHEN Li, JOLY A, et al. Video copy detection: a comparative study[C]//Proceedings of the 6th ACM International Conference on Image and Video Retrieval. New York, NY, USA, 2007: 371-378.
[5] WU Xiao, HAUPTMANN A G, NGO C W. Practical elimination of near-duplicates from web video search[C]//Proceedings of the 15th ACM International Conference on Multimedia. New York, NY, USA, 2007: 218-227.
[6] SONG Jingkuan, YANG Yi, HUANG Zi, et al. Multiple feature hashing for real-time large scale near-duplicate video retrieval[C]//Proceedings of the 19th ACM International Conference on Multimedia. New York, NY, USA, 2011: 423-432.
[7] PERRONNIN F, DANCE C. Fisher kernels on visual vocabularies for image categorization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA, 2007: 1-8.
[8] JéGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). San Francisco, CA, USA, 2010: 3304-3311.
[9] TAN H K, NGO C W, HONG R, et al. Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]//Proceedings of the 17th ACM International Conference on Multimedia. New York, NY, USA, 2009: 145-154.
[10] FENG Bailan, CAO Juan, CHEN Zhineng, et al. Multi-modal query expansion for web video search[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, 2010: 721-722.
[11] BREZEALE D, COOK D J. Automatic video classification: a survey of the literature[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(3): 416-430.
[12] YANG Linjun, LIU Jiemin, YANG Xiaokang, et al. Multi-modality web video categorization[C]//Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval. New York, NY, USA, 2007: 265-274.
[13] WU Xiao, ZHAO Wanlei, NGO C W. Towards google challenge: combining contextual and social information for web video categorization[C]//Proceedings of the 17th ACM International Conference on Multimedia. New York, NY, USA, 2009: 1109-1110.
[14] DAVIDSON J, LIEBALD B, LIU J, et al. The YouTube video recommendation system[C]//Proceedings of the 4th ACM Conference on Recommender Systems. New York, NY, USA, 2010: 293-296.
[15] ZHAO Wanlei, WU Xiao, NGO C W. On the annotation of web videos by efficient near-duplicate search[J]. IEEE Transactions on Multimedia, 2010, 12(5): 448-461.
[16] TAN H K, NGO C W, CHUA T S. Efficient mining of multiple partial near-duplicate alignments by temporal network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2010, 20(11): 1486-1498.
[17] CAO J, ZHANG Y D, SONG Y D, et al. MCG-WEBV: a benchmark dataset for web video analysis[R]. Technical Report, Beijing, China: Institute of Computing Technology, 2009: 324-334.
[18] JIANG Yugang, JIANG Yudong, WANG Jiajun. VCDB: a large-scale database for partial copy detection in videos[M]//FLEET D, PAJDLA T, SCHIELE B, et al. Computer Vision-ECCV 2014. Zurich, Switzerland: Springer, 2014: 357-371.

备注/Memo

收稿日期:2016-3-19;改回日期:。
基金项目:国家自然科学基金项目(61473030,61303175);重点大学研究基金项目(2014JBM031);重点实验室数字媒体技术开放课题
作者简介:温有福,男,1991年生,硕士研究生,主要研究方向为视频/图像检索、社交网络分析。贾彩燕,女,1976年生,副教授,博士生导师,博士,主要研究方向为数据挖掘、社会计算、文本挖掘及生物信息学。近年来主持国家自然科学基金面上项目1项,主持国家自然科学基金青年基金项目和面上项目1项;参加国家自然科学基金重点项目、国家科技重大专项、北京市自然科学基金项目各1项;获得湖南省科学技术进步二等奖1项,发表学术论文40余篇。陈智能,男,1982年生,副研究员,博士,主要研究方向为多媒体内容分析与检索、机器学习、图像处理。近年来主持国家自然科学基金青年基金1项,发表学术论文20余篇。
通讯作者:贾彩燕.E-mail:cyjia@bjtu.edu.cn.

更新日期/Last Update: 1900-01-01

一种多模态融合的网络视频相关性度量方法 PDF下载HTML

备注/Memo

一种多模态融合的网络视频相关性度量方法

PDF下载 HTML