[1]温有福,贾彩燕,陈智能.一种多模态融合的网络视频相关性度量方法[J].智能系统学报编辑部,2016,11(3):359-365.[doi:10.11992/tis.201603040]
 WEN Youfu,JIA Caiyan,CHEN Zhineng.A multi-modal fusion approach for measuring web video relatedness[J].CAAI Transactions on Intelligent Systems,2016,11(3):359-365.[doi:10.11992/tis.201603040]
点击复制

一种多模态融合的网络视频相关性度量方法(/HTML)
分享到:

《智能系统学报》编辑部[ISSN:1673-4785/CN:23-1538/TP]

卷:
第11卷
期数:
2016年3期
页码:
359-365
栏目:
出版日期:
2016-06-25

文章信息/Info

Title:
A multi-modal fusion approach for measuring web video relatedness
作者:
温有福12 贾彩燕1 陈智能2
1. 北京交通大学 交通数据分析与数据挖掘北京市重点实验室, 北京 100044;
2. 中国科学院自动化研究所 数字内容技术与服务研究中心, 北京 100190
Author(s):
WEN Youfu12 JIA Caiyan1 CHEN Zhineng2
1. Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China;
2. Interactive Media Research and Services Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
关键词:
网络视频海量视频社会特征交互多源异构信息多模态信息融合相关性度量视频检索
Keywords:
web videolarge-scale videosocial featurehuman-video interactionsmulti-source heterogeneous informationsocial featuresmulti-modal fusionrelatedness measurementvideo retrieval
分类号:
TP393
DOI:
10.11992/tis.201603040
摘要:
随着网络和多媒体技术的发展,视频分享网站中的网络视频数量呈爆炸式增长。海量视频库中的高精度视频检索、分类、标注等任务成为亟待解决的研究问题。视频间的相关性度量是这些问题所面临的一个共性基础技术。本文从视频视觉内容,视频标题和标签文本,以及视频上传时间、类别、作者3种人与视频交互产生的社会特征等多源异构信息出发,提出一种新颖的多模态融合的网络视频相关性度量方法,并将所获相关性应用到大规模视频检索任务中。YouTube数据上的实验结果显示:相对于传统单一文本特征、单一视觉特征的检索方案,以及文本和视觉特征相融合的检索方案,文本视觉和用户社会特征多模态融合方法表现出更好的性能。
Abstract:
With the advances in internet and multimedia technologies, the number of web videos on social video platforms rapidly grows. Therefore, tasks such as large-scale video retrieval, classification, and annotation become issues that need to be urgently addressed. Web video relatedness serves as a basic and common infrastructure for these issues. This paper investigates the measurement of web video relatedness from a multi-modal fusion perspective. It proposes to measure web video relatedness based on multi-source heterogeneous information. The multi-modal fusion simultaneously leverages videos’ visual content, title, and tag text as well as social features contributed by human-video interactions (i.e., the upload time, channel, and author of a video). Consequently, a novel multi-modal fusion approach is proposed for computing web video relatedness, which serves to give a ranking criterion and is applied to the task of large-scale video retrieval. Experimental results using YouTube videos show that the proposed text, visual, and users’ social feature multi-modal fusion approach performs best in comparison tests with three alternate approaches; i.e., those approaches that compute web video relatedness based just on text features, just on visual features, or jointly on text and visual features.

参考文献/References:

[1] ZHU Weiyu, TOKLU C, LIOU S P. Automatic news video segmentation and categorization based on closed-captioned text[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Tokyo, Japan, 2001: 829-832.
[2] BREZEALE D, COOK D J. Using closed captions and visual features to classify movies by genre[C]//Poster Session of the Seventh International Workshop on Multimedia Data Mining. Philadelphia, Pennsylvania, USA, 2006.
[3] SCHMIEDEKE S, KELM P, SIKORA T. TUB @ MediaEval 2011 genre tagging task: prediction using bag-of-(visual)-words approaches[C]//Working Notes Proceedings of the MediaEval 2011 Workshop. Pisa, Italy, 2011: 1-2.
[4] LAW-TO J, CHEN Li, JOLY A, et al. Video copy detection: a comparative study[C]//Proceedings of the 6th ACM International Conference on Image and Video Retrieval. New York, NY, USA, 2007: 371-378.
[5] WU Xiao, HAUPTMANN A G, NGO C W. Practical elimination of near-duplicates from web video search[C]//Proceedings of the 15th ACM International Conference on Multimedia. New York, NY, USA, 2007: 218-227.
[6] SONG Jingkuan, YANG Yi, HUANG Zi, et al. Multiple feature hashing for real-time large scale near-duplicate video retrieval[C]//Proceedings of the 19th ACM International Conference on Multimedia. New York, NY, USA, 2011: 423-432.
[7] PERRONNIN F, DANCE C. Fisher kernels on visual vocabularies for image categorization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA, 2007: 1-8.
[8] JéGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). San Francisco, CA, USA, 2010: 3304-3311.
[9] TAN H K, NGO C W, HONG R, et al. Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]//Proceedings of the 17th ACM International Conference on Multimedia. New York, NY, USA, 2009: 145-154.
[10] FENG Bailan, CAO Juan, CHEN Zhineng, et al. Multi-modal query expansion for web video search[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, 2010: 721-722.
[11] BREZEALE D, COOK D J. Automatic video classification: a survey of the literature[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(3): 416-430.
[12] YANG Linjun, LIU Jiemin, YANG Xiaokang, et al. Multi-modality web video categorization[C]//Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval. New York, NY, USA, 2007: 265-274.
[13] WU Xiao, ZHAO Wanlei, NGO C W. Towards google challenge: combining contextual and social information for web video categorization[C]//Proceedings of the 17th ACM International Conference on Multimedia. New York, NY, USA, 2009: 1109-1110.
[14] DAVIDSON J, LIEBALD B, LIU J, et al. The YouTube video recommendation system[C]//Proceedings of the 4th ACM Conference on Recommender Systems. New York, NY, USA, 2010: 293-296.
[15] ZHAO Wanlei, WU Xiao, NGO C W. On the annotation of web videos by efficient near-duplicate search[J]. IEEE Transactions on Multimedia, 2010, 12(5): 448-461.
[16] TAN H K, NGO C W, CHUA T S. Efficient mining of multiple partial near-duplicate alignments by temporal network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2010, 20(11): 1486-1498.
[17] CAO J, ZHANG Y D, SONG Y D, et al. MCG-WEBV: a benchmark dataset for web video analysis[R]. Technical Report, Beijing, China: Institute of Computing Technology, 2009: 324-334.
[18] JIANG Yugang, JIANG Yudong, WANG Jiajun. VCDB: a large-scale database for partial copy detection in videos[M]//FLEET D, PAJDLA T, SCHIELE B, et al. Computer Vision-ECCV 2014. Zurich, Switzerland: Springer, 2014: 357-371.

备注/Memo

备注/Memo:
收稿日期:2016-3-19;改回日期:。
基金项目:国家自然科学基金项目(61473030,61303175);重点大学研究基金项目(2014JBM031);重点实验室数字媒体技术开放课题
作者简介:温有福,男,1991年生,硕士研究生,主要研究方向为视频/图像检索、社交网络分析。贾彩燕,女,1976年生,副教授,博士生导师,博士,主要研究方向为数据挖掘、社会计算、文本挖掘及生物信息学。近年来主持国家自然科学基金面上项目1项,主持国家自然科学基金青年基金项目和面上项目1项;参加国家自然科学基金重点项目、国家科技重大专项、北京市自然科学基金项目各1项;获得湖南省科学技术进步二等奖1项,发表学术论文40余篇。陈智能,男,1982年生,副研究员,博士,主要研究方向为多媒体内容分析与检索、机器学习、图像处理。近年来主持国家自然科学基金青年基金1项,发表学术论文20余篇。
通讯作者:贾彩燕.E-mail:cyjia@bjtu.edu.cn.
更新日期/Last Update: 1900-01-01