[1]WEN Youfu,JIA Caiyan,CHEN Zhineng.A multi-modal fusion approach for measuring web video relatedness[J].CAAI Transactions on Intelligent Systems,2016,11(3):359-365.[doi:10.11992/tis.201603040]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
11
Number of periods:
2016 3
Page number:
359-365
Column:
学术论文—机器感知与模式识别
Public date:
2016-06-25
- Title:
-
A multi-modal fusion approach for measuring web video relatedness
- Author(s):
-
WEN Youfu1; 2; JIA Caiyan1; CHEN Zhineng2
-
1. Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China;
2. Interactive Media Research and Services Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
-
- Keywords:
-
web video; large-scale video; social feature; human-video interactions; multi-source heterogeneous information; social features; multi-modal fusion; relatedness measurement; video retrieval
- CLC:
-
TP393
- DOI:
-
10.11992/tis.201603040
- Abstract:
-
With the advances in internet and multimedia technologies, the number of web videos on social video platforms rapidly grows. Therefore, tasks such as large-scale video retrieval, classification, and annotation become issues that need to be urgently addressed. Web video relatedness serves as a basic and common infrastructure for these issues. This paper investigates the measurement of web video relatedness from a multi-modal fusion perspective. It proposes to measure web video relatedness based on multi-source heterogeneous information. The multi-modal fusion simultaneously leverages videos’ visual content, title, and tag text as well as social features contributed by human-video interactions (i.e., the upload time, channel, and author of a video). Consequently, a novel multi-modal fusion approach is proposed for computing web video relatedness, which serves to give a ranking criterion and is applied to the task of large-scale video retrieval. Experimental results using YouTube videos show that the proposed text, visual, and users’ social feature multi-modal fusion approach performs best in comparison tests with three alternate approaches; i.e., those approaches that compute web video relatedness based just on text features, just on visual features, or jointly on text and visual features.