[1]宫大汉,陈辉,陈仕江,等.一致性协议匹配的跨模态图像文本检索方法[J].智能系统学报,2021,16(6):1143-1150.[doi:10.11992/tis.202108013]
GONG Dahan,CHEN Hui,CHEN Shijiang,et al.Matching with agreement for cross-modal image-text retrieval[J].CAAI Transactions on Intelligent Systems,2021,16(6):1143-1150.[doi:10.11992/tis.202108013]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
16
期数:
2021年第6期
页码:
1143-1150
栏目:
吴文俊人工智能科学技术奖论坛
出版日期:
2021-11-05
- Title:
-
Matching with agreement for cross-modal image-text retrieval
- 作者:
-
宫大汉1,2, 陈辉2,3, 陈仕江4, 包勇军5, 丁贵广1,2
-
1. 清华大学 软件学院,北京 100084;
2. 清华大学 北京信息科学与技术国家研究中心,北京 100084;
3. 清华大学 自动化系,北京 100084;
4. 涿溪脑与智能研究所,浙江 杭州 311121;
5. 京东集团,北京 100176
- Author(s):
-
GONG Dahan1,2, CHEN Hui2,3, CHEN Shijiang4, BAO Yongjun5, DING Guiguang1,2
-
1. School of Software, Tsinghua University, Beijing 100084, China;
2. Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China;
3. Department of Automation, Tsinghua University, Beijing 1000
-
- 关键词:
-
人工智能; 计算机视觉; 视觉和语言; 跨模态检索; 一致性协议匹配; 注意力; 卷积神经网络; 循环神经网络; 门控循环单元
- Keywords:
-
artificial intelligence; computer vision; vision and language; cross-modal retrieval; matching with agreement; attention; convolutional neural network; recurrent neural network; gated recurrent unit
- 分类号:
-
TP18
- DOI:
-
10.11992/tis.202108013
- 摘要:
-
跨模态图像文本检索的任务对于理解视觉和语言之间的对应关系很重要,大多数现有方法利用不同的注意力模块挖掘区域到词和词到区域的对齐来探索细粒度的跨模态关联。然而,现有的方法没有考虑到基于双重注意力会导致对齐不一致的问题。为此,本文提出了一种一致性协议匹配方法,旨在利用一致性对齐来增强跨模态检索的性能。本文采用注意力实现跨模态关联对齐,并基于跨模态对齐结果设计了基于竞争性投票的跨模态协议,该协议衡量了跨模态对齐的一致性,可以有效提升跨模态图像文本检索的性能。在Flickr30K和MS COCO两个基准数据集上,本文通过大量的实验证明了所提出的方法的有效性。
- Abstract:
-
The task of cross-modal image-text retrieval is important to understand the correspondence between vision and language. Most existing methods leverage different attention modules to explore region-to-word and word-to-region alignments and study fine-grained cross-modal correlations. However, the inconsistent alignment problem based on attention has rarely been considered. This study proposes a matching with agreement (MAG) method, which aims to take advantage of the alignment consistency, enhancing the cross-modal retrieval performance. The attention mechanism is adopted to achieve the cross-modal association alignment, which is then used to perform a cross-modal matching agreement with a novel competitive voting strategy. This agreement evaluates the cross-modal matching consistency and effectively improves the performance. The extensive experiments on two benchmark datasets, namely, Flickr30K and MS COCO, show that our MAG method can achieve state-of-the-art performance, demonstrating its effectiveness well.
备注/Memo
收稿日期:2021-08-13。
基金项目:国家自然科学基金项目(61925107,U1936202);中国博士后科学基金创新人才支持计划项目(BX2021161)
作者简介:宫大汉,博士研究生,主要研究方向为图像语义理解、卷积神经网络压缩加速;陈辉,助理研究员,博士,主要研究方向为图像语义理解、多媒体信息处理;丁贵广,副教授,博士,主要研究方向为多媒体信息处理、计算机视觉感知。主持基金委重点项目、重点研发项目等国家级项目数十项。曾获国家科技进步二等奖、吴文俊人工智能科技进步一等奖、中国电子学会技术发明一等奖等。发表学术论文近百篇,引用量近7 000次.
通讯作者:丁贵广.E-mail:dinggg@tsinghua.edu.cn
更新日期/Last Update:
2021-12-25