<-上一篇/Previous Article 下一篇/Next Article->

[1]宫大汉,陈辉,陈仕江,等.一致性协议匹配的跨模态图像文本检索方法[J].智能系统学报,2021,16(6):1143-1150.[doi:10.11992/tis.202108013]
　GONG Dahan,CHEN Hui,CHEN Shijiang,et al.Matching with agreement for cross-modal image-text retrieval[J].CAAI Transactions on Intelligent Systems,2021,16(6):1143-1150.[doi:10.11992/tis.202108013]

点击复制

一致性协议匹配的跨模态图像文本检索方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 16 期数: 2021年第6期页码: 1143-1150 栏目: 吴文俊人工智能科学技术奖论坛出版日期: 2021-11-05

Title:: Matching with agreement for cross-modal image-text retrieval

作者:: 宫大汉^1,2, 陈辉^2,3, 陈仕江⁴, 包勇军⁵, 丁贵广^1,2; 1. 清华大学软件学院，北京 100084;
2. 清华大学北京信息科学与技术国家研究中心，北京 100084;
3. 清华大学自动化系，北京 100084;
4. 涿溪脑与智能研究所，浙江杭州 311121;
5. 京东集团，北京 100176

Author(s):: GONG Dahan^1,2, CHEN Hui^2,3, CHEN Shijiang⁴, BAO Yongjun⁵, DING Guiguang^1,2; 1. School of Software, Tsinghua University, Beijing 100084, China;
2. Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China;
3. Department of Automation, Tsinghua University, Beijing 1000

关键词:: 人工智能; 计算机视觉; 视觉和语言; 跨模态检索; 一致性协议匹配; 注意力; 卷积神经网络; 循环神经网络; 门控循环单元

Keywords:: artificial intelligence; computer vision; vision and language; cross-modal retrieval; matching with agreement; attention; convolutional neural network; recurrent neural network; gated recurrent unit

分类号:: TP18

DOI:: 10.11992/tis.202108013

摘要:: 跨模态图像文本检索的任务对于理解视觉和语言之间的对应关系很重要，大多数现有方法利用不同的注意力模块挖掘区域到词和词到区域的对齐来探索细粒度的跨模态关联。然而，现有的方法没有考虑到基于双重注意力会导致对齐不一致的问题。为此，本文提出了一种一致性协议匹配方法，旨在利用一致性对齐来增强跨模态检索的性能。本文采用注意力实现跨模态关联对齐，并基于跨模态对齐结果设计了基于竞争性投票的跨模态协议，该协议衡量了跨模态对齐的一致性，可以有效提升跨模态图像文本检索的性能。在Flickr30K和MS COCO两个基准数据集上，本文通过大量的实验证明了所提出的方法的有效性。

Abstract:: The task of cross-modal image-text retrieval is important to understand the correspondence between vision and language. Most existing methods leverage different attention modules to explore region-to-word and word-to-region alignments and study fine-grained cross-modal correlations. However, the inconsistent alignment problem based on attention has rarely been considered. This study proposes a matching with agreement (MAG) method, which aims to take advantage of the alignment consistency, enhancing the cross-modal retrieval performance. The attention mechanism is adopted to achieve the cross-modal association alignment, which is then used to perform a cross-modal matching agreement with a novel competitive voting strategy. This agreement evaluates the cross-modal matching consistency and effectively improves the performance. The extensive experiments on two benchmark datasets, namely, Flickr30K and MS COCO, show that our MAG method can achieve state-of-the-art performance, demonstrating its effectiveness well.

参考文献/References:: [1] WANG Liwei, LI Yin, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA, 2016: 5005-5013.
[2] FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives[EB/OL]. (2018-07-29)[2021-07-30] https://arxiv.org/pdf/1707.05612.
[3] KARPATHY A, LI Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 3128-3137.
[4] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 2156-2164.
[5] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International Conference on Machine Learning. Sydney, Australia, 2015: 2048-2057.
[6] LEE K H, CHEN Xi, HUA Gang, et al. Stacked cross attention for image-text matching[M]//FERRARI V, HEBERT M, SMINCHISESCU C, et al. Proceedings of the 15th European Conference on Computer Vision-ECCV 2018. Munich, Germany: Springer, 2018: 201-216.
[7] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. Nevada, USA, 2013: 2121-2129.
[8] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10). https://arxiv.org/pdf/1409.1556.
[9] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07)[2021-07-30] https://arxiv.org/pdf/1301.3781.
[10] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[EB/OL]. (2014-11-10). https://arxiv.org/pdf/1411.2539.
[11] CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. (2014-12-11)[2021-07-30] https://arxiv.org/pdf/1412.3555.
[12] NIU Zhenxing, ZHOU Mo, WANG Le, et al. Hierarchical multimodal LSTM for dense visual-semantic embedding[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 1899-1907.
[13] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada, 2015: 91-99.
[14] CHEN Hui, DING Guiguang, LIN Zijia, et al. Cross-modal image-text retrieval with semantic consistency[C]//Proceedings of the 27th ACM International Conference on Multimedia. Nice, French, 2019: 1749-1757.
[15] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the association for computational linguistics, 2014, 2(1): 67-78.
[16] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//13th European Conference on Computer Vision-ECCV 2014. Zurich, Switzerland, 2014: 740-755.
[17] PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in PyTorch[C]//31st Conference on Neural Information Processing Systems. Long Beach, USA, 2017.
[18] KINGMA D P, BA J L. Adam: A method for stochastic optimization[EB/OL]. (2015-04-23)[2021-08-01] https://arxiv.org/pdf/1412.6980.
[19] ZHENG Zhedong, ZHENG Liang, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM transactions on multimedia computing, communications, and applications, 2020, 16(2): 51.
[20] HUANG Yan, WANG Wei, WANG Liang. Instance-aware image and sentence matching with selective multimodal LSTM[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA, 2017: 2310-2318.
[21] WANG Yaxiong, YANG Hao, QIAN Xueming, et al. Position focused attention network for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Macao, China, 2019: 3792-3798.
[22] SONG Yale, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019.
[23] 陈丹, 李永忠, 于沛泽, 等. 跨模态行人重识别研究与展望[J]. 计算机系统应用, 2020, 29(10): 20-28
CHEN Dan, LI Yongzhong, YU Peizhe, et al. Research and prospect of cross modality person re-identification[J]. Computer systems & applications, 2020, 29(10): 20-28
[24] 刘天瑜, 刘正熙. 跨模态行人重识别研究综述[J]. 现代计算机, 2021, 27(7): 135-139
LIU Tianyu, LIU Zhengxi. Overview of cross modality person Re-identification research[J]. Modern computer, 2021, 27(7): 135-139
[25] 姚伟娜. 基于深度哈希算法的图像—文本跨模态检索研究[D]. 北京: 北京交通大学, 2018.
YAO Weina. Image-text cross-modal retrieval based on deep hashing method[D]. Beijing: Beijing Jiaotong University, 2018.

相似文献/References:: [1]李德毅.网络时代人工智能研究与发展[J].智能系统学报,2009,4(1):1.
　LI De-yi.AI research and development in the network age[J].CAAI Transactions on Intelligent Systems,2009,4():1.
[2]赵克勤.二元联系数A+Bi的理论基础与基本算法及在人工智能中的应用[J].智能系统学报,2008,3(6):476.
　ZHAO Ke-qin.The theoretical basis and basic algorithm of binary connection A+Bi and its application in AI[J].CAAI Transactions on Intelligent Systems,2008,3():476.
[3]徐玉如,庞永杰,甘?? 永,等.智能水下机器人技术展望[J].智能系统学报,2006,1(1):9.
　XU Yu-ru,PANG Yong-jie,GAN Yong,et al.AUV—state-of-the-art and prospect[J].CAAI Transactions on Intelligent Systems,2006,1():9.
[4]王志良.人工心理与人工情感[J].智能系统学报,2006,1(1):38.
　WANG Zhi-liang.Artificial psychology and artificial emotion[J].CAAI Transactions on Intelligent Systems,2006,1():38.
[5]赵克勤.集对分析的不确定性系统理论在AI中的应用[J].智能系统学报,2006,1(2):16.
　ZHAO Ke-qin.The application of uncertainty systems theory of set pair analysis (SPU)in the artificial intelligence[J].CAAI Transactions on Intelligent Systems,2006,1():16.
[6]秦裕林,朱新民,朱? 丹.Herbert Simon在最后几年里的两个研究方向[J].智能系统学报,2006,1(2):11.
　QIN Yu-lin,ZHU Xin-min,ZHU Dan.Herbert Simons two research directions in his lost years[J].CAAI Transactions on Intelligent Systems,2006,1():11.
[7]谷文祥,李丽,李丹丹.规划识别的研究及其应用[J].智能系统学报,2007,2(1):1.
　GU Wen-xiang,LI Li,LI Dan-dan.Research and application of plan recognition[J].CAAI Transactions on Intelligent Systems,2007,2():1.
[8]杨春燕,蔡文.可拓信息-知识-智能形式化体系研究[J].智能系统学报,2007,2(3):8.
　YANG Chun-yan,CAI Wen.A formalized system of extension information-knowledge-intelligence[J].CAAI Transactions on Intelligent Systems,2007,2():8.
[9]夏凡,王宏.基于局部异常行为检测的欺骗识别研究[J].智能系统学报,2007,2(5):12.
　XIA Fan,WANG Hong.Methodologies for deception detection based on abnormal b ehavior[J].CAAI Transactions on Intelligent Systems,2007,2():12.
[10]赵克勤.SPA的同异反系统理论在人工智能研究中的应用[J].智能系统学报,2007,2(5):20.
　ZHAO Ke-qin.The application of SPAbased identicaldiscrepancycontrary system theory in artificial intelligence research[J].CAAI Transactions on Intelligent Systems,2007,2():20.
[11]李雪,蒋树强.智能交互的物体识别增量学习技术综述[J].智能系统学报,2017,12(2):140.[doi:10.11992/tis.201701006]
　LI Xue,JIANG Shuqiang.Incremental learning and object recognition system based on intelligent HCI: a survey[J].CAAI Transactions on Intelligent Systems,2017,12():140.[doi:10.11992/tis.201701006]
[12]刘彪,黄蓉蓉,林和,等.基于卷积神经网络的盲文音乐识别研究[J].智能系统学报,2019,14(1):186.[doi:10.11992/tis.201805002]
　LIU Biao,HUANG Rongrong,LIN He,et al.Research on braille music recognition based on convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2019,14():186.[doi:10.11992/tis.201805002]
[13]王凯诚,鲁华祥,龚国良,等.基于注意力机制的显著性目标检测方法[J].智能系统学报,2020,15(5):956.[doi:10.11992/tis.201903001]
　WANG Kaicheng,LU Huaxiang,GONG Guoliang,et al.Salient object detection method based on the attention mechanism[J].CAAI Transactions on Intelligent Systems,2020,15():956.[doi:10.11992/tis.201903001]
[14]宫大汉,于龙龙,陈辉,等.面向车规级芯片的对象检测模型优化方法[J].智能系统学报,2021,16(5):900.[doi:10.11992/tis.202107057]
　GONG Dahan,YU Longlong,CHEN Hui,et al.Object detection model optimization method for car-level chips[J].CAAI Transactions on Intelligent Systems,2021,16():900.[doi:10.11992/tis.202107057]
[15]冯晗,姜勇.使用改进Yolov5的变电站绝缘子串检测方法[J].智能系统学报,2023,18(2):325.[doi:10.11992/tis.202201027]
　FENG Han,JIANG Yong.A substation insulator string detection method based on an improved Yolov5[J].CAAI Transactions on Intelligent Systems,2023,18():325.[doi:10.11992/tis.202201027]
[16]黄昱程,肖子旺,武丹凤,等.时空融合与判别力增强的孪生网络目标跟踪方法[J].智能系统学报,2024,19(5):1218.[doi:10.11992/tis.202306005]
　HUANG Yucheng,XIAO Ziwang,WU Danfeng,et al.Spatiotemporal fusion and discriminative augmentation for improved Siamese tracking[J].CAAI Transactions on Intelligent Systems,2024,19():1218.[doi:10.11992/tis.202306005]
[17]肖建力,许东舟,王浩,等.医疗领域的大型语言模型综述[J].智能系统学报,2025,20(3):530.[doi:10.11992/tis.202405003]
　XIAO Jianli,XU Dongzhou,WANG Hao,et al.Survey of large language models in healthcare[J].CAAI Transactions on Intelligent Systems,2025,20():530.[doi:10.11992/tis.202405003]

备注/Memo

收稿日期:2021-08-13。
基金项目:国家自然科学基金项目(61925107，U1936202)；中国博士后科学基金创新人才支持计划项目(BX2021161)
作者简介:宫大汉，博士研究生，主要研究方向为图像语义理解、卷积神经网络压缩加速;陈辉，助理研究员，博士，主要研究方向为图像语义理解、多媒体信息处理;丁贵广，副教授，博士，主要研究方向为多媒体信息处理、计算机视觉感知。主持基金委重点项目、重点研发项目等国家级项目数十项。曾获国家科技进步二等奖、吴文俊人工智能科技进步一等奖、中国电子学会技术发明一等奖等。发表学术论文近百篇，引用量近7 000次.
通讯作者:丁贵广.E-mail:dinggg@tsinghua.edu.cn

更新日期/Last Update: 2021-12-25

一致性协议匹配的跨模态图像文本检索方法 PDF下载HTML

备注/Memo

一致性协议匹配的跨模态图像文本检索方法

PDF下载 HTML