[1]GONG Dahan,CHEN Hui,CHEN Shijiang,et al.Matching with agreement for cross-modal image-text retrieval[J].CAAI Transactions on Intelligent Systems,2021,16(6):1143-1150.[doi:10.11992/tis.202108013]
Copy

Matching with agreement for cross-modal image-text retrieval

References:
[1] WANG Liwei, LI Yin, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA, 2016: 5005-5013.
[2] FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives[EB/OL]. (2018-07-29)[2021-07-30] https://arxiv.org/pdf/1707.05612.
[3] KARPATHY A, LI Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 3128-3137.
[4] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 2156-2164.
[5] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International Conference on Machine Learning. Sydney, Australia, 2015: 2048-2057.
[6] LEE K H, CHEN Xi, HUA Gang, et al. Stacked cross attention for image-text matching[M]//FERRARI V, HEBERT M, SMINCHISESCU C, et al. Proceedings of the 15th European Conference on Computer Vision-ECCV 2018. Munich, Germany: Springer, 2018: 201-216.
[7] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. Nevada, USA, 2013: 2121-2129.
[8] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10). https://arxiv.org/pdf/1409.1556.
[9] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07)[2021-07-30] https://arxiv.org/pdf/1301.3781.
[10] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[EB/OL]. (2014-11-10). https://arxiv.org/pdf/1411.2539.
[11] CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. (2014-12-11)[2021-07-30] https://arxiv.org/pdf/1412.3555.
[12] NIU Zhenxing, ZHOU Mo, WANG Le, et al. Hierarchical multimodal LSTM for dense visual-semantic embedding[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 1899-1907.
[13] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada, 2015: 91-99.
[14] CHEN Hui, DING Guiguang, LIN Zijia, et al. Cross-modal image-text retrieval with semantic consistency[C]//Proceedings of the 27th ACM International Conference on Multimedia. Nice, French, 2019: 1749-1757.
[15] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the association for computational linguistics, 2014, 2(1): 67-78.
[16] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//13th European Conference on Computer Vision-ECCV 2014. Zurich, Switzerland, 2014: 740-755.
[17] PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in PyTorch[C]//31st Conference on Neural Information Processing Systems. Long Beach, USA, 2017.
[18] KINGMA D P, BA J L. Adam: A method for stochastic optimization[EB/OL]. (2015-04-23)[2021-08-01] https://arxiv.org/pdf/1412.6980.
[19] ZHENG Zhedong, ZHENG Liang, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM transactions on multimedia computing, communications, and applications, 2020, 16(2): 51.
[20] HUANG Yan, WANG Wei, WANG Liang. Instance-aware image and sentence matching with selective multimodal LSTM[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA, 2017: 2310-2318.
[21] WANG Yaxiong, YANG Hao, QIAN Xueming, et al. Position focused attention network for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Macao, China, 2019: 3792-3798.
[22] SONG Yale, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019.
[23] 陈丹, 李永忠, 于沛泽, 等. 跨模态行人重识别研究与展望[J]. 计算机系统应用, 2020, 29(10): 20-28
CHEN Dan, LI Yongzhong, YU Peizhe, et al. Research and prospect of cross modality person re-identification[J]. Computer systems & applications, 2020, 29(10): 20-28
[24] 刘天瑜, 刘正熙. 跨模态行人重识别研究综述[J]. 现代计算机, 2021, 27(7): 135-139
LIU Tianyu, LIU Zhengxi. Overview of cross modality person Re-identification research[J]. Modern computer, 2021, 27(7): 135-139
[25] 姚伟娜. 基于深度哈希算法的图像—文本跨模态检索研究[D]. 北京: 北京交通大学, 2018.
YAO Weina. Image-text cross-modal retrieval based on deep hashing method[D]. Beijing: Beijing Jiaotong University, 2018.
Similar References:

Memo

-

Last Update: 2021-12-25

Copyright © CAAI Transactions on Intelligent Systems