<-上一篇/Previous Article 下一篇/Next Article->

[1]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740-749.[doi:10.11992/tis.201910039]
　MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15(4):740-749.[doi:10.11992/tis.201910039]

点击复制

基于注意力融合的图像描述生成方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 15 期数: 2020年第4期页码: 740-749 栏目: 学术论文—知识工程出版日期: 2020-07-05

Title:: An image caption generation method based on attention fusion

作者:: 莫宏伟, 田朋; 哈尔滨工程大学自动化学院，黑龙江哈尔滨 150001

Author(s):: MO Hongwei, TIAN Peng; College of Automation, Harbin Engineering University, Harbin 150001, China

关键词:: 图像描述; 卷积神经网络; 空间注意力; Faster R-CNN; 注意力机制; 名称属性; 高层语义; 强化学习

Keywords:: image caption; convolutional neural network; spatial attention; Faster R-CNN; attention mechanism; noun attribute; high-level semantic; reinforcement learning

分类号:: TP181

DOI:: 10.11992/tis.201910039

摘要:: 空间注意力机制和高层语义注意力机制都能够提升图像描述的效果，但是通过直接划分卷积神经网络提取图像空间注意力的方式不能准确地提取图像中目标对应的特征。为了提高基于注意力的图像描述效果，提出了一种基于注意力融合的图像描述模型，使用Faster R-CNN（faster region with convolutional neural network）作为编码器在提取图像特征的同时可以检测出目标的准确位置和名称属性特征，再将这些特征分别作为高层语义注意力和空间注意力来指导单词序列的生成。在COCO数据集上的实验结果表明，基于注意力融合的图像描述模型的性能优于基于空间注意力的图像描述模型和多数主流的图像描述模型。在使用交叉熵训练方法的基础上，使用强化学习方法直接优化图像描述评价指标对模型进行训练，提升了基于注意力融合的图像描述模型的准确率。

Abstract:: The spatial attention mechanism and the high-level semantic attention mechanism can improve the effect of image captioning, but the method for extracting the spatial attention of image by directly dividing the convolutional neural network cannot accurately extract the features corresponding to target in the image. In order to improve the effect of image captioning based on attention, this paper proposes an image caption model based on attention fusion, using Faster R-CNN (faster region with convolutional neural network) as an encoder to exect image features and simultaneously detect the features of accurate position and noun attribute of the target object, then those features as high-level semantic attention and spatial attention respectively to guide the generation of word sequence. The experimental results on COCO dataset show that the performance of the image caption model based on attention fusion outperforms the image caption models based on spatial attention and most mainstream image caption models. Based on the cross entropy training method, we use reinforcement learning method to directly optimize the image caption evaluation index to train the model, which significantly improves the accuracy of the image caption model based on attention fusion.

参考文献/References:: [1] 李亚栋, 莫红, 王世豪. 基于图像描述的人物检索方法[J]. 系统仿真学报, 2018, 30(7): 377-383
LI Yadong, MO Hong, WANG Shihao. Person retrieval method based on image caption[J]. Journal of system simulation, 2018, 30(7): 377-383
[2] WU Jie, XIE Siya, SHI Xinbao, et al. Global-local feature attention network with reranking strategy for image caption generation[J]. Optoelectronics letters, 2017, 13(6): 448-451.
[3] 邓珍荣, 张宝军, 蒋周琴. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 274-279
DENG Zhenrong, ZHANG Baojun, JIANG Zhouqin. Image description model fusing Word2vec and attention mechanism[J]. Journal of computer science, 2019, 46(4): 274-279
[4] 陶云松, 张丽红. 基于双向注意力机制图像描述方法研究[J]. 测试技术学报, 2019, 33(4): 346-351
TAO Yunsong, ZHANG Lihong. Research on image description method based on bidirectional attentional mechanism[J]. Journal of test and measurement technology, 2019, 33(4): 346-351
[5] QU Shiru, XI Yuling. Visual attention based on long-short term memory model for image caption generation[C]// 2017 29th Chinese Control and Decision Conference. Chongqing, China, 2017: 4789-4794
[6] XU Jia, EFSTRATIOS G. Guiding the Long-Short Term Memory Model for Image Caption Generation[C]// 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 2407-2415
[7] JIN Junqi, FU Kun, CUI Runpeng, et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 39(12): 2321-2334.
[8] FARHADI A, HEJRATI M. Every picture tells a story: Generating sentences from images[C]//European Conference on Computer Vision. Berlin, Heidelberg, 2010: 15-29.
[9] GIRISH K, VISRUTH P, SAGNIK D, et al. Babytalk: Understanding and generating simple image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(12): 2891-2903.
[10] LI Siming, GIRISH K, TAMARA L B, et al. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Portland, Oregon, USA, 2011: 220-228.
[11] POLINA K, VICENTE O, ALEXANDER C, et al. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju: Republic of Korea, 2012: 359-368.
[12] YASHASWI V, ANKUSH G. Generating image descriptions using semantic similarities in the output space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA, 2013: 288-293.
[13] JACOB D, CHENG Hao, FANG Hao, et al. Language models for image captioning: The quirks and what works[J]. arXiv preprint arXiv:1505.01809, 2015.
[14] KAREN S, ANDREW Z. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[15] NAOMI.S ALTMAN. An introduction to kernel and nearest-neighbor nonparametric regression[J]. The american statistician, 1992, 46(3): 175-185.
[16] MAO Junhua, XU Wei, YANG Yi, et al. Explain images with multimodal recurrent neural networks[J]. arXiv preprint arXiv:1410.1090, 2014.
[17] ORIOL V, ALEXANDER T, SAMY B, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 3156-3164.
[18] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[19] XU KELVIN, BA JIMMY, KIROS RYAN, et al. Show, attend and tell: neural image caption generation with visual attention[C]//International Conference on Machine Learning. Lille, France, 2015: 2048-2057.
[20] YOU Quanzeng, JIN Hailin, WANG Zhaowen, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 4651-4659.
[21] CHEN Long, ZHANG Hanwang, XIAO Jun, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 5659-5667.
[22] MINH-THANG L, HIEU P, CHRISTOPHER D. Manning. Effective approaches to attention-based neural machine translation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, 2015: 1412-1421.
[23] XU K, JIMMY L B. Show, attend and tell: neural image caption generation with visual attention [C]// Proceedings of the 32th International Conference on Machine Learning. Lille, France, 2015: 2048-2057.
[24] MARCO P, THOMAS L, CORDELIA S, et al. Areas of attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, 2017: 1242-1250.
[25] LI Linghui, TANG Sheng, DENG Lixi, et al. Image caption with global-local attention[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, USA, 2017: 4133-4138.
[26] LU Jiasen, XIONG Caiming, DEVI P, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 375-383.
[27] ANDERSON P, HE Xiaodong, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake, USA, 2018: 6077-6086.
[28] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster r-cnn: towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis & machine intelligence, 2017, 39(6): 1137-1149.
[29] RICHARD S, SUTTON A, BARTO G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998.
[30] MARC’AURELIO R, SUMIT C, MICHAEL A, et al. Sequence level training with recurrent neural networks[J]. arXiv preprint arXiv:1511.06732, 2015.
[31] LIU Siqi, ZHU Zhenhai, YE Ning, et al. Improved image captioning via policy gradient optimization of spider[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, 2017: 873-881.
[32] RONALD J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine learning, 1992, 8: 229-256.
[33] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016.
[34] LIN Tsungyi, MICHAEL M, SERGE B, et al. Microsoft coco: common objects in context[C]//European Conference on Computer Vision. Zürich, Switzerland, 2014: 740-755.
[35] ANDREJ K, LI Feifei. Deep visual-semantic alignments for generating image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2016: 664-676.
[36] DIEDERIK P. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014.
[37] WANG Pidong, HWEEe T N. A beam-search decoder for normalization of social media text with application to machine translation[C]//Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia, 2013: 471-481.
[38] KISHORE P, SALIM R, TODD W, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA, 2002: 311-318.
[39] SATANJEEV B, ALON L. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the acl workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Michigan, USA, 2005: 65-72.
[40] LIN C, EDUARD H. Automatic evaluation of summaries using n-gram co-occurrence statistics [C]//Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Edmonton, Canada, 2003: 71-78.
[41] VEDANTAM R, C. ZITNICK L, PARIKH D. Cider: Consensus-based image description evaluation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 4566-4575.
[42] ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: Semantic propositional image caption evaluation [C]//European Conference on Computer Vision. Amsterdam, Netherlands, 2016: 382-398.

相似文献/References:: [1]殷瑞,苏松志,李绍滋.一种卷积神经网络的图像矩正则化策略[J].智能系统学报,2016,11(1):43.[doi:10.11992/tis.201509018]
　YIN Rui,SU Songzhi,LI Shaozi.Convolutional neural network’s image moment regularizing strategy[J].CAAI Transactions on Intelligent Systems,2016,11():43.[doi:10.11992/tis.201509018]
[2]龚震霆,陈光喜,任夏荔,等.基于卷积神经网络和哈希编码的图像检索方法[J].智能系统学报,2016,11(3):391.[doi:10.11992/tis.201603028]
　GONG Zhenting,CHEN Guangxi,REN Xiali,et al.An image retrieval method based on a convolutional neural network and hash coding[J].CAAI Transactions on Intelligent Systems,2016,11():391.[doi:10.11992/tis.201603028]
[3]刘帅师,程曦,郭文燕,等.深度学习方法研究新进展[J].智能系统学报,2016,11(5):567.[doi:10.11992/tis.201511028]
　LIU Shuaishi,CHENG Xi,GUO Wenyan,et al.Progress report on new research in deep learning[J].CAAI Transactions on Intelligent Systems,2016,11():567.[doi:10.11992/tis.201511028]
[4]师亚亭,李卫军,宁欣,等.基于嘴巴状态约束的人脸特征点定位算法[J].智能系统学报,2016,11(5):578.[doi:10.11992/tis.201602006]
　SHI Yating,LI Weijun,NING Xin,et al.A facial feature point locating algorithmbased on mouth-state constraints[J].CAAI Transactions on Intelligent Systems,2016,11():578.[doi:10.11992/tis.201602006]
[5]宋婉茹,赵晴晴,陈昌红,等.行人重识别研究综述[J].智能系统学报,2017,12(6):770.[doi:10.11992/tis.201706084]
　SONG Wanru,ZHAO Qingqing,CHEN Changhong,et al.Survey on pedestrian re-identification research[J].CAAI Transactions on Intelligent Systems,2017,12():770.[doi:10.11992/tis.201706084]
[6]杨晓兰,强彦,赵涓涓,等.基于医学征象和卷积神经网络的肺结节CT图像哈希检索[J].智能系统学报,2017,12(6):857.[doi:10.11992/tis.201706035]
　YANG Xiaolan,QIANG Yan,ZHAO Juanjuan,et al.Hashing retrieval for CT images of pulmonary nodules based on medical signs and convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2017,12():857.[doi:10.11992/tis.201706035]
[7]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报,2018,13(1):55.[doi:10.11992/tis.201609029]
　WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13():55.[doi:10.11992/tis.201609029]
[8]莫凌飞,蒋红亮,李煊鹏.基于深度学习的视频预测研究综述[J].智能系统学报,2018,13(1):85.[doi:10.11992/tis.201707032]
　MO Lingfei,JIANG Hongliang,LI Xuanpeng.Review of deep learning-based video prediction[J].CAAI Transactions on Intelligent Systems,2018,13():85.[doi:10.11992/tis.201707032]
[9]王成济,罗志明,钟准,等.一种多层特征融合的人脸检测方法[J].智能系统学报,2018,13(1):138.[doi:10.11992/tis.201707018]
　WANG Chengji,LUO Zhiming,ZHONG Zhun,et al.Face detection method fusing multi-layer features[J].CAAI Transactions on Intelligent Systems,2018,13():138.[doi:10.11992/tis.201707018]
[10]葛园园,许有疆,赵帅,等.自动驾驶场景下小且密集的交通标志检测[J].智能系统学报,2018,13(3):366.[doi:10.11992/tis.201706040]
　GE Yuanyuan,XU Youjiang,ZHAO Shuai,et al.Detection of small and dense traffic signs in self-driving scenarios[J].CAAI Transactions on Intelligent Systems,2018,13():366.[doi:10.11992/tis.201706040]

备注/Memo

收稿日期:2019-10-29。
基金项目:国家重点研发计划新一代人工智能重大专项（2018AAA0102702）
作者简介:莫宏伟，教授，博士生导师，主要研究方向为人工智能、类脑计算、智能机器人。承担完成国家自然科学基金、国防预研等项目17 项，授权发明专利 7 项。发表学术论文 70 余篇，出版专著 6 部;田朋，博士研究生，主要研究方向为图像描述、视觉关系检测和场景理解
通讯作者:莫宏伟.E-mail:honwei2004@126.com

更新日期/Last Update: 2020-07-25

基于注意力融合的图像描述生成方法 PDF下载HTML

备注/Memo

基于注意力融合的图像描述生成方法

PDF下载 HTML