[1]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740-749.[doi:10.11992/tis.201910039]
 MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15(4):740-749.[doi:10.11992/tis.201910039]
点击复制

基于注意力融合的图像描述生成方法

参考文献/References:
[1] 李亚栋, 莫红, 王世豪. 基于图像描述的人物检索方法[J]. 系统仿真学报, 2018, 30(7): 377-383
LI Yadong, MO Hong, WANG Shihao. Person retrieval method based on image caption[J]. Journal of system simulation, 2018, 30(7): 377-383
[2] WU Jie, XIE Siya, SHI Xinbao, et al. Global-local feature attention network with reranking strategy for image caption generation[J]. Optoelectronics letters, 2017, 13(6): 448-451.
[3] 邓珍荣, 张宝军, 蒋周琴. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 274-279
DENG Zhenrong, ZHANG Baojun, JIANG Zhouqin. Image description model fusing Word2vec and attention mechanism[J]. Journal of computer science, 2019, 46(4): 274-279
[4] 陶云松, 张丽红. 基于双向注意力机制图像描述方法研究[J]. 测试技术学报, 2019, 33(4): 346-351
TAO Yunsong, ZHANG Lihong. Research on image description method based on bidirectional attentional mechanism[J]. Journal of test and measurement technology, 2019, 33(4): 346-351
[5] QU Shiru, XI Yuling. Visual attention based on long-short term memory model for image caption generation[C]// 2017 29th Chinese Control and Decision Conference. Chongqing, China, 2017: 4789-4794
[6] XU Jia, EFSTRATIOS G. Guiding the Long-Short Term Memory Model for Image Caption Generation[C]// 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 2407-2415
[7] JIN Junqi, FU Kun, CUI Runpeng, et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 39(12): 2321-2334.
[8] FARHADI A, HEJRATI M. Every picture tells a story: Generating sentences from images[C]//European Conference on Computer Vision. Berlin, Heidelberg, 2010: 15-29.
[9] GIRISH K, VISRUTH P, SAGNIK D, et al. Babytalk: Understanding and generating simple image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(12): 2891-2903.
[10] LI Siming, GIRISH K, TAMARA L B, et al. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Portland, Oregon, USA, 2011: 220-228.
[11] POLINA K, VICENTE O, ALEXANDER C, et al. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju: Republic of Korea, 2012: 359-368.
[12] YASHASWI V, ANKUSH G. Generating image descriptions using semantic similarities in the output space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA, 2013: 288-293.
[13] JACOB D, CHENG Hao, FANG Hao, et al. Language models for image captioning: The quirks and what works[J]. arXiv preprint arXiv:1505.01809, 2015.
[14] KAREN S, ANDREW Z. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[15] NAOMI.S ALTMAN. An introduction to kernel and nearest-neighbor nonparametric regression[J]. The american statistician, 1992, 46(3): 175-185.
[16] MAO Junhua, XU Wei, YANG Yi, et al. Explain images with multimodal recurrent neural networks[J]. arXiv preprint arXiv:1410.1090, 2014.
[17] ORIOL V, ALEXANDER T, SAMY B, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 3156-3164.
[18] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[19] XU KELVIN, BA JIMMY, KIROS RYAN, et al. Show, attend and tell: neural image caption generation with visual attention[C]//International Conference on Machine Learning. Lille, France, 2015: 2048-2057.
[20] YOU Quanzeng, JIN Hailin, WANG Zhaowen, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 4651-4659.
[21] CHEN Long, ZHANG Hanwang, XIAO Jun, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 5659-5667.
[22] MINH-THANG L, HIEU P, CHRISTOPHER D. Manning. Effective approaches to attention-based neural machine translation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, 2015: 1412-1421.
[23] XU K, JIMMY L B. Show, attend and tell: neural image caption generation with visual attention [C]// Proceedings of the 32th International Conference on Machine Learning. Lille, France, 2015: 2048-2057.
[24] MARCO P, THOMAS L, CORDELIA S, et al. Areas of attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, 2017: 1242-1250.
[25] LI Linghui, TANG Sheng, DENG Lixi, et al. Image caption with global-local attention[C]//Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, USA, 2017: 4133-4138.
[26] LU Jiasen, XIONG Caiming, DEVI P, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 375-383.
[27] ANDERSON P, HE Xiaodong, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake, USA, 2018: 6077-6086.
[28] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster r-cnn: towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis & machine intelligence, 2017, 39(6): 1137-1149.
[29] RICHARD S, SUTTON A, BARTO G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998.
[30] MARC’AURELIO R, SUMIT C, MICHAEL A, et al. Sequence level training with recurrent neural networks[J]. arXiv preprint arXiv:1511.06732, 2015.
[31] LIU Siqi, ZHU Zhenhai, YE Ning, et al. Improved image captioning via policy gradient optimization of spider[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, 2017: 873-881.
[32] RONALD J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine learning, 1992, 8: 229-256.
[33] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016.
[34] LIN Tsungyi, MICHAEL M, SERGE B, et al. Microsoft coco: common objects in context[C]//European Conference on Computer Vision. Zürich, Switzerland, 2014: 740-755.
[35] ANDREJ K, LI Feifei. Deep visual-semantic alignments for generating image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2016: 664-676.
[36] DIEDERIK P. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014.
[37] WANG Pidong, HWEEe T N. A beam-search decoder for normalization of social media text with application to machine translation[C]//Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia, 2013: 471-481.
[38] KISHORE P, SALIM R, TODD W, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA, 2002: 311-318.
[39] SATANJEEV B, ALON L. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the acl workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Michigan, USA, 2005: 65-72.
[40] LIN C, EDUARD H. Automatic evaluation of summaries using n-gram co-occurrence statistics [C]//Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Edmonton, Canada, 2003: 71-78.
[41] VEDANTAM R, C. ZITNICK L, PARIKH D. Cider: Consensus-based image description evaluation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 4566-4575.
[42] ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: Semantic propositional image caption evaluation [C]//European Conference on Computer Vision. Amsterdam, Netherlands, 2016: 382-398.
相似文献/References:
[1]殷瑞,苏松志,李绍滋.一种卷积神经网络的图像矩正则化策略[J].智能系统学报,2016,11(1):43.[doi:10.11992/tis.201509018]
 YIN Rui,SU Songzhi,LI Shaozi.Convolutional neural network’s image moment regularizing strategy[J].CAAI Transactions on Intelligent Systems,2016,11():43.[doi:10.11992/tis.201509018]
[2]龚震霆,陈光喜,任夏荔,等.基于卷积神经网络和哈希编码的图像检索方法[J].智能系统学报,2016,11(3):391.[doi:10.11992/tis.201603028]
 GONG Zhenting,CHEN Guangxi,REN Xiali,et al.An image retrieval method based on a convolutional neural network and hash coding[J].CAAI Transactions on Intelligent Systems,2016,11():391.[doi:10.11992/tis.201603028]
[3]刘帅师,程曦,郭文燕,等.深度学习方法研究新进展[J].智能系统学报,2016,11(5):567.[doi:10.11992/tis.201511028]
 LIU Shuaishi,CHENG Xi,GUO Wenyan,et al.Progress report on new research in deep learning[J].CAAI Transactions on Intelligent Systems,2016,11():567.[doi:10.11992/tis.201511028]
[4]师亚亭,李卫军,宁欣,等.基于嘴巴状态约束的人脸特征点定位算法[J].智能系统学报,2016,11(5):578.[doi:10.11992/tis.201602006]
 SHI Yating,LI Weijun,NING Xin,et al.A facial feature point locating algorithmbased on mouth-state constraints[J].CAAI Transactions on Intelligent Systems,2016,11():578.[doi:10.11992/tis.201602006]
[5]宋婉茹,赵晴晴,陈昌红,等.行人重识别研究综述[J].智能系统学报,2017,12(6):770.[doi:10.11992/tis.201706084]
 SONG Wanru,ZHAO Qingqing,CHEN Changhong,et al.Survey on pedestrian re-identification research[J].CAAI Transactions on Intelligent Systems,2017,12():770.[doi:10.11992/tis.201706084]
[6]杨晓兰,强彦,赵涓涓,等.基于医学征象和卷积神经网络的肺结节CT图像哈希检索[J].智能系统学报,2017,12(6):857.[doi:10.11992/tis.201706035]
 YANG Xiaolan,QIANG Yan,ZHAO Juanjuan,et al.Hashing retrieval for CT images of pulmonary nodules based on medical signs and convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2017,12():857.[doi:10.11992/tis.201706035]
[7]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报,2018,13(1):55.[doi:10.11992/tis.201609029]
 WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13():55.[doi:10.11992/tis.201609029]
[8]莫凌飞,蒋红亮,李煊鹏.基于深度学习的视频预测研究综述[J].智能系统学报,2018,13(1):85.[doi:10.11992/tis.201707032]
 MO Lingfei,JIANG Hongliang,LI Xuanpeng.Review of deep learning-based video prediction[J].CAAI Transactions on Intelligent Systems,2018,13():85.[doi:10.11992/tis.201707032]
[9]王成济,罗志明,钟准,等.一种多层特征融合的人脸检测方法[J].智能系统学报,2018,13(1):138.[doi:10.11992/tis.201707018]
 WANG Chengji,LUO Zhiming,ZHONG Zhun,et al.Face detection method fusing multi-layer features[J].CAAI Transactions on Intelligent Systems,2018,13():138.[doi:10.11992/tis.201707018]
[10]葛园园,许有疆,赵帅,等.自动驾驶场景下小且密集的交通标志检测[J].智能系统学报,2018,13(3):366.[doi:10.11992/tis.201706040]
 GE Yuanyuan,XU Youjiang,ZHAO Shuai,et al.Detection of small and dense traffic signs in self-driving scenarios[J].CAAI Transactions on Intelligent Systems,2018,13():366.[doi:10.11992/tis.201706040]

备注/Memo

收稿日期:2019-10-29。
基金项目:国家重点研发计划新一代人工智能重大专项(2018AAA0102702)
作者简介:莫宏伟,教授,博士生导师,主要研究方向为人工智能、类脑计算、智能机器人。承担完成国家自然科学基金、国防预研等项目17 项,授权发明专利 7 项。发表学术论文 70 余篇,出版专著 6 部;田朋,博士研究生,主要研究方向为图像描述、视觉关系检测和场景理解
通讯作者:莫宏伟.E-mail:honwei2004@126.com

更新日期/Last Update: 2020-07-25
Copyright @ 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134