[1]莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740-749.[doi:10.11992/tis.201910039]
MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15(4):740-749.[doi:10.11992/tis.201910039]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
15
期数:
2020年第4期
页码:
740-749
栏目:
学术论文—知识工程
出版日期:
2020-07-05
- Title:
-
An image caption generation method based on attention fusion
- 作者:
-
莫宏伟, 田朋
-
哈尔滨工程大学 自动化学院,黑龙江 哈尔滨 150001
- Author(s):
-
MO Hongwei, TIAN Peng
-
College of Automation, Harbin Engineering University, Harbin 150001, China
-
- 关键词:
-
图像描述; 卷积神经网络; 空间注意力; Faster R-CNN; 注意力机制; 名称属性; 高层语义; 强化学习
- Keywords:
-
image caption; convolutional neural network; spatial attention; Faster R-CNN; attention mechanism; noun attribute; high-level semantic; reinforcement learning
- 分类号:
-
TP181
- DOI:
-
10.11992/tis.201910039
- 摘要:
-
空间注意力机制和高层语义注意力机制都能够提升图像描述的效果,但是通过直接划分卷积神经网络提取图像空间注意力的方式不能准确地提取图像中目标对应的特征。为了提高基于注意力的图像描述效果,提出了一种基于注意力融合的图像描述模型,使用Faster R-CNN(faster region with convolutional neural network)作为编码器在提取图像特征的同时可以检测出目标的准确位置和名称属性特征,再将这些特征分别作为高层语义注意力和空间注意力来指导单词序列的生成。在COCO数据集上的实验结果表明,基于注意力融合的图像描述模型的性能优于基于空间注意力的图像描述模型和多数主流的图像描述模型。在使用交叉熵训练方法的基础上,使用强化学习方法直接优化图像描述评价指标对模型进行训练,提升了基于注意力融合的图像描述模型的准确率。
- Abstract:
-
The spatial attention mechanism and the high-level semantic attention mechanism can improve the effect of image captioning, but the method for extracting the spatial attention of image by directly dividing the convolutional neural network cannot accurately extract the features corresponding to target in the image. In order to improve the effect of image captioning based on attention, this paper proposes an image caption model based on attention fusion, using Faster R-CNN (faster region with convolutional neural network) as an encoder to exect image features and simultaneously detect the features of accurate position and noun attribute of the target object, then those features as high-level semantic attention and spatial attention respectively to guide the generation of word sequence. The experimental results on COCO dataset show that the performance of the image caption model based on attention fusion outperforms the image caption models based on spatial attention and most mainstream image caption models. Based on the cross entropy training method, we use reinforcement learning method to directly optimize the image caption evaluation index to train the model, which significantly improves the accuracy of the image caption model based on attention fusion.
更新日期/Last Update:
2020-07-25