[1]MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAI Transactions on Intelligent Systems,2020,15(4):740-749.[doi:10.11992/tis.201910039]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
15
Number of periods:
2020 4
Page number:
740-749
Column:
学术论文—知识工程
Public date:
2020-07-05
- Title:
-
An image caption generation method based on attention fusion
- Author(s):
-
MO Hongwei; TIAN Peng
-
College of Automation, Harbin Engineering University, Harbin 150001, China
-
- Keywords:
-
image caption; convolutional neural network; spatial attention; Faster R-CNN; attention mechanism; noun attribute; high-level semantic; reinforcement learning
- CLC:
-
TP181
- DOI:
-
10.11992/tis.201910039
- Abstract:
-
The spatial attention mechanism and the high-level semantic attention mechanism can improve the effect of image captioning, but the method for extracting the spatial attention of image by directly dividing the convolutional neural network cannot accurately extract the features corresponding to target in the image. In order to improve the effect of image captioning based on attention, this paper proposes an image caption model based on attention fusion, using Faster R-CNN (faster region with convolutional neural network) as an encoder to exect image features and simultaneously detect the features of accurate position and noun attribute of the target object, then those features as high-level semantic attention and spatial attention respectively to guide the generation of word sequence. The experimental results on COCO dataset show that the performance of the image caption model based on attention fusion outperforms the image caption models based on spatial attention and most mainstream image caption models. Based on the cross entropy training method, we use reinforcement learning method to directly optimize the image caption evaluation index to train the model, which significantly improves the accuracy of the image caption model based on attention fusion.