simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Houfeng Wang, Xu Sun


Abstract
The encode-decoder framework has shown recent success in image captioning. Visual attention, which is good at detailedness, and semantic attention, which is good at comprehensiveness, have been separately proposed to ground the caption on the image. In this paper, we propose the Stepwise Image-Topic Merging Network (simNet) that makes use of the two kinds of attention at the same time. At each time step when generating the caption, the decoder adaptively merges the attentive information in the extracted topics and the image according to the generated context, so that the visual information and the semantic information can be effectively combined. The proposed approach is evaluated on two benchmark datasets and reaches the state-of-the-art performances.
Anthology ID:
D18-1013
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
137–149
Language:
URL:
https://aclanthology.org/D18-1013
DOI:
10.18653/v1/D18-1013
Bibkey:
Cite (ACL):
Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Houfeng Wang, and Xu Sun. 2018. simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 137–149, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions (Liu et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1013.pdf
Video:
 https://aclanthology.org/D18-1013.mp4
Code
 lancopku/simNet
Data
Flickr30k