Generating Video Description using Sequence-to-sequence Model with Temporal Attention

Natsuda Laokulrat, Sang Phan, Noriki Nishida, Raphael Shu, Yo Ehara, Naoaki Okazaki, Yusuke Miyao, Hideki Nakayama


Abstract
Automatic video description generation has recently been getting attention after rapid advancement in image caption generation. Automatically generating description for a video is more challenging than for an image due to its temporal dynamics of frames. Most of the work relied on Recurrent Neural Network (RNN) and recently attentional mechanisms have also been applied to make the model learn to focus on some frames of the video while generating each word in a describing sentence. In this paper, we focus on a sequence-to-sequence approach with temporal attention mechanism. We analyze and compare the results from different attention model configuration. By applying the temporal attention mechanism to the system, we can achieve a METEOR score of 0.310 on Microsoft Video Description dataset, which outperformed the state-of-the-art system so far.
Anthology ID:
C16-1005
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Yuji Matsumoto, Rashmi Prasad
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
44–52
Language:
URL:
https://aclanthology.org/C16-1005
DOI:
Bibkey:
Cite (ACL):
Natsuda Laokulrat, Sang Phan, Noriki Nishida, Raphael Shu, Yo Ehara, Naoaki Okazaki, Yusuke Miyao, and Hideki Nakayama. 2016. Generating Video Description using Sequence-to-sequence Model with Temporal Attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 44–52, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Generating Video Description using Sequence-to-sequence Model with Temporal Attention (Laokulrat et al., COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1005.pdf
Data
MSVD