Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, Yoav Artzi


Abstract
We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks.
Anthology ID:
D18-1287
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2667–2678
Language:
URL:
https://aclanthology.org/D18-1287
DOI:
10.18653/v1/D18-1287
Bibkey:
Cite (ACL):
Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2667–2678, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction (Misra et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1287.pdf
Attachment:
 D18-1287.Attachment.zip
Code
 clic-lab/ciff +  additional community code
Data
Lani