Sampling Informative Training Data for RNN Language Models

Jared Fernandez, Doug Downey


Abstract
We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNNs) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014 Wikitext-103 benchmark corpora (Merity et al., 2016).
Anthology ID:
P18-3002
Volume:
Proceedings of ACL 2018, Student Research Workshop
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Vered Shwartz, Jeniya Tabassum, Rob Voigt, Wanxiang Che, Marie-Catherine de Marneffe, Malvina Nissim
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–13
Language:
URL:
https://aclanthology.org/P18-3002
DOI:
10.18653/v1/P18-3002
Bibkey:
Cite (ACL):
Jared Fernandez and Doug Downey. 2018. Sampling Informative Training Data for RNN Language Models. In Proceedings of ACL 2018, Student Research Workshop, pages 9–13, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Sampling Informative Training Data for RNN Language Models (Fernandez & Downey, ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-3002.pdf
Data
Billion Word BenchmarkWikiText-103WikiText-2