Improving Neural Language Models with Weight Norm Initialization and Regularization

Christian Herold, Yingbo Gao, Hermann Ney


Abstract
Embedding and projection matrices are commonly used in neural language models (NLM) as well as in other sequence processing networks that operate on large vocabularies. We examine such matrices in fine-tuned language models and observe that a NLM learns word vectors whose norms are related to the word frequencies. We show that by initializing the weight norms with scaled log word counts, together with other techniques, lower perplexities can be obtained in early epochs of training. We also introduce a weight norm regularization loss term, whose hyperparameters are tuned via a grid search. With this method, we are able to significantly improve perplexities on two word-level language modeling tasks (without dynamic evaluation): from 54.44 to 53.16 on Penn Treebank (PTB) and from 61.45 to 60.13 on WikiText-2 (WT2).
Anthology ID:
W18-6310
Volume:
Proceedings of the Third Conference on Machine Translation: Research Papers
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
93–100
Language:
URL:
https://aclanthology.org/W18-6310
DOI:
10.18653/v1/W18-6310
Bibkey:
Cite (ACL):
Christian Herold, Yingbo Gao, and Hermann Ney. 2018. Improving Neural Language Models with Weight Norm Initialization and Regularization. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Improving Neural Language Models with Weight Norm Initialization and Regularization (Herold et al., WMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6310.pdf
Data
WikiText-2