An Empirical Investigation of Discounting in Cross-Domain Language Models

Greg Durrett and Dan Klein
UC Berkeley


Abstract

We investigate the empirical behavior of n-gram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2005.pdf