Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Jonathan H. Clark,  Chris Dyer,  Alon Lavie,  Noah A. Smith
Carnegie Mellon University


Abstract

In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability -- an extraneous variable that is seldom controlled for

-- on experimental outcomes, and make recommendations for reporting results more accurately.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2031.pdf