Language-independent compound splitting with morphological operations

Klaus Macherey1,  Andrew Dai2,  David Talbot1,  Ashok Popat1,  Franz Och1
1Google Inc., 2University of Edinburgh


Abstract

Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1140.pdf