Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Ivan Vulić,  Wim De Smet,  Marie-Francine Moens
K.U. Leuven, Department of Computer Science, Leuven, Belgium


Abstract

A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from word-topic distributions with similarity measures in the original space, are also reported.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2084.pdf