Clustering Comparable Corpora For Bilingual Lexicon Extraction

Bo Li1,  Eric Gaussier1,  Akiko Aizawa2
1UJF-Grenoble 1 / CNRS, France, 2National Institute of Informatics, Tokyo, Japan


Abstract

We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2083.pdf