Fixing Translation Divergences in Parallel Corpora for Neural MT

MinhQuang Pham, Josep Crego, Jean Senellart, François Yvon


Abstract
Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.
Anthology ID:
D18-1328
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2967–2973
Language:
URL:
https://aclanthology.org/D18-1328
DOI:
10.18653/v1/D18-1328
Bibkey:
Cite (ACL):
MinhQuang Pham, Josep Crego, Jean Senellart, and François Yvon. 2018. Fixing Translation Divergences in Parallel Corpora for Neural MT. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2967–2973, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Fixing Translation Divergences in Parallel Corpora for Neural MT (Pham et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1328.pdf
Code
 jmcrego/similarity
Data
OpenSubtitles