Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models

Satoru Katsumata; Yukio Matsumura; Hayahide Yamagishi; Mamoru Komachi

doi:10.18653/v1/P18-3016

Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models

Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, Mamoru Komachi

Abstract

Encoder-decoder models typically only employ words that are frequently used in the training corpus because of the computational costs and/or to exclude noisy words. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. The proposed method is applied to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieved a BLEU score that was 0.56 points more than that of a baseline. It also outperformed the baseline method for English grammatical error correction, with an F-measure that was 1.48 points higher.

Anthology ID:: P18-3016
Volume:: Proceedings of ACL 2018, Student Research Workshop
Month:: July
Year:: 2018
Address:: Melbourne, Australia
Editors:: Vered Shwartz, Jeniya Tabassum, Rob Voigt, Wanxiang Che, Marie-Catherine de Marneffe, Malvina Nissim
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 112–119
Language:
URL:: https://aclanthology.org/P18-3016/
DOI:: 10.18653/v1/P18-3016
Bibkey:
Cite (ACL):: Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, and Mamoru Komachi. 2018. Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models. In Proceedings of ACL 2018, Student Research Workshop, pages 112–119, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):: Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models (Katsumata et al., ACL 2018)
Copy Citation:
PDF:: https://aclanthology.org/P18-3016.pdf
Code: Katsumata420/HITS_Ranking

PDF Cite Search Code Fix data