Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models

Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, Mamoru Komachi


Abstract
Encoder-decoder models typically only employ words that are frequently used in the training corpus because of the computational costs and/or to exclude noisy words. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. The proposed method is applied to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieved a BLEU score that was 0.56 points more than that of a baseline. It also outperformed the baseline method for English grammatical error correction, with an F-measure that was 1.48 points higher.
Anthology ID:
P18-3016
Volume:
Proceedings of ACL 2018, Student Research Workshop
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Vered Shwartz, Jeniya Tabassum, Rob Voigt, Wanxiang Che, Marie-Catherine de Marneffe, Malvina Nissim
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–119
Language:
URL:
https://aclanthology.org/P18-3016
DOI:
10.18653/v1/P18-3016
Bibkey:
Cite (ACL):
Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, and Mamoru Komachi. 2018. Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models. In Proceedings of ACL 2018, Student Research Workshop, pages 112–119, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models (Katsumata et al., ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-3016.pdf
Code
 Katsumata420/HITS_Ranking