Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Oliver Hellwig, Sebastian Nehrdich


Abstract
The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.
Anthology ID:
D18-1295
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2754–2763
Language:
URL:
https://aclanthology.org/D18-1295
DOI:
10.18653/v1/D18-1295
Bibkey:
Cite (ACL):
Oliver Hellwig and Sebastian Nehrdich. 2018. Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2754–2763, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks (Hellwig & Nehrdich, EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1295.pdf