Automatic Glossing in a Low-Resource Setting for Language Documentation

Sarah Moeller, Mans Hulden


Abstract
Morphological analysis of morphologically rich and low-resource languages is important to both descriptive linguistics and natural language processing. Field documentary efforts usually procure analyzed data in cooperation with native speakers who are capable of providing some level of linguistic information. Manually annotating such data is very expensive and the traditional process is arguably too slow in the face of language endangerment and loss. We report on a case study of learning to automatically gloss a Nakh-Daghestanian language, Lezgi, from a very small amount of seed data. We compare a conditional random field based sequence labeler and a neural encoder-decoder model and show that a nearly 0.9 F1-score on labeled accuracy of morphemes can be achieved with 3,000 words of transcribed oral text. Errors are mostly limited to morphemes with high allomorphy. These results are potentially useful for developing rapid annotation and fieldwork tools to support documentation of morphologically rich, endangered languages.
Anthology ID:
W18-4809
Volume:
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editor:
Judith L. Klavans
Venue:
PYLO
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
84–93
Language:
URL:
https://aclanthology.org/W18-4809
DOI:
Bibkey:
Cite (ACL):
Sarah Moeller and Mans Hulden. 2018. Automatic Glossing in a Low-Resource Setting for Language Documentation. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pages 84–93, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Automatic Glossing in a Low-Resource Setting for Language Documentation (Moeller & Hulden, PYLO 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-4809.pdf