The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

Dingquan Wang, Jason Eisner


Abstract
We release Galactic Dependencies 1.0—a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar languages. Each synthetic treebank is produced from a real treebank by stochastically permuting the dependents of nouns and/or verbs to match the word order of other real languages. We discuss the usefulness, realism, parsability, perplexity, and diversity of the synthetic languages. As a simple demonstration of the use of Galactic Dependencies, we consider single-source transfer, which attempts to parse a real target language using a parser trained on a “nearby” source language. We find that including synthetic source languages somewhat increases the diversity of the source pool, which significantly improves results for most target languages.
Anthology ID:
Q16-1035
Volume:
Transactions of the Association for Computational Linguistics, Volume 4
Month:
Year:
2016
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Kristina Toutanova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
491–505
Language:
URL:
https://aclanthology.org/Q16-1035
DOI:
10.1162/tacl_a_00113
Bibkey:
Cite (ACL):
Dingquan Wang and Jason Eisner. 2016. The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Transactions of the Association for Computational Linguistics, 4:491–505.
Cite (Informal):
The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages (Wang & Eisner, TACL 2016)
Copy Citation:
PDF:
https://aclanthology.org/Q16-1035.pdf
Video:
 https://aclanthology.org/Q16-1035.mp4
Code
 gdtreebank/gdtreebank
Data
Universal Dependencies