Aaron Smith


2018

pdf bib
82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models
Aaron Smith | Bernd Bohnet | Miryam de Lhoneux | Joakim Nivre | Yan Shao | Sara Stymne
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the Uppsala system for the CoNLL 2018 Shared Task on universal dependency parsing. Our system is a pipeline consisting of three components: the first performs joint word and sentence segmentation; the second predicts part-of-speech tags and morphological features; the third predicts dependency trees from words and tags. Instead of training a single parsing model for each treebank, we trained models with multiple treebanks for one language or closely related languages, greatly reducing the number of models. On the official test run, we ranked 7th of 27 teams for the LAS and MLAS metrics. Our system obtained the best scores overall for word segmentation, universal POS tagging, and morphological features.

pdf bib
An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing
Aaron Smith | Miryam de Lhoneux | Sara Stymne | Joakim Nivre
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We provide a comprehensive analysis of the interactions between pre-trained word embeddings, character models and POS tags in a transition-based dependency parser. While previous studies have shown POS information to be less important in the presence of character models, we show that in fact there are complex interactions between all three techniques. In isolation each produces large improvements over a baseline system using randomly initialised word embeddings only, but combining them quickly leads to diminishing returns. We categorise words by frequency, POS tag and language in order to systematically investigate how each of the techniques affects parsing quality. For many word categories, applying any two of the three techniques is almost as good as the full combined system. Character models tend to be more important for low-frequency open-class words, especially in morphologically rich languages, while POS tags can help disambiguate high-frequency function words. We also show that large character embedding sizes help even for languages with small character sets, especially in morphologically rich languages.

pdf bib
Parser Training with Heterogeneous Treebanks
Sara Stymne | Miryam de Lhoneux | Aaron Smith | Joakim Nivre
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

How to make the most of multiple heterogeneous treebanks when training a monolingual dependency parser is an open question. We start by investigating previously suggested, but little evaluated, strategies for exploiting multiple treebanks based on concatenating training sets, with or without fine-tuning. We go on to propose a new method based on treebank embeddings. We perform experiments for several languages and show that in many cases fine-tuning and treebank embeddings lead to substantial improvements over single treebanks or concatenation, with average gains of 2.0–3.5 LAS points. We argue that treebank embeddings should be preferred due to their conceptual simplicity, flexibility and extensibility.

2016

pdf bib
Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations
Aaron Smith | Christian Hardmeier | Joerg Tiedemann
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

2015

pdf bib
A Multiword Expression Data Set: Annotating Non-Compositionality and Conventionalization for English Noun Compounds
Meghdad Farahmand | Aaron Smith | Joakim Nivre
Proceedings of the 11th Workshop on Multiword Expressions

2014

pdf bib
ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT
Liane Guillou | Christian Hardmeier | Aaron Smith | Jörg Tiedemann | Bonnie Webber
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present ParCor, a parallel corpus of texts in which pronoun coreference ― reduced coreference in which pronouns are used as referring expressions ― has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.

pdf bib
Breaking Bad: Extraction of Verb-Particle Constructions from a Parallel Subtitles Corpus
Aaron Smith
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

pdf bib
Anaphora Models and Reordering for Phrase-Based SMT
Christian Hardmeier | Sara Stymne | Jörg Tiedemann | Aaron Smith | Joakim Nivre
Proceedings of the Ninth Workshop on Statistical Machine Translation