Jan A. Botha


2023

pdf bib
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation
Parker Riley | Timothy Dozat | Jan A. Botha | Xavier Garcia | Dan Garrette | Jason Riesa | Orhan Firat | Noah Constant
Transactions of the Association for Computational Linguistics, Volume 11

We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms. We explore automatic evaluation metrics for FRMT and validate their correlation with expert human evaluation across both region-matched and mismatched rating scenarios. Finally, we present a number of baseline models for this task, and offer guidelines for how researchers can train, evaluate, and compare their own models. Our dataset and evaluation code are publicly available: https://bit.ly/frmt-task.

2020

pdf bib
Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Julian Michael | Jan A. Botha | Ian Tenney
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to classifier-based probing that induces a latent categorization (or ontology) of the probe’s inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.

pdf bib
Entity Linking in 100 Languages
Jan A. Botha | Zifei Shan | Daniel Gillick
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero- and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.

2018

pdf bib
Learning To Split and Rephrase From Wikipedia Edit History
Jan A. Botha | Manaal Faruqui | John Alex | Jason Baldridge | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia’s edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.

2017

pdf bib
Natural Language Processing with Small Feed-Forward Networks
Jan A. Botha | Emily Pitler | Ji Ma | Anton Bakalov | Alex Salcianu | David Weiss | Ryan McDonald | Slav Petrov
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.

2016

pdf bib
Cross-Lingual Morphological Tagging for Low-Resource Languages
Jan Buys | Jan A. Botha
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Adaptor Grammars for Learning Non-Concatenative Morphology
Jan A. Botha | Phil Blunsom
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Hierarchical Bayesian Language Modelling for the Linguistically Informed
Jan A. Botha
Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Bayesian Language Modelling of German Compounds
Jan A. Botha | Chris Dyer | Phil Blunsom
Proceedings of COLING 2012