Jonas Kuhn


2019

pdf pdf bib
Learning the Dyck Language with Attention-based Seq2Seq Models
Xiang Yu | Ngoc Thang Vu | Jonas Kuhn

The generalized Dyck language has been used to analyze the ability of Recurrent Neural Networks (RNNs) to learn context-free grammars (CFGs). Recent studies draw conflicting conclusions on their performance, especially regarding the generalizability of the models with respect to the depth of recursion. In this paper, we revisit several common models and experimental settings, discuss the potential problems of the tasks and analyses. Furthermore, we explore the use of attention mechanisms within the seq2seq framework to learn the Dyck language, which could compensate for the limited encoding ability of RNNs. Our findings reveal that attention mechanisms still cannot truly generalize over the recursion depth, although they perform much better than other models on the closing bracket tagging task. Moreover, this also suggests that this commonly used task is not sufficient to test a model’s understanding of CFGs.

pdf pdf bib
The (Non-)Utility of Structural Features in BiLSTM-based Dependency Parsers
Agnieszka Falenska | Jonas Kuhn

Classical non-neural dependency parsers put considerable effort on the design of feature functions. Especially, they benefit from information coming from structural features, such as features drawn from neighboring tokens in the dependency tree. In contrast, their BiLSTM-based successors achieve state-of-the-art performance without explicit information about the structural context. In this paper we aim to answer the question: How much structural context are the BiLSTM representations able to capture implicitly? We show that features drawn from partial subtrees become redundant when the BiLSTMs are used. We provide a deep insight into information flow in transition- and graph-based neural architectures to demonstrate where the implicit information comes from when the parsers make their decisions. Finally, with model ablations we demonstrate that the structural context is not only present in the models, but it significantly influences their performance.

pdf pdf bib
Who Sides with Whom? Towards Computational Construction of Discourse Networks for Political Debates
Sebastian Padó | Andre Blessing | Nico Blokker | Erenay Dayanik | Sebastian Haunss | Jonas Kuhn

Understanding the structures of political debates (which actors make what claims) is essential for understanding democratic political decision making. The vision of computational construction of such discourse networks from newspaper reports brings together political science and natural language processing. This paper presents three contributions towards this goal: (a) a requirements analysis, linking the task to knowledge base population; (b) an annotated pilot corpus of migration claims based on German newspaper reports; (c) initial modeling results.

pdf pdf bib
An Environment for Relational Annotation of Political Debates
Andre Blessing | Nico Blokker | Sebastian Haunss | Jonas Kuhn | Gabriella Lapesa | Sebastian Padó

This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as “pseudo-annotators”.

2018

pdf pdf bib
Bridging resolution: Task definition, corpus resources and rule-based experiments
Ina Roesiger | Arndt Riester | Jonas Kuhn

Recent work on bridging resolution has so far been based on the corpus ISNotes (Markert et al. 2012), as this was the only corpus available with unrestricted bridging annotation. Hou et al. 2014’s rule-based system currently achieves state-of-the-art performance on this corpus, as learning-based approaches suffer from the lack of available training data. Recently, a number of new corpora with bridging annotations have become available. To test the generalisability of the approach by Hou et al. 2014, we apply a slightly extended rule-based system to these corpora. Besides the expected out-of-domain effects, we also observe low performance on some of the in-domain corpora. Our analysis shows that this is the result of two very different phenomena being defined as bridging, namely referential and lexical bridging. We also report that filtering out gold or predicted coreferent anaphors before applying the bridging resolution system helps improve bridging resolution.

pdf pdf bib
NLATool: an Application for Enhanced Deep Text Understanding
Markus Gärtner | Sven Mayer | Valentin Schwind | Eric Hämmerle | Emine Turcan | Florin Rheinwald | Gustav Murawski | Lars Lischke | Jonas Kuhn

Today, we see an ever growing number of tools supporting text annotation. Each of these tools is optimized for specific use-cases such as named entity recognition. However, we see large growing knowledge bases such as Wikipedia or the Google Knowledge Graph. In this paper, we introduce NLATool, a web application developed using a human-centered design process. The application combines supporting text annotation and enriching the text with additional information from a number of sources directly within the application. The tool assists users to efficiently recognize named entities, annotate text, and automatically provide users additional information while solving deep text understanding tasks.

pdf pdf bib
A Lightweight Modeling Middleware for Corpus Processing
Markus Gärtner | Jonas Kuhn

pdf pdf bib
Moving TIGER beyond Sentence-Level
Agnieszka Falenska | Kerstin Eckart | Jonas Kuhn

pdf pdf bib
German Radio Interviews: The GRAIN Release of the SFB732 Silver Standard Collection
Katrin Schweitzer | Kerstin Eckart | Markus Gärtner | Agnieszka Falenska | Arndt Riester | Ina Rösiger | Antje Schweitzer | Sabrina Stehwien | Jonas Kuhn

pdf pdf bib
Supervised Rhyme Detection with Siamese Recurrent Networks
Thomas Haider | Jonas Kuhn

We present the first supervised approach to rhyme detection with Siamese Recurrent Networks (SRN) that offer near perfect performance (97% accuracy) with a single model on rhyme pairs for German, English and French, allowing future large scale analyses. SRNs learn a similarity metric on variable length character sequences that can be used as judgement on the distance of imperfect rhyme pairs and for binary classification. For training, we construct a diachronically balanced rhyme goldstandard of New High German (NHG) poetry. For further testing, we sample a second collection of NHG poetry and set of contemporary Hip-Hop lyrics, annotated for rhyme and assonance. We train several high-performing SRN models and evaluate them qualitatively on selected sonnetts.

pdf pdf bib
Approximate Dynamic Oracle for Dependency Parsing with Reinforcement Learning
Xiang Yu | Ngoc Thang Vu | Jonas Kuhn

We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive. We treat oracle parsing as a reinforcement learning problem, design the reward function inspired by the classical dynamic oracle, and use Deep Q-Learning (DQN) techniques to train the oracle with gold trees as features. The combination of a priori knowledge and data-driven methods enables an efficient dynamic oracle, which improves the parser performance over static oracles in several transition systems.

pdf pdf bib
Polyglot Semantic Parsing in APIs
Kyle Richardson | Jonathan Berant | Jonas Kuhn

Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017b,a). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.

2017

pdf pdf bib
IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks
Anders Björkelund | Agnieszka Falenska | Xiang Yu | Jonas Kuhn

This paper presents the IMS contribution to the CoNLL 2017 Shared Task. In the preprocessing step we employed a CRF POS/morphological tagger and a neural tagger predicting supertags. On some languages, we also applied word segmentation with the CRF tagger and sentence segmentation with a perceptron-based parser. For parsing we took an ensemble approach by blending multiple instances of three parsers with very different architectures. Our system achieved the third place overall and the second place for the surprise languages.

pdf pdf bib
Learning Semantic Correspondences in Technical Documentation
Kyle Richardson | Jonas Kuhn

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.

pdf pdf bib
The Code2Text Challenge: Text Generation in Source Libraries
Kyle Richardson | Sina Zarrieß | Jonas Kuhn

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction, and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

pdf pdf bib
Multi-modular domain-tailored OCR post-correction
Sarah Schulz | Jonas Kuhn

One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We show that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over single-handed approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we describe the workflow we suggest for efficient text recognition and subsequent automatic and manual post-correction

pdf pdf bib
Function Assistant: A Tool for NL Querying of APIs
Kyle Richardson | Jonas Kuhn

In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries, or descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of (Richardson and Kuhn, 2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github.

2016

pdf pdf bib
How to Train Dependency Parsers with Inexact Search for Joint Sentence Boundary Detection and Parsing of Entire Documents
Anders Björkelund | Agnieszka Faleńska | Wolfgang Seeker | Jonas Kuhn

pdf pdf bib
Flexible and Reliable Text Analytics in the Digital Humanities – Some Methodological Considerations
Jonas Kuhn

The availability of Language Technology Resources and Tools generates a considerable methodological potential in the Digital Humanities: aspects of research questions from the Humanities and Social Sciences can be addressed on text collections in ways that were unavailable to traditional approaches. I start this talk by sketching some sample scenarios of Digital Humanities projects which involve various Humanities and Social Science disciplines, noting that the potential for a meaningful contribution to higher-level questions is highest when the employed language technological models are carefully tailored both (a) to characteristics of the given target corpus, and (b) to relevant analytical subtasks feeding the discipline-specific research questions. Keeping up a multidisciplinary perspective, I then point out a recurrent dilemma in Digital Humanities projects that follow the conventional set-up of collaboration: to build high-quality computational models for the data, fixed analytical targets should be specified as early as possible – but to be able to respond to Humanities questions as they evolve over the course of analysis, the analytical machinery should be kept maximally flexible. To reach both, I argue for a novel collaborative culture that rests on a more interleaved, continuous dialogue. (Re-)Specification of analytical targets should be an ongoing process in which the Humanities Scholars and Social Scientists play a role that is as important as the Computational Scientists’ role. A promising approach lies in the identification of re-occurring types of analytical subtasks, beyond linguistic standard tasks, which can form building blocks for text analysis across disciplines, and for which corpus-based characterizations (viz. annotations) can be collected, compared and revised. On such grounds, computational modeling is more directly tied to the evolving research questions, and hence the seemingly opposing needs of reliable target specifications vs. “malleable” frameworks of analysis can be reconciled. Experimental work following this approach is under way in the Center for Reflected Text Analytics (CRETA) in Stuttgart.

pdf pdf bib
IMS HotCoref DE: A Data-driven Co-reference Resolver for German
Ina Roesiger | Jonas Kuhn

This paper presents a data-driven co-reference resolution system for German that has been adapted from IMS HotCoref, a co-reference resolver for English. It describes the difficulties when resolving co-reference in German text, the adaptation process and the features designed to address linguistic challenges brought forth by German. We report performance on the reference dataset TüBa-D/Z and include a post-task SemEval 2010 evaluation, showing that the resolver achieves state-of-the-art performance. We also include ablation experiments that indicate that integrating linguistic features increases results. The paper also describes the steps and the format necessary to use the resolver on new texts. The tool is freely available for download.

pdf pdf bib
Learning from Within? Comparing PoS Tagging Approaches for Historical Text
Sarah Schulz | Jonas Kuhn

In this paper, we investigate unsupervised and semi-supervised methods for part-of-speech (PoS) tagging in the context of historical German text. We locate our research in the context of Digital Humanities where the non-canonical nature of text causes issues facing an Natural Language Processing world in which tools are mainly trained on standard data. Data deviating from the norm requires tools adjusted to this data. We explore to which extend the availability of such training material and resources related to it influences the accuracy of PoS tagging. We investigate a variety of algorithms including neural nets, conditional random fields and self-learning techniques in order to find the best-fitted approach to tackle data sparsity. Although methods using resources from related languages outperform weakly supervised methods using just a few training examples, we can still reach a promising accuracy with methods abstaining additional resources.

pdf pdf bib
Learning to Make Inferences in a Semantic Parsing Task
Kyle Richardson | Jonas Kuhn

We introduce a new approach to training a semantic parser that uses textual entailment judgements as supervision. These judgements are based on high-level inferences about whether the meaning of one sentence follows from another. When applied to an existing semantic parsing task, they prove to be a useful tool for revealing semantic distinctions and background knowledge not captured in the target representations. This information is used to improve the quality of the semantic representations being learned and to acquire generic knowledge for reasoning. Experiments are done on the benchmark Sportscaster corpus (Chen and Mooney, 2008), and a novel RTE-inspired inference dataset is introduced. On this new dataset our method strongly outperforms several strong baselines. Separately, we obtain state-of-the-art results on the original Sportscaster semantic parsing task.

pdf pdf bib
Named Entity Disambiguation for little known referents: a topic-based approach
Andrea Glaser | Jonas Kuhn

We propose an approach to Named Entity Disambiguation that avoids a problem of standard work on the task (likewise affecting fully supervised, weakly supervised, or distantly supervised machine learning techniques): the treatment of name mentions referring to people with no (or very little) coverage in the textual training data is systematically incorrect. We propose to indirectly take into account the property information for the “non-prominent” name bearers, such as nationality and profession (e.g., for a Canadian law professor named Michael Jackson, with no Wikipedia article, it is very hard to obtain reliable textual training data). The target property information for the entities is directly available from name authority files, or inferrable, e.g., from listings of sportspeople etc. Our proposed approach employs topic modeling to exploit textual training data based on entities sharing the relevant properties. In experiments with a pilot implementation of the general approach, we show that the approach does indeed work well for name/referent pairs with limited textual coverage in the training data.

2015

pdf pdf bib
Multi-modal Visualization and Search for Text and Prosody Annotations
Markus Gärtner | Katrin Schweitzer | Kerstin Eckart | Jonas Kuhn

pdf pdf bib
Structural Alignment for Comparison Detection
Wiltrud Kessler | Jonas Kuhn

pdf pdf bib
A Pilot Experiment on Exploiting Translations for Literary Studies on Kafka’s “Verwandlung”
Fabienne Cap | Ina Rösiger | Jonas Kuhn

pdf pdf bib
Towards Opinion Mining from Reviews for the Prediction of Product Rankings
Wiltrud Kessler | Roman Klinger | Jonas Kuhn

2014

pdf pdf bib
Learning Structured Perceptrons for Coreference Resolution with Latent Antecedents and Non-local Features
Anders Björkelund | Jonas Kuhn

pdf pdf bib
Visualization, Search, and Error Analysis for Coreference Annotations
Markus Gärtner | Anders Björkelund | Gregor Thiele | Wolfgang Seeker | Jonas Kuhn

pdf bib
A Corpus of Comparisons in Product Reviews
Wiltrud Kessler | Jonas Kuhn

pdf bib
Textual Emigration Analysis (TEA)
Andre Blessing | Jonas Kuhn

pdf bib
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
Masood Ghayoomi | Jonas Kuhn

pdf bib
An Out-of-Domain Test Suite for Dependency Parsing of German
Wolfgang Seeker | Jonas Kuhn

pdf bib
UnixMan Corpus: A Resource for Language Learning in the Unix Domain
Kyle Richardson | Jonas Kuhn

pdf bib
Exploring the utility of coreference chains for improved identification of personal names
Andrea Glaser | Jonas Kuhn

pdf pdf bib
A Graphical Interface for Automatic Error Mining in Corpora
Gregor Thiele | Wolfgang Seeker | Markus Gärtner | Anders Björkelund | Jonas Kuhn

2013

pdf pdf bib
Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures
Sina Zarrieß | Jonas Kuhn

pdf pdf bib
ICARUS – An Extensible Graphical Search Tool for Dependency Treebanks
Markus Gärtner | Gregor Thiele | Wolfgang Seeker | Anders Björkelund | Jonas Kuhn

pdf pdf bib
Morphological and Syntactic Case in Statistical Dependency Parsing
Wolfgang Seeker | Jonas Kuhn

pdf pdf bib
Towards a Tool for Interactive Concept Building for Large Scale Analysis in the Humanities
Andre Blessing | Jonathan Sonntag | Fritz Kliche | Ulrich Heid | Jonas Kuhn | Manfred Stede

pdf pdf bib
Towards Joint Morphological Analysis and Dependency Parsing of Turkish
Özlem Çetinoğlu | Jonas Kuhn

pdf pdf bib
The Effects of Syntactic Features in Automatic Prediction of Morphology
Wolfgang Seeker | Jonas Kuhn

pdf pdf bib
Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
Wiltrud Kessler | Jonas Kuhn

2012

pdf bib
Making Ellipses Explicit in Dependency Conversion for a German Treebank
Wolfgang Seeker | Jonas Kuhn

pdf bib
A Corpus-based Study of the German Recipient Passive
Patrick Ziering | Sina Zarrieß | Jonas Kuhn

pdf pdf bib
Generating Non-Projective Word Order in Statistical Linearization
Bernd Bohnet | Anders Björkelund | Jonas Kuhn | Wolfgang Seeker | Sina Zarriess

pdf pdf bib
Comparing Non-projective Strategies for Labeled Graph-Based Dependency Parsing
Anders Björkelund | Jonas Kuhn

pdf pdf bib
Phrase Structures and Dependencies for End-to-End Coreference Resolution
Anders Björkelund | Jonas Kuhn

pdf pdf bib
Light Textual Inference for Semantic Parsing
Kyle Richardson | Jonas Kuhn

pdf pdf bib
Data-driven Dependency Parsing With Empty Heads
Wolfgang Seeker | Richárd Farkas | Bernd Bohnet | Helmut Schmid | Jonas Kuhn

pdf pdf bib
The Best of BothWorlds – A Graph-based Completion Model for Transition-based Parsers
Bernd Bohnet | Jonas Kuhn

pdf pdf bib
To what extent does sentence-internal realisation reflect discourse context? A study on word order
Sina Zarrieß | Aoife Cahill | Jonas Kuhn

2011

pdf pdf bib
Underspecifying and Predicting Voice for Surface Realisation Ranking
Sina Zarrieß | Aoife Cahill | Jonas Kuhn

pdf pdf bib
On the Role of Explicit Morphological Feature Representation in Syntactic Dependency Parsing for German
Wolfgang Seeker | Jonas Kuhn

2010

pdf pdf bib
Hard Constraints for Grammatical Function Labelling
Wolfgang Seeker | Ines Rehbein | Jonas Kuhn | Josef van Genabith

pdf pdf bib
A Cross-Lingual Induction Technique for German Adverbial Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer

pdf bib
Towards a Large Parallel Corpus of Cleft Constructions
Gerlof Bouma | Lilja Øvrelid | Jonas Kuhn

pdf bib
Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
Cheikh M. Bamba Dione | Jonas Kuhn | Sina Zarrieß

pdf bib
Training Parsers on Partial Trees: A Cross-language Comparison
Kathrin Spreyer | Lilja Øvrelid | Jonas Kuhn

pdf pdf bib
Informed ways of improving data-driven dependency parsing for German
Wolfgang Seeker | Bernd Bohnet | Lilja Øvrelid | Jonas Kuhn

pdf pdf bib
Cross-Lingual Induction for Deep Broad-Coverage Syntax: A Case Study on German Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer

2009

pdf pdf bib
Data-Driven Dependency Parsing of New Languages Using Incomplete and Noisy Training Data
Kathrin Spreyer | Jonas Kuhn

pdf pdf bib
Empirical Lower Bounds on Aligment Error Rates in Syntax-Based Machine Translation
Anders Søgaard | Jonas Kuhn

pdf pdf bib
Exploiting Translational Correspondences for Pattern-Independent MWE Identification
Sina Zarrieß | Jonas Kuhn

pdf pdf bib
Using a maximum entropy-based tagger to improve a very fast vine parser
Anders Søgaard | Jonas Kuhn

pdf pdf bib
Improving data-driven dependency parsing using large-scale LFG grammars
Lilja Øvrelid | Jonas Kuhn | Kathrin Spreyer

2008

pdf bib
Identification of Comparable Argument-Head Relations in Parallel Corpora
Kathrin Spreyer | Jonas Kuhn | Bettina Schrader

2007

pdf pdf bib
Machine Translation as Tree Labeling
Mark Hopkins | Jonas Kuhn

pdf pdf bib
Deep Grammars in a Tree Labeling Approach to Syntax-based Statistical Machine Translation
Mark Hopkins | Jonas Kuhn

2006

pdf pdf bib
Exploring the Potential of Intractable Parsers
Mark Hopkins | Jonas Kuhn

pdf bib
Multilingual parallel treebanking: a lean and flexible approach
Jonas Kuhn | Michael Jellinghaus

pdf pdf bib
A Framework for Incorporating Alignment Information in Parsing
Mark Hopkins | Jonas Kuhn

2005

pdf pdf bib
Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
Jonas Kuhn

2004

pdf pdf bib
Experiments in parallel-text based grammar induction
Jonas Kuhn

pdf bib
Utilization of Multiple Language Resources for Robust Grammar-Based Tense and Aspect Classification
Alexis Palmer | Jonas Kuhn | Carlota Smith

pdf bib
Applying Computational Linguistic Techniques in a Documentary Project for Q’anjob’al (Mayan, Guatemala)
Jonas Kuhn | B’alam Mateo-Toledo

2003

pdf pdf bib
Compounding and Derivational Morphology in a Finite-State Setting
Jonas Kuhn

2002

pdf pdf bib
OT Syntax – Decidability of Generation-based Optimization
Jonas Kuhn

2000

pdf pdf bib
Processing Optimality-theoretic Syntax by Interleaved Chart Parsing and Generation
Jonas Kuhn

pdf pdf bib
Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training
Stefan Riezler | Detlef Prescher | Jonas Kuhn | Mark Johnson

1996

pdf pdf bib
An Underspecified HPSG Representation for Information Structure
Jonas Kuhn