Chris Biemann

Also published as: Christian Biemann


2019

pdf pdf bib
Learning Graph Embeddings from WordNet-based Similarity Measures
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

pdf pdf bib
HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings
Saba Anwar | Dmitry Ustalov | Nikolay Arefyev | Simone Paolo Ponzetto | Chris Biemann | Alexander Panchenko

We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (Qasem-iZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages.

pdf pdf bib
UHH-LT at SemEval-2019 Task 6: Supervised vs. Unsupervised Transfer Learning for Offensive Language Detection
Gregor Wiedemann | Eugen Ruppert | Chris Biemann

We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks. The selected tasks are supervised classification problems from public NLP resources with some overlap to offensive language such as sentiment detection, emoji classification, and aggressive language classification. Unsupervised transfer learning is performed with a thematic clustering of 40M unlabeled tweets via LDA. Based on this dataset, pre-training is performed by predicting the main topic of a tweet. Results indicate that unsupervised transfer from large datasets performs slightly better than supervised training on small ‘near target category’ datasets. In the SemEval Task, our system ranks 14 out of 103 participants.

pdf pdf bib
Language-Agnostic Model for Aspect-Based Sentiment Analysis
Md Shad Akhtar | Abhishek Kumar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya

In this paper, we propose a language-agnostic deep neural network architecture for aspect-based sentiment analysis. The proposed approach is based on Bidirectional Long Short-Term Memory (Bi-LSTM) network, which is further assisted with extra hand-crafted features. We define three different architectures for the successful combination of word embeddings and hand-crafted features. We evaluate the proposed approach for six languages (i.e. English, Spanish, French, Dutch, German and Hindi) and two problems (i.e. aspect term extraction and aspect sentiment classification). Experiments show that the proposed model attains state-of-the-art performance in most of the settings.

pdf pdf bib
Reviving a psychometric measure: Classification and prediction of the Operant Motive Test
Dirk Johannßen | Chris Biemann | David Scheffer

Implicit motives allow for the characterization of behavior, subsequent success and long-term development. While this has been operationalized in the operant motive test, research on motives has declined mainly due to labor-intensive and costly human annotation. In this study, we analyze over 200,000 labeled data items from 40,000 participants and utilize them for engineering features for training a logistic model tree machine learning model. It captures manually assigned motives well with an F-score of 80%, coming close to the pairwise annotator intraclass correlation coefficient of r = .85. In addition, we found a significant correlation of r = .2 between subsequent academic success and data automatically labeled with our model in an extrinsic evaluation.

pdf pdf bib
Categorizing Comparative Sentences
Alexander Panchenko | Alexander Bondarenko | Mirco Franzek | Matthias Hagen | Chris Biemann

We tackle the tasks of automatically identifying comparative sentences and categorizing the intended preference (e.g., “Python has better NLP libraries than MATLAB” Python, better, MATLAB). To this end, we manually annotate 7,199 sentences for 217 distinct target item pairs from several domains (27% of the sentences contain an oriented comparison in the sense of “better” or “worse”). A gradient boosting model based on pre-trained sentence embeddings reaches an F1 score of 85% in our experimental evaluation. The model can be used to extract comparative sentences for pro/con argumentation in comparative / argument search engines or debating technologies.

pdf pdf bib
LT Expertfinder: An Evaluation Framework for Expert Finding Methods
Tim Fischer | Steffen Remus | Chris Biemann

Expert finding is the task of ranking persons for a predefined topic or search query. Finding experts for a specified area is an important task and has attracted much attention in the information retrieval community. Most approaches for this task are evaluated in a supervised fashion, which depend on predefined topics of interest as well as gold standard expert rankings. Famous representatives of such datasets are enriched versions of DBLP provided by the ArnetMiner projet or the W3C Corpus of TREC. However, manually ranking experts can be considered highly subjective and detailed rankings are hardly distinguishable. Evaluating these datasets does not necessarily guarantee a good or bad performance of the system. Particularly for dynamic systems, where topics are not predefined but formulated as a search query, we believe a more informative approach is to perform user studies for directly comparing different methods in the same view. In order to accomplish this in a user-friendly way, we present the LT Expert Finder web-application, which is equipped with various query-based expert finding methods that can be easily extended, a detailed expert profile view, detailed evidence in form of relevant documents and statistics, and an evaluation component that allows the qualitative comparison between different rankings.

pdf pdf bib
On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings
Abhik Jana | Dima Puzyrev | Alexander Panchenko | Pawan Goyal | Chris Biemann | Animesh Mukherjee

The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincaré embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincaré similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincaré embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus.

pdf pdf bib
Making Fast Graph-based Algorithms with Graph Metric Embeddings
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko

Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.

pdf pdf bib
Every Child Should Have Parents: A Taxonomy Refinement Algorithm Based on Hyperbolic Term Embeddings
Rami Aly | Shantanu Acharya | Alexander Ossa | Arne Köhn | Chris Biemann | Alexander Panchenko

We introduce the use of Poincaré embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extraction. We demonstrate the superiority of Poincaré embeddings over distributional semantic representations, supporting the hypothesis that they can better capture hierarchical lexical-semantic relationships than embeddings in the Euclidean space.

pdf pdf bib
Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records
Max Friedrich | Arne Köhn | Gregor Wiedemann | Chris Biemann

De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHR) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works well across many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records. We introduce a method to create privacy-preserving shareable representations of medical text (i.e. they contain no PHI) that does not require expensive manual pseudonymization. These representations can be shared between organizations to create unified datasets for training de-identification models. Our representation allows training a simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable to a strong baseline that exposes private information in its representation. A robust, widely available de-identification classifier based on our representation could potentially enable studies for which de-identification would otherwise be too costly.

pdf pdf bib
Improving Neural Entity Disambiguation with Graph Embeddings
Özge Sevgili | Alexander Panchenko | Chris Biemann

Entity Disambiguation (ED) is the task of linking an ambiguous entity mention to a corresponding entry in a knowledge base. Current methods have mostly focused on unstructured text data to learn representations of entities, however, there is structured information in the knowledge base itself that should be useful to disambiguate entities. In this work, we propose a method that uses graph embeddings for integrating structured information from the knowledge base with unstructured information from text-based representations. Our experiments confirm that graph embeddings trained on a graph of hyperlinks between Wikipedia articles improve the performances of simple feed-forward neural ED model and a state-of-the-art neural ED system.

pdf pdf bib
Hierarchical Multi-label Classification of Text with Capsule Networks
Rami Aly | Steffen Remus | Chris Biemann

Capsule networks have been shown to demonstrate good performance on structured data in the area of visual inference. In this paper we apply and compare simple shallow capsule networks for hierarchical multi-label text classification and show that they can perform superior to other neural networks, such as CNNs and LSTMs, and non-neural network architectures such as SVMs. For our experiments, we use the established Web of Science (WOS) dataset and introduce a new real-world scenario dataset, the BlurbGenreCollection (BGC). Our results confirm the hypothesis that capsule networks are especially advantageous for rare events and structurally diverse categories, which we attribute to their ability to combine latent encoded information.

pdf pdf bib
TARGER: Neural Argument Mining at Your Fingertips
Artem Chernodub | Oleksiy Oliynyk | Philipp Heidenreich | Alexander Bondarenko | Matthias Hagen | Chris Biemann | Alexander Panchenko

We present TARGER, an open source neural argument mining framework for tagging arguments in free input texts and for keyword-based retrieval of arguments from an argument-tagged web-scale corpus. The currently available models are pre-trained on three recent argument mining datasets and enable the use of neural argument mining without any reproducibility effort on the user’s side. The open source code ensures portability to other domains and use cases.

2018

pdf pdf bib
Par4Sim – Adaptive Paraphrasing for Text Simplification
Seid Muhie Yimam | Chris Biemann

Learning from a real-world data stream and continuously updating the model without explicit supervision is a new challenge for NLP applications with machine learning components. In this work, we have developed an adaptive learning system for text simplification, which improves the underlying learning-to-rank model from usage data, i.e. how users have employed the system for the task of simplification. Our experimental result shows that, over a period of time, the performance of the embedded paraphrase ranking model increases steadily improving from a score of 62.88% up to 75.70% based on the NDCG@10 evaluation metrics. To our knowledge, this is the first study where an NLP component is adaptively improved through usage.

pdf pdf bib
Demonstrating Par4Sem - A Semantic Writing Aid with Adaptive Paraphrasing
Seid Muhie Yimam | Chris Biemann

In this paper, we present Par4Sem, a semantic writing aid tool based on adaptive paraphrasing. Unlike many annotation tools that are primarily used to collect training examples, Par4Sem is integrated into a real word application, in this case a writing aid tool, in order to collect training examples from usage data. Par4Sem is a tool, which supports an adaptive, iterative, and interactive process where the underlying machine learning models are updated for each iteration using new training examples from usage data. After motivating the use of ever-learning tools in NLP applications, we evaluate Par4Sem by adopting it to a text simplification task through mere usage.

pdf pdf bib
A Multilingual Information Extraction Pipeline for Investigative Journalism
Gregor Wiedemann | Seid Muhie Yimam | Chris Biemann

We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.

pdf pdf bib
Using Semantics for Granularities of Tokenization
Martin Riedl | Chris Biemann

Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.

pdf pdf bib
BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern- and Graph-based Information to Identify Discriminative Attributes
Enrico Santus | Chris Biemann | Emmanuele Chersoni

This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern- and word embedding-based features. It participated in the SemEval-2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0.73 and ranking 2nd out of 26 participant systems.

pdf pdf bib
Enriching Frame Representations with Distributionally Induced Senses
Stefano Faralli | Alexander Panchenko | Chris Biemann | Simone Paolo Ponzetto

pdf pdf bib
An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages
Dmitry Ustalov | Denis Teslenko | Alexander Panchenko | Mikhail Chernoskutov | Chris Biemann | Simone Paolo Ponzetto

pdf pdf bib
Retrofitting Word Representations for Unsupervised Sense Aware Word Similarities
Steffen Remus | Chris Biemann

pdf pdf bib
Improving Hypernymy Extraction with Distributional Semantic Classes
Alexander Panchenko | Dmitry Ustalov | Stefano Faralli | Simone P. Ponzetto | Chris Biemann

pdf pdf bib
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
Alexander Panchenko | Eugen Ruppert | Stefano Faralli | Simone P. Ponzetto | Chris Biemann

pdf pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf pdf bib
Document-based Recommender System for Job Postings using Dense Representations
Ahmed Elsafty | Martin Riedl | Chris Biemann

Job boards and professional social networks heavily use recommender systems in order to better support users in exploring job advertisements. Detecting the similarity between job advertisements is important for job recommendation systems as it allows, for example, the application of item-to-item based recommendations. In this work, we research the usage of dense vector representations to enhance a large-scale job recommendation system and to rank German job advertisements regarding their similarity. We follow a two-folded evaluation scheme: (1) we exploit historic user interactions to automatically create a dataset of similar jobs that enables an offline evaluation. (2) In addition, we conduct an online A/B test and evaluate the best performing method on our platform reaching more than 1 million users. We achieve the best results by combining job titles with full-text job descriptions. In particular, this method builds dense document representation using words of the titles to weigh the importance of words of the full-text description. In the online evaluation, this approach allows us to increase the click-through rate on job recommendations for active users by 8.0%.

pdf pdf bib
Unsupervised Semantic Frame Induction using Triclustering
Dmitry Ustalov | Alexander Panchenko | Andrey Kutuzov | Chris Biemann | Simone Paolo Ponzetto

We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.

2017

pdf pdf bib
CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann

Complex word identification (CWI) is an important task in text accessibility. However, due to the scarcity of CWI datasets, previous studies have only addressed this problem on Wikipedia sentences and have solely taken into account the needs of non-native English speakers. We collect a new CWI dataset (CWIG3G2) covering three text genres News, WikiNews, and Wikipedia) annotated by both native and non-native English speakers. Unlike previous datasets, we cover single words, as well as complex phrases, and present them for judgment in a paragraph context. We present the first study on cross-genre and cross-group CWI, showing measurable influences in native language and genre types.

pdf pdf bib
Multilingual and Cross-Lingual Complex Word Identification
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann

Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.

pdf pdf bib
Watset: Automatic Induction of Synsets from a Graph of Synonyms
Dmitry Ustalov | Alexander Panchenko | Chris Biemann

This paper presents a new graph-based approach that induces synsets using synonymy dictionaries and word embeddings. First, we build a weighted graph of synonyms extracted from commonly available resources, such as Wiktionary. Second, we apply word sense induction to deal with ambiguous words. Finally, we cluster the disambiguated version of the ambiguous input graph into synsets. Our meta-clustering approach lets us use an efficient hard clustering algorithm to perform a fuzzy clustering of the graph. Despite its simplicity, our approach shows excellent results, outperforming five competitive state-of-the-art methods in terms of F-score on three gold standard datasets for English and Russian derived from large-scale manually constructed lexical resources.

pdf pdf bib
Replacing OOV Words For Dependency Parsing With Distributional Semantics
Prasanth Kolachina | Martin Riedl | Chris Biemann

pdf pdf bib
Using Pseudowords for Algorithm Comparison: An Evaluation Framework for Graph-based Word Sense Induction
Flavio Massimiliano Cecchini | Chris Biemann | Martin Riedl

pdf pdf bib
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Alexander Panchenko | Stefano Faralli | Simone Paolo Ponzetto | Chris Biemann

We introduce a new method for unsupervised knowledge-based word sense disambiguation (WSD) based on a resource that links two types of sense-aware lexical networks: one is induced from a corpus using distributional semantics, the other is manually constructed. The combination of two networks reduces the sparsity of sense representations used for WSD. We evaluate these enriched representations within two lexical sample sense disambiguation benchmarks. Our results indicate that (1) features extracted from the corpus-based resource help to significantly outperform a model based solely on the lexical resource; (2) our method achieves results comparable or better to four state-of-the-art unsupervised knowledge-based WSD systems including three hybrid systems that also rely on text corpora. In contrast to these hybrid methods, our approach does not require access to web search engines, texts mapped to a sense inventory, or machine translation systems.

pdf pdf bib
There’s no ‘Count or Predict’ but task-based selection for distributional models
Martin Riedl | Chris Biemann

pdf pdf bib
Entity-Centric Information Access with Human in the Loop for the Biomedical Domain
Seid Muhie Yimam | Steffen Remus | Alexander Panchenko | Andreas Holzinger | Chris Biemann

In this paper, we describe the concept of entity-centric information access for the biomedical domain. With entity recognition technologies approaching acceptable levels of accuracy, we put forward a paradigm of document browsing and searching where the entities of the domain and their relations are explicitly modeled to provide users the possibility of collecting exhaustive information on relations of interest. We describe three working prototypes along these lines: NEW/S/LEAK, which was developed for investigative journalists who need a quick overview of large leaked document collections; STORYFINDER, which is a personalized organizer for information found in web pages that allows adding entities as well as relations, and is capable of personalized information management; and adaptive annotation capabilities of WEBANNO, which is a general-purpose linguistic annotation tool. We will discuss future steps towards the adaptation of these tools to biomedical data, which is subject to a recently started project on biomedical knowledge acquisition. A key difference to other approaches is the centering around the user in a Human-in-the-Loop machine learning approach, where users define and extend categories and enable the system to improve via feedback and interaction.

pdf pdf bib
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Question Answering and Implicit Dialogue Identification
Titas Nandi | Chris Biemann | Seid Muhie Yimam | Deepak Gupta | Sarah Kohail | Asif Ekbal | Pushpak Bhattacharyya

In this paper we present the system for Answer Selection and Ranking in Community Question Answering, which we build as part of our participation in SemEval-2017 Task 3. We develop a Support Vector Machine (SVM) based system that makes use of textual, domain-specific, word-embedding and topic-modeling features. In addition, we propose a novel method for dialogue chain identification in comment threads. Our primary submission won subtask C, outperforming other systems in all the primary evaluation metrics. We performed well in other English subtasks, ranking third in subtask A and eighth in subtask B. We also developed open source toolkits for all the three English subtasks by the name cQARank [https://github.com/TitasNandi/cQARank].

pdf pdf bib
STS-UHH at SemEval-2017 Task 1: Scoring Semantic Textual Similarity Using Supervised and Unsupervised Ensemble
Sarah Kohail | Amr Rekaby Salama | Chris Biemann

This paper reports the STS-UHH participation in the SemEval 2017 shared Task 1 of Semantic Textual Similarity (STS). Overall, we submitted 3 runs covering monolingual and cross-lingual STS tracks. Our participation involves two approaches: unsupervised approach, which estimates a word alignment-based similarity score, and supervised approach, which combines dependency graph similarity and coverage features with lexical similarity measures using regression methods. We also present a way on ensembling both models. Out of 84 submitted runs, our team best multi-lingual run has been ranked 12th in overall performance with correlation of 0.61, 7th among 31 participating teams.

pdf pdf bib
IITPB at SemEval-2017 Task 5: Sentiment Prediction in Financial Text
Abhishek Kumar | Abhishek Sethi | Md Shad Akhtar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya

This paper reports team IITPB’s participation in the SemEval 2017 Task 5 on ‘Fine-grained sentiment analysis on financial microblogs and news’. We developed 2 systems for the two tracks. One system was based on an ensemble of Support Vector Classifier and Logistic Regression. This system relied on Distributional Thesaurus (DT), word embeddings and lexicon features to predict a floating sentiment value between -1 and +1. The other system was based on Support Vector Regression using word embeddings, lexicon features, and PMI scores as features. The system was ranked 5th in track 1 and 8th in track 2.

pdf pdf bib
Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
Alexander Panchenko | Fide Marten | Eugen Ruppert | Stefano Faralli | Dmitry Ustalov | Simone Paolo Ponzetto | Chris Biemann

Interpretability of a predictive model is a powerful feature that gains the trust of users in the correctness of the predictions. In word sense disambiguation (WSD), knowledge-based systems tend to be much more interpretable than knowledge-free counterparts as they rely on the wealth of manually-encoded elements representing word senses, such as hypernyms, usage examples, and images. We present a WSD system that bridges the gap between these two so far disconnected groups of methods. Namely, our system, providing access to several state-of-the-art WSD models, aims to be interpretable as a knowledge-based system while it remains completely unsupervised and knowledge-free. The presented tool features a Web interface for all-word disambiguation of texts that makes the sense predictions human readable by providing interpretable word sense inventories, sense representations, and disambiguation results. We provide a public API, enabling seamless integration.

pdf pdf bib
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation
Alexander Panchenko | Eugen Ruppert | Stefano Faralli | Simone Paolo Ponzetto | Chris Biemann

The current trend in NLP is the use of highly opaque models, e.g. neural networks and word embeddings. While these models yield state-of-the-art results on a range of tasks, their drawback is poor interpretability. On the example of word sense induction and disambiguation (WSID), we show that it is possible to develop an interpretable model that matches the state-of-the-art models in accuracy. Namely, we present an unsupervised, knowledge-free WSID approach, which is interpretable at three levels: word sense inventory, sense feature representations, and disambiguation procedure. Experiments show that our model performs on par with state-of-the-art word sense embeddings and other unsupervised systems while offering the possibility to justify its decisions in human-readable form.

pdf pdf bib
The ContrastMedium Algorithm: Taxonomy Induction From Noisy Knowledge Graphs With Just A Few Links
Stefano Faralli | Alexander Panchenko | Chris Biemann | Simone Paolo Ponzetto

In this paper, we present ContrastMedium, an algorithm that transforms noisy semantic networks into full-fledged, clean taxonomies. ContrastMedium is able to identify the embedded taxonomy structure from a noisy knowledge graph without explicit human supervision such as, for instance, a set of manually selected input root and leaf concepts. This is achieved by leveraging structural information from a companion reference taxonomy, to which the input knowledge graph is linked (either automatically or manually). When used in conjunction with methods for hypernym acquisition and knowledge base linking, our methodology provides a complete solution for end-to-end taxonomy induction. We conduct experiments using automatically acquired knowledge graphs, as well as a SemEval benchmark, and show that our method is able to achieve high performance on the task of taxonomy induction.

pdf pdf bib
Negative Sampling Improves Hypernymy Extraction Based on Projection Learning
Dmitry Ustalov | Nikolay Arefyev | Chris Biemann | Alexander Panchenko

We present a new approach to extraction of hypernyms based on projection learning and word embeddings. In contrast to classification-based approaches, projection-based methods require no candidate hyponym-hypernym pairs. While it is natural to use both positive and negative training examples in supervised relation extraction, the impact of positive examples on hypernym prediction was not studied so far. In this paper, we show that explicit negative examples used for regularization of the model significantly improve performance compared to the state-of-the-art approach of Fu et al. (2014) on three datasets from different languages.

2016

pdf pdf bib
Language Transfer Learning for Supervised Lexical Substitution
Gerold Hintz | Chris Biemann

pdf pdf bib
new/s/leak – Information Extraction and Visualization for Investigative Data Journalists
Seid Muhie Yimam | Heiner Ulrich | Tatiana von Landesberger | Marcel Rosenbach | Michaela Regneri | Alexander Panchenko | Franziska Lehmann | Uli Fahrer | Chris Biemann | Kathrin Ballweg

pdf pdf bib
Making Sense of Word Embeddings
Maria Pelevina | Nikolay Arefiev | Chris Biemann | Alexander Panchenko

pdf pdf bib
Learning Paraphrasing for Multiword Expressions
Seid Muhie Yimam | Héctor Martínez Alonso | Martin Riedl | Chris Biemann

pdf pdf bib
Impact of MWE Resources on Multiword Recognition
Martin Riedl | Chris Biemann

pdf pdf bib
EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
Steffen Remus | Gerold Hintz | Chris Biemann | Christian M. Meyer | Darina Benikova | Judith Eckle-Kohler | Margot Mieskes | Thomas Arnold

pdf pdf bib
A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures
Richard Eckart de Castilho | Éva Mújdricza-Maydt | Seid Muhie Yimam | Silvana Hartmann | Iryna Gurevych | Anette Frank | Chris Biemann

We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.

pdf pdf bib
Vectors or Graphs? On Differences of Representations for Distributional Semantic Models
Chris Biemann

Distributional Semantic Models (DSMs) have recently received increased attention, together with the rise of neural architectures for scalable training of dense vector embeddings. While some of the literature even includes terms like ‘vectors’ and ‘dimensionality’ in the definition of DSMs, there are some good reasons why we should consider alternative formulations of distributional models. As an instance, I present a scalable graph-based solution to distributional semantics. The model belongs to the family of ‘count-based’ DSMs, keeps its representation sparse and explicit, and thus fully interpretable. I will highlight some important differences between sparse graph-based and dense vector approaches to DSMs: while dense vector-based models are computationally easier to handle and provide a nice uniform representation that can be compared and combined in many ways, they lack interpretability, provenance and robustness. On the other hand, graph-based sparse models have a more straightforward interpretation, handle sense distinctions more naturally and can straightforwardly be linked to knowledge bases, while lacking the ability to compare arbitrary lexical units and a compositionality operation. Since both representations have their merits, I opt for exploring their combination in the outlook.

pdf pdf bib
Towards a resource based on users’ knowledge to overcome the Tip of the Tongue problem.
Michael Zock | Chris Biemann

Language production is largely a matter of words which, in the case of access problems, can be searched for in an external resource (lexicon, thesaurus). In this kind of dialogue the user provides the momentarily available knowledge concerning the target and the system responds with the best guess(es) it can make given this input. As tip-of-the-tongue (ToT)-studies have shown, people always have some knowledge concerning the target (meaning fragments, number of syllables, ...) even if its complete form is eluding them. We will show here how to tap on this knowledge to build a resource likely to help authors (speakers/writers) to overcome the ToT-problem. Yet, before doing so we need a better understanding of the various kinds of knowledge people have when looking for a word. To this end, we asked crowdworkers to provide some cues to describe a given target and to specify then how each one of them relates to the target, in the hope that this could help others to find the elusive word. Next, we checked how well a given search strategy worked when being applied to differently built lexical networks. The results showed quite dramatic differences, which is not really surprising. After all, different networks are built for different purposes; hence each one of them is more or less suited for a given task. What was more surprising though is the fact that the relational information given by the users did not allow us to find the elusive word in WordNet better than without it.

pdf pdf bib
Domain-Specific Corpus Expansion with Focused Webcrawling
Steffen Remus | Chris Biemann

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

pdf pdf bib
SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines
Darina Benikova | Chris Biemann

Semantic relations play an important role in linguistic knowledge representation. Although their role is relevant in the context of written text, there is no approach or dataset that makes use of contextuality of classic semantic relations beyond the boundary of one sentence. We present the SemRelData dataset that contains annotations of semantic relations between nominals in the context of one paragraph. To be able to analyse the universality of this context notion, the annotation was performed on a multi-lingual and multi-genre corpus. To evaluate the dataset, it is compared to large, manually created knowledge resources in the respective languages. The comparison shows that knowledge bases not only have coverage gaps; they also do not account for semantic relations that are manifested in particular contexts only, yet still play an important role for text cohesion.

pdf pdf bib
Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods
Martin Riedl | Chris Biemann

pdf pdf bib
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis
Ayush Kumar | Sarah Kohail | Amit Kumar | Asif Ekbal | Chris Biemann

pdf pdf bib
TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling
Alexander Panchenko | Stefano Faralli | Eugen Ruppert | Steffen Remus | Hubert Naets | Cédrick Fairon | Simone Paolo Ponzetto | Chris Biemann

pdf pdf bib
Ambient Search: A Document Retrieval System for Speech Streams
Benjamin Milde | Jonas Wacker | Stefan Radomski | Max Mühlhäuser | Chris Biemann

We present Ambient Search, an open source system for displaying and retrieving relevant documents in real time for speech input. The system works ambiently, that is, it unobstructively listens to speech streams in the background, identifies keywords and keyphrases for query construction and continuously serves relevant documents from its index. Query terms are ranked with Word2Vec and TF-IDF and are continuously updated to allow for ongoing querying of a document collection. The retrieved documents, in our case Wikipedia articles, are visualized in real time in a browser interface. Our evaluation shows that Ambient Search compares favorably to another implicit information retrieval system on speech streams. Furthermore, we extrinsically evaluate multiword keyphrase generation, showing positive impact for manual transcriptions.

pdf pdf bib
Demonstrating Ambient Search: Implicit Document Retrieval for Speech Streams
Benjamin Milde | Jonas Wacker | Stefan Radomski | Max Mühlhäuser | Chris Biemann

In this demonstration paper we describe Ambient Search, a system that displays and retrieves documents in real time based on speech input. The system operates continuously in ambient mode, i.e. it generates speech transcriptions and identifies main keywords and keyphrases, while also querying its index to display relevant documents without explicit query. Without user intervention, the results are dynamically updated; users can choose to interact with the system at any time, employing a conversation protocol that is enriched with the ambient information gathered continuously. Our evaluation shows that Ambient Search outperforms another implicit speech-based information retrieval system. Ambient search is available as open source software.

2015

pdf pdf bib
JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models
Eugen Ruppert | Manuel Kaufmann | Martin Riedl | Chris Biemann

pdf pdf bib
Distributional Semantics for Resolving Bridging Mentions
Tim Feuerbach | Martin Riedl | Chris Biemann

pdf pdf bib
Do Supervised Distributional Methods Really Learn Lexical Inference Relations?
Omer Levy | Steffen Remus | Chris Biemann | Ido Dagan

pdf pdf bib
Book Reviews: Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger and John McCrae
Chris Biemann

pdf pdf bib
A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics
Martin Riedl | Chris Biemann

2014

pdf pdf bib
That’s sick dude!: Automatic identification of word sense change across different timescales
Sunny Mitra | Ritwik Mitra | Martin Riedl | Chris Biemann | Animesh Mukherjee | Pawan Goyal

pdf pdf bib
Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno
Seid Muhie Yimam | Chris Biemann | Richard Eckart de Castilho | Iryna Gurevych

pdf bib
DISTRIBUTED DISTRIBUTIONAL SIMILARITIES OF GOOGLE BOOKS OVER THE CENTURIES
Martin Riedl | Richard Steuer | Chris Biemann

pdf bib
NoSta-D Named Entity Annotation for German: Guidelines and Dataset
Darina Benikova | Chris Biemann | Marc Reznicek

pdf bib
Lexical Substitution Dataset for German
Kostadin Cholakov | Chris Biemann | Judith Eckle-Kohler | Iryna Gurevych

pdf pdf bib
Multiobjective Optimization and Unsupervised Lexical Acquisition for Named Entity Recognition and Classification
Govind | Asif Ekbal | Chris Biemann

pdf pdf bib
Combining Supervised and Unsupervised Parsing for Distributional Similarity
Martin Riedl | Irina Alles | Chris Biemann

2013

pdf pdf bib
WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
Seid Muhie Yimam | Iryna Gurevych | Richard Eckart de Castilho | Chris Biemann

pdf pdf bib
SemEval-2013 Task 5: Evaluating Phrasal Semantics
Ioannis Korkontzelos | Torsten Zesch | Fabio Massimo Zanzotto | Chris Biemann

pdf pdf bib
Three Knowledge-Free Methods for Automatic Lexical Chain Extraction
Steffen Remus | Chris Biemann

pdf pdf bib
Supervised All-Words Lexical Substitution using Delexicalized Features
György Szarvas | Chris Biemann | Iryna Gurevych

pdf pdf bib
Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis
Janneke Rauscher | Leonard Swiezinski | Martin Riedl | Chris Biemann

pdf pdf bib
JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity
Chris Biemann | Bonaventura Coppola | Michael R. Glass | Alfio Gliozzo | Matthew Hatem | Martin Riedl

pdf pdf bib
From Global to Local Similarities: A Graph-Based Contextualization Method using Distributional Thesauri
Martin Riedl | Chris Biemann

pdf pdf bib
Scaling to Large³ Data: An Efficient and Effective Method to Compute Distributional Thesauri
Martin Riedl | Chris Biemann

2012

pdf pdf bib
UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures
Daniel Bär | Chris Biemann | Iryna Gurevych | Torsten Zesch

pdf pdf bib
Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev
Chris Biemann

pdf pdf bib
How Text Segmentation Algorithms Gain from Topic Models
Martin Riedl | Chris Biemann

pdf bib
Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource for Lexical Substitution
Chris Biemann

pdf pdf bib
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP
Omri Abend | Chris Biemann | Anna Korhonen | Ari Rappoport | Roi Reichart | Anders Søgaard

pdf pdf bib
Sweeping through the Topic Space: Bad luck? Roll again!
Martin Riedl | Chris Biemann

pdf pdf bib
TopicTiling: A Text Segmentation Algorithm based on LDA
Martin Riedl | Chris Biemann

pdf pdf bib
Quantifying Semantics using Complex Network Analysis
Chris Biemann | Stefanie Roos | Karsten Weihe

pdf pdf bib
Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation
Tristan Miller | Chris Biemann | Torsten Zesch | Iryna Gurevych

2011

pdf pdf bib
Proceedings of the Workshop on Distributional Semantics and Compositionality
Chris Biemann | Eugenie Giesbrecht

pdf pdf bib
Distributional Semantics and Compositionality 2011: Shared Task Description and Results
Chris Biemann | Eugenie Giesbrecht

pdf pdf bib
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing
Chris Biemann | Anders Søgaard

2010

pdf pdf bib
Co-Occurrence Cluster Features for Lexical Substitutions in Context
Chris Biemann

2009

pdf pdf bib
Syntax is from Mars while Semantics from Venus! Insights from Spectral Analysis of Distributional Similarity Networks
Chris Biemann | Monojit Choudhury | Animesh Mukherjee

2008

pdf pdf bib
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing
Irina Matveeva | Chris Biemann | Monojit Choudhury | Mona Diab

pdf bib
Unsupervised Parts-of-Speech Induction for Bengali
Joydeep Nath | Monojit Choudhury | Animesh Mukherjee | Christian Biemann | Niloy Ganguly

pdf bib
ASV Toolbox: a Modular Collection of Language Exploration Tools
Chris Biemann | Uwe Quasthoff | Gerhard Heyer | Florian Holz

2007

pdf pdf bib
Proceedings of the ACL 2007 Student Research Workshop
Chris Biemann | Violeta Seretan | Ellen Riloff

pdf pdf bib
A Random Text Model for the Generation of Statistical Language Invariants
Chris Biemann

pdf pdf bib
Unsupervised Natural Language Processing Using Graph Models
Chris Biemann

pdf pdf bib
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing
Chris Biemann | Irina Matveeva | Rada Mihalcea | Dragomir Radev

pdf pdf bib
Combining Contexts in Lexicon Learning for Semantic Parsing
Richard Socher | Chris Biemann | Rainer Osswald

pdf pdf bib
Íslenskur Orðasjóður – Building a Large Icelandic Corpus
Erla Hallsteinsdóttir | Thomas Eckart | Chris Biemann | Uwe Quasthoff | Matthias Richter

2006

pdf pdf bib
Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering
Chris Biemann

pdf pdf bib
Dictionary acquisition using parallel text and co-occurrence statistics
Chris Biemann | Uwe Quasthoff

pdf pdf bib
Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization
Hans Friedrich Witschel | Chris Biemann

pdf bib
Corpus Portal for Search in Monolingual Corpora
Uwe Quasthoff | Matthias Richter | Christian Biemann

pdf pdf bib
Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems
Chris Biemann

2004

pdf bib
Linguistic Corpus Search
Christian Biemann | Uwe Quasthoff | Christian Wolff

pdf bib
Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences
Chris Biemann | Stefan Bordag | Uwe Quasthoff

pdf bib
Web Services for Language Resources and Language Technology Applications
Christian Biemann | Stefan Bordag | Uwe Quasthoff | Christian Wolff

pdf pdf bib
Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences
Chris Biemann | Sa-Im Shin | Key-Sun Choi

2002

pdf pdf bib
Named Entity Learning and Verification: Expectation Maximization in Large Corpora
Uwe Quasthoff | Christian Biemann | Christian Wolff

Search
Co-authors