Thierry Declerck

2023

For Sign Languages (SLs), can we create a SignNet, like a WordNet for spoken languages: a network of semantic relations between constitutive elements of SLs? We first discuss approaches that link SL data to wordnets, or integrate such elements with some adaptations into the structure of WordNet. Then, we present requirements for a SignNet, which is built on SL data and then linked to WordNet.

pdf bib abs
Towards an RDF Representation of the Infrastructure consisting in using Wordnets as a conceptual Interlingua between multilingual Sign Language Datasets
Thierry Declerck | Thomas Troelsgård | Sussi Olsen
Proceedings of the 12th Global Wordnet Conference

We present ongoing work dealing with a Linked Data compliant representation of infrastructures using wordnets for connecting multilingual Sign Language data sets. We build for this on already existing RDF and OntoLex representations of Open Multilingual Wordnet (OMW) data sets and work done by the European EASIER research project on the use of the CSV files of OMW for linking glosses and basic semantic information associated with Sign Language data sets in two languages: German and Greek. In this context, we started the transformation into RDF of a Danish data set, which links Danish Sign Language data and the wordnet for Danish, DanNet. The final objective of our work is to include Sign Language data sets (and their conceptual cross-linking via wordnets) in the Linguistic Linked Open Data cloud.

pdf bib
Leveraging DBnary Data to Enrich Information of Multiword Terms in Wiktionary
Gilles Sérasset | Thierry Declerck | Lenka Bajčetić
Proceedings of the 4th Conference on Language, Data and Knowledge

pdf bib
A uniform RDF-based Representation of the Interlinking of Wordnets and Sign Language Data
Thierry Declerck | Sam Bigeard | Dorians Callus | Benjamin Matthews | Sussi Olsen | Loran Ripard Xuereb
Proceedings of the 4th Conference on Language, Data and Knowledge

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

We present work dealing with a Linked Open Data (LOD)-compliant representation of Sign Language (SL) data, with the goal of supporting the cross-lingual alignment of SL data and their linking to Spoken Language (SpL) data. The proposed representation is based on activities of groups of researchers in the field of SL who have investigated the use of Open Multilingual Wordnet (OMW) datasets for (manually) cross-linking SL data or for linking SL and SpL data. Another group of researchers is proposing an XML encoding of articulatory elements of SLs and (manually) linking those to an SpL lexical resource. We propose an RDF-based representation of those various data. This unified formal representation offers a semantic repository of information on SL and SpL data that could be accessed for supporting the creation of datasets for training or evaluating NLP applications dealing with SLs, thinking for example of Machine Translation (MT) between SLs and between SLs and SpLs.

pdf bib abs
Linked Open Data compliant Representation of the Interlinking of Nordic Wordnets and Sign Language Data
Thierry Declerck | Sussi Olsen
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

We present ongoing work dealing with a Linked Open Data (LOD) compliant representation of Sign Language (SL) data, with the goal of supporting the cross-lingual linking of SL data, also to Spoken Language data. As the European EASIER research project has already investigated the use of Open Multilingual Wordnet (OMW) datasets for cross-linking German and Greek SL data, we propose a unified RDF-based representation of OMW and SL data. In this context, we experimented with the transformation into RDF of a rich dataset, which links Danish Sign Language data and the wordnet for Danish, DanNet. We extend this work to other Nordic languages, aiming at supporting cross-lingual comparisons of Nordic Sign Languages. This unified formal representation offers a semantic repository of information on SL data that could be accessed for supporting the creation of datasets for training or evaluating NLP applications that involve SLs.

In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks.

pdf bib abs
Enriching Multiword Terms in Wiktionary with Pronunciation Information
Lenka Bajcetic | Thierry Declerck | Gilles Sérasset
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We report on work in progress dealing with the automated generation of pronunciation information for English multiword terms (MWTs) in Wiktionary, combining information available for their single components. We describe the issues we were encountering, the building of an evaluation dataset, and our teaming with the DBnary resource maintainer. Our approach shows potential for automatically adding morphosyntactic and semantic information to the components of such MWTs.

2022

pdf bib abs
Using Wiktionary to Create Specialized Lexical Resources and Datasets
Lenka Bajčetić | Thierry Declerck
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes an approach aiming at utilizing Wiktionary data for creating specialized lexical datasets which can be used for enriching other lexical (semantic) resources or for generating datasets that can be used for evaluating or improving NLP tasks, like Word Sense Disambiguation, Word-in-Context challenges, or Sense Linking across lexicons and dictionaries. We have focused on Wiktionary data about pronunciation information in English, and grammatical number and grammatical gender in German.

pdf bib abs
Towards a new Ontology for Sign Languages
Thierry Declerck
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the current status of a new ontology for representing constitutive elements of Sign Languages (SL). This development emerged from investigations on how to represent multimodal lexical data in the OntoLex-Lemon framework, with the goal to publish such data in the Linguistic Linked Open Data (LLOD) cloud. While studying the literature and various sites dealing with sign languages, we saw the need to harmonise all the data categories (or features) defined and used in those sources, and to organise them in an ontology to which lexical descriptions in OntoLex-Lemon could be linked. We make the code of the first version of this ontology available, so that it can be further developed collaboratively by both the Linked Data and the SL communities

pdf bib abs
Towards the Profiling of Linked Lexicographic Resources
Lenka Bajcetic | Seung-bin Yim | Thierry Declerck
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

This paper presents Edie: ELEXIS DIctionary Evaluator. Edie is designed to create profiles for lexicographic resources accessible through the ELEXIS platform. These profiles can be used to evaluate and compare lexicographic resources, and in particular they can be used to identify potential data that could be linked.

pdf bib abs
Towards the Linking of a Sign Language Ontology with Lexical Data
Thierry Declerck
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

We describe our current work for linking a new ontology for representing constitutive elements of Sign Languages with lexical data encoded within the OntoLex-Lemon framework. We first present very briefly the current state of the ontology, and show how transcriptions of signs can be represented in OntoLex-Lemon, in a minimalist manner, before addressing the challenges of linking the elements of the ontology to full lexical descriptions of the spoken languages.

pdf bib
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
Thierry Declerck | John P. McCrae | Elena Montiel | Christian Chiarcos | Maxim Ionov
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

pdf bib abs
A Survey of Guidelines and Best Practices for the Generation, Interlinking, Publication, and Validation of Linguistic Linked Data
Fahad Khan | Christian Chiarcos | Thierry Declerck | Maria Pia Di Buono | Milan Dojchinovski | Jorge Gracia | Giedre Valunaite Oleskeviciene | Daniela Gifu
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This article discusses a survey carried out within the NexusLinguarum COST Action which aimed to give an overview of existing guidelines (GLs) and best practices (BPs) in linguistic linked data. In particular it focused on four core tasks in the production/publication of linked data: generation, interlinking, publication, and validation. We discuss the importance of GLs and BPs for LLD before describing the survey and its results in full. Finally we offer a number of directions for future work in order to address the findings of the survey.

2021

pdf bib
Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos | Thierry Declerck | Maxim Ionov
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

pdf bib abs
Towards the Addition of Pronunciation Information to Lexical Semantic Resources
Thierry Declerck | Lenka Bajčetić
Proceedings of the 11th Global Wordnet Conference

This paper describes ongoing work aiming at adding pronunciation information to lexical semantic resources, with a focus on open wordnets. Our goal is not only to add a new modality to those semantic networks, but also to mark heteronyms listed in them with the pronunciation information associated with their different meanings. This work could contribute in the longer term to the disambiguation of multi-modal resources, which are combining text and speech.

2020

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

pdf bib abs
Towards an Extension of the Linking of the Open Dutch WordNet with Dutch Lexicographic Resources
Thierry Declerck
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

This extended abstract presents on-going work consisting in interlinking and merging the Open Dutch WordNet and generic lexicographic resources for Dutch, focusing for now on the Dutch and English versions of Wiktionary and using the Algemeen Nederlands Woordenboek as a quality checking instance. As the Open Dutch WordNet is already equipped with a relevant number of complex lexical units, we are aiming at expanding it and proposing a new representational framework for the encoding of the interlinked and integrated data. The longer term goal of the work is to investigate if and on how senses can be restricted to particular morphological variations of Dutch lexical entries, and how to represent this information in a Linguistic Linked Open Data compliant format.

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

pdf bib abs
Language Data Sharing in European Public Services – Overcoming Obstacles and Creating Sustainable Data Sharing Infrastructures
Lilli Smal | Andrea Lösch | Josef van Genabith | Maria Giagkou | Thierry Declerck | Stephan Busemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

Data is key in training modern language technologies. In this paper, we summarise the findings of the first pan-European study on obstacles to sharing language data across 29 EU Member States and CEF-affiliated countries carried out under the ELRC White Paper action on Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe. Why Language Data Matters. We present the methodology of the study, the obstacles identified and report on recommendations on how to overcome those. The obstacles are classified into (1) lack of appreciation of the value of language data, (2) structural challenges, (3) disposition towards CAT tools and lack of digital skills, (4) inadequate language data management practices, (5) limited access to outsourced translations, and (6) legal concerns. Recommendations are grouped into addressing the European/national policy level, and the organisational/institutional level.

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

pdf bib abs
Adding Pronunciation Information to Wordnets
Thierry Declerck | Lenka Bajcetic | Melanie Siegel
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

We describe on-going work consisting in adding pronunciation information to wordnets, as such information can indicate specific senses of a word. Many wordnets associate with their senses only a lemma form and a part-of-speech tag. At the same time, we are aware that additional linguistic information can be useful for identifying a specific sense of a wordnet lemma when encountered in a corpus. While work already deals with the addition of grammatical number or grammatical gender information to wordnet lemmas,we are investigating the linking of wordnet lemmas to pronunciation information, adding thus a speech-related modality to wordnets

pdf bib abs
On the Linguistic Linked Open Data Infrastructure
Christian Chiarcos | Bettina Klimek | Christian Fäth | Thierry Declerck | John Philip McCrae
Proceedings of the 1st International Workshop on Language Technology Platforms

In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories. We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD.

2019

pdf bib abs
Porting Multilingual Morphological Resources to OntoLex-Lemon
Thierry Declerck | Stefania Racioppa
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We describe work consisting in porting various morphological resources to the OntoLex-Lemon model. A main objective of this work is to offer a uniform representation of different morphological data sets in order to be able to compare and interlink multilingual resources and to cross-check and interlink or merge the content of morphological resources of one and the same language. The results of our work will be published on the Linguistic Linked Open Data cloud.

pdf bib abs
Using OntoLex-Lemon for Representing and Interlinking German Multiword Expressions in OdeNet and MMORPH
Thierry Declerck | Melanie Siegel | Stefania Racioppa
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We describe work consisting in porting two large German lexical resources into the OntoLex-Lemon model in order to establish complementary interlinkings between them. One resource is OdeNet (Open German WordNet) and the other is a further development of the German version of the MMORPH morphological analyzer. We show how the Multiword Expressions (MWEs) contained in OdeNet can be morphologically specified by the use of the lexical representation and linking features of OntoLex-Lemon, which also support the formulation of restrictions in the usage of such expressions.

pdf bib
Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)
Luis Espinosa-Anke | Thierry Declerck | Dagmar Gromann | Jose Camacho-Collados | Mohammad Taher Pilehvar
Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)

pdf bib abs
OntoLex as a possible Bridge between WordNets and full lexical Descriptions
Thierry Declerck | Melanie Siegel
Proceedings of the 10th Global Wordnet Conference

In this paper we describe our current work on representing a recently created German lexical semantics resource in OntoLex-Lemon and in conformance with WordNet specifications. Besides presenting the representation effort, we show the utilization of OntoLex-Lemon to bridge from WordNet-like resources to full lexical descriptions and extend the coverage of WordNets to other types of lexical data, such as decomposition results, exemplified for German data, and inflectional phenomena, here outlined for English data.

2018

pdf bib
Proceedings of the Third Workshop on Semantic Deep Learning
Luis Espinosa Anke | Dagmar Gromann | Thierry Declerck
Proceedings of the Third Workshop on Semantic Deep Learning

pdf bib
Comparing Pretrained Multilingual Word Embeddings on an Ontology Alignment Task
Dagmar Gromann | Thierry Declerck
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
An Integrated Formal Representation for Terminological and Lexical Data included in Classification Schemes
Thierry Declerck | Kseniya Egorova | Eileen Schnur
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management
Andrea Lösch | Valérie Mapelli | Stelios Piperidis | Andrejs Vasiļjevs | Lilli Smal | Thierry Declerck | Eileen Schnur | Khalid Choukri | Josef van Genabith
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Proceedings of the 2nd Workshop on Semantic Deep Learning (SemDeep-2)
Dagmar Gromann | Thierry Declerck | Georg Heigl
Proceedings of the 2nd Workshop on Semantic Deep Learning (SemDeep-2)

pdf bib abs
Multilingual Ontologies for the Representation and Processing of Folktales
Thierry Declerck | Anastasija Aman | Martin Banzer | Dominik Macháček | Lisa Schäfer | Natalia Skachkova
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

We describe work done in the field of folkloristics and consisting in creating ontologies based on well-established studies proposed by “classical” folklorists. This work is supporting the availability of a huge amount of digital and structured knowledge on folktales to digital humanists. The ontological encoding of past and current motif-indexation and classification systems for folktales was in the first step limited to English language data. This led us to focus on making those newly generated formal knowledge sources available in a few more languages, like German, Russian and Bulgarian. We stress the importance of achieving this multilingual extension of our ontologies at a larger scale, in order for example to support the automated analysis and classification of such narratives in a large variety of languages, as those are getting more and more accessible on the Web.

pdf bib abs
Hashtag Processing for Enhanced Clustering of Tweets
Dagmar Gromann | Thierry Declerck
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Rich data provided by tweets have beenanalyzed, clustered, and explored in a variety of studies. Typically those studies focus on named entity recognition, entity linking, and entity disambiguation or clustering. Tweets and hashtags are generally analyzed on sentential or word level but not on a compositional level of concatenated words. We propose an approach for a closer analysis of compounds in hashtags, and in the long run also of other types of text sequences in tweets, in order to enhance the clustering of such text documents. Hashtags have been used before as primary topic indicators to cluster tweets, however, their segmentation and its effect on clustering results have not been investigated to the best of our knowledge. Our results with a standard dataset from the Text REtrieval Conference (TREC) show that segmented and harmonized hashtags positively impact effective clustering.

2016

pdf bib abs
Towards a WordNet based Classification of Actors in Folktales
Thierry Declerck | Tyler Klement | Antonia Kostova
Proceedings of the 8th Global WordNet Conference (GWC)

In the context of a student software project we are investigating the use of WordNet for improving the automatic detection and classification of actors (or characters) mentioned in folktales. Our starting point is the book “Classification of International Folktales”, out of which we extract text segments that name the different actors involved in tales, taking advantage of patterns used by its author, Hans-Jo ̈rg Uther. We apply on those text segments functions that are implemented in the NLTK interface to WordNet in order to obtain lexical semantic information to enrich the original naming of characters proposed in the “Classification of International Folktales” and to support their translation in other languages.

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

pdf bib abs
Monolingual Social Media Datasets for Detecting Contradiction and Entailment
Piroska Lendvai | Isabelle Augenstein | Kalina Bontcheva | Thierry Declerck
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Entailment recognition approaches are useful for application domains such as information extraction, question answering or summarisation, for which evidence from multiple sentences needs to be combined. We report on a new 3-way judgement Recognizing Textual Entailment (RTE) resource that originates in the Social Media domain, and explain our semi-automatic creation method for the special purpose of information verification, which draws on manually established rumourous claims reported during crisis events. From about 500 English tweets related to 70 unique claims we compile and evaluate 5.4k RTE pairs, while continue automatizing the workflow to generate similar-sized datasets in other languages.

pdf bib
Towards a Formal Representation of Components of German Compounds
Thierry Declerck | Piroska Lendvai
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2015

pdf bib
Towards the Representation of Hashtags in Linguistic Linked Open Data Format
Thierry Declerck | Piroska Lendvai
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf bib
Processing and Normalizing Hashtags
Thierry Declerck | Piroska Lendvai
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
How to semantically relate dialectal Dictionaries in the Linked Data Framework
Thierry Declerck | Eveline Wandl-Vogt
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Harmonizing Lexical Data for their Linking to Knowledge Objects in the Linked Data Framework
Thierry Declerck
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
SentiMerge: Combining Sentiment Lexicons in a Bayesian Framework
Guy Emerson | Thierry Declerck
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib abs
TMO — The Federated Ontology of the TrendMiner Project
Hans-Ulrich Krieger | Thierry Declerck
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes work carried out in the European project TrendMiner which partly deals with the extraction and representation of real time information from dynamic data streams. The focus of this paper lies on the construction of an integrated ontology, TMO, the TrendMiner Ontology, that has been assembled from several independent multilingual taxonomies and ontologies which are brought together by an interface specification, expressed in OWL. Within TrendMiner, TMO serves as a common language that helps to interlink data, delivered from both symbolic and statistical components of the TrendMiner system. Very often, the extracted data is supplied as quintuples, RDF triples that are extended by two further temporal arguments, expressing the temporal extent in which an atemporal statement is true. In this paper, we will also sneak a peek on the temporal entailment rules and queries that are built into the semantic repository hosting the data and which can be used to derive useful new information.

pdf bib abs
A SKOS-based Schema for TEI encoded Dictionaries at ICLTT
Thierry Declerck | Karlheinz Mörth | Eveline Wandl-Vogt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

At our institutes we are working with quite some dictionaries and lexical resources in the field of less-resourced language data, like dialects and historical languages. We are aiming at publishing those lexical data in the Linked Open Data framework in order to link them with available data sets for highly-resourced languages and elevating them thus to the same digital dignity the mainstream languages have gained. In this paper we concentrate on two TEI encoded variants of the Arabic language and propose a mapping of this TEI encoded data onto SKOS, showing how the lexical entries of the two dialectal dictionaries can be linked to other language resources available in the Linked Open Data cloud.

pdf bib abs
Harmonization of German Lexical Resources for Opinion Mining
Thierry Declerck | Hans-Ulrich Krieger
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present on-going work on the harmonization of existing German lexical resources in the field of opinion and sentiment mining. The input of our harmonization effort consisted in four distinct lexicons of German word forms, encoded either as lemmas or as full forms, marked up with polarity features, at distinct granularity levels. We describe how the lexical resources have been mapped onto each other, generating a unique list of entries, with unified Part-of-Speech information and basic polarity features. Future work will be dedicated to the comparison of the harmonized lexicon with German corpora annotated with polarity information. We are further aiming at both linking the harmonized German lexical resources with similar resources in other languages and publishing the resulting set of lexical data in the context of the Linguistic Linked Open Data cloud.

2013

pdf bib
Integration of the Thesaurus for the Social Sciences (TheSoz) in an Information Extraction System
Thierry Declerck
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Porting Elements of the Austrian Baroque Corpus onto the Linguistic Linked Open Data Format
Ulrike Czeitschner | Thierry Declerck | Claudia Resch
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Linguistically analyzed labels of knowledge objects: How can they support OBIE? Lessons learned from the Monnet and TrendMiner projects
Thierry Declerck
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Linguistic Linked Open Data (LLOD). Introduction and Overview
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

2012

pdf bib abs
Accessing and standardizing Wiktionary lexical entries for the translation of labels in Cultural Heritage taxonomies
Thierry Declerck | Karlheinz Mörth | Piroska Lendvai
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the usefulness of Wiktionary, the freely available web-based lexical resource, in providing multilingual extensions to catalogues that serve content-based indexing of folktales and related narratives. We develop conversion tools between Wiktionary and TEI, using ISO standards (LMF, MAF), to make such resources available to both the Digital Humanities community and the Language Resources community. The converted data can be queried via a web interface, while the tools of the workflow are to be released with an open source license. We report on the actual state and functionality of our tools and analyse some shortcomings of Wiktionary, as well as potential domains of application.

This paper presents a metadata model for the description of language resources proposed in the framework of the META-SHARE infrastructure, aiming to cover both datasets and tools/technologies used for their processing. It places the model in the overall framework of metadata models, describes the basic principles and features of the model, elaborates on the distinction between minimal and maximal versions thereof, briefly presents the integrated environment supporting the LRs description and search and retrieval processes and concludes with work to be done in the future for the improvement of the model.

pdf bib
Ontology-Based Incremental Annotation of Characters in Folktales
Thierry Declerck | Nikolina Koleva | Hans-Ulrich Krieger
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2010

pdf bib abs
LAF/GrAF-grounded Representation of Dependency Structures
Yoshihiko Hayashi | Thierry Declerck | Chiharu Narawa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper shows that a LAF/GrAF-based annotation schema can be used for the adequate representation of syntactic dependency structures possibly in many languages. We first argue that there are at least two types of textual units that can be annotated with dependency information: words/tokens and chunks/phrases. We especially focus on importance of the latter dependency unit: it is particularly useful for representing Japanese dependency structures, known as Kakari-Uke structure. Based on this consideration, we then discuss a sub-typing of GrAF to represent the corresponding dependency structures. We derive three node types, two edge types, and the associated constraints for properly representing both the token-based and the chunk-based dependency structures. We finally propose a wrapper program that, as a proof of concept, converts output data from different dependency parsers in proprietary XML formats to the GrAF-compliant XML representation. It partially proves the value of an international standard like LAF/GrAF in the Web service context: an existing dependency parser can be, in a sense, standardized, once wrapped by a data format conversion process.

pdf bib abs
Integration of Linguistic Markup into Semantic Models of Folk Narratives: The Fairy Tale Use Case
Piroska Lendvai | Thierry Declerck | Sándor Darányi | Pablo Gervás | Raquel Hervás | Scott Malec | Federico Peinado
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Propp's influential structural analysis of fairy tales created a powerful schema for representing storylines in terms of character functions, which is directly exploitable for computational semantic analysis, and procedural generation of stories of this genre. We tackle two resources that draw on the Proppian model - one formalizes it as a semantic markup scheme and the other as an ontology -, both lacking linguistic phenomena explicitly represented in them. The need for integrating linguistic information into structured semantic resources is motivated by the emergence of suitable standards that facilitate this, as well as the benefits such joint representation would create for transdisciplinary research across Digital Humanities, Computational Linguistics, and Artificial Intelligence.

pdf bib abs
Towards a Standardized Linguistic Annotation of the Textual Content of Labels in Knowledge Representation Systems
Thierry Declerck | Piroska Lendvai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

WWe propose applying standardized linguistic annotation to terms included in labels of knowledge representation schemes (taxonomies or ontologies), hypothesizing that this would help improving ontology-based semantic annotation of texts. We share the view that currently used methods for including lexical and terminological information in such hierarchical networks of concepts are not satisfactory, and thus put forward ― as a preliminary step to our annotation goal ― a model for modular representation of conceptual, terminological and linguistic information within knowledge representation systems. Our CTL model is based on two recent initiatives that describe the representation of terminologies and lexicons in ontologies: the Terminae method for building terminological and ontological models from text (Aussenac-Gilles et al., 2008), and the LexInfo metamodel for ontology lexica (Buitelaar et al., 2009). CTL goes beyond the mere fusion of the two models and introduces an additional level of representation for the linguistic objects, whereas those are no longer limited to lexical information but are covering the full range of linguistic phenomena, including constituency and dependency. We also show that the approach benefits linguistic and semantic analysis of external documents that are often to be linked to semantic resources for enrichment with concepts that are newly extracted or inferred.

pdf bib abs
Extraction, Merging, and Monitoring of Company Data from Heterogeneous Sources
Christian Federmann | Thierry Declerck
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe the implementation of an enterprise monitoring system that builds on an ontology-based information extraction (OBIE) component applied to heterogeneous data sources. The OBIE component consists of several IE modules - each extracting on a regular temporal basis a specific fraction of company data from a given data source - and a merging tool, which is used to aggregate all the extracted information about a company. The full set of information about companies, which is to be extracted and merged by the OBIE component, is given in the schema of a domain ontology, which is guiding the information extraction process. The monitoring system, in case it detects changes in the extracted and merged information on a company with respect to the actual state of the knowledge base of the underlying ontology, ensures the update of the population of the ontology. As we are using an ontology extended with temporal information, the system is able to assign time intervals to any of the object instances. Additionally, detected changes can be communicated to end-users, who can validate and possibly correct the resulting updates in the knowledge base.

2009

pdf bib
Proceedings of SRSL 2009, the 2nd Workshop on Semantic Representation of Spoken Language
Manuel Alcantara-Pla | Thierry Declerck
Proceedings of SRSL 2009, the 2nd Workshop on Semantic Representation of Spoken Language

pdf bib
Concept and Relation Extraction in the Finance Domain
Mihaela Vela | Thierry Declerck
Proceedings of the Eight International Conference on Computational Semantics

2008

pdf bib abs
Foundation of a Component-based Flexible Registry for Language Resources and Technology
Daan Broeder | Thierry Declerck | Erhard Hinrichs | Stelios Piperidis | Laurent Romary | Nicoletta Calzolari | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. The registry will address interoperability needs by refering to a shared vocabulary registered in data category registries as they are suggested by ISO.

pdf bib abs
A Framework for Standardized Syntactic Annotation
Thierry Declerck
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This poster presents an ISO framework for the standardization of syntactic annotation (SynAF). The normative part SynAF is concerned with a metamodel for syntactic annotation that covers both dimensions of constituency and dependency, and propose thus a multi-layered annotation framework that allows the combined and interrelated annotation of language data along both lines of consideration. This standard is designed to be used in close conjuncion with the metamodel presented in the Linguistic Annotation Framework (LAF) and with ISO 12620, Terminology and other language resources - Data categories.

2006

pdf bib
Annotating text using the Linguistic Description Scheme of MPEG-7: The DIRECT-INFO Scenario
Thierry Declerck | Stephan Busemann | Herwig Rehatschek | Gert Kienast
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing

pdf bib abs
Generic NLP Tools for Supporting Shallow Ontology Building
Thierry Declerck | Mihaela Vela
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present on-going investigations on how complex syntactic annotation, combined with linguistic semantics, can possibly help in supporting the semi-automatic building of (shallow) ontologies from text by proposing an automated extraction of (possibly underspecified) semantic relations from linguistically annotated text.

pdf bib abs
Multilingual Lexical Semantic Resources for Ontology Translation
Thierry Declerck | Asunción Gómez Pérez | Ovidiu Vela | Zeno Gantner | David Manzano-Macho
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe the integration of some multilingual language resources in ontological descriptions, with the purpose of providing ontologies, which are normally using concept labels in just one (natural) language, with multilingual facility in their design and use in the context of Semantic Web applications, supporting both the semantic annotation of textual documents with multilingual ontology labels and ontology extraction from multilingual text sources.

pdf bib abs
SynAF: Towards a Standard for Syntactic Annotation
Thierry Declerck
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In the paper we present the actual state of development of an international standard for syntactic annotation, called SynAF. This standard is being prepared by the Technical Committee ISO/TC 37 (Terminology and Other Language Resources), Subcommittee SC 4 (Language Resource Management), in collaboration with the European eContent Project LIRICS (Linguistic Infrastructure for Interoperable Resources and Systems).