Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias (Editors)

Anthology ID:: L06-1
Month:: May
Year:: 2006
Address:: Genoa, Italy
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
URL:: https://aclanthology.org/L06-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

pdf bib abs
DanPASS - A Danish Phonetically Annotated Spontaneous Speech Corpus
Nina Grønnum

A corpus is described consisting of non-scripted monologues and dialogues, recorded by 22 speakers, comprising a total of about 70.000 words, corresponding to well over 10 hours of speech. The monologues were recorded as one-way communication with blind partner where the speaker performed three different tasks: (S)he described a network consisting of various geometrical shapes in various colours. (S)he guided the listener through four different routes in a virtual city map.(S)he instructed the listener how to build a house from its individual parts. The dialogues are replicas of the HCRC map tasks (http://www.hcrc.ed.ac.uk/maptask/). Annotation is performed in Praat. The sound files are segmented into prosodic phrases, words, and syllables, always to the nearest zero-crossing in the waveform. It is supplied, in seven separate interval tiers, with an orthographical transcription, detailed part-of-speech tags, simplified part-of-speech tags, a phonological transcription, a broad phonetic transcription, the pitch relation between each stressed and post-tonic syllable, the phrasal intonation, and an empty tier for comments.

pdf bib abs
A Hebrew Tree Bank Based on Cantillation Marks
Andi Wu | Kirk Lowery

In the Masoretic text of the Hebrew Bible (HB), the cantillation marks function like a punctuation system that shows the division and subdivision of each verse, forming a tree structure which is similar to the prosodic tree in modern linguistics. However, in the Masoretic text, the structure is hidden in a complicated set of diacritic symbols and the rich information is accessible only to a few trained scholars. In order to make the structural information available to the general public and to automatic processing by the computer, we built a tree bank where the hierarchical structure of each HB verse is explicitly represented in XML format. We coded the punctuation system in a context-tree grammar which was then used by a CYK parser to automatically generate trees for the whole HB. The results show that (1) the CFG correctly encoded the annotation rules and (2) the annotation done by the Masoretes is highly consistent.

pdf bib abs
Techno-langue: The French National Initiative for Human Language Technologies (HLT)
Stéphane Chaudiron | Joseph Mariani

Techno-langue is the French National Program on HLT supported by the French Ministries in charge of Research, Industry and Culture. It addresses four action lines: creating basic language and software resources, organizing evaluation campaigns, participating in the standardization process and creating a Web Portal for disseminating information and surveys to a large audience. This paper presents the main results of the program and an ongoing initiative for launching a transnational program at the European level on a similar basis.

pdf bib abs
REGULUS: A Generic Multilingual Open Source Platform for Grammar-Based Speech Applications
Manny Rayner | Pierrette Bouillon | Beth Ann Hockey | Nikos Chatzichrisafis

We present an overview of Regulus, an Open Source platform that supports corpus-based derivation of efficient domain-specific speech recognisers from general linguistically motivated unification grammars. We list available Open Source resources, which include compilers, resource grammars for various languages, documentation and a development environment. The greater part of the paper presents a series of experiments carried out using a medium-vocabulary medical speech translation application and a corpus of 801 recorded domain utterances, designed to investigate the impact on speech understanding performance of vocabulary size, grammatical coverage, presence or absence of various linguistic features, degree of generality of thegrammar and use or otherwise of probabilistic weighting in the CFGlanguage model. In terms of task accuracy, the most significant factors were the use of probabilistic weighting, the degree of generality of the grammar and the inclusion of features which model sortal restrictions.

pdf bib abs
Extraction of Temporal Information from Texts in Swedish
Anders Berglund | Richard Johansson | Pierre Nugues

This paper describes the implementation and evaluation of a generic component to extract temporal information from texts in Swedish. It proceeds in two steps. The first step extracts time expressions and events, and generates a feature vector for each element it identifies. Using the vectors, the second step determines the temporal relations, possibly none, between the extracted events and orders them in time. We used a machine learning approach to find the relations between events. To run the learning algorithm, we collected a corpus of road accident reports from newspapers websites that we manually annotated. It enabled us to train decision trees and to evaluate the performance of the algorithm.

pdf bib abs
New Approach to Frequency Dictionaries - Czech Example
Jaroslava Hlaváčová

On the example of the recent edition of the Frequency Dictionary of Czech wedescribe and explain some new general principles that should be followed forgetting better results for practical uses of frequency dictionaries. It ismainly adopting average reduced frequency instead of absolute frequency forordering items. The formula for calculation of the average reduced frequencyis presented in the contribution together with a brief explanation, including examples clarifying the difference between the measures. Then, the Frequency Dictionary of Czech and its parts are described.

pdf bib abs
A Spell Checker for a World Language: The New Microsoft’s Spanish Spell Checker
Flora Ramírez Bustamante | Alfredo Arnaiz | Mar Ginés

This paper reports work carried out to develop a speller for Spanish at Microsoft Corporation, discusses the technique for isolated-word error correction used by the speller, provides general descriptions of the error data collection and error typology, and surveys a variety of linguistic considerations relevant when dealing with a world language spread over several countries and exposed to different language influences. We show that even though it has been claimed that the state of the art for practical applications based on isolated word error correction does not offer always a sensible set of ranked candidates for the misspelling, the introduction of a finer-grained categorization of errors and the use of their relative frequency has had a positive impact in the speller application developed for Spanish (the corresponding evaluation data is presented).

pdf bib abs
TransType2 : The Last Word
Elliott Macklovitch

This paper presents the results of the usability evaluations that were conducted within TransType2, an international R&D project the goal of which was to develop a novel approach to interactive machine translation. We briefly sketch the TransType system and then describe the methodology that we elaborated for the five rounds of user trials that were held on the premises of two translation agencies over the last eighteen months of the project. We provide the productivity results posted by the six translators who tested the system and we also discuss some of the non-quantitative factors which influenced the users reaction to TransType.

pdf bib abs
Designing and Recording an Emotional Speech Database for Corpus Based Synthesis in Basque
Ibon Saratxaga | Eva Navas | Inmaculada Hernáez | Iker Aholab

This paper describes an emotional speech database recorded for standard Basque. The database has been designed with the twofold purpose of being used for corpus based synthesis, and also of allowing the study of prosodic models for the emotions. The database is thus large, to get good corpus based synthesis quality and contains the same texts recorded in the six basic emotions plus the neutral style. The recordings were carried out by two professional dubbing actors, a man and a woman. The paper explains the whole creation process, beginning with the design stage, following with the corpus creation and the recording phases, and finishing with some learned lessons and hints.

pdf bib abs
Collection, Encoding and Linguistic Processing of a Swedish Medical Corpus - The MEDLEX Experience
Dimitrios Kokkinakis

Corpora annotated with structural and linguistic characteristics play a major role in nearly every area of language processing. During recent years a number of corpora and large data sets became known and available to research even in specialized fields such as medicine, but still however, targeted predominantly for the English language. This paper provides a description of the collection, encoding and linguistic processing of an ever growing Swedish medical corpus, the MEDLEX Corpus. MEDLEX consists of a variety of text-documents related to various medical text genres. The MEDLEX Corpus has been structurally annotated using the Corpus Encoding Standard for XML (XCES), lemmatized and automatically annotated with part-of-speech and semantic information (extended named entities and the Medical Subject Headings, MeSH, terminology). The results from the processing stages (part-of-speech, entities and terminology) have been merged into a single representation format and syntactically analysed using a cascaded finite state parser. Finally, the parsers results are converted into a tree structure that follows the TIGER-XML coding scheme, resulting a suitable for further exploration and fairly large Treebank of Swedish medical texts.

pdf bib abs
A new approach to syntactic annotation
Masaki Noguchi | Hiroshi Ichikawa | Taiichi Hashimoto | Takenobu Tokunaga

Many systems have been developed for creating syntactically annotated corpora. However, they mainly focus on interface usability and hardly pay attention toknowledge sharing among annotators in the task. In order to incorporate the functionality of knowledge sharing, we emphasized the importance of normalizingthe annotation process. As a first step toward knowledge sharing, this paper proposes a method of system initiative annotation in which the system suggests annotators the order of ambiguities to solve. To be more concrete, the system forces annotators to solve ambiguity of constituent structure in a top-down and depth-first manner, and then to solve ambiguity of grammatical category in a bottom-up and breadth-first manner. We implemented the system on top of eBonsai, our annotation tool, and conducted experiments to compare eBonsai and the proposed system in terms of annotation accuracy and efficiency. We found that at least for novice annotators, the proposed system is more efficient while keeping annotation accuracy comparable with eBonsai.

pdf bib abs
A French Non-Native Corpus for Automatic Speech Recognition
Tien-Ping Tan | Laurent Besacier

Automatic speech recognition (ASR) technology has achieved a level of maturity, where it is already practical to be used by novice users. However, most non-native speakers are still not comfortable with services including ASR systems, because of the accuracy on non-native speakers. This paper describes our approach in constructing a non-native corpus particularly in French for testing and adapting non-native speaker for automatic speech recognition. Finally, we also propose in this paper a method for detecting pronunciation variants and possible pronunciation mistakes by non-native speakers.

pdf bib abs
Documenting variation across Europe and the Mediterranean: the Pavia Typological Database
Andrea Sansò

This paper describes the Pavia Typological Database (PTD), a follow-up to the MED-TYP database (Sansò 2004). The PTD is an ever-growing repository of primary linguistic data (words, clauses, sentences) documenting a number of morphosyntactic phenomena in the languages of Europe (and including in some cases languages from the Mediterranean area). Its prospective users are typologists wanting to access primary, typologically uninterpreted (but glossed) data, but also anyone interested in linguistic variation on a continental scale. The paper discusses the background and motivation for the creation of the PTD, its present coverage, the techniques used to annotate the primary data, and the general architecture of the database.

pdf bib abs
Building and Incorporating Language Models for Persian Continuous Speech Recognition Systems
M. Bahrani | H. Sameti | N. Hafezi | H. Movasagh

In this paper building statistical language models for Persian language using a corpus and incorporating them in Persian continuous speech recognition (CSR) system are described. We used Persian Text Corpus for building the language models. First we preprocessed the texts of corpus by correcting the different orthography of words. Also, the number of POS tags was decreased by clustering POS tags manually. Then we extracted word based monogram and POS-based bigram and trigram language models from the corpus. We also present the procedure of incorporating language models in a Persian CSR system. By using the language models 27.4% reduction in word error rate was achieved in the best case.

pdf bib abs
Blind Evaluation for Thai Search Engines
Shisanu Tongchim | Prapass Srichaivattana | Virach Sornlertlamvanich | Hitoshi Isahara

This paper compares the effectiveness of two different Thai search engines by using a blind evaluation. The probabilistic-based dictionary-less search engine is evaluated against the traditional word-based indexing method. The web documents from 12 Thai newspaper web sites consisting of 83,453 documents are used as the test collection. The relevance judgment is conducted on the first five returned results from each system. The evaluation process is completely blind. That is, the retrieved documents from both systems are shown to the judges without any information about thesearch techniques. Statistical testing shows that the dictionary-less approach is better than the word-based indexingapproach in terms of the number of found documents and the number of relevance documents.

Large scale annotated corpora are very important not only inlinguistic research but also in practical natural language processingtasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learning-based systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management tool that provides various functions that include flexible search, statistic calculation, and error correction for linguistically annotated corpora. The target of annotation covers POS tags, base phrase chunks and syntactic dependency structures. This tool aims at helping development of consistent construction of lexicon and annotated corpora to be used by researchers both in linguists and language processing communities.

pdf bib abs
Hierarchical Relationships “is-a”: Distinguishing Belonging, Inclusion and Part/of Relationships.
Christophe Jouis

In thesauri, conceptual structures or semantic networks, relationships are too often vague. For instance, in terminology, the relationships between concepts are often reduced to the distinction established by standard (ISO 704, 1987) and (ISO 1087, 1990) between hierarchical relationships (genus-species relationships and part/whole relationships) and non-hierarchical relationships (time, space, causal relationships, etc.). The semantics of relationships are vague because the principal users of these relationships are industrial actors (translators of technical handbooks, terminologists, data-processing specialists, etc.). Nevertheless, the consistency of the models built must always be guaranteed... One possible approach to this problem consists in organizing the relationships in a typology based on logical properties. For instance, we typically use only the general relation Is-a. It is too vague. We assume that general relation Is-a is characterized by asymmetry. This asymmetry is specified in: (1) the belonging of one individualizable entity to a distributive class, (2) Inclusion among distributive classes and (3) relation part of (or composition).

pdf bib abs
Morphological annotation of Korean with Directly Maintainable Resources
Ivan Berlocher | Hyun-gue Huh | Eric Laporte | Jee-sun Nam

This article describes an exclusively resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. Our annotator is designed to process text before the operation of a syntactic parser. In its present state, it annotates one-stem words only. The output is a graph of morphemes annotated with accurate linguistic information. The granularity of the tagset is 3 to 5 times higher than usual tagsets. A comparison with a reference annotated corpus showed that it achieves 89% recall without any corpus training. The language resources used by the system are lexicons of stems, transducers of suffixes and transducers of generation of allomorphs. All can be easily updated, which allows users to control the evolution of the performances of the system. It has been claimed that morphological annotation of Korean text could only be performed by a morphological analysis module accessing a lexicon of morphemes. We show that it can also be performed directly with a lexicon of words and without applying morphological rules at annotation time, which speeds up annotation to 1,210 words. The lexicon of words is obtained from the maintainable language resources through a fully automated compilation process.

pdf bib abs
PrepNet: a Multilingual Lexical Description of Prepositions
Patrick Saint-Dizier

In this paper, we present the results of a preliminary investigation that aims at constructing a repository of preposition syntactic and semantic behaviors. A preliminary frame-based format for representing their prototypical behavior is then proposed together with related inferential patterns that describe functional or paradigmatic relations between preposition senses.

pdf bib abs
A Methodology for Developing Multilingual Resources for Terminology
Marie-Claude L’Homme | Hee Sook Bae

This paper presents a project that aims at building lexical resources for terminology. By lexical resources, we mean dictionaries that provide detailed lexico-semantic information on terms, i.e. lexical units the sense of which can be related to a special subject field. In terminology, there is a lack of such resources. The specific dictionaries we are currently developing describe basic French and Korean terms that belong to the fields of computer science and the Internet (e.g. computer, configure, user-friendly, Web, browse, spam). This paper presents the structure of the French and Korean articles: each component is examined and illustrated with examples. We then describe the corpus-based methodology and the different computer applications used for developing the articles. Our methodology comprises five steps: design of the corpora, selection of terms; sense distinction; definition of actantial structures and listing of semantic relations. Details on the current state of each database are also given.

pdf bib abs
Does a Virtual Talking Face Generate Proper Multimodal Cues to Draw User’s Attention to Points of Interest?
Stephan Raidt | Gérard Bailly | Frederic Elisei

We present a series of experiments investigating face-to-face interaction between an Embodied Conversational Agent (ECA) and a human interlocutor. The ECA is embodied by a video realistic talking head with independent head and eye movements. For a beneficial application in face-to-face interaction, the ECA should be able to derive meaning from communicational gestures of a human interlocutor, and likewise to reproduce such gestures. Conveying its capability to interpret human behaviour, the system encourages the interlocutor to show appropriate natural activity. Therefore it is important that the ECA knows how to display what would correspond to mental states in humans. This allows to interpret the machine processes of the system in terms of human expressiveness and to assign them a corresponding meaning. Thus the system may maintain an interaction based on human patterns. During a first experiment we investigated the ability of our talking head to direct user attention with facial deictic cues (Raidt, Bailly et al. 2005). Users interact with the ECA during a simple card game offering different levels of help and guidance through facial deictic cues. We analyzed the users performance and their perception of the quality of assistance given by the ECA. The experiment showed that users profit from its presence and its facial deictic cues. In the continuative series of experiments presented here, we investigated the effect of an enhancement of the multimodality of the deictic gestures by adding a spoken instruction.

pdf bib abs
Skeleton Parsing in Chinese: Annotation Scheme and Guidelines
May Lai-Yin Wong

This paper presents my manual skeleton parsing on a sample text of approximately 100,000 word tokens (or about 2,500 sentences) taken from the PFR Chinese Corpus with a clearly defined parsing scheme of 17 constituent labels. The manually-parsed sample skeleton treebank is one of the very few extant Chinese treebanks. While Chinese part-of-speech tagging and word segmentation have been the subject of concerted research for many years, the syntactic annotation of Chinese corpora is a comparatively new field. The difficulties that I encountered in the production of this treebank demonstrate some of the peculiarities of Chinese syntax. A noteworthy syntactic property is that some serial verb constructions tend to be used as if they were compound verbs. The two transitive verbs in series, unlike common transitive verbs, do not take an object separately within the construction; rather, the serial construction as a whole is able to take the same direct object and the perfective aspect marker le. The skeleton-parsed sample treebank is evaluated against Eyes & Leech (1993)s criteria and proves to be accurate, uniform and linguistically valid.

pdf bib abs
Computer-aided summarisation – what the user really wants
Constantin Orăsan | Laura Hasler

Computer-aided summarisation is a technology developed at the University of Wolverhampton as a complement to automatic summarisation, to produce high quality summaries with less effort. To achieve this, a user-friendly environment which incorporates several well-known summarisation methods has been developed. This paper presents the main features of the computer-aided summarisation environment and explains the changes introduced to it as a result of user feedback.

pdf bib abs
Annotating the Predicate-Argument Structure of Chinese Nominalizations
Nianwen Xue

This paper describes the Chinese NomBank Project, the goal of which is to annotate the predicate-argument structure of nominalized predicates in Chinese. The Chinese Nombank extends the general framework of the English and Chinese Proposition Banks to the annotation of nominalized predicates and adds a layer of semantic annotation to the Chinese Treebank. We first outline the scope of the work by discussing the markability of the nominalized predicates and their arguments. We then attempt to provide a categorization of the distribution of the arguments of nominalized predicates. We also discuss the relevance of the event/result distinction to the annotation of nominalized predicates and the phenomenon of incorporation. Finally we discuss some cross-linguistic differences between English and Chinese.

pdf bib abs
A Self-Referring Quantitative Evaluation of the ATR Basic Travel Expression Corpus (BTEC)
Kyo Kageura | Genichiro Kikui

In this paper we evaluate the Basic Travel Expression Corpus (BTEC), developed by ATR (Advanced Telecommunication Research Laboratory), Japan. BTEC was specifically developed as a wide-coverage, consistent corpus containing basic Japanese travel expressions with English counterparts, for the purpose of providing basic data for the development of high quality speech translation systems. To evaluate the corpus, we introduce a quantitative method for evaluating the sufficiency of qualitatively well-defined corpora, on the basis of LNRE methods that can estimate the potential growth patterns of various sparse data by fitting various skewed distributions such as the Zipfian group of distributions, lognormal distribution, and inverse Gauss-Poisson distribution to them. The analyses show the coverage of lexical items of BTEC vis-a-vis the possible targets implicitly defined by the corpus itself, and thus provides basic insights into strategies for enhancing BTEC in future.

pdf bib abs
Features of Terms in Actual Nursing Activities
Hiromi itoh Ozaku | Akinori Abe | Kaoru Sagara | Noriaki Kuwahara | Kiyoshi Kogure

In this paper, we analyze nurses' dialogue and conversation data sets after manual transcriptions and show their features. Recently, medical risk management has been recognized as very important for both hospitals and their patients. To carry out medical risk management, it is important to model nursing activities as well as to collect many accident and incident examples. Therefore, we are now researching strategies of modeling nursing activities in order to understand them (E-nightingale Project). To model nursing activities, it is necessary to collect data of nurses' activities in actual situations and to accurately understand these activities and situations. We developed a method to determine any type of nursing activity from voice data. However we found that our method could not determine several activities because it misunderstood special nursing terms. To improve the accuracy of this method, we focus on analyzing nurses' dialogue and conversation data and on collecting special nursing terms. We have already collected 800 hours of nurses' dialogue and conversation data sets in hospitals to find the tendencies and features of how nurses use special terms such as abbreviations and jargon as well as new terms. Consequently, in this paper we categorize nursing terms according to their usage and effectiveness. In addition, based on the results, we show a rough strategy for building nursing dictionaries.

pdf bib abs
HAREM: An Advanced NER Evaluation Contest for Portuguese
Diana Santos | Nuno Seco | Nuno Cardoso | Rui Vilela

In this paper we provide an overview of the first evaluation contest for named entity recognition in Portuguese, HAREM, which features several original traits and provided the first state of the art for the field in Portuguese, as well as a public-domain evaluation architecture.

pdf bib abs
Acceptance Testing of a Spoken Language Translation System
Rafael Banchs | Antonio Bonafonte | Javier Pérez

This paper describes an acceptance test procedure for evaluating a spoken language translation system between Catalan and Spanish. The procedure consists of two independent tests. The first test was an utterance-oriented evaluation for determining how the use of speech benefits communication. This test allowed for comparing relative performance of the different system components, explicitly: source text to target text, source text to target speech, source speech to target text, and source speech to target speech. The second test was a task-oriented experiment for evaluating if users could achieve some predefined goals for a given task with the state of the technology. Eight subjects familiar with the technology and four subjects not familiar with the technology participated in the tests. From the results we can conclude that state of technology is getting closer to provide effective speech-to-speech translation systems but there is still lot of work to be done in this area. No significant differences in performance between users that are familiar with the technology and users that are not familiar with the technology were evidenced. This constitutes, as far as we know, the first evaluation of a Spoken Translation System that considers performance at both, the utterance level and the task level.

pdf bib abs
Creating Tools for Morphological Analysis of Sumerian
Valentin Tablan | Wim Peters | Diana Maynard | Hamish Cunningham

Sumerian is a long-extinct language documented throughout the ancient MiddleEast, arguably the first language for which we have written evidence, and is a language isolate (i.e. no related languages have so far been identified). The Electronic Text Corpus of Sumerian Literature (ETCSL), based at theUniversity of Oxford, aims to make accessible on the web over 350 literary workscomposed during the late third and early second millennia BCE. The transliterations and translations can be searched, browsed and read online using the tools of the website. In this paper we describe the creation of linguistic analysis and corpus search tools for Sumerian, as part of the development of the ETCSL. This is designed to enable Sumerian scholars, students and interested laymen to analyse the texts online and electronically, and to further knowledge about the language.

pdf bib abs
FonDat1: A Speech Synthesis Corpus for Norwegian
Ingunn Amdal | Torbjørn Svendsen

This paper describes the Norwegian speech database FonDat1 designedfor development and assessment of Norwegian unit selection speechsynthesis. The quality of unit selection speech synthesis systems depends highly on the database used. The database should contain sufficient phonemicand prosodic coverage. High quality unit selection synthesis alsorequires that the database is annotated with accurate information about identity and position of the units. Traditionally this involves much manual work, either by hand labelingthe entire database or by correcting automatic annotations. We are working on methods for a complete automation of the annotationprocess. To validate these methods a realistic unit selectionsynthesis database is needed. In addition to serve as a testbed for annotation tools and synthesisexperiments, the process of producing the database using automaticmethods is in itself an important result. FonDat1 contains studio recordings of approximately 2000 sentencesread by two professional speakers, one male and one female. 10% ofthe database is manually annotated.

pdf bib abs
A Computational Lexicon of Contemporary Hebrew
Alon Itai | Shuly Wintner | Shlomo Yona

Computational lexicons are among the most important resources for natural language processing (NLP). Their importance is even greater in languages with rich morphology, where the lexicon is expected to provide morphological analyzers with enough information to enable themto correctly process intricately inflected forms. We describe the Haifa Lexicon of Contemporary Hebrew, the broadest-coverage publicly available lexicon of Modern Hebrew, currently consisting of over 20,000 entries. While other lexical resources of Modern Hebrew have been developed in the past, this is the first publicly available large-scale lexicon of the language. In addition to supporting morphological processors (analyzers and generators), which was our primary objective, thelexicon is used as a research tool in Hebrew lexicography and lexical semantics. It is open for browsing on the web and several search tools and interfaces were developed which facilitate on-line access to its information. The lexicon is currently used for a variety of NLP applications.

pdf bib abs
Development of the First LRs for Macedonian: Current Projects
Ruska Ivanovska-Naskova

This paper presents in brief several ongoing projects whose aim is to develop the first LRs for Macedonian, in particular the raw corpus compiled by Prof. George Mitrevski at the Auburn University, the preparation for the compilation of a reference corpus for the Macedonian written language at the MASA (Macedonian Academy of Sciences and Arts), the first small annotated corpus of the Macedonian translation of the Orwells 1984, the electronic dictionary of simple words created by Aleksandar Petrovski for the Macedonian module in the frame of the corpus processing system Intex/Nooj and the Morphological dictionary developed by the LTRC (Language Technology Research Center). Further we discuss the importance of the development of the basic LRs for Macedonian as a means of preservation and a prerequisite for the creation of the first commercial language products for this Slavic language.

pdf bib abs
Example-Based Machine Translation Using a Dictionary of Word Pairs
Reinhard Rapp | Carlos Martin Vide

Machine translation systems, whether rule-based, example-based, or statistical, all rely on dictionaries that are in essence mappings between individual words of the source and the target language. Criteria for the disambiguation of ambiguous words and for differences in word order between the two languages are not accounted for in the lexicon. Instead, these important issues are dealt with in the translation engines. Because the engines tend to be compact and (even with data-oriented approaches) do not fully reflect the complexity of the problem, this approach generally does not account for the more fine grained facets of word behavior. This leads to wrong generalizations and, as a consequence, translation quality tends to be poor. In this paper we suggest to approach this problem by using a new type of lexicon that is not based on individual words but on pairs of words. For each pair of consecutive words in the source language the lexicon lists the possible translations in the target language together with information on order and distance of the target words. The process of machine translation is then seen as a combinatorial problem: For all word pairs in a source sentence all possible translations are retrieved from the lexicon and then those translations are discarded that lead to contradictions when constructing the target sentence. This process implicitly leads to word sense disambiguation and to language specific reordering of words.

This paper describes the CESART project which deals with the evaluation of terminological resources acquisition tools. The objective of the project is to propose and validate an evaluation protocol allowing one to objectively evaluate and compare different systems for terminology application such as terminological resource creation and semantic relation extraction. The project also aims to create quality-controlled resources such as domain-specific corpora, automatic scoring tool, etc.

pdf bib abs
New tools for the encoding of lexical data extracted from corpus
Núria Bel | Sergio Espeja | Montserrat Marimon

This paper describes the methodology and tools that are the basis of our platform AAILE.4 AAILE has been built for supplying those working in the construction of lexicons for syntactic parsing with more efficient ways of visualizing and analyzing data extracted from corpus. The platform offers support using techniques such as similarity measures, clustering and pattern classification.

pdf bib abs
Progmatica: A Prosodic Database for European Portuguese
Daniela Braga | Luís Coelho | João P. Teixeira | Diamantino Freitas

In this work, a spontaneous speech corpus of broadcasted television material in European Portuguese (EP) is presented. We decided to name it ProGmatica as it is meant to combine prosody information under a pragmatic framework. Our purpose is to analyse, describe and predict the prosodic patterns that are involved in speech acts and discourse events. It is also our goal to relate both prosody and pragmatics to emotion, style and attitude. In future developments, we intend, by this way, to provide EP TTS systems with pragmatic and emotional dimensions. From the whole recorded material we selected, extracted and saved prototypical speech acts with the help of speech analysis tools. We have a multi-speaker corpus, where linguistic, paralinguistic and extra linguistic information are labelled and related to each other. The paper is organized as follows. In section one, a brief state-of-the-art for the available EP corpora containing prosodic information is presented. In section two, we explain the pragmatic criteria used to structure this database. Then, we describe how the speech signal was labelled and which information layers were considered. In section three, we propose a prosodic prediction model to be applied to each speech act in future. In section four, some of the main problems we went through are discussed and future work is presented.

pdf bib abs
Iqmt: A Framework for Automatic Machine Translation Evaluation
Jesús Giménez | Enrique Amigó

We present the IQMT Framework for Machine Translation Evaluation Inside QARLA. IQMT offers a common workbench in which existing evaluation metrics can be utilized and combined. It provides i) a measure to evaluate the quality of any set of similarity metrics (KING), ii) a measure to evaluate the quality of a translation using a set of similarity metrics (QUEEN), and iii) a measure to evaluate the reliability of a test set (JACK). The first release of the IQMT package is freely available for public use. Current version includes a set of 26 metrics from 7 different well-known metric families, and allows the user to supply its own metrics. For future releases, we are working on the design of new metrics that are able to capture linguistic aspects of translation beyond lexical ones.

pdf bib abs
Annotating Bridging Anaphors in Italian: in Search of Reliability
Tommaso Caselli | Irina Prodanof

The aim of this work is the presentation and preliminary evaluation of an XML annotation scheme for marking bridging anaphors of the form definite article + N in Italian. The scheme is based on a corpus-study. The data we collected from the evaluation experiment seem to support the reliability of the scheme, although some problems still remain open.

pdf bib abs
TC-STAR: New language resources for ASR and SLT purposes
Henk van den Heuvel | Khalid Choukri | Christian Gollan | Asuncion Moreno | Djamel Mostefa

In TC-STAR a variety of Language Resources (LR) is being produced. In this contribution we address the resources that have been created for Automatic Speech Recrognition and Spoken Language Translation. As yet, these are 14 LR in total: two training SLR for ASR (English and Spanish), three development LR and three evaluation LR for ASR (English, Spanish, Mandarin), and three development LR and three evaluation LR for SLT (English-Spanish, Spanish-English, Mandarin-English). In this paper we describe the properties, validation, and availability of these resources.

pdf bib abs
Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach
Yunqing Xia | Kam-Fai Wong | Wenjie Li

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in proc-essing unseen chat terms. This motivates us to construct a chat language corpus so that corpus-based techniques of chat language text processing can be developed and evaluated. However, creating the corpus merely by hand is difficult. One, this work is manpower consuming. Second, annotation inconsistency is serious. To minimize manpower and annotation inconsistency, a two-stage incre-mental annotation approach is proposed in this paper in constructing a chat language corpus. Experiments conducted in this paper show that the performance of corpus annotation can be improved greatly with this approach.

pdf bib abs
Searching for Language Resources on the Web: User Behaviour in the Open Language Archives Community
Baden Hughes

While much effort is expended in the curation of language resources, such investment is largely irrelevant if users cannot locate resourcesof interest. The Open Language Archives Community (OLAC) was established to define standards for the description of language resources and providecore infrastructure for a virtual digital library, thus addressing the resource discovery issue. In this paper we consider naturalistic user search behaviour in the Open Language Archives Community. Specifically, we have collected the query logs from the OLAC Search Engine over a 2 year period, collecting in excess of 1.2 million queries, in over 450K user search sessions. Subsequently we have mined these to discover user search patterns of various types, all pertaining to the discovery of language resources.A number of interesting observations can be made based on this analysis, in this paper we report on a range of properties and behaviours based on empirical evidence.

pdf bib abs
Statistical Analysis for Thesaurus Construction using an Encyclopedic Corpus
Yasunori Ohishi | Katunobu Itou | Kazuya Takeda | Atsushi Fujii

This paper proposes a discrimination method for hierarchical relationsbetween word pairs. The method is a statistical one using an encyclopedic corpus' extracted and organized from Web pages. In the proposed method, we use the statistical naturethat hyponyms' descriptionstend to include hypernyms whereas hypernyms' descriptions do notinclude all of the hyponyms.Experimental results show that the method detected 61.7% of therelations in an actual thesaurus.

pdf bib abs
BULB: A Unified Lexical Browser
Catherine Havasi | James Pustejovsky | Marc Verhagen

Natural language processing researchers currently have access to a wealth of information about words and word senses. This presents problems as well as resources, as it is often difficult to search through and coordinate lexical information across various data sources. We have approached this problem by creating a shared environment for various lexical resources. This browser, BULB (Brandeis Unified Lexical Browser) and its accompanying front-end provides the NLP researcher with a coordinated display from many of the available lexical resources, focusing, in particular, on a newly developed lexical database, the Brandeis Semantic Ontology (BSO). BULB is a module-based browser focusing on the interaction and display of modules from existing NLP tools. We discuss the BSO, PropBank, FrameNet, WordNet, and CQP, as well as other modules which will extend the system. We then outline future extensions to this work and present a release schedule for BULB.

pdf bib abs
Automatic Testing and Evaluation of Multilingual Language Technology Resources and Components
Ulrich Schäfer | Daniel Beck

We describe SProUTomat, a tool for daily building, testing and evaluating a complex general-purpose multilingual natural language text processor including its linguistic resources (lingware). Software and lingware are developed, maintained and extended in a distributed manner by multiple authors and projects, i.e., the source code stored in a version control system is modified frequently. The modular design of different, dedicated lingware modules like tokenizers, morphology, gazetteers, type hierarchy, rule formalism on the one hand increases flexibility and re-usability, but on the other hand may lead to fragility with respect to changes. Therefore, frequent testing as known from software engineering is necessary also for lingware to warrant a high level of quality and overall stability of the system. We describe the build, testing and evaluation methods for LT software and lingware we have developed on the basis of the open source, platform-independent Apache Ant tool and the configurable evaluation tool JTaCo.

pdf bib abs
Spoken Russian in the Russian National Corpus (RNC)
Elena Grishina

The RNC now it is a 120 million-word collection of Russian text, thus, it is the most representative and authoritative corpus of the Russian language. It is available in the Internet at www.ruscorpora.ru. The RNC contains texts of all genres and types, which covers Russian from 19 up to 21 centuries. The practice of national corpora constructing has revealed that it's indispensable to include in the RNC the sub-corpora of spoken language. Therefore, the constructors of the RNC have an intention to include in it about 10 million words of Spoken Russian. Oral speech in the Corpus is represented in the standard Russian orthography. Although this decision made impossible any phonetic exploration of the Spoken Russian Corpus, but studying Spoken Russian from any other linguistic point of view is completely available. In addition to traditional annotations (metatextual and morphological), in Spoken Sub-corpus there is sociological annotation. Unlike the standard oral speech, which is spontaneous and isn't intended to be reproduced, Multimedia Spoken Russian (MSR) is otherwise in great deal premeditated and evidently meant to be reproduced. MSR is also to be included in the RNC: first of all we plan to make the very interesting and provocative part of the RNC from the textual ingredient of about 300 Russian films.

pdf bib abs
Ontology-based Information Extraction with SOBA
Paul Buitelaar | Philipp Cimiano | Stefania Racioppa | Melanie Siegel

In this paper we describe SOBA, a sub-component of the SmartWeb multi-modal dialog system. SOBA is a component for ontologybased information extraction from soccer web pages for automatic population of a knowledge base that can be used for domainspecific question answering. SOBA realizes a tight connection between the ontology, knowledge base and the information extraction component. The originality of SOBA is in the fact that it extracts information from heterogeneous sources such as tabular structures, text and image captions in a semantically integrated way. In particular, it stores extracted information in a knowledge base, and in turn uses the knowledge base to interpret and link newly extracted information with respect to already existing entities.

pdf bib abs
Alexandria: A Powerful Multilingual Resource for Web
Dominique Dutoit

This paper is dealing with a new web interface to display linguistic data on the web. This new web interface is a general proposal for the web. Its present name is Alexandria. Alexandria is an amazing tool that can be downloaded free of charge, under certain conditions. Although the initial idea was hatched six or seven years ago, its technical realization has only been feasible for the past two years. If you want to read the HTML page, for instance http://www.memodata.com, double click on any word at random and you'll see a window open with a definition of the word followed by a list of synonyms and expressions using the word. If not, your browser is not in French. Then, you have to use the menu to modify the target language and choice the French between 22 languages.

pdf bib abs
BITT: A Corpus for Topic Tracking Evaluation on Multimodal Human-Robot-Interaction
Jan Frederik Maas | Britta Wrede

Our research is concerned with the development of robotic systems which can support people in household environments, such as taking care of elderly people. A central goal of our research consists in creating robot systems which are able to learn and communicate about a given environment without the need of a specially trained user. For the communication with such users it is necessary that the robot is able to communicate multimodally, which especially includes the ability to communicate in natural language. Our research is concerned with the development of robotic systems which can support people in household environments, such as taking care of elderly people. A central goal of our research consists in creating robot systems which are able to learn and communicate about a given environment without the need of a specially trained user. For the communication with such users it is necessary that the robot is able to communicate multimodally, which especially includes the ability to communicate in natural language. We believe that the ability to communicate naturally in multimodal communication must be supported by the ability to access contextual information, with topical knowledge being an important aspect of this knowledge. Therefore, we currently develop a topic tracking system for situated human-robot communication on our robot systems. This paper describes the BITT (Bielefeld Topic Tracking) corpus which we built in order to develop and evaluate our system. The corpus consists of human-robot communication sequences about a home-like environment, delivering access to the information sources a multimodal topic tracking system requires.

pdf bib abs
Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser
Hercules Dalianis | Bart Jongejan

The Euroling stemmer is developed for a commercial web site and intranet search engine called SiteSeeker. SiteSeeker is basically used in the Swedish domain but to some extent also for the English domain. CST's lemmatiser comes from the Center for Language Technology, University of Copenhagen and was originally developed as a research prototype to create lemmatisation rules from training data. In this paper we compare the performance of the stemmer that uses handcrafted rules for Swedish, Danish and Norwegian as well one stemmer for Greek with CST's lemmatiser that uses training data to extract lemmatisation rules for Swedish, Danish, Norwegian and Greek. The performances of the two approaches are about the same with around 10 percent errors. The handcrafted rule based stemmer techniques are easy to get started with if the programmer has the proper linguistic knowledge. The machine trained sets of lemmatisation rules are very easy to produce without having linguistic knowledge given that one has correct training data.

pdf bib abs
Improving coverage and parsing quality of a large-scale LFG for German
Christian Rohrer | Martin Forst

We describe experiments in parsing the German TIGER Treebank. In parsing the complete treebank, 86.44% of the sentences receive full parses; 13.56% receive fragment parses. We discuss the methods used to enhance coverage and parsing quality and we present an evaluation on a gold standard, to our knowledge the first one for a deep grammar of German. Considering the selection performed by our current version of a stochastic disambiguation component, we achieve an f-score of 84.2%, the upper and lower bounds being 87.4% and 82.3% respectively.

pdf bib abs
Non-probabilistic alignment of rare German and English nominal expressions
Bettina Schrader

We present an alignment strategy that specifically deals with the correct alignment of rare German nominal compounds to their English multiword translations. It recognizes compounds and multiwords based on their character lengths and on their most frequent POS-patterns, and aligns them based on their length ratios. Our approach is designed on the basis of a data analysis on roughly 500 German hapax legomena, and as it does not use any frequency or co-occurrence information, it is well-suited to align rare compounds, but also achieves good results for more frequent expressions. Experiment results show that the strategy is able to correctly identify correct translations for 70% of the compound hapaxes in our data set. Additionally, we checked on 700 randomly chosen entries in the dictionary that was automatically generated by our alignment tool. Results of this experiment also indicate that our strategy works for non-hapaxes as well, including finding multiple correct translations for the same head compound.

pdf bib abs
Automatic extraction of subcategorization frames for French
Paula Chesley | Susanne Salmon-Alt

This paper describes the automatic extraction of French subcategorization frames from corpora. The subcategorization frames have been acquired via VISL, a dependency-based parser (Bick 2003), whose verb lexicon is currently incomplete with respect to subcategorization frames. Therefore, we have implemented binomial hypothesis testing as a post-parsing filtering step. On a test set of 104 frequent verbs we achieve lower bounds on type precision at 86.8% and on token recall at 54.3%. These results show that, contra (Korhonen et al. 2000), binomial hypothesis testing can be robust for determining subcategorization frames given corpus data. Additionally, we estimate that our extracted subcategorization frames account for 85.4% of all frames in French corpora. We conclude that using a language resource, such as the VISL parser, with a currently unevaluated (and potentially high) error rate can yield robust results in conjunction with probabilistic filtering of the resource output.

pdf bib abs
Layered Speech-Act Annotation for Spoken Dialogue Corpus
Yuki Irie | Shigeki Matsubara | Nobuo Kawaguchi | Yukiko Yamaguchi | Yasuyoshi Inagaki

This paper describes the design of speech act tags for spoken dialogue corpora and its evaluation. Compared with the tags used for conventional corpus annotation, the proposed speech intention tag is specialized enough to determine system operations. However, detailed information description increases tag types. This causes an ambiguous tag selection. Therefore, we have designed an organization of tags, with focusing attention on layered tagging and context-dependent tagging. Over 35,000 utterance units in the CIAIR corpus have been tagged by hand. To evaluate the reliability of the intention tag, a tagging experiment was conducted. The reliability of tagging is evaluated by comparing the tagging among some annotators using kappa value. As a result, we confirmed that reliable data could be built. This corpus with speech intention tag could be widely used from basic research to applications of spoken dialogue. In particular, this would play an important role from the viewpoint of practical use of spoken dialogue corpora.

pdf bib abs
Visual Surveillance and Video Annotation and Description
Khurshid Ahmad | Craig Bennett | Tim Oliver

The effectiveness of CCTV surveillance networks is in part determined by their ability to perceive possible threats. Our traditional means for determining a level of threat has been to manually observe a situation through the network and take action as appropriate. The increasing scale of such surveillance networks has however made such an approach untenable, leading us look for a means by which processes may be automated. Here we investigate the language used by security experts in an attempt to look for patterns in the way in which they describe events as observed through a CCTV camera. It is suggested that natural language based descriptions of events may provide the basis for an index which may prove an important component for future automated surveillance systems.

pdf bib abs
Dictionary Building with the Jibiki Platform: the GDEF case
Mathieu Mangeot | Antoine Chalvin

This paper presents the use of the âJibikiâ generic dictionary online development platform in the case of the GDEF Estonian-French bilingual dictionary building project. This platform has been developed mainly by Mathieu Mangeot and Gilles SÃÂ©rasset based on their research work in the domain. The platform is generic and thus can be used in (almost) any kind of dictionary development project from simple monolingual lexicons to complex multilingual pivot dictionaries as well as terminological resources. The platform is available online, thus it allows entry writers to work and collaborate from any part of the world. It consists in two main modules and data management tools. There is one module for elaborating complex queries on the data and one module for editing entries online. The editing modules generate automatically an interface from the XML structure of the entry.

pdf bib abs
A Syntactically Annotated Corpus of Japanese Spoken Monologue
Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka | Naoto Kato | Yasuyoshi Inagaki

Recently, monologue data such as lecture and commentary by professionals have been considered as valuable intellectual resources, and have been gathering attention. On the other hand, in order to use these monologue data effectively and efficiently, it is necessary for the monologue data not only just to be accumulated but also to be structured. This paper describes the construction of a Japanese spoken monologue corpus in which dependency structure is given to each utterance. Spontaneous monologue includes a lot of very long sentences composed of two or more clauses. In these sentences, there may exist the subject or the adverb common to multi-clauses, and it may be considered that the subject or adverb depend on multi-predicates. In order to give the dependency information in a real fashion, our research allows that a bunsetsu depends on multiple bunsetsus.

pdf bib abs
SI-PRON: A Pronunciation Lexicon for Slovenian
Jerneja Žganec Gros | Varja Cvetko-Orešnik | Primož Jakopin | Aleš Mihelič

We present the efforts involved in designing SI-PRON, a comprehensive machine-readable pronunciation lexicon for Slovenian. It has been built from two sources and contains all the lemmas from the Dictionary of Standard Slovenian (SSKJ), the most frequent inflected word forms found in contemporary Slovenian texts, and a first pass of inflected word forms derived from SSKJ lemmas. The lexicon file contains the orthography, corresponding pronunciations, lemmas and morphosyntactic descriptors of lexical entries in a format based on requirements defined by the W3C Voice Browser Activity. The current version of the SI-PRON pronunciation lexicon contains over 1.4 million lexical entries. The word list determination procedure, the generation and validation of phonetic transcriptions, and the lexicon format are described in the paper. Along with Onomastica, SI-PRON presents a valuable language resource for linguistic studies and research of speech technologies for Slovenian. The lexicon is already being used by the AlpSynth Slovenian text-to-speech synthesis system and for generating audio samples of the SSKJ word list.

pdf bib abs
From PropBank to EngValLex: Adapting the PropBank-Lexicon to the Valency Theory of the Functional Generative Description
Silvie Cinková

EngValLex is the name of an FGD-compliant valency lexicon of English verbs, built from the PropBank-Lexicon and following the structure of Vallex, the FGD-based lexicon of Czech verbs. EngValLex is interlinked with the PropBank-Lexicon, thus preserving the original links between the PropBank-Lexicon and the PropBank-Corpus. Therefore it is also supposed to be part of corpus annotation. This paper describes the automatic conversion of the PropBank-Lexicon into Pre-EngValLex, as well as the progress of its subsequent manual refinement (EngValLex). At the start, the Propbank-arguments were automatically re-labeled with functors (semantic labels of FGD) and the PropBank-rolesets were split into the respective example sentences, which became FGD-valency frames of Pre-EngValLex. Human annotators check and correct the labels and make the preliminary valency frames FGD-compliant. The most essential theoretical difference between the original and EngValLex is the syntactic alternations used by the PropBank-Lexicon, not yet employed within the Czech framework. The alternation-based approach substantially affects the conception of the frame, making in very different from the one applied within the FGD-framework. Preserving the valuable alternation information required special linguistic rules for keeping, altering and re-merging the automatically generated preliminary valency frames.

pdf bib abs
Training Language Models without Appropriate Language Resources: Experiments with an AAC System for Disabled People
Tonio Wandmacher | Jean-Yves Antoine

Statistical Language Models (LM) are highly dependent on their training resources. This makes it not only difficult to interpret evaluation results, it also has a deteriorating effect on the use of an LM-based application. This question has already been studied by others. Considering a specific domain (text prediction in a communication aid for handicapped people) we want to address the problem from a different point of view: the influence of the language register. Considering corpora from five different registers, we want to discuss three methods to adapt a language model to its actual language resource ultimately reducing the effect of training dependency: (a) A simple cache model augmenting the probability of the n last inserted words; (b) a user dictionary, keeping every unseen word; and (c) a combined LM interpolating a base model with a dynamically updated user model. Our evaluation is based on the results obtained from a text prediction system working on a trigram LM.

While both spoken and written language processing stand to benefit from parsing, the standard Parseval metrics (Black et al., 1991) and their canonical implementation (Sekine and Collins, 1997) are only useful for text. The Parseval metrics are undefined when the words input to the parser do not match the words in the gold standard parse tree exactly, and word errors are unavoidable with automatic speech recognition (ASR) systems. To fill this gap, we have developed a publicly available tool for scoring parses that implements a variety of metrics which can handle mismatches in words and segmentations, including: alignment-based bracket evaluation, alignment-based dependency evaluation, and a dependency evaluation that does not require alignment. We describe the different metrics, how to use the tool, and the outcome of an extensive set of experiments on the sensitivity.

pdf bib abs
Constructing a Named Entity Ontology from Web Corpora
Ming-Shun Lin | Hsin-Hsi Chen

This paper proposes a named entity (NE) ontology generation engine, called XNE-Tree engine, which produces relational named entities by given a seed. The engine incrementally extracts high co-occurring named entities with the seed by using a common search engine. In each iterative step, the seed will be replaced by its siblings or descendants, which form new seeds. In this way, XNE-Tree engine will build a tree structure with the original seed as a root incrementally. Two seeds, Chinese transliteration names of Nicole Kidman (a famous actress) and Ernest Hemingway (a famous writer), are experimented to evaluate the performance of the XNE-Tree.¡@¡@For test the applicability of the ontology, we employ it to a phoneme-character conversion system, which convert input phoneme syllable sequences to text strings. Total 100 Chinese transliteration names, including 50 person names and 50 location names are used as test data. We derive an ontology composed of 7,642 named entities. The results of phoneme-character conversion show that both the recall rate and the MRR are improved from 0.79 and 0.50 to 0.84 to 0.55, respectively.

pdf bib abs
Spelling Error Patterns in Spanish for Word Processing Applications
Flora Ramírez Bustamante | Enrique López Díaz

This paper reports findings from the elaboration of a typology of spelling errors for Spanish. It also discusses previous generalizations about spelling error patterns found in other studies and offers new insights on them. The typology is based on the analysis of around 76K misspellings found in real-life texts produced by humans. The main goal of the elaboration of the typology was to help in the im-plementation of a spell checker that detects context-independent misspellings in general unrestricted texts with the most common con-fusion pairs (i.e. error/correction pairs) to improve the set of ranked correction candidates for misspellings. We found that spelling er-rors are language dependent and are closely related to the orthographic rules of each language. The statistical data we provide on spell-ing error patterns in Spanish and their comparison with other data in other related works are the novel contribution of this paper. In this line, this paper shows that some of the general statements found in the literature about spelling error patterns apply mainly to English and cannot be extrapolated to other languages.

pdf bib abs
Developing a re-usable web-demonstrator for automatic anaphora resolution with support for manual editing of coreference chains
Anders Nøklestad | Øystein Reigem | Christer Johansson

Automatic markup and editing of anaphora and coreference is performed within one system. The processing is trained using memory based learning, and representations derive from various lexical resources. The current model reaches an expected combined precision and recall of F=62. The further improvement of the coreference detection is work in progress. Editing of coreference is separated into a module working on an xml-file. The editing mechanism can thus be reused in other projects. The editor is designed to store a copy on the server of all files that are edited over the internet using our demonstrator. This might help us to expand our database of texts annotated for anaphora and coreference. Further research includes creating high coverage lexical resources, and modules for other languages. The current system is trained on Norwegian bokm°al, but we hope to extend this to other languages with available tools (e.g. POS-taggers).

pdf bib abs
Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus
Hitomi Tohyama | Shigeki Matsubara

The manual quantitative analysis of CIAIR simultaneous interpretation corpus and the collection of interpreting patterns This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech simultaneously with the source speech. Therefore, they have various kinds of strategies to raise simultaneity. In this investigation, the simultaneous interpreting patterns with high frequency and high flexibility were extracted from the corpus. As a result, we collected 203 cases out of aligned utterances in which simultaneous interpretersf strategies for raising simultaneity were observed. These 203 cases could be categorized into 12 types of interpreting pattern. It was clarified that 4.5 percent of the English-Japanese monologue data were fitted in those interpreting patterns. These interpreting patterns can be expected to be used as interpreting rules of simultaneous machine interpretation.

pdf bib abs
Oriental COCOSDA: Past, Present and Future
Shuichi Itahashi | Chiu-yu Tseng | Satoshi Nakamura

The purpose of Oriental COCOSDA is to exchange ideas, to share information and to discuss regional matters on creation, utilization, dissemination of spoken language corpora of oriental languages and also on the assessment methods of speech recognition/synthesis systems as well as to promote speech research on oriental languages. A series of International Workshop on East Asian Language Resources and Evaluation (EALREW) or Oriental COCOSDA Workshop has been held annually since the preparatory meeting held in 1997. After that, we have had a series of workshops every year in Japan, Taiwan, China, Korea, Thailand, Singapore, India and Indonesia. The Oriental COCOSDA is managed by a convener, three advisory members, and 21 representatives from ten regions in Oriental countries. We need much more Pan-Asia collaboration with research organizations and consortia, though there are some domestic activities in Oriental countries. We note that speech research has become popular gradually in Oriental countries including Malaysia, Vietnam, Xinjang Uygur Autonomous Region of China, etc. We plan to hold future Oriental COCOSDA meetings in these places in order to promote speech research there.

pdf bib abs
Automated detection and annotation of term definitions in German text corpora
Angelika Storrer | Sandra Wellinghoff

We describe an approach to automatically detect and annotate definitions for technical terms in German text corpora. This approach focuses on verbs that typically appear in definitions (= definitor verbs). We specify search patterns based on the valency frames of these definitor verbs and use them (1) to detect and delimit text segments containing definitions and (2) to annotate their main functional components: the definiendum (the term that is defined) and the definiens (meaning postulates for this term). On the basis of these annotations we aim at automatically extracting WordNet-style semantic relations that hold between the head nouns of the definiendum and the head nouns of the definiens. In this paper, we will describe our annotation scheme for definitions and report on two studies: (1) a pilot study that evaluates our definition extraction approach using a German corpus with manually annotated definitions as a gold standard. (2) A feasibility study that evaluates the possibility to extract hypernym, hyponym and holonym relations from these annotated definitions.

The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support (www.nemlar.org) was a project supported by the EC with partners from Europe and Arabic countries, whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources (LRs) in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.

The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Ourapproach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The initial treebank contains a portion of theMULTEXT-East parallel word-level annotated corpus, namely the firstpart of the Slovene translation of Orwell's 1984. This corpus was first parsed automatically, to arrive at the initial analytic level dependency trees. These were then hand corrected using the tree editorTrEd; simultaneously, the Czech annotation manual was modified forSlovene. The current version is available in XML/TEI, as well asderived formats, and has been used in a comparative evaluation using the MALT parser, and as one of the languages present in the CoNLL-Xshared task on dependency parsing. The paper also discusses further work, in the first instance the composition of the corpus to be annotated next.

pdf bib abs
A Conditional Random Field Framework for Thai Morphological Analysis
Canasai Kruengkrai | Virach Sornlertlamvanich | Hitoshi Isahara

This paper presents a framework for Thai morphological analysis based on the theoretical background of conditional random fields. We formulate morphological analysis of an unsegmented language as the sequential supervised learning problem. Given a sequence of characters, all possibilities of word/tag segmentation are generated, and then the optimal path is selected with some criterion. We examine two different techniques, including the Viterbi score and the confidence estimation. Preliminary results are given to show the feasibility of our proposed framework.

pdf bib abs
A Corpus Search System Utilizing Lexical Dependency Structure
Yoshihide Kato | Shigeki Matsubara | Yasuyoshi Inagaki

This paper presents a corpus search system utilizing lexical dependency structure. The user's query consists of lexical dependency structure. The user's query consists of a sequence of keywords. For a given query, the system automatically generates the dependency structure patterns which consist of keywords in the query, and returns the sentences whose dependency structures match the generated patterns. The dependency structure patterns are generated by using two operations: combining and interpolation, which utilize dependency structures in the searched corpus. The operations enable the system to generate only the dependency structure patterns that occur in the corpus. The system achieves simple and intuitive corpus search and it is enough linguistically sophisticated to utilize structural information.

pdf bib abs
ANNEX - a web-based Framework for Exploiting Annotated Media Resources
Peter Berck | Albert Russel

Manual annotation of various media streams, time series data and also text sequences is still a very time consuming work that has to be carried out in many areas of linguistics and beyond. Based on many theoretical discussions and practical experiences professional tools have been deployed such as ELAN that support the researcher in his/her work. Most of these annotation tools operate on local computers. However, since more and more language resources are stored in web-accessible archives, researchers want to take profit from the new possibilities. ANNEX was developed to fill this gap, since it allows web-based analysis of complex annotated media streams, i.e., the users dont have to download resources and dont have to download and install programs. By simply using a normal web-browser they can start their linguistic work. Yet, due to the architecture of the Internet, ANNEX does not offer the options to create annotations, but this feature will come. However, users have to be aware of the fact that media streaming does not offer that high accuracy as on local computers.

pdf bib abs
The English-Slovene ACQUIS corpus
Tomaž Erjavec

The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing translated legal texts of the European Union, the ACQUIS Communautaire. The corpus contains approx. 2 x 5 million words and was compiled from the translation memory obtained from the Translation Unit of the Slovene Government Office for European Affairs. The corpus is encoded in XML, accordingto the Text Encoding Initiative Guidelines TEI P4, where each translation memory unit contains useful metadata and the two aligned segments (sentences). Both the Slovene and English text islinguistically annotated at the word-level, by context disambiguatedlemmas and morphosyntactic descriptions, which follow the MULTEXTguidelines. The complete corpus is freely available for research, either via an on-line concordancer, or for downloading from the corpushome page at http://nl.ijs.si/svez/.

Language Archiving, Resource Management LAMUS is a web-based service that allows researchers to deposit their language resources into a language resources archive. It was developed at the MPI for Psycholinguistics for stricter control of the archive coherence and consistency and allowing wider use of the archiving facilities without increasing the workload for archive and corpus managers. LAMUS is based on the use of IMDI metadata standard for language resources and offers metadata search and browsing over the archive.

The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. It is the basis for establishing a research infrastructure for language resources which will finally enable eHumanities. Currently, the complete architecture is designed based on a few well-known components and some components are already tested. Based on the technological insights gathered and due to discussions within the international DELAMAN network the ethical and organizational basis for such a federation is defined.

pdf bib abs
An API for accessing the Data Category Registry
Marc Kemps-Snijders | Julien Ducret | Laurent Romary | Peter Wittenburg

Central Ontologies are increasingly important to manage interoperability between different types of language resources. This was the reason for ISO to set up a new committee ISO TC37/SC4 taking care of language resource management issues. Central to the work of this committee is the definition of a framework for a central registry of data categories that are important in the domain of language resources. This paper describes an application programming interface that was designed to request services from this data category registry. The DCR is operational and the described API has already been tested from a lexicon application.

pdf bib abs
Foundations of Modern Language Resource Archives
Peter Wittenburg | Daan Broeder | Wolfgang Klein | Stephen Levinson | Laurent Romary

A number of serious reasons will convince an increasing amount of researchers to store their relevant material in centers which we will call "language resource archives". They combine the duty of taking care of long-term preservation as well as the task to give access to their material to different user groups. Access here is meant in the sense that an active interaction with the data will be made possible to support the integration of new data, new versions or commentaries of all sorts. Modern Language Resource Archives will have to adhere to a number of basic principles to fulfill all requirements and they will have to be involved in federations to create joint language resource domains making it even simpler for the researchers to access the data. This paper makes an attempt to formulate the essential pillars language resource archives have to adhere to.

pdf bib abs
Metadata Profile in the ISO Data Category Registry
Freddy Offenga | Daan Broeder | Peter Wittenburg | Julien Ducret | Laurent Romary

Metadata descriptions of language resources become an increasing necessity since the shear amount of language resources is increasing rapidly and especially since we are now creating infrastuctures to access these resources via the web through integrated domains of language resource archives. Yet, the metadata frameworks offered for the domain of language resources (IMDI and OLAC), although mature, are not as widely accepted as necessary. The lack of confidence in the stability and persistence of the concepts and formats introduced by these metadata sets seems to be one argument for people to not invest the time needed for metadata creation. The introduction of these concepts into an ISO standardization process may convince contributors to make use of the terminology. The availability of the ISO Data Category Registry that includes a metadata profile will also offer the opportunity for researchers to construct their own metadata set tailored to the needs of the project at hand, but nevertheless supporting interoperability.

pdf bib abs
LEXUS, a web-based tool for manipulating lexical resources lexicon
Marc Kemps-Snijders | Mark-Jan Nederhof | Peter Wittenburg

LEXUS provides a flexible framework for the maintaining lexical structure and content. It is the first implementation of the Lexical Markup Framework model currently being developed at ISO TC37/SC4. Amongst its capabilities are the possibility to create lexicon structures, manipulate content and use of typed relations. Integration of well established Data Category Registries is supported to further promote interoperability by allowing access to well established linguistic concepts. Advanced linguistic functionality is offered to assist users in cross lexica operations such as search and comparison and merging of lexica. To enable use within various user groups the look and feel of each lexicon may be customized. In the near future more functionality will be added including integration with other tools accessing lexical content.

pdf bib abs
Building Slovene WordNet
Tomaž Erjavec | Darja Fišer

A WordNet is a lexical database in which nouns, verbs, adjectives and adverbs are organized in a conceptual hierarchy, linking semantically and lexically related concepts. Such semantic lexicons have become oneof the most valuable resources for a wide range of NLP research and applications, such as semantic tagging, automatic word-sense disambiguation, information retrieval and document summarisation. Following the WordNet design for the English languagedeveloped at Princeton, WordNets for a number of other languages havebeen developed in the past decade, taking the idea into the domain ofmultilingual processing. This paper reports on the prototype SloveneWordNet which currently contains about 5,000 top-level concepts. Theresource has been automatically translated from the Serbian WordNet, with the help of a bilingual dictionary, synset literals ranked according to the frequency of corpus occurrence, and results manually corrected. The paper presents the results obtained, discusses some problems encountered along the way and points out some possibilitiesof automated acquisition and refinement of synsets in the future.

pdf bib abs
Towards an Ontology for Art and Colours
Luciana Bordoni | Tiziana Mazzoli

To meet a variety of needs in information modeling, software development and integration as well as knowledge management and reuse, various groups within industry, academia, and government have been developing and deploying sharable and reusable models known as ontologies. Ontologies play an important role in knowledge representation. In this paper, we address the problem of capturing knowledge needed for indexing and retrieving art resources. We describe a case study in which we attempt to construct an ontology for a subset of art. The aim of the present ontology is to build an extensible repository of knowledge and information about artists, their works and materials used in artistic creations. Influenced by the recent interest in colours and colouring materials, mainly shared by French researchers and linguists, an ontology prototype has been developed using Protégé. It allows to organize and catalog information about artists, art works, colouring materials and related colours.

pdf bib abs
Construction of a FrameNet Labeler for Swedish Text
Richard Johansson | Pierre Nugues

We describe the implementation of a FrameNet-based semantic role labeling system for Swedish text. To train the system, we used a semantically annotated corpus that was produced by projection across parallel corpora. As part of the system, we developed two frame element bracketing algorithms that are suitable when no robust constituent parsers are available. Apart from being the first such system for Swedish, this is, as far as we are aware, the first semantic role labeling system for a language for which no role-semantic annotated corpora are available. The estimated accuracy of classification of pre-segmented frame elements is 0.75, and the precision and recall measures for the complete task are 0.67 and 0.47, respectively.

pdf bib abs
ELAN: a Professional Framework for Multimodality Research
Peter Wittenburg | Hennie Brugman | Albert Russel | Alex Klassmann | Han Sloetjes

Utilization of computer tools in linguistic research has gained importance with the maturation of media frameworks for the handling of digital audio and video. The increased use of these tools in gesture, sign language and multimodal interaction studies has led to stronger requirements on the flexibility, the efficiency and in particular the time accuracy of annotation tools. This paper describes the efforts made to make ELAN a tool that meets these requirements, with special attention to the developments in the area of time accuracy. In subsequent sections an overview will be given of other enhancements in the latest versions of ELAN that makes it a useful tool in multimodality research.

pdf bib abs
Ontology-based Language Archive Utilization
Peter Berck | Hans-Jörg Bibiko | Marc Kemps-Snijders | Albert Russel | Peter Wittenburg

At the MPI for Psycholinguistics a large archive with language resources has been created with contributions from many different individual researchers and research projects. All of these resources, in particular annotated media streams and multimedia lexica, are accessible via the web and can be utilized with the help of web-based utilization frameworks. Therefore, the archive lends itself to motivate users to operate across the boundaries of single corpora and to support cross-language work. This, however, can only be done when the problems of interoperability, in particular at the level of linguistic encoding, can be solved in an efficient way. Two Max-Planck-Institutes are cooperating to build a framework that allows users to easily create their own practical ontologies and if wanted to relate their concepts to central ontologies.

pdf bib abs
MaltParser: A Data-Driven Parser-Generator for Dependency Parsing
Joakim Nivre | Johan Hall | Jens Nilsson

We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbitrary combinations of lexical features, part-of-speech features and dependency features. MaltParser is freely available for research and educational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.

pdf bib abs
SINOD - Slovenian non-native speech database
Andrej Žgank | Darinka Verdonik | Aleksandra Zögling Markuš | Zdravko Kačič

This paper presents the SINOD database, which is the first Slovenian non-native speech database. It will be used to improve the performance of large vocabulary continuous speech recogniser for non-native speakers. The main quality impact is expected for acoustic models and recognisers vocabulary. The SINOD database is designed as supplement to the Slovenian BNSI Broadcast News database. The same BN recommendations were used for both databases. Two interviews with non-native Slovenian speakers were incorporated in the set. Both non-native speakers were female, whereas the journalist was Slovenian native male speaker. The transcription approach applied in the production phase is presented. Different statistics and analyses of database are given in the paper.

pdf bib abs
Conversion of WordNet to a standard RDF/OWL representation
Mark van Assem | Aldo Gangemi | Guus Schreiber

This paper presents an overview of the work in progress at the W3C to produce a conversion of WordNet to the RDF/OWL representation language in use in the Semantic Web community. Such a standard representation is useful to provide application developers a high-quality resource and to promote interoperability. Important requirements in this conversion process are that it should be complete and should stay close to WordNet's conceptual model. The paper explains the steps taken to produce the conversion and details design decisions such as the composition of the class hierarchy and properties, the addition of suitable OWL semantics and the chosen format of the URIs. Additional topics include a strategy to incorporate OWL and RDFS semantics in one schema such that both RDF(S) infrastructure and OWL infrastructure can interpret the information correctly, problems encountered in understanding the Prolog source files and the description of the two versions that are provided (Basic and Full) to accommodate different usages of WordNet.

pdf bib abs
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf bib abs
Edit Distance: A Metric for Machine Translation Evaluation
Mark Przybocki | Gregory Sanders | Audrey Le

NIST has coordinated machine translation (MT) evaluations for several years using an automatic and repeatable evaluation measure. Under the Global Autonomous Language Exploitation (GALE) program, NIST is tasked with implementing an edit-distance-based evaluation of MT. Here edit distance is defined to be the number of modifications a human editor is required to make to a system translation such that the resulting edited translation contains the complete meaning in easily understandable English, as a single high-quality human reference translation. In preparation for this change in evaluation paradigm, NIST conducted two proof-of-concept exercises specifically designed to probe the data space, to answer questions related to editor agreement, and to establish protocols for the formal GALE evaluations. We report here our experimental design, the data used, and our findings for these exercises.

pdf bib abs
H. C. Andersen Conversation Corpus
Niels Ole Bernsen | Laila Dybkjær | Svend Kiilerich

This paper describes the design, collection and current status of the Hans Christian Andersen (HCA) conversation corpus. The corpus consists of five separate corpora and represents transcription and annotation of some 57 hours of English spoken and deictic gesture user-system interaction recorded mainly with children 2002-2005. The corpora were collected as part of the development and evaluation process of two consecutive research prototypes. The set-up used to collect each corpus is described as well as our use of each corpus in system development. We describe the annotation of each corpus and briefly present various uses we have made of the corpora so far. The HCA corpus was made publicly available at http://www.niceproject.com/data/ in March 2006.

pdf bib abs
Towards pertinent evaluation methodologies for word-space models
Magnus Sahlgren

This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use distributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computational linguistics laboratories. However, the evaluation methodologies of such models remain infantile, and lack efforts at standardization. Very few studies have critically assessed the methodologies used to evaluate word spaces. This paper attempts to fill some of this void. It is the central goal of this paper to answer the question how can we determine whether a given word space is a good word space?

pdf bib abs
A Model for Context-Based Evaluation of Language Processing Systems and its Application to Machine Translation Evaluation
Andrei Popescu-Belis | Paula Estrella | Margaret King | Nancy Underwood

In this paper, we propose a formal framework that takes into account the influence of the intended context of use of an NLP system on the procedure and the metrics used to evaluate the system. We introduce in particular the notion of a context-dependent quality model and explain how it can be adapted to a given context of use. More specifically, we define vector-space representations of contexts of use and of quality models, which are connected by a generic contextual quality model (GCQM). For each domain, experts in evaluation are needed to build a GCQM based on analytic knowledge and on previous evaluations, using the mechanism proposed here. The main inspiration source for this work is the FEMTI framework for the evaluation of machine translation, which implements partly the present model, and which is described briefly along with insights from other domains.

pdf bib abs
The importance of precise tokenizing for deep grammars
Martin Forst | Ronald M. Kaplan

We present a non-deterministic finite-state transducer that acts as a tokenizer and normalizer for free text that is input to a broad-coverage LFG of German. We compare the basic tokenizer used in an earlier version of the grammar and the more sophisticated tokenizer that we now use. The revised tokenizer increases the coverage of the grammar in terms of full parses from 68.3% to 73.4% on sentences 8,001 through 10,000 of the TiGer Corpus.

pdf bib abs
Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC)
Annalisa Sandrelli | Claudio Bendazzoli

The performance of three different taggers (Treetagger, Freeling and GRAMPAL) is evaluated on three different languages, i.e. English, Italian and Spanish. The materials are transcripts from the European Parliament Interpreting Corpus (EPIC), a corpus of original (source) and simultaneously interpreted (target) speeches. Owing to the oral nature of our materials and to the specific characteristics of spoken language produced in simultaneous interpreting, the chosen taggers have to deal with non-standard word order, disfluencies and other features not to be found in written language. Parts of the tagged sub-corpora were automatically extracted in order to assess the success rate achieved in tagging and lemmatisation. Errors and problems are discussed for each tagger, and conclusions are drawn regarding future developments.

pdf bib abs
RefRef: A Tool for Viewing and Exploring Coreference Space
Hisami Suzuki | Gary Kacmarcik

We present RefRef, a tool for viewing and exploring coreference space, which is publicly available for research purposes. Unlike similar tools currently available whose main goal is to assist the annotation process of coreference links, RefRef is dedicated for viewing and exploring coreference-annotated data, whether manually tagged or automatically resolved. RefRef is also highly customizable, as the tool is being made available with the source code. In this paper we describe the main functionalities of RefRef as well as some possibilities for customization to meet the specific needs of the users of such coreference-annotated text.

pdf bib abs
Human and machine recognition as a function of SNR
Bernt Andrassy | Harald Hoege

In-car automatic speech recognition (ASR) is usually evaluated behaviour for different levels of noise. Yet this is interesting for car manufacturers in order to predict system performances for different speeds and different car models and thus allow to design speech based applications in a better way. It therefore makes sense to split the single WER into SNR dependent WERs, where SNR stands for the signal to noise ratio, which is an appropriate measure for the noise level. In this paper a SNR measure based on the concept of the Articulation Index is developed, which allows the direct comparison with human recognition performance.

pdf bib abs
Building a Lexical Database for an Interactive Joke-Generator
R. Manurung | D. O’Mara | H. Pain | G. Ritchie | A. Waller

As part of a project to construct an interactive program which will encourage children to play with language by building jokes, we have developed a large lexical database, closely based on WordNet. As well as the standard WordNet information about part of speech, synonymy, hyponymy, etc, we have added phonetic representations and symbolic links allowing attachment of pictures. All information is represented in a relational database, allowing powerful searches using SQL via a Java API. The lexicon has a facility to label subsets of the lexicon with symbolic names, and we are working to incorporate some educationally relevant word lists as sublexicons. This should also allow us to improve the familiarity ratings which the lexicon assigns to words.

pdf bib abs
Dealing with unknown words by simple decomposition: feasibility studies with Italian prefixes.
Bruno Cartoni

In this article, we present an experiment that aims to evaluate the feasibility of a superficial morphological analysis, to analyse unknown constructed neologisms. For any morphosyntactic analyser, lexical incompleteness is a real problem. This lack of information is partly due to lexical creativity, and more especially to the productivity of some morphological processes. We present here a set of word formation rules based on constructional morphology principles that can be used to improve the performance of an Italian morphosyntactic analyser. These rules use only simple computing techniques in order to ensure efficiency because any improvements in coverage must not slow down the entire system. In the second part of this paper, we describe a method for constraining the rules, and an evaluation of these constraints in terms of performance. Great improvements are achieved in reducing the number of incorrect analyses of unknown neologisms (noise), although this is at the cost of some increase in silence (correct analyses which are no longer produced). This classic trade-off between noise and silence, however, can hardly be avoided and we believe that this experiment successfully demonstrates the feasibility of superficial analysis in improving performance and points the way to other avenues of research.

pdf bib abs
A Uniform Interface to Large-Scale Linguistic Resources
Serge Sharoff

In the paper we address two practical problems concerning the use of corpora in translation studies. The first stems from the limited resources available for targeted languages and genres within languages, whereas translation researchers and students need: sufficiently large modern corpora, either reflecting general language or specific to a problem domain. The second problem concerns the lackof a uniform interface for accessing the resources, even when the yexist. We deal with the first problem by developing a framework for semi-automatic acquisition of large corpora from the Internet for the languages relevant for our research and training needs. We outline the methodology used and discuss the composition of Internet-derived corpora. We deal with the second problem by developing a uniform interface to our corpora. In addition to standard options for choosingcorpora and sorting concordance lines, the interface can compute the list of collocations and filter the results according touser-specified patterns in order to detect language-specific syntacticstructures.

pdf bib abs
The Affective Weight of Lexicon
Carlo Strapparava | Alessandro Valitutti | Oliviero Stock

This paper presents resources and functionalities for the recognition and selection of affective evaluative terms. An affective hierarchy as an extension of the WordNet-Affect lexical database was developed in the first place. The second phase was the development of a semantic similarity function, acquired automatically in an unsupervised way from a large corpus of texts, which allows us to put into relation concepts and emotional categories. The integration of the two components is a key element for several applications.

pdf bib abs
ROTE: A Tool to Support Users in Defining the Relative Importance of Quality Characteristics
Agnes Lisowska | Nancy L. Underwood

This paper describes the Relative Ordering Tool for Evaluation (ROTE) which is designed to support the process of building a parameterised quality model for evaluation. It is a very simple tool which enables users to specify the relative importance of quality characteristics (and associated metrics) to reflect the users' particular requirements. The tool allows users to order any number of quality characteristics by comparing them in a pair-wise fashion. The tool was developed in the context of a collaborative project developing a text mining system. A full scale evaluation of the text mining system was designed and executed for three different users and the ROTE tool was successfully applied by those users during that process. The tool will be made available for general use by the evaluation community.

pdf bib abs
Using collocations from comparable corpora to find translation equivalents
Serge Sharoff | Bogdan Babych | Anthony Hartley

In this paper we present a tool for finding appropriate translation equivalents for words from the general lexicon using comparable corpora. For a phrase in the source language the tool suggests arange of possible expressions used in similar contexts in target language corpora. In the paper we discuss the method and present results of human evaluation of the performance of the tool.

The traditional dialect vocabulary of the Netherlands and Flanders is recorded and researched in several Dutch and Belgian research institutes and universities. Most of these distributed dictionary creation and research projects collaborate in the Permanent Overlegorgaan Regionale Woordenboeken (ReWo). In the project digital databases and digital tools for WBD and WLD (D-square) the dialect data published by two of these dictionary projects (Woordenboek van de Brabantse Dialecten and Woordenboek van de Limburgse Dialecten) is being digitised. One of the additional goals of the D-square project is the development of an infrastructure for electronic access to all dialect dictionaries collaborating in the ReWo. In this paper we will firstly reconsider the nature of the core data types - form, sense and location - present in the different dialect dictionaries and the ways these data types are further classified. Next we will focus on the problems encountered when trying to unify this dictionary data and their classifications and suggest solutions. Finally we will look at several implementation issues regarding a specific encoding for the dictionaries.

pdf bib abs
A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation
E. Uí Dhonnchadha | J. Van Genabith

This paper describes the methodology used to develop a part-of-speech tagger for Irish, which is used to annotate a corpus of 30 million words of text with part-of-speech tags and lemmas. The tagger is evaluated using a manually disambiguated test corpus and it currently achieves 95% accuracy on unrestricted text. To our knowledge, this is the first part-of-speech tagger for Irish.

pdf bib abs
Language Challenges for Data Fusion in Question-Answering
Véronique Moriceau

Search engines on the web and most existing question-answering systems provide the user with a set of hyperlinks and/or web page extracts containing answer(s) to a question. These answers are often incoherent to a certain degree (equivalent, contradictory, etc.). It is then quite difficult for the user to know which answer is the correct one. In this paper, we present an approach which aims at providing synthetic numerical answers in a question-answering system. These answers are generated in natural language and, in a cooperative perspective, the aim is to explain to the user the variation of numerical values when several values, apparently incoherent, are extracted from the web as possible answers to a question. We present in particular how lexical resources are essential to answer extraction from the web, to the characterization of the variation mode associated with the type of information and to answer generation in natural language.

pdf bib abs
BACO - A large database of text and co-occurrences
Luís Sarmento

In this paper we introduce a public resource named BACO (Base de Co-Ocorrências), a very large textual database built from the WPT03 collection, a publicly available crawl of the whole Portuguese web in 2003. BACO uses a generic relational database engine to store 1.5 million web documents in raw text (more than 6GB of plain text), corresponding to 35 million sentences, consisting of more than 1000 million words. BACO comprises four lexicon tables, including a standard single token lexicon, and three n-gram tables (2-grams, 3-grams and 4-grams) with several hundred million entries, and a table containing 780 million co-occurrence pairs. We describe the design choices and explain the preparation tasks involved in loading the data in the relational database. We present several statistics regarding storage requirements and we demonstrate how this resource is currently used.

pdf bib abs
OntoNERdIE – Mapping and Linking Ontologies to Named Entity Recognition and Information Extraction Resources
Ulrich Schäfer

Semantic Web and NLP We describe an implemented offline procedure that maps OWL/RDF-encoded ontologies with large, dynamically maintained instance data to named entity recognition (NER) and information extraction (IE) engine resources, preserving hierarchical concept information and links back to the ontology concepts and instances. The main motivations are (i) improving NER/IE precision and recall in closed domains, (ii) exploiting linguistic knowledge (context, inflection, anaphora) for identifying ontology instances in texts more robustly, (iii) giving full access to ontology instances and concepts in natural language processing results, e.g. for subsequent ontology queries, navigation or inference, (iv) avoiding duplication of work in development and maintenance of similar resources in independent places, namely lingware and ontologies. We show an application in hybrid deep-shallow natural language processing that is e.g. used for question analysis in closed domains. Further applications could be automatic hyperlinking or other innovative semantic-web related applications.

pdf bib abs
Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech
Jonathan G. Fiscus | Jerome Ajot | Nicolas Radde | Christophe Laprun

Since 1987, the National Institute of Standards and Technology has been providing evaluation infrastructure for the Automatic Speech Recognition (ASR), and more recently referred to as the Speech-To-Text (STT), research community. From the first efforts in the Resource Management domain to the present research, the NIST SCoring ToolKit (SCTK) has formed the tool set for system developers to make continued progress in many domains; Wall Street Journal, Conversational Telephone Speech (CTS), Broadcast News (BN), and Meetings (MTG) to name a few. For these domains, the community agreed to declared sections of simultaneous speech as not scoreable. While this had minor impact on most of these domains, the highly interactive nature of Meeting speech rendered a very large fraction of the test material not scoreable. This paper documents a multi-dimensional extension of the Dynamic Programming solution to Levenshtein Edit Distance calculations capable of evaluating STT systems during periods of overlapping, simultaneous speech.

This paper describes version 1.3 of the FreeLing suite of NLP tools. FreeLing was first released in February 2004 providing morphological analysis and PoS tagging for Catalan, Spanish, and English. From then on, the package has been improved and enlarged to cover more languages (i.e. Italian and Galician) and offer more services: Named entity recognition and classification, chunking, dependency parsing, and WordNet based semantic annotation. FreeLing is not conceived as end-user oriented tool, but as library on top of which powerful NLP applications can be developed. Nevertheless, sample interface programs are provided, which can be straightforwardly used as fast, flexible, and efficient corpus processing tools. A remarkable feature of FreeLing is that it is distributed under a free-software LGPL license, thus enabling any developer to adapt the package to his needs in order to get the most suitable behaviour for the application being developed.

pdf bib abs
On Automatic Assignment of Verb Valency Frames in Czech
Jiří Semecký

Many recent NLP applications, including machine translation and information retrieval, could benefit from semantic analysis of language data on the sentence level. This paper presents a method for automatic disambiguation of verb valency frames on Czech data. For each verb occurrence, we extracted features describing its local context. We experimented with diverse types of features, including morphological, syntax-based, idiomatic, animacy and WordNet-based features. The main contribution of the paper lies in determining which ones are most useful for the disambiguation task. The considered features were classified using decision trees, rule-based learning and a Naïve Bayes classifier. We evaluated the methods using 10-fold cross-validation on VALEVAL, a manually annotated corpus of frame annotations containing 7,778 sentences. Syntax-based features have shown to be the most effective. When we used the full set of features, we achieved an accuracy of 80.55% against the baseline 67.87% obtained by assigning the most frequent frame.

pdf bib abs
An Introduction to NLP-based Textual Anonymisation
Ben Medlock

We introduce the problem of automatic textual anonymisation and present a new publicly-available, pseudonymised benchmark corpus of personal email text for the task, dubbed ITAC (Informal Text Anonymisation Corpus). We discuss the method by which the corpus was constructed, and consider some important issues related to the evaluation of textual anonymisation systems. We also present some initial baseline results on the new corpus using a state of the art HMM-based tagger. We introduce the problem of automatic textual anonymisation and present a new publicly-available, pseudonymised benchmark corpus of personal email text for the task, dubbed ITAC (Informal Text Anonymisation Corpus). We discuss the method by which the corpus was constructed, and consider some important issues related to the evaluation of textual anonymisation systems. We also present some initial baseline results on the new corpus using a state of the art HMM-based tagger.

pdf bib abs
A Web Based General Thesaurus Browser to Support Indexing of Television and Radio Programs
Hennie Brugman | Véronique Malaisé | Luit Gazendam

Documentation and retrieval processes at the Netherlands Institute for Sound and Vision are organized around a common thesaurus. To help improve the quality of these processes the thesaurus was transformed into a RDF/OWL ontology and extended on basis of implicit information and external resources. A thesaurus browser web application was designed, implemented and tested on future users.

pdf bib abs
SKELETON: Specialised knowledge retrieval on the basis of terms and conceptual relations
Judit Feliu | Jorge Vivaldi | M. Teresa Cabré

The main goal of this paper is to present a first approach to an automatic detection of conceptual relations between two terms in specialised written text. Previous experiments on the basis of the manual analysis lead the authors to implement an automatic query strategy combining the term candidates proposed by an extractor together with a list of verbal syntactic patterns used for the relations refinement. Next step on the research will be the integration of the results into the term extractor in order to attain more restrictive pieces of information directly reused for the ontology building task.

pdf bib abs
Building an Evaluation Corpus for German Question Answering by Harvesting Wikipedia
Irene Cramer | Jochen L. Leidner | Dietrich Klakow

The growing interest in open-domain question answering is limited by the lack of evaluation and training resources. To overcome this resource bottleneck for German, we propose a novel methodology to acquire new question-answer pairs for system evaluation that relies on volunteer collaboration over the Internet. Utilizing Wikipedia, a popular free online encyclopedia available in several languages, we show that the data acquisition problem can be cast as a Web experiment. We present a Web-based annotation tool and carry out a distributed data collection experiment. The data gathered from the mostly anonymous contributors is compared to a similar dataset produced in-house by domain experts on the one hand, and the German questions from the from the CLEF QA 2004 effort on the other hand. Our analysis of the datasets suggests that using our novel method a medium-scale evaluation resource can be built at very small cost in a short period of time. The technique and software developed here is readily applicable to other languages where free online encyclopedias are available, and our resulting corpus is likewise freely available.

pdf bib abs
Towards Natural Interactive Question Answering
Gerhard Fliedner

Interactive question answering systems should allow users to lead a coherent information seeking dialogue. Compared with systems that only locally evaluate a question, interactive systems facilitate the information seeking process and provide a more natural feel. We show that by extending a QA system to handle several types of anaphora and ellipsis, the naturalness of the interaction can be considerably improved. We describe an implementation in our prototype QA system for German and give a walk-through example of the enhanced interaction capabilities.

pdf bib abs
Preprocessing and Tokenisation Standards in DELPH-IN Tools
Benjamin Waldron | Ann Copestake | Ulrich Schäfer | Bernd Kiefer

We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing components and deep grammars.

We describe a project aimed at creating a deeply annotated corpus of Russian texts. The annotation consists of comprehensive morphological marking, syntactic tagging in the form of a complete dependency tree, and semantic tagging within a restricted semantic dictionary. Syntactic tagging is using about 80 dependency relations. The syntactically annotated corpus counts more than 28,000 sentences and makes an autonomous part of the Russian National Corpus (www.ruscorpora.ru). Semantic tagging is based on an inventory of semantic features (descriptors) and a dictionary comprising about 3,000 entries, with a set of tags assigned to each lexeme and its argument slots. The set of descriptors assigned to words has been designed in such a way as to construct a linguistically relevant classification for the whole Russian vocabulary. This classification serves for discovering laws according to which the elements of various lexical and semantic classes interact in the texts. The inventory of semantic descriptors consists of two parts, object descriptors (about 90 items in total) and predicate descriptors (about a hundred). A set of semantic roles is thoroughly elaborated and contains about 50 roles.

pdf bib abs
Towards Holistic Summarization – Selecting Summaries, Not Sentences
Martin Hassel | Jonas Sjöbergh

In this paper we present a novel method for automatic text summarization through text extraction, using computational semantics. The new idea is to view all the extracted text as a whole and compute a score for the total impact of the summary, instead of ranking for instance individual sentences. A greedy search strategy is used to search through the space of possible summaries and select the summary with the highest score of those found. The aim has been to construct a summarizer that can be quickly assembled, with the use of only a very few basic language tools, for languages that lack large amounts of structured or annotated data or advanced tools for linguistic processing. The proposed method is largely language independent, though we only evaluate it on English in this paper, using ROUGE-scores on texts from among others the DUC 2004 task 2. On this task our method performs better than several of the systems evaluated there, but worse than the best systems.

pdf bib abs
“Casselberveetovallarga” and other Unpronounceable Places: The CrossTowns Corpus
Stefan Schaden | Ute Jekosch

This paper presents a corpus of non-native speech that contains pronunciation variants of European city names from fivecountries spoken by speakers of four native languages. It was originally designed as a research tool for the study ofpronunciation errors by non-native speakers in the pronunciation of foreign city names. The corpus has now been released. Followinga brief sketch of the research context in which this data collection was established, the first part of this paper describes the contents and technical specifications of the corpus (design, speakers, language material, recording conditions).Compared to corpora of native speech, non-native speech compilations raise a number of additional difficulties that requirespecific attention and methodology. Therefore, the second part of the paper aims to point out some of these general issuesfrom the perspective of the experience gained in our research. Strategies to deal with these difficulties will be exploredalong with their specific benefits and shortfalls, concluding that non-native speech corpora require a number of specificdesign guidelines which are often difficult to put into practice.

pdf bib abs
Recognizing Acronyms and their Definitions in Swedish Medical Texts
Dimitrios Kokkinakis | Dana Dannélls

This paper addresses the task of recognizing acronym-definition pairs in Swedish (medical) texts as well as the compilation of a freely available sample of such manually annotated pairs. A material suitable not only for supervised learning experiments, but also as a testbed for the evaluation of the quality of future acronym-definition recognition systems. There are a number of approaches to the identification described in the literature, particularly within the biomedical domain, but none of those addresses the variation and complexity exhibited in a language other than English. This is realized by the fact that we can have a mixture of two languages in the same document and/or sentence, i.e. Swedish and English; that Swedish is a compound language that significantly deteriorates the performance of previous approaches (without adaptations) and, most importantly, the fact that there is a large variation of possible acronym-definition permutations realized in the analysed corpora, a variation that is usually ignored in previous studies.

pdf bib abs
Tagging Heterogeneous Evaluation Corpora for Opinionated Tasks
Lun-Wei Ku | Yu-Ting Liang | Hsin-Hsi Chen

Opinion retrieval aims to tell if a document is positive, neutral or negative on a given topic. Opinion extraction further identifies the supportive and the non-supportive evidence of a document. To evaluate the performance of technologies for opinionated tasks, a suitable corpus is necessary. This paper defines the annotations for opinionated materials. Heterogeneous experimental materials are annotated, and the agreements among annotators are analyzed. How human can monitor opinions of the whole is also examined. The corpus can be employed to opinion extraction, opinion summarization, opinion tracking and opinionated question answering.

pdf bib abs
Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation
Joakim Nivre | Jens Nilsson | Johan Hall

We introduce Talbanken05, a Swedish treebank based on a syntactically annotated corpus from the 1970s, Talbanken76, converted to modern formats. The treebank is available in three different formats, besides the original one: two versions of phrase structure annotation and one dependency-based annotation, all of which are encoded in XML. In this paper, we describe the conversion process and exemplify the available formats. The treebank is freely available for research and educational purposes.

pdf bib abs
Transferring Coreference Chains through Word Alignment
Oana Postolache | Dan Cristea | Constantin Orasan

This paper investigates the problem of automatically annotating resources with NP coreference information using a parallel corpus, English-Romanian, in order to transfer, through word alignment, coreference chains from the English part to the Romanian part of the corpus. The results show that we can detect Romanian referential expressions and coreference chains with over 80% F-measure, thus using our method as a preprocessing step followed by manual correction as part of an annotation effort for creating a large Romanian corpus with coreference information is worthwhile.

pdf bib abs
A novel Textual Encoding paradigm based on Semantic Web tools and semantics
G. Tummarello | C. Morbidoni | F. Kepler | F. Piazza | P. Puliti

In this paper we perform a preliminary evaluation on how Semantic Web technologies such as RDF and OWL can be used to perform textual encoding. Among the potential advantages, we notice how RDF, given its conceptual graph structure, appears naturally suited to deal with overlapping hierarchies of annotations, something notoriously problematic using classic XML based markup. To conclude, we show how complex querying can be performed using slight modifications of already existing Semantic Web query tools.

pdf bib abs
General and Task-Specific Corpus Resources for Polish Adult Learners of English
Anna Bogacka | Katarzyna Dziubalska-Kołaczyk | Grzegorz Krynicki | Dawid Pietrala | Mikołaj Wypych

This paper offers a comparison of two resources for Polish adult learners of English. The first has been designed for Polish-English Literacy Tutor (PELT), a multimodal system for foreign language learning, as training input to speech recognition system for highly accented, strongly variable second language speech. The second corpus is a task-specific resource designed in the PELT framework to investigate the vowel space of English produced by Poles. Presented are linguistically and technologically challenging aspects of the two ventures and their complementary character.

pdf bib abs
Language Specific and Topic Focused Web Crawling
Olena Medelyan | Stefan Schulz | Jan Paetzold | Michael Poprat | Kornél Markó

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.

pdf bib abs
Corpus-Induced Corpus Clean-up
Martin Reynaert

We explore the feasibility of using only unsupervised means to identify non-words, i.e. typos, in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language. We call the system we built and evaluate in this paper ciccl, which stands for Corpus-Induced Corpus Clean-up. The algorithm on which ciccl is primarily based is the anagram-key hashing algorithm introduced by (Reynaert, 2004). The core correction mechanism is a simple and effective method which translates the actual characters which make up a word into a large natural number in such a way that all the anagrams, i.e. all the words composed of precisely the same subset of characters, are allocated the same natural number. In effect, this constitutes a novel approximate string matching algorithm for indexed text search. This is because by simple addition, subtraction or a combination of both, all variants within reach of the range of numerical values defined in the alphabet are retrieved by iterating over the alphabet. ciccl's input consists primarily of corpus derived frequency lists, from which it derives valuable morphological information by performing frequency counts over the substrings of the words, which are then used to perform decompounding, as well as for distinguishing between most likely correctly spelled words and typos.

pdf bib abs
TQB: Accessing Multimodal Data Using a Transcript-based Query and Browsing Interface
Andrei Popescu-Belis | Maria Georgescul

This article describes an interface for searching and browsing multimodal recordings of group meetings. We provide first an overall perspective of meeting processing and retrieval applications, and distinguish between the media/modalities that are recorded and the ones that are used for browsing. We then proceed to describe the data and the annotations that are stored in a meeting database. Two scenarios of use for the transcript-based query and browsing interface (TQB) are then outlined: search and browse vs. overview and browse. The main functionalities of TQB, namely the database backend and the multimedia rendering solutions are described. An outline of evaluation perspectives is finally provided, with a description of the user interaction features that will be monitored.

pdf bib abs
An Annotated Corpus of Typical Durations of Events
Feng Pan | Rutu Mulkar | Jerry R. Hobbs

In this paper, we present our work on generating an annotated corpus for extracting information about the typical durations of events from texts. We include the annotation guidelines, the event classes we categorized, the way we use normal distributions to model vague and implicit temporal information, and how we evaluate inter-annotator agreement. The experimental results show that our guidelines are effective in improving the inter-annotator agreement.

pdf bib abs
Lexical similarity can distinguish between automatic and manual translations
Agam Patel | Dragomir R. Radev

We consider the problem of identifying automatic translations from manual translations of the same sentence. Using two different similarity metrics (BLEU and Levenshtein edit distance), we found out that automatic translations are closer to each other than they are to manual translations. We also use phylogenetic trees to provide a visual representation of the distances between pairs of individual sentences in a set of translations. The differences in lexical distance are statistically significant, both for Chinese to English and for Arabic to English translations.

At ATR, we are collecting and analysing meetings data using a table-top sensor device consisting of a small 360-degree camera surrounded by an array of high-quality directional microphones. This equipment provides a stream of information about the audio and visual events of the meeting which is then processed to form a representation of the verbal and non-verbal interpersonal activity, or discourse flow, during the meeting. This paper describes the resulting corpus of speech and video data which is being collected for the abovere search. It currently includes data from 12 monthly sessions, comprising 71 video and 33 audio modules. Collection is continuingmonthly and is scheduled to include another ten sessions.

pdf bib abs
Towards Unified Chinese Segmentation Algorithm
Fu Lee Wang | Xiaotie Deng | Feng Zou

As Chinese is an ideographic character-based language, the words in the texts are not delimited by spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although the search engines do not segment texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining web data with the help from search engines. It is the first unified segmentation algorithm for Chinese language from different geographical areas. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problem of segmentation ambiguity, new word (unknown word) detection, and stop words.

pdf bib abs
Building a Parallel Multilingual Corpus (Arabic-Spanish-English)
Doaa Samy | Antonio Moreno Sandoval | José M. Guirao | Enrique Alfonseca

This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resource for the NLP community that completes the present panorama of parallel corpora. In the first part of this study, we introduce the novelty of our approach and the challenges encountered to create such a corpus. This introductory part highlights the main features of the corpus and the criteria applied during the selection process. The second part focuses on two main stages: basic processing (tokenization and segmentation) and alignment. Methodology of alignment is explained in detail and results obtained in the three different linguistic pairs are compared. POS tagging and tools used in this stage are discussed in the third part. The final output is available in two versions: the non-aligned version and the aligned one. The latter adopts the TMX (Translation Memory Exchange) standard format. At the end, the section dedicated to the future work points out the key stages concerned with extending the corpus and the studies that can benefit, directly or indirectly, from such a resource.

pdf bib abs
The OSU Quake 2004 corpus of two-party situated problem-solving dialogs
Donna K. Byron | Eric Fosler-Lussier

This report describes the Ohio State University Quake 2004 corpus of English spontaneous task-oriented two-person situated dialog. The corpus was collected using a first-person display of an interior space (rooms, corridors, stairs) in which the partners collaborate on a treasure hunt task. The corpus contains exciting new features such as deictic and exophoric reference, language that is calibrated against the spatial arrangement of objects in the world, and partial-observability of the task world imposed by the perceptual limitations inherent in the physical arrangement of the world. The corpus differs from prior dialog collections which intentionally restricted the interacting subjects from sharing any perceptual context, and which allowed one subject (the direction-giver or system) to have total knowledge of the state of the task world. The corpus consists of audio/video recordings of each person's experience in the virtual world and orthographic transcriptions. The virtual world can also be used by other researchers who want to conduct additional studies using this stimulus.

pdf bib abs
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
Md. Aminul Islam | Diana Inkpen

This paper presents a new corpus-based method for calculating the semantic similarity of two target words. Our method, called Second Order Co-occurrencePMI (SOC-PMI), uses Pointwise Mutual Information to sort lists of important neighbor words of the two target words. Then we consider the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative semantic similarity. Our method was empirically evaluated using Miller and Charlers (1991) 30 noun pair subset, Ruben-stein and Goodenoughs (1965) 65 noun pairs, 80 synonym test questions from the Test of English as a Foreign Language (TOEFL), and 50 synonym test questions from a collection of English as a Second Language (ESL) tests. Evaluation results show that our method outperforms several competing corpus-based methods.

pdf bib abs
A Domain Ontology Production Tool Kit Based on Automatically Constructed Case Frames
Yoji Kiyota | Hiroshi Nakagawa

This paper proposes a tool kit to produce a domain ontology for text mining, based on case frames automatically constructed from a raw corpus of a specific domain. Since case frames are strongly related to implicit facts hidden in large domain-specific corpora, we can say that case frames are a promising device for text mining. The aim of the tool kit is to enable automatic analysis of event reports, from which implicit factors of the events are to be extracted. The tool kit enables us to produce a domain ontology by iterating associative retrieval of case frames and manual refinement. In this study, the tool kit is applied to the Japan Airlines pilot report collection, and a domain ontology of contributing factors in the civil aviation domain is experimentally produced. A lot of interesting examples are found in the ontology. In addition, a brief examination of the production process shows the efficiency of the tool kit.

pdf bib abs
Tangible Objects for the Acquisition of Multimodal Interaction Patterns
Ronnie Taib | Natalie Ruiz

Multimodal user interfaces offer more intuitive interaction for end-users, however, usually only through predefined input schemes. This paper describes a user experiment for multimodal interaction pattern identification, using head gesture and speech inputs for a 3D graph manipulation. We show that a direct mapping between head gestures and the 3D object predominates, however even for such a simple task inputs vary greatly between users, and do not exhibit any clustering pattern. Also, in spite of the high degree of expressiveness of linguistic modalities, speech commands in particular tend to use a limited vocabulary. We observed a common set of verb and adverb compounds in a majority of users. In conclusion, we recommend that multimodal user interfaces be individually customisable or adaptive to users interaction preferences.

pdf bib abs
Perspectives of Turning Prague Dependency Treebank into a Knowledge Base
Václav Novák | Jan Hajič

Recently, the Prague Dependency Treebank 2.0 (PDT 2.0) has emerged as the largest text corpora annotated on the level of tectogrammatical representation (linguistic meaning) described in Sgall et al. (2004) and containing about 0.8 milion words (see Hajic (2004)). We hope that this level of annotation is so close to the meaning of the utterances contained in the corpora that it should enable us to automatically transform texts contained in the corpora to the form of knowledge base, usable for information extraction, question answering, summarization, etc. We can use Multilayered Extended Semantic Networks (MultiNet) described in Helbig (2006) as the target formalism. In this paper we discuss the suitability of such approach and some of the main issues that will arise in the process. In section 1, we introduce formalisms underlying PDT 2.0 and MultiNet, in section 2. We describe the role MultiNet can play in the system of Functional Generative Description (FGD), section 3 discusses issues of automatic conversion to MultiNet and section 4 gives some conclusions.

The importance of computer learner corpora for research in both second language acquisition and foreign language teaching is rapidly increasing. Computer learner corpora can provide us with data to describe the learners interlanguage system at different points of its development and they can be used to create pedagogical tools. In this paper, we first present a new computer learner corpus in French. We then describe an analyzer called Direkt Profil, that we have developed using this corpus. The system carries out a sentence analysis based on developmental sequences, i.e. local morphosyntactic phenomena linked to a development in the acquisition of French as a foreign language. We present a brief introduction to developmental sequences and some examples in French. In the final section, we introduce and evaluate a method to optimize the definition and detection of learner profiles using machine-learning techniques.

pdf bib abs
Development of a phoneme-to-phoneme (p2p) converter to improve the grapheme-to-phoneme (g2p) conversion of names
Qian Yang | Jean-Pierre Martens | Nanneke Konings | Henk van den Heuvel

It is acknowledged that a good phonemic transcription of proper names is imperative for the success of many modern speech-based services such as directory assistance, car navigation, etc. It is also known that state-of-the-art general-purpose grapheme-to-phoneme (g2p) converters perform rather poorly on many name categories. This paper proposes to use a g2p-p2p tandem comprising a state-of-the-art general-purpose g2p converter that produces an initial transcription and a name category specific phoneme-to-phoneme (p2p) converter that aims at correcting the mistakes made by the g2p converter. The main body of the paper describes a novel methodology for the automatic construction of the p2p converter. The methodology is implemented in a software toolbox that will be made publicly available in a form that will permit the user to design a p2p converter for an arbitrary name category. To give a proof of concept, the toolbox was used for the development of three p2p converters for first names, surnames and geographical names respectively. The obtained systems are small (few rules) and effective: significant improvements (up to 50% relative) of the grapheme-to-phoneme conversion are obtained. These encouraging results call for a further development and improvement of the approach.

pdf bib abs
Recurrent Markov Cluster (RMCL) Algorithm for the Refinement of the Semantic Network
Jaeyoung Jung | Maki Miyake | Hiroyuki Akam

The purpose of this work is to propose a new methodology to ameliorate the Markov Cluster (MCL) Algorithm that is well known as an efficient way of graph clustering (Van Dongen, 2000). The MCL when applied to a graph of word associations has the effect of producing concept areas in which words are grouped into the similar topics or similar meanings as paradigms. However, since a word is determined to belong to only one cluster that represents a concept, Markov clusters cannot show the polysemy or semantic indetermination among the properties of natural language. Our Recurrent MCL (RMCL) allows us to create a virtual adjacency relationship among the Markov hard clusters and produce a downsized and intrinsically informative semantic network of word association data. We applied one of the RMCL algorithms (Stepping-stone type) to a Japanese associative concept dictionary and obtained a satisfactory level of performance in refining the semantic network generated from MCL.

pdf bib abs
JASMIN-CGN: Extension of the Spoken Dutch Corpus with Speech of Elderly People, Children and Non-natives in the Human-Machine Interaction Modality
Catia Cucchiarini | Hugo Van hamme | Olga van Herwijnen | Felix Smits

Large speech corpora (LSC) constitute an indispensable resource for conducting research in speech processing and for developing real-life speech applications. In 2004 the Spoken Dutch Corpus (CGN) became available, a corpus of standard Dutch as spoken by adult natives in the Netherlands and Flanders. Owing to budget constraints, CGN does not include speech of children, non-natives, elderly people and recordings of speech produced in human-machine interactions. Since such recordings would be extremely useful for conducting research and for developing HLT applications for these specific groups of speakers of Dutch, a new project, JASMIN-CGN, was started which aims at extending CGN in different ways: by collecting a corpus of contemporary Dutch as spoken by children of different age groups, non-natives with different mother tongues and elderly people in the Netherlands and Flanders and, in addition, by collecting speech material in a communication setting that was not envisaged in CGN: human-machine interaction. We expect that the knowledge gathered from these data can be generalized to developing appropriate systems also for other speaker groups (i.e. adult natives). One third of the data will be collected in Flanders and two thirds in the Netherlands.

pdf bib abs
A framework for real-time dictionary updating
Cédrick Fairon | Sébastien Paumier

We present a framework that combines a web-based text acquisition tool, a term extractor and a two-level workflow management system tailored for facilitating dictionary updates. Our aim is to show that, thanks to such a methodology, it is possible to monitor data sources and rapidly review and code new dictionary entries. Once approved, these new entries can feed in real-time client dictionary-based applications that need to be continuously kept up to date.

pdf bib abs
Bilingual speech corpus in two phonetically similar languages
Vicente Alabau | Carlos D. Martínez

As Speech Recognition Systems improve, they become suitable for facingnew problems. Multilingual speech recognition is one such problems. In the present work, the case of the Comunitat Valenciana multilingual environment is studied. The official languages in the Comunitat Valenciana (Spanish and Valencian) share most of their acoustic units, and their vocabularies and syntax are quite similar. They have influenced each other for many years.A small corpus on an Information System task was developed for experimentationpurposes.This choice will make it possible to develop a working prototype in the future,and it is simple enough to build semi-automatic language models. The design of the acoustic corpus is discussed, showing that all combinations of accents have been studied (native, non-native speakers, male, female, etc.).

pdf bib abs
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

pdf bib abs
The Dutch-Flemish HLT Programme STEVIN: Essential Speech and Language Technology Resources
Elisabeth D’Halleweyn | Jan Odijk | Lisanne Teunissen | Catia Cucchiarini

In 2004 a consortium of ministries and organizations in the Netherlands and Flanders launched the comprehensive Dutch-Flemish HLT programme STEVIN (a Dutch acronym for Essential Speech and Language Technology Resources). To guarantee its Dutch-Flemish character, this large-scale programme is carried out under the auspices of the intergovernmental Dutch Language Union (NTU). The aim of STEVIN is to contribute to the further progress of HLT for the Dutch language, by raising awareness of HLT results, stimulating the demand of HLT products, promoting strategic research in HLT, and developing HLT resources that are essential and are known to be missing. Furthermore, a structure was set up for the management, maintenance and distribution of HLT resources. The STEVIN programme, which will run from 2004 to 2009, resulted from HLT activities in the Dutch language area, which were reported on at previous LREC conferences (2000, 2002, 2004). In this paper we will explain how different activities are combined in one comprehensive programme. We will show how cooperation can successfully be realized between different parties (language and speech technology, Flanders and the Netherlands, academia, industry and policy institutions) so as to achieve one common goal: progress in HLT.

pdf bib abs
On the Web Trilingual Sign Language Dictionary to Learn the foreign Sign Language without Learning a Target Spoken Language
Emiko Suzuki | Tomomi Suzuki | Kyoko Kakihana

This paper describes a trilingual sign language dictionary (Japanese Sign Language and American Sign Language, and Korean Sign Language) which helps those who learn each sign language directly from their mother sign language. Our discussion covers two main points. The first describes the necessity of a trilingual dictionary. Since there is no universal sign language or real international sign language deaf people should learn at least four languages: they want to talk to people whose mother tongue is different from their owns, the mother sign language, the mother spoken language as the first intermediate language, the target spoken language as the second intermediate language, and the sign language in which they want to communicate. Those two spoken languages become language barriers for deaf people and our trilingual dictionary will remove the barrier. The second describes the use of computer. As the use of computers becomes widespread, it is increasingly convenient to study through computer software or Internet facilities. Our WWW dictionary system provides deaf people with an easy means of access using their mother-sign language, which means they don't have to overcome the barrier of learning a foreign spoken language. It also provides a way for people who are going to learn three sign languages to look up new vocabulary. We are further planning to examine how our dictionary system could be used to educate and assist deaf people.

pdf bib abs
Exploiting logical document structure for anaphora resolution
Daniela Goecke | Andreas Witt

The aim of the paper is twofold. Firstly, an approach is presented how to select the correct antecedent for an anaphoric element according to the kind of text segments in which both of them occur. Basically, information on logical text structure (e.g. chapters, sections, paragraphs) is used in order to select the antecedent life span of a linguistic expression, i.e. some linguistic expressions are more likely to be chosen as an antecedent throughout the whole text than others. In addition, an appropriate search scope for an anaphora expressed by an expression can be defined according to the document structuring elements that include the linguistic expression. Corpus investigations give rise to the supposition that logical text structure influences the search scope of candidates for antecedents. Second, a solution is presented how to integrate the resources used for anaphora resolution. In this approach, multi-layered XML annotation is used in order to make a set of resources accessible for the anaphora resolution system.

pdf bib abs
A translated corpus of 30,000 French SMS
Cédrick Fairon | Sébastien Paumier

The development of communication technologies has contributed to the appearance of new forms in the written language that scientists have to study according to their peculiarities (typing or viewing constraints, synchronicity, etc). In the particular case of SMS (Short Message Service), studies are complicated by a lack of data, mainly due to technical constraints and privacy considerations. In this paper, we present a corpus of 30,000 French SMS collected through a project in Belgium named Faites don de vos SMS à la science (Give your SMS to Science). This corpus is unique in its quality, its size and the fact that the SMS have been manually translated into standard French. We will first describe the collection process and discuss the writers' profiles. Then we will explain in detail how the translation was carried out.

pdf bib abs
Evaluation of Stop Word Lists in Chinese Language
Feng Zou | Fu Lee Wang | Xiaotie Deng | Song Han

In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical. In this paper, to save the time and release the burden of manual comparison, we propose a novel stop word list evaluation method with a mutual information-based Chinese segmentation methodology. Experiments have been conducted on training texts taken from a recent international Chinese segmentation competition. Results show that effective stop word lists can improve the accuracy of Chinese segmentation significantly.

pdf bib abs
Intelligent Dictionary Interfaces: Usability Evaluation of Access-Supporting Enhancements
Anna Sinopalnikova | Pavel Smrž

The present paper describes psycholinguistic experiments aimed at exploring the way people behave while accessing electronic dictionaries. In our work we focused on the access by meaning that, in comparison with the access by form, is currently less studied and very seldom implemented in modern dictionary interfaces. Thus, the goal of our experiments was to explore dictionary users requirements and to study what services an intelligent dictionary interface should be able to supply to help solving access by meaning problems. We tested several access-supporting enhancements of electronic dictionaries based on various language resources (corpora, wordnets, word association norms and explanatory dictionaries). Experiments were carried out with native speakers of three European languages English, Czech and Russian. Results for monolingual and bilingual cases are presented.

pdf bib abs
SmartWeb UMTS Speech Data Collection: The SmartWeb Handheld Corpus
Hannes Mögele | Moritz Kaiser | Florian Schiel

In this paper we outline the German speech data collection for the SmartWeb project, which is fundedby the German Ministry of Science and Education. We focus on the SmartWeb Handheld Corpus (SHC), which has been collected by the Bavarian Archive for Speech Signals (BAS) at the Phonetic Institute (IPSK) of Munich University. Signals of SHC are being recorded in real-life environments(indoor and outdoor) with real background noise as well as real transmission line errors. We developed a new elicitation method and recording technique, calledsituational prompting, which facilitates collecting realistic dialogue speech data in a cost efficient way. We can show that almost realistic speech queries to a dialogue system issued over a mobile PDA or smart phonecan be collected very efficiently using an automatic speech server. We describe the technical and linguistic features of the resulting speech corpus, which will bepublicly available at BAS or ELDA.

pdf bib abs
Valency Lexicon of Czech Verbs: Alternation-Based Model
Markéta Lopatková | Zdeněk Žabokrtský | Karolina Skwarska

The main objective of this paper is to introduce an alternation-based model of valency lexicon of Czech verbs VALLEX. Alternations describe regular changes in valency structure of verbs -- they are seen as transformations taking one lexical unit and return a modified lexical unit as a result. We characterize and exemplify syntactically-based and semantically-based' alternations and their effects on verb argument structure. The alternation-based model allows to distinguish a minimal form of lexicon, which provides compact characterization of valency structure of Czech verbs, and an expanded form of lexicon useful for some applications.

pdf bib abs
Are you ready for a call? - Spontaneous conversations in tourism for speech-to-speech translation systems
Darinka Verdonik | Matej Rojc

The paper represents the Turdis database of spontaneous conversations in tourist domain in Slovenian language. Database was built for use in developing speech-to-speech translation components, however it can be used also for developing dialog systems or used for linguistic researches. The idea was to record a database of telephone conversations in tourism where the naturalness of conversations is affected as little as possible while we still obtain a permission for recording from all the speakers. When recording in studio environment there can be many problems. It is especially difficult to imitate a tourist agent if a speaker does not have such experiences and therefore lacks the background knowledge that a tourist agent has. Therefore the Turdis database was recorded with professional tourist agents. The agreement with local tourist companies enabled that we recorded a tourist agent while he was at his working place in his working time answering the telephone. Callers were contacted individually and asked to use the Turdis system and make a call to selected tourist company. Technically the recording was done using PC ISDN card. Database was orthographically transcribed with Transcriber tool. At the present it includes cca 43 000 words.

pdf bib abs
Bikers Accessing the Web: The SmartWeb Motorbike Corpus
Moritz Kaiser | Hannes Mögele | Florian Schiel

Three advanced German speech corpora have been collected during theGerman SmartWeb project. One of them, the SmartWeb MotorbikeCorpus (SMC) is described in this paper. As with all SmartWeb speech corpora SMC is designed for a dialogue system dealing with open domains. The corpus is recorded under the special circumstances of a motorbike ride and contains utterances of the driver related to information retrieval from various sources and different topics. Audio tracks show characteristic noise from the engine and surrounding traffic as well as drop outs caused by the transmission over Bluetooth and the UMTS mobile network. We discuss the problems of the technical setup and the fully automatic evocation of natural-spoken queries by means of dialogue-like sequences.

Knowledge has been crucial for the countrys development and business intelligence, where valuable knowledge is distributed over several websites with heterogeneous formats. Moreover, finding the needed information is a complex task since there has been lack of semantic relation and organization. Even if it has been found, an overload may occur because there is no content digestion. This paper focuses on ontology-driven knowledge extraction with natural language processing techniques and a framework of usercentric design for accessing the required information based on their demands. These demands can be expressed in the form of Knowwhat, Know-why, Know-where, Know-when, Know-how, and Know-who for a question answering system.

pdf bib abs
Temporality in relation with discourse structure
Corina Forăscu | Ionuț Cristian Pistol | Dan Cristea

Temporal relations between events and times are often difficult to discover, time-consuming and expensive. In this paper a corpus study is performed to derive a strong relation between discourse structure, as revealed by Veins theory, and the temporal links between entities, as addressed in the TimeML annotation standard. The data interpretation helps us gain insight on how Veins theory can improve the manual and even (semi-) automatic detection of temporal relations.

pdf bib abs
Corpus Annotation as a Test of a Linguistic Theory
Eva Hajičová | Petr Sgall

In the present contribution we claim that corpus annotation serves, among other things, as an invaluable test for linguistic theories standing behind the annotation schemes, and as such represents an irreplaceable resource of linguistic information for the build-up of grammars. To support this claim we present four linguistic phenomena for the study and relevant description of which in grammar a deep layer of corpus annotation as introduced in the Prague Dependency Treebank has brought important observations, namely the information structure of the sentence, condition of projectivity and word order, types of dependency relations and textual coreference.

pdf bib abs
Czech-English Word Alignment
Ondřej Bojar | Magdelena Prokopová

We describe an experiment with Czech-English word alignment. Half a thousand sentences were manually annotated by two annotators in parallel and the most frequent reasons for disagreement are described. We evaluate the accuracy of GIZA++ alignment toolkit on the data and identify that lemmatization of the Czech part can reduce alignment error to a half. Furthermore we document that about 38% of tokens difficult for GIZA++ were difficult for humans already.

pdf bib abs
MORBO/COMP: a multilingual database of compound words
Emiliano Guevara | Sergio Scalise | Antonietta Bisetto | Chiara Melloni

The aim of this paper is to present the MORBO/COMP project, which has reached its final stage in development and will soon be published on-line. MORBO/COMP is large database of compound types in over 20 languages. The data for these languages have been collected and analysed by a group of morphologists from various European countries.

pdf bib abs
A Multimodal Result Ontology for Integrated Semantic Web Dialogue Applications
Daniel Sonntag | Massimo Romanelli

General purpose ontologies and domain ontologies make up the infrastructure of the Semantic Web, which allow for accurate data representations with relations, and data inferences. In our approach to multimodal dialogue systems providing question answering functionality (SMARTWEB), the ontological infrastructure is essential. We aim at an integrated approach in which all knowledge-aware system modules are based on interoperating ontologiesin a common data model. The discourse ontology is meant to provide the necessary dialogue- and HCI concepts. We present the ontological syntactic structure of multimodal question answering results as partof this discourse ontology which extends the W3C EMMA annotation framework and uses MPEG-7 annotations. In addition, we describe anextension to ontological result structures where automatic and context-based sorting mechanisms can be naturally incorporated.

pdf bib abs
Elaborating the parameterized Equivalence Class Method for Dutch
Nicole Grégoire

This paper discusses the parameterized Equivalence Class Method for Dutch, an approach developed to incorporate standard lexical representations for Dutch idioms into representations required by any specific NLP system with as minimal manual work as possible. The purpose of the paper is to give an overview of parameters applicable to Dutch, which are determined by examining a large set of data and two Dutch NLP systems. The effects of the introduced parameters are evaluated and the results presented.

pdf bib abs
Developing a ContextualizedMultimodal Corpus for Human-Robot Interaction
Anders Green | Helge Hüttenrauch | Elin Anna Topp | Kerstin Severinson

This paper describes the development process of a contextualized corpus for research on Human-Robot Communication. The data have been collected in two Wizard-of-Oz user studies performedwith 22 and 5 users respectively in a scenario that is called the HomeTour. In this scenario the users show the environment (a single room, or a whole floor) to the robot using a combination of speech and gestures. The corpus has been transcribed and annotated with respect to gestures and conversational acts, thus forming a core annotation. We have also annotated or linked other types of data, e.g., laser range finder readings, positioning analysis, questionnaire data and task descriptions that form the annotated context of the scenario. By providing a rich set of different annotated data, thecorpus is thus an important resource both for research on natural language speech interfaces for robots and for research on human-robot communication in general.

pdf bib abs
Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus
Wei-Yun Ma | Chu-Ren Huang

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus containsroughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenges to effectively tag this large-scale corpus. The evaluation result shows our mechanism of tagging maintains high annotation quality.

pdf bib abs
Romanian Valence Dictionary in XML Format
Ana-Maria Barbu | Emil Ionescu | Verginica Barbu Mititelu

Valence dictionaries are dictionaries in which logical predicates (most of the times verbs) are inventoried alongside with the semantic and syntactic information regarding the role of the arguments with which they combine, as well as the syntactic restrictions these arguments have to obey. In this article we present the incipient stage of the project Syntactic and semantic database in XML format: an HPSG representation of verb valences in Romanian. Its aim is the development of a valence dictionary in XML format for a set of 3000 Romanian verbs. Valences are specified for each sense of each verb, alongside with an illustrative example, possible argument alternations and a set of multiword expressions in which the respective verb occurs with the respective sense. The grammatical formalism we make use of is Head-driven Phrase Structure Grammar, which offers one of the most comprehensive frames of encoding various types of linguistic information for lexical items. XML is the most appropriate mark-up language for describing information structured in HPSG framework. The project can be further on extended so that to cover all Romanian verbs (around 7000) and also other predicates (nouns, adjectives, prepositions).

pdf bib abs
Field Evaluation of a Single-Word Pronunciation Training System
Niels Ole Bernsen | Thomas K. Hansen | Svend Kiilerich | Torben Kruchov Madsen

Many learning tasks require substantial skills training. Ideally, the student might benefit the most from having a human expert a teacher or trainer at hand throughout, but human expertise remains a scarce resource. The second-best solution could be to do skills training with a computer-based self-training system. This vision of the computer as tutor currently motivates increasing efforts world-wide, in all manner of fields, including that of computer-assisted language learning, or CALL. But, as pointed out by Hincks [2003], along with the growth of the CALL area comes a growing need for empirical evidence that CALL systems have a beneficial effect. This point is reiterated by Chapelle [2002] who defines the goal for Computer Assisted Second Language Research as the gathering of evidence for the effect of CALL and instructional design. This paper presents results of a field test of our pronunciation training system which enables immigrants and others to self-train their pronunciation skills of single Danish words.

pdf bib abs
Mining Implicit Entities in Queries
Wei Li | Wenjie Li | Qin Lu

Entities are pivotal in describing events and objects, and also very important in Document Summarization. In general only explicit entities which can be extracted by a Named Entity Recognizer are used in real applications. However, implicit entities hidden behind the phrases or words, e.g. entity referred by the phrase cross border, are proved to be helpful in Document Summarization. In our experiment, we extract the implicit entities from the web resources.

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

pdf bib abs
Learning Database Content for Spoken Dialogue System Design
Joseph Polifroni | Marilyn Walker

Spoken dialogue systems are common interfaces to backend data in information retrieval domains. As more data is made available on the Web and IE technology matures, dialogue systems, whether they be speech- or text-based, will be more in demand to provide user-friendly access to this data. However, dialogue systems must become both easier to configure, as well as more informative than the traditional form-based systems that are currently available. We present techniques in this paper to address the issue of automating both content selection for use in summary responses and in system initiative queries.

pdf bib abs
Bilingual Machine-Aided Indexing
Jorge Civera | Alfons Juan

The proliferation of multilingual documentation in our Information Society has become a common phenomenon. This documentation is usually categorised by hand, entailing a time-consuming and arduous burden. This is particularly true in the case of keyword assignment, in which a list of keywords (descriptors) from a controlled vocabulary (thesaurus) is assigned to a document. A possible solution to alleviate this problem comes from the hand of the so-called Machine-Aided Indexing (MAI) systems. These systems work in cooperation with professional indexer by providing a initial list of descriptors from which those most appropiated will be selected. This way of proceeding increases the productivity and eases the task of indexers. In this paper, we propose a statistical text classification framework for bilingual documentation, from which we derive two novel bilingual classifiers based on the naive combination of monolingual classifiers. We report preliminary results on the multilingual corpus Acquis Communautaire (AC) that demonstrates the suitability of the proposed classifiers as the backend of a fully-working MAI system.

pdf bib abs
Integrating Methods and LRs for Automatic Keyword Extraction from Open Domain Texts
Alessandro Panunzi | Marco Fabbri | Massimo Moneglia

The paper presents a tool for keyword extraction from multilingual resources developed within the AXMEDIS project. In this tool lexical collocations (Sinclair, 1991) internal to documents are used to enhance the performance obtained through standard statistical procedure. A first set of mono-term keywords is extracted through the TF.IDF algorithm (Salton, 1989). The internal analysis of the document generates a second set of multi-term keywords based on the first set, rather than on multi-term frequency comparison with a general resource (Witten et al. 1999). Collocations in which a mono-term keyword occurs as the head are considered as multi-term keywords, and are assumed to increase the identification of the content. The evaluation compares the results of the TF.IDF procedure and the ones obtained with the enhanced procedure in terms of precision. Each set of keywords received a value from the point of view of a possible user, regarding: (a) overall efficiency of the whole set of keywords for the identification of the content; (b) adequacy of each extracted keyword. Results show that multi-term keywords increase the content identification with a 100% relative factor and that the adequacy is enhanced in 33% of cases.

pdf bib abs
Component Evaluation in a Question Answering System
Luís Fernando Costa | Luís Sarmento

Automatic question answering (QA) is a complex task, which lies in the cross-road of Natural Language Processing, Information Retrieval and Human Computer Interaction. A typical QA system has four modules question processing, document retrieval, answer extraction and answer presentation. In each of these modules, a multitude of tools can be used. Therefore, the performance evaluation of each of these components is of great importance in order to check their impact in the global performance, and to conclude whether these components are necessary, need to be improved or substituted. This paper describes some experiments performed in order to evaluate several components of the question answering system Esfinge.We describe the experimental set up and present the results of error analysis based on runtime logs of Esfinge. We present the results of component analysis, which provides good insights about the importance of the individual components and pre-processing modules at various levels, namely stemming, named-entity recognition, PoS Filtering and filtering of undesired answers. We also present the results of substituting the document source in which Esfinge tries to find possible answers and compare the results obtained using web sources such as Google, Yahoo and BACO, a large database of web documents in Portuguese.

pdf bib abs
Set-up of a Unit-Selection Synthesis with a Prominent Voice
Stefan Breuer | Sven Bergmann | Ralf Dragon | Sebastian Möller

In this paper, we describe the set-up process and an initial evaluation of a unit-selection speech synthesizer. The synthesizer is specific in that it is intended to speak with a prominent voice. As a consequence, only very limited resources were available for setting up the unit database. These resources have been extracted from an audio book, segmented with the help of an HMM-based wrapper, and then used with the non-uniform unit-selection approach implemented in the Bonn Open Synthesis System (BOSS). In order to adapt the database to the BOSS implementation, the label files were amended by phrase boundaries, converted to XML, amended by prosodic and spectral information, and then further converted to a MySQL relational database structure. The BOSS system selects units on the basis of this information, adding individual unit costs to the concatenation costs given by MFCC and F0 distances. The paper discusses the problems which occurred during the database set-up, the invested effort, as well as the quality level which can be reached by this approach.

pdf bib abs
A Deep Linguistic Analysis for Cross-language Information Retrieval
Nasredine Semmar | Meriama Laib | Christian Fluhr

Cross-language information retrieval consists in providing a query in one language and searching documents in one or different languages. These documents are ordered by the probability of being relevant to the user's request. The highest ranked document is considered to be the most likely relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents. This system is composed of a linguistic analyzer, a statistic analyzer, a reformulator, a comparator and a search engine. The linguistic analysis processes both documents to be indexed and queries to extract concepts representing their content. This analysis includes a morphological analysis, a part-of-speech tagging and a syntactic analysis. In this paper, we present the deep linguistic analysis used in the LIC2M cross-lingual search engine and we will particularly focus on the impact of the syntactic analysis on the retrieval effectiveness.

pdf bib abs
Annotating COMPARA, a Grammar-aware Parallel Corpus
Diana Santos | Susana Inácio

In this paper we describe the annotation of COMPARA, currently the largest post-edited parallel corpora which include Portuguese. We describe the motivation, the results so far, and the way the corpus is being annotated. We also provide the first grounded results about syntactical ambiguity in Portuguese. Finally, we discuss some interesting problems in this connection.

pdf bib abs
EuroTermBank - a Terminology Resource based on Best Practice
Lina Henriksen | Claus Povlsen | Andrejs Vasiljevs

The new EU member countries face the problems of terminology resource fragmentation and lack of coordination in terminology development in general. The EuroTermBank project aims at contributing to improve the terminology infrastructure of the new EU countries and the project will result in a centralized online terminology bank - interlinked to other terminology banks and resources - for languages of the new EU member countries. The main focus of this paper is on a description of how to identify best practice within terminology work seen from a broad perspective. Surveys of real life terminology work have been conducted and these surveys have resulted in identification of scenario specific best practice descriptions of terminology work. Furthermore, this paper will present an outline of the specific criteria that have been used for selection of existing term resources to be included in the EuroTermBank database.

This paper presents the TagShare project and the linguistic resources and tools for the shallow processing of Portuguese developed in its scope. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several state of the art shallow processing tools capable of automatically producing that type of annotation. At present, the linguistic annotations in the corpus are sentence and paragraph boundaries, token boundaries, morphosyntactic POS categories, values of inflection features, lemmas and namedentities. Hence, the set of tools comprise a sentence chunker, a tokenizer, a POS tagger, nominal and verbal analyzers and lemmatizers, a verbal conjugator, a nominal inflector, and a namedentity recognizer, some of which underline several online services.

pdf bib abs
Corpógrafo V3 - From Terminological Aid to Semi-automatic Knowledge Engineering
Luís Sarmento | Belinda Maia | Diana Santos | Ana Pinto | Luís Cabral

In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafos workflow and describe the most important information extraction methods used, namely its term extraction, and definition / semantic relations identification procedures. We will describe current Corpógrafo users and present a brief overview of the XML format currently used to export terminology databases. Finally, we present future improvements for this tool.

pdf bib abs
On the data base of Romanian syllables and some of its quantitative and cryptographic aspects
Liviu Dinu | Anca Dinu

In this paper we argue for the need to construct a data base of Romanian syllables. We explain the reasons for our choice of the DOOM corpus which we have used. We describe the way syllabification was performed and explain how we have constructed the data base. The main quantitative aspects which we have extracted from our research are presented. We also computed the entropy of the syllables and the entropy of the syllables w.r.t. the consonant-vowel structure. The results are compared with results of similar researches realized for different languages.

pdf bib abs
A Dependency-based Algorithm for Grammar Conversion
Alessandro Bahgat Shehata | Fabio Massimo Zanzotto

In this paper we present a model to transfer a grammatical formalism in another. The model is applicable only on restrictive conditions. However, it is fairly useful for many purposes: parsing evaluation, researching methods for truly combining different parsing outputs to reach better parsing performances, and building larger syntactically annotated corpora for data-driven approaches. The model has been tested over a case study: the translation of the Turin Tree Bank Grammar to the Shallow Grammar of the CHAOS Italian parser.

pdf bib abs
Ontological and Terminological Commitments and the Discourse of Specialist Communities
Khurshid Ahmad | Maria Teresa Musacchio | Giuseppe Palumbo

The paper presents a corpus-based study aimed at an analysis of ontological and terminological commitments in the discourse of specialist communities. The analyzed corpus contains the lectures delivered by the Nobel Prize winners in Physics and Economics. The analysis focuses on (a) the collocational use of automatically identified domain-specific terms and (b) a description of meta-discourse in the lectures. Candidate terms are extracted based on the z-score of frequency and weirdness. Compounds comprising these candidate terms are then identified using the ontology representation system Protégé. This method is then replicated to complete analysis by including an investigation of metadiscourse markers signalling how writers project themselves into their work.

pdf bib abs
LexiPass methodology: a conceptual path from frames to senses and back
Alessandro Oltramari

In this paper we claim that an integration of FrameNet and WordNet will improve interoperability, user-friendliness and usability of both lexical resources. If the former provides a sophisticated representational structure compared to a narrow lexical coverage, the latter - on the other side - supplies a dense network of word senses and semantic relations although not supporting advanced accessibility (i.e., via frames). According to the integration perspective we present in the paper, we introduce LexiPass methodology, which combines Burckardts tool WordNet Detour of FrameNet with basic statistical analysis, enabling frame-guided search and extraction of domain synsets from WordNet.

pdf bib abs
Fear-type emotions of the SAFE Corpus: annotation issues
Chloé Clavel | Ioana Vasilescu | Laurence Devillers | Thibaut Ehrette | Gaël Richard

The present research focuses on annotation issues in the context of the acoustic detection of fear-type emotions for surveillance applications. The emotional speech material used for this study comes from the previously collected SAFE Database (Situation Analysis in a Fictional and Emotional Database) which consists of audio-visual sequences extracted from movie fictions. A generic annotation scheme was developed to annotate the various emotional manifestations contained in the corpus. The annotation was carried out by two labellers and the two annotations strategies are confronted. It emerges that the borderline between emotion and neutral vary according to the labeller. An acoustic validation by a third labeller allows at analysing the two strategies. Two human strategies are then observed: a first one, context-oriented which mixes audio and contextual (video) information in emotion categorization; and a second one, based mainly on audio information. The k-means clustering confirms the role of audio cues in human annotation strategies. It particularly helps in evaluating those strategies from the point of view of a detection system based on audio cues.

pdf bib abs
Training a Statistical Machine Translation System without GIZA++
Arne Mauser | Evgeny Matusov | Hermann Ney

The IBM Models (Brown et al., 1993) enjoy great popularity in the machine translation community because they offer high quality word alignments and a free implementation is available with the GIZA++ Toolkit (Och and Ney, 2003). Several methods have been developed to overcome the asymmetry of the alignment generated by the IBM Models. A remaining disadvantage, however, is the high model complexity. This paper describes a word alignment training procedure for statistical machine translation that uses a simple and clear statistical model, different from the IBM models. The main idea of the algorithm is to generate a symmetric and monotonic alignment between the target sentence and a permutation graph representing different reorderings of the words in the source sentence. The quality of the generated alignment is shown to be comparable to the standard GIZA++ training in an SMT setup.

pdf bib abs
Regional Bias in the Broad Phonetic Transcriptions of the Spoken Dutch Corpus
Evie Coussé | Steven Gillis

In this paper, we assess an aspect of the quality of the broad phonetic transcriptions in the Spoken Dutch Corpus (CGN). The corpus contains speech from native speakers of Dutch originating from The Netherlands and the Dutch speaking part of Belgium. The phonetic transcriptions were made by transcribers from both regions. In previous research, we have identified regional differences in the transcribers' behaviour. In this paper, we explore the precise sources of the regional bias in the CGN transcriptions and we evaluate its impact on the phonetic transcriptions. More specifically, (1) the regional bias in the canonical transcriptions that served as the basis for the verification task of the transcribers is critically analysed, and (2) we verify in an experiment the regional bias introduced by the transcribers themselves. The possible effects of this inherent regional bias in the CGN transcriptions on subsequent linguistic analyses are briefly discussed.

pdf bib abs
Test Collections for Patent Retrieval and Patent Classification in the Fifth NTCIR Workshop
Atsushi Fujii | Makoto Iwayama | Noriko Kando

This paper describes the test collections produced for the Patent Retrieval Task in the Fifth NTCIR Workshop. We performed the invalidity search task, in which each participant group searches a patent collection for the patents that can invalidate the demand in an existing claim. For this purpose, we performed both document and passage retrieval tasks. We also performed the automatic patent classification task using the F-term classification system. The test collections will be available to the public for research purposes.

pdf bib abs
Evaluating Morphosyntactic Tagging of Croatian Texts
Željko Agić | Marko Tadić

This paper describes results of the first successful effort in applying a stochastic strategy or, namely, a second order Markov model paradigm implemented by the TnT trigram tagger to morphosyntactic tagging of Croatian texts. Beside the tagger, for purposes of both training and testing, we had at our disposal only a 100 Kw Croatia Weekly newspaper subcorpus, manually tagged using approximately 1000 different MULTEXT-East v3 morphosyntactic tags. The test basically consisted of randomly assigning a variable size portion of the corpus for the taggers training procedure and also another fixed-size portion, sized at 10% of the corpus, for the tagging procedure itself; this method allowed us not only to provide preliminary results regarding tagger accuracy on Croatian texts, but also to inspect the behavior of the stochastic tagging paradigm in general. The results were then taken from the test case providing 90% of the corpus for training purposes and varied from around 86% in the worst case scenario up to a peak of around 95% correctly assigned full MSD tags. Results on PoS only expectedly reached the human error level, with TnT correctly tagging above 98% of test sets on average. Most MSD errors occurred on types with the highest number of candidate tags per word form nouns, pronouns and adjectives while errors on PoS, although following the same pattern, were almost insignificant. Detailed insight on tagging, F-measure for all PoS categories is provided in the course of the paper along with other facts of interest.

pdf bib abs
Extending the Wizard of Oz Methodologie for Multimodal Language-enabled Systems
Martin Rajman | Marita Ailomaa | Agnes Lisowska | Miroslav Melichar | Susan Armstrong

In this paper we present a proposal for extending the standard Wizard of Oz experimental methodology to language-enabled multimodal systems. We first discuss how Wizard of Oz experiments involving multimodal systems differ from those involving voice-only systems. We then go on to discuss the Extended Wizard of Oz methodology and the Wizard of Oz testing environment and protocol that we have developed. We then describe an example of applying this methodology to Archivus, a multimodal system for multimedia meeting retrieval and browsing. We focus in particular on the tools that the wizards would need to successfully and efficiently perform their tasks in a multimodal context. We conclude with some general comments about which questions need to be addressed when developing and using the Wizard of Oz methodology for testing multimodal systems.

pdf bib abs
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

pdf bib abs
Mining Knowledge fromWikipedia for the Question Answering task
Davide Buscaldi | Paolo Rosso

Although significant advances have been made recently in the Question Answering technology, more steps have to be undertaken in order to obtain better results. Moreover, the best systems at the CLEF and TREC evaluation exercises are very complex systems based on custom-built, expensive ontologies whose aim is to provide the systems with encyclopedic knowledge. In this paper we investigated the use of Wikipedia, the open domain encyclopedia, for the Question Answering task. Previous works considered Wikipedia as a resource where to look for the answers to the questions. We focused on some different aspects of the problem, such as the validation of the answers as returned by our Question Answering System and on the use of Wikipedia categories in order to determine a set of patterns that should fit with the expected answer. Validation consists in, given a possible answer, saying wether it is the right one or not. The possibility to exploit the categories ofWikipedia was not considered until now. We performed our experiments using the Spanish version of Wikipedia, with the set of questions of the last CLEF Spanish monolingual exercise. Results show that Wikipedia is a potentially useful resource for the Question Answering task.

pdf bib abs
Human Verb Associations as the Basis for Gold Standard Verb Classes: Validation against GermaNet and FrameNet
Sabine Schulte im Walde

We describe a gold standard for semantic verb classes which is based on human associations to verbs. The associations were collected in a web experiment and then applied as verb features in a hierarchical cluster analysis. We claim that the resulting classes represent a theory-independent gold standard classification which covers a variety of semantic verb relations, and whose features can be used to guide the feature selection in automatic processes. To evaluate our claims, the association-based classification is validated against two standard approaches to semantic verb classes, GermaNet and FrameNet.

pdf bib abs
A Grapheme-Based Approach for Accent Restoration in Gikuyu
Peter W. Wagacha | Guy De Pauw | Pauline W. Githinji

The orthography of Gikuyu includes a number of accented characters to represent the entire vowel system. These characters are however not readily available on standard computer keyboards and are usually represented as the nearest available character. This can render reading and understanding written texts more difficult. This paper describes a system that is able to automatically place these accents in Gikuyu text on the basis of local graphemic context. This approach avoids the need for an extensive digital lexicon, typically not available for resource-scarce languages. Using an extended trigram based-approach, the experiments show that this method can achieve a very high accuracy even with a limited amount of digitally available textual data. The experiments on Gikuyu are contrasted with experiments on French, German and Dutch.

pdf bib abs
MEDUSA: User-Centred Design and usability evaluation of Automatic Speech Recognition telephone services in Telefónica Móviles España
Juan José Rodríguez Soler | Pedro Concejero Cerezo | Daniel Tapias Merino | José Sánchez

One of the greatest challenges in the design of speech recognition based interfaces is about the navigation through the different service hierarchies and structures. On the one hand, the interactions based on human machine dialogues force a high level of hierarchical structuring of services, and on the other hand, it is necessary to wait for the last phases of the user interface development to obtain a global vision of the dialogue problems by means of user trials. To tackle these problems, Telefónica Móviles España has carried out several projects with the final aim to define a corporate methodology based on rapid prototyping of the user interfaces, so that designers could integrate the process of design of voice interfaces with emulations of the navigation through the flow charts. This was also the starting point for a specific software product (MEDUSA) which addresses the needs of rapid prototyping of these user interfaces from the earliest stages of the design and analysis phases.

This paper describes the SALSA corpus, a large German corpus manually annotated with manual role-semantic annotation, based on the syntactically annotated TIGER newspaper corpus. The first release, comprising about 20,000 annotated predicate instances (about half the TIGER corpus), is scheduled for mid-2006. In this paper we discuss the annotation framework (frame semantics) and its cross-lingual applicability, problems arising from exhaustive annotation, strategies for quality control, and possible applications.

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).

pdf bib abs
SALTO - A Versatile Multi-Level Annotation Tool
Aljoscha Burchardt | Katrin Erk | Anette Frank | Andrea Kowalski | Sebastian Pado

In this paper, we describe the SALTO tool. It was originally developed for the annotation of semantic roles in the frame semantics paradigm, but can be used for graphical annotation of treebanks with general relational information in a simple drag-and-drop fashion. The tool additionally supports corpus management and quality control.

pdf bib abs
KNACK-2002: a Richly Annotated Corpus of Dutch Written Text
Véronique Hoste | Guy De Pauw

In this paper, we introduce the annotated KNACK-2002 corpus of Dutch written text. The corpus features five different annotation layers, ranging from the annotation of morphological boundaries at the word level, over the annotation of part-of-speech tags and phrase chunks at the syntactic level to the annotation of named entities at the semantic level and coreferential relations at the discourse level. We believe the corpus is unique in the Dutch language area because of its richness of annotation layers, providing researchers with a useful gold standard data set for different NLP tasks in the domains of morphology, (morpho)syntax, semantics and discourse.

pdf bib abs
Finding the Appropriate Generalization Level for Binary Ontological Relations Extracted from the Genia Corpus
P. Cimiano | M. Hartung | E. Ratsch

Recent work has aimed at discovering ontological relations from text corpora. Most approaches are based on the assumption that verbs typically indicate semantic relations between concepts. However, the problem of finding the appropriate generalization level for the verb's arguments with respect to a given taxonomy has not received much attention in the ontology learning community. In this paper, we address the issue of determining the appropriate level of abstraction for binary relations extracted from a corpus with respect to a given concept hierarchy. For this purpose, we reuse techniques from the subcategorization and selectional restrictions acquisition communities. The contribution of our work lies in the systematic analysis of three different measures. We conduct our experiments on the Genia corpus and the Genia ontology and evaluate the different measures by comparing the results of our approach with a gold standard provided by one of the authors, a biologist.

pdf bib abs
Transcription Cost Reduction for Constructing Acoustic Models Using Acoustic Likelihood Selection Criteria
Tomoyuki Kato | Tomiki Toda | Hiroshi Saruwatari | Kiyohiro Shikano

This paper describes a novel method for reducing the transcription effort in the construction of task-adapted acoustic models for a practical automatic speech recognition (ASR) system. We have to prepare actual data samples collected in the practical system and transcribe them for training the task-adapted acoustic models. However, transcribing utterances is a time-consuming and laborious process. In the proposed method, we firstly adapt initial models to acoustic environment of the system using a small number of collected data samples with transcriptions. And then, we automatically select informative training data samples to be transcribed from a large-sized speech corpus based on acoustic likelihoods of the models. We perform several experimental evaluations in the framework of Takemarukun, a practical speech-oriented guidance system. Experimental results show that 1) utterance sets with low likelihoods cause better task-adapted models compared with those with high likelihoods although the set with the lowest likelihoods causes the performance degradation because of including outliers, and 2) MLLR adaptation is effective for training the task-adapted models when the amount of the transcribed data is small and EM training outperforms MLLR if we transcribe more than around 10,000 utterances.

pdf bib abs
Part-of-Speech Tagging of Transcribed Speech
Margot Mieskes | Michael Strube

We used four Part-of-Speech taggers, which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues. The assigned tags were then manually corrected. The correction was first used to evaluate the four taggers, then to retrain them. Despite limited resources in time, money and annotators we reached results comparable to those reported for the taggers on text. Based on our experience we present guidelines to produce reliably POS tagged corpora of new domains.

pdf bib abs
Analysis of TimeBank as a Resource for TimeML Parsing
Branimir Boguraev | Rie Kubota Ando

In our work, we present an analysis of the TimeBank corpus---the only available reference sample of TimeML-compliant annotation---from the point of view of its utility as a training resource for developing automated TimeML annotators. We are encouraged by experimental results indicative of the potential of TimeBank; at the same time, closer inspection of causes for some systematic errors shows off certain deficiencies in the corpus, primarily to do with small size and inconsistent annotation. Our analysis suggests that even a reference resource, developed outside of a rigorous process of training corpus design and creation, can be extremely valuable for training and development purposes. The analysis also highlights areas of correction and improvement for evolving the current reference corpus into a community infrastructure resource.

pdf bib abs
Case Frame Compilation from the Web using High-Performance Computing
Daisuke Kawahara | Sadao Kurohashi

Case frames are important knowledge for a variety of NLP systems, especially when wide-coverage case frames are available. To acquire such large-scale case frames, it is necessary to automatically compile them from an enormous amount of corpus. In this paper, we consider the web as a corpus. We first build a huge text corpus from the web, and then construct case frames from the corpus. It is infeasible to do these processes by one CPU, and thus we employ a high-performance computing environment composed of 350 CPUs. The acquired corpus consists of 470M sentences, and the case frames compiled from them have 90,000 verb entries. The case frames contain most examples of usual use, and are ready to be applied to lots of NLP analyses and applications.

pdf bib abs
Data, Annotations and Measures in EASY the Evaluation Campaign for Parsers of French.
Patrick Paroubek | Isabelle Robba | Anne Vilnat | Christelle Ayache

This paper presents the protocol of EASY the evaluation campaign for syntactic parsers of French in the EVALDA project of the TECHNOLANGUE program. We describe the participants, the corpus and its genre partitioning, the annotation scheme, which allows for the annotation of both constituents and relations, the evaluation methodology and, as an illustration, the results obtained by one participant on half of the corpus.

pdf bib abs
Creation of a Japanese Adverb Dictionary that Includes Information on the Speaker’s Communicative Intention Using Machine Learning
Toshiyuki Kanamaru | Masaki Murata | Hitoshi Isahara

Japanese adverbs are classified as either declarative or normal; the former declare the communicative intention of the speaker, while the latter convey a manner of action, a quantity, or a degree by which the adverb modifies the verb or adjective that it accompanies. We have automatically classified adverbs as either declarative or not declarative using a machine-learning method such as the maximum entropy method. We defined adverbs having positive or negative connotations as the positive data. We classified adverbs in the EDR dictionary and IPADIC used by Chasen using this result and built an adverb dictionary that contains descriptions of the communicative intentions of the speaker.

pdf bib abs
Exploiting text for extracting image processing resources
Gregory Grefenstette | Fathi Debili | Christian Fluhr | Svitlana Zinger

Much everyday knowledge about physical aspects of objects does not exist as computer data, though such computer-based knowledge will be needed to communicate with next generation voice-commanded personal robots as well in other applications involving visual scene recognition. The largest attempt at manually creating common-sense knowledge, the CYC project, has not yet produced the information needed for these tasks. A new direction is needed, based on an automated approach to knowledge extraction. In this article we present our project to mine web text to find properties of objects that are not currently stored in computer readable form.

pdf bib abs
Clustering acronyms in biomedical text for disambiguation
Naoaki Okazaki | Sophia Ananiadou

Given the increasing number of neologisms in biomedicine (names of genes, diseases, molecules, etc.), the rate of acronyms used in literature also increases. Existing acronym dictionaries cannot keep up with the rate of new creations. Thus, discovering and disambiguating acronyms and their expanded forms are essential aspects of text mining and terminology management. We present a method for clustering long forms identified by an acronym recognition method. Applying the acronym recognition method to MEDLINE abstracts, we obtained a list of short/long forms. The recognized short/long forms were classified by abiologist to construct an evaluation set for clustering sets of similar long forms. We observed five types of term variation in the evaluation set and defined four similarity measures to gathers the similar longforms (i.e., orthographic, morphological, syntactic, lexico semantic variants, nested abbreviations). The complete-link clustering with the four similarity measures achieved 87.5% precision and 84.9% recall on the evaluation set.

pdf bib abs
Towards a terminological resource for biomedical text mining
Goran Nenadic | Naoki Okazaki | Sophia Ananiadou

One of the main challenges in biomedical text mining is the identification of terminology, which is a key factor for accessing and integrating the information stored in literature. Manual creation of biomedical terminologies cannot keep pace with the data that becomes available. Still, many of them have been used in attempts to recognise terms in literature, but their suitability for text mining has been questioned as substantial re-engineering is needed to tailor the resources for automatic processing. Several approaches have been suggested to automatically integrate and map between resources, but the problems of extensive variability of lexical representations and ambiguity have been revealed. In this paper we present a methodology to automatically maintain a biomedical terminological database, which contains automatically extracted terms, their mutual relationships, features and possible annotations that can be useful in text processing. In addition to TermDB, a database used for terminology management and storage, we present the following modules that are used to populate the database: TerMine (recognition, extraction and normalisation of terms from literature), AcroTerMine (extraction and clustering of acronyms and their long forms), AnnoTerm (annotation and classification of terms), and ClusTerm (extraction of term associations and clustering of terms).

pdf bib abs
Extraction of Cross Language Term Correspondences
Hans Hjelm

This paper describes a method for extracting translations of terms across languages, using parallel corpora. The extracted term correspondences are such that they are useful when performing query expansion for cross language information retrieval, or for bilingual lexicon extraction. The method makes use of the mutual information measure and allows for mapping between single word- to multi-word terms and vice versa. The method is scalable (accommodates addition or removal of data) and produces high quality results, while keeping the computational costs low enough for allowing on-the-fly translations in e.g., cross language information retrieval systems. The work was carried out in collaboration with Intrafind Software AG (Munich, Germany).

pdf bib abs
A Closer Look at Skip-gram Modelling
David Guthrie | Ben Allison | Wei Liu | Louise Guthrie | Yorick Wilks

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

pdf bib abs
Development of Linguistic Ontology on Natural Sciences and Technology
B. Dobrov | N. Loukachevitch

The paper describes the main principles of development and current state of Linguistic Ontology on Natural Sciences and Technology intended for information-retrieval tasks. In the development of the ontology we combined three different methodologies: development of information-retrieval thesauri, development of wordnets, formal ontology research. Combination of these methodologies allows us to develop large ontologies for broad domains.

pdf bib abs
Evaluation for Scenario Question Answering Systems
Matthew W. Bilotti | Eric Nyberg

Scenario Question Answering is a relatively new direction in Question Answering (QA) research that presents a number of challenges for evaluation. In this paper, we propose a comprehensive evaluation strategy for Scenario QA, including amethodology for building reusable test collections for Scenario QA and metrics for evaluating system performance over such test collections. Using this methodology, we have built a test collection, which we have made available for public download as a service to the research community. It is our hope that widespread availability of quality evaluation materials fuels research in new approaches to the Scenario QA task.

pdf bib abs
Stochastic Spoken Natural Language Parsing in the Framework of the French MEDIA Evaluation Campaign
Dirk Bühler | Wolfgang Minker

A stochastic parsing component has been applied on a French spoken language dialogue corpus, recorded in the framework of the MEDIA evaluation campaign. Realized as an ergodic HMM using Viterbide coding, the parser outputs the most likely semantic representation given a transcribed utterance as input. The semantic sequences used for training and testing have been derived from the semantic representations of the MEDIA corpus. The HMM parameters have been estimated given the word sequences along with their semantic representation. The performance score of the stochastic parser has been automatically determined using the mediaval tool applied to a held out reference corpus. Evaluation results will be presented in the paper.

pdf bib abs
Discriminant-Based MRS Banking
Stephan Oepen | Jan Tore Lønning

We present an approach to discriminant-based MRS banking, i.e. the construction of an annotated corpus where each input item is paired with a logical-form semantics. Semantic annotations are produced by parsing with a broad-coverage precision grammar, followed by manual disambiguation. The selection of the preferred analysis for each item (and hence its semantic form) builds on a notion of semantic discriminants, essentially localized dependencies extracted from a full-fledged, underspecified semantic representation.

pdf bib abs
A highly accurate Named Entity corpus for Hungarian
György Szarvas | Richárd Farkas | László Felföldi | András Kocsor | János Csirik

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.

pdf bib abs
Generic NLP Tools for Supporting Shallow Ontology Building
Thierry Declerck | Mihaela Vela

In this paper we present on-going investigations on how complex syntactic annotation, combined with linguistic semantics, can possibly help in supporting the semi-automatic building of (shallow) ontologies from text by proposing an automated extraction of (possibly underspecified) semantic relations from linguistically annotated text.

pdf bib abs
Extraction tools for collocations and their morphosyntactic specificities
Julia Ritz | Ulrich Heid

We describe tools for the extraction of collocations not only in the form of word combinations, but also of data about the morphosyntactic properties of collocation candidates. Such data are needed for a detailed lexical description of collocations, and to support both their recognition in text and the generation of collocationally acceptable text. We describe the tool architecture, report on a case study based on noun+verb collocations, and we give a first rough evaluation of the data quality produced.

pdf bib abs
What in the world is a Shahab?: Wide Coverage Named Entity Recognition for Arabic
Luke Nezda | Andrew Hickl | John Lehmann | Sarmad Fayyaz

This paper describes the development of CiceroArabic, the first wide coverage named entity recognition (NER) system for Modern Standard Arabic. Capable of classifying 18 different named entity classes with over 85% F, CiceroArabic utilizes a new 800,000-word annotated Arabic newswire corpus in order to achieve high performance without the need for hand-crafted rules or morphological information. In addition to describing results from our system, we show that accurate named entity annotation for a large number of semantic classes is feasible, even for very large corpora, and we discuss new techniques designed to boost agreement and consistency among annotators over a long-term annotation effort.

Growing privacy and security concerns mean there is an increasing need for data to be anonymized before being publically released. We present a module for anonymizing references implemented as part of the SQUAD tools for specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies. The tool is implemented on top of the GUITAR anaphoric resolver.

pdf bib abs
The Collection of Distributionally Idiosyncratic Items: A Multilingual Resource for Linguistic Research
Manfred Sailer | Beata Trawiński

We present two collections of lexical items with idiosyncratic distribution. The collections document the behavior of German and English bound words (BW, such as English headway), i.e., words which can only occur in one expression (make headway). BWs are a problem for both general and idiomatic dictionaries since it is unclear whether they have an independent lexical status and to what extent the expressions in which they occur are typical idiomatic expressions. We propose a system which allows us to document the information about BWs from dictionaries and linguistic literature, together with corpus data and example queries for major text corpora. We present our data structure and point to other phraseologically oriented collections. We will also show differences between the German and the English collection.

pdf bib abs
Grammar-based tools for the creation of tagging resources for an unresourced language: the case of Northern Sotho
Ulrich Heid | Elsabé Taljard | Danie J. Prinsloo

We describe an architecture for the parallel construction of a tagger lexicon and an annotated reference corpus for the part-of-speech tagging of Nothern Sotho, a Bantu language of South Africa, for which no tagged resources have been available so far. Our tools make use of grammatical properties (morphological and syntactic) of the language. We use symbolic pretagging, followed by stochastic tagging, an architecture which proves useful not only for the bootstrapping of tagging resources, but also for the tagging of any new text. We discuss the tagset design, the tool architecture and the current state of our ongoing effort.

pdf bib abs
Building a historical corpus for Classical Portuguese: some technological aspects
Maria Clara Paixão de Sousa | Thorsten Trippel

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of the formerly existing corpus. The development of the standardized markup and techniques allowed the inclusion of important new materials, such as original 16th and 17th century prints and manuscripts; and enlarged the potential user groups. On the technological side, we were grounded on the premise that open standards are the best way of making sure that the resources will be accessible even after years in an archive. This is a welcomed result in view of the additional consequence of the remodeled corpus concept: it serves as a repository for important historical documents, some of which had been preserved for 500 years in paper format. This very rich material can from now on be handled freely for linguistic research goals.

pdf bib abs
Mixing WordNet, VerbNet and PropBank for studying verb relations
Maria Teresa Pazienza | Marco Pennacchiotti | Fabio Massimo Zanzotto

In this paper we present a novel resource for studying the semantics of verb relations. The resource is created by mixing sense relational knowledge enclosed in WordNet, frame knowledge enclosed in VerbNet and corpus knowledge enclosed in PropBank. As a result, a set of about 1000 frame pairs is made available. A frame pair represents a pair of verbs in a peculiar semantic relation accompanied with specific information, such as: the syntactic-semantic frames of the two verbs, the mapping among their thematic roles and a set of textual examples extracted from the PennTreeBank. We specifically focus on four relations: Troponymy, Causation, Entailment and Antonymy. The different steps required for the mapping are described in detail and statistics on resource mutual coverage are reported. We also propose a practical use of the resource for the task of Textual Entailment acquisition and for Question Answering. A first attempt for automate the mapping among verb arguments is also presented: early experiments show that simple techniques can achieve good results, up to 85% F-Measure.

pdf bib abs
Local Document Relevance Clustering in IR Using Collocation Information
Leo Wanner | Margarita Alonso Ramos

A series of different automatic query expansion techniques has been suggested in Information Retrieval. To estimate how suitable a document term is as an expansion term, the most popular of them use a measure of the frequency of the co-occurrence of this term with one or several query terms. The benefit of the use of the linguistic relations that hold between query terms is often questioned. If a linguistic phenomenon is taken into account, it is the phrase structure or lexical compound. We propose a technique that is based on the restricted lexical cooccurrence (collocation) of query terms. We use the knowledge on collocations formed by query terms for two tasks: (i) document relevance clustering done in the first stage of local query expansion and (ii) choice of suitable expansion terms from the relevant document cluster. In this paper, we describe the first task, providing evidence from first preliminary experiments on Spanish material that local relevance clustering benefits largely from knowledge on collocations.

pdf bib abs
SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining
Andrea Esuli | Fabrizio Sebastiani

Opinion mining (OM) is a recent subdiscipline at the crossroads of information retrieval and computational linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. OM has a rich set of applications, ranging from tracking users opinions about products or about political candidates as expressed in online forums, to customer relationship management. In order to aid the extraction of opinions from text, recent research has tried to automatically determine the PNpolarity of subjective terms, i.e. identify whether a term that is a marker of opinionated content has a positive or a negative connotation. Research on determining whether a term is indeed a marker of opinionated content (a subjective term) or not (an objective term) has been instead much scarcer. In this work we describe SENTIWORDNET, a lexical resource in which each WORDNET synset sis associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. The method used to develop SENTIWORDNET is based on the quantitative analysis of the glosses associated to synsets, and on the use of the resulting vectorial term representations for semi-supervised synset classi.cation. The three scores are derived by combining the results produced by a committee of eight ternary classi.ers, all characterized by similar accuracy levels but different classification behaviour. SENTIWORDNET is freely available for research purposes, and is endowed with a Web-based graphical user interface.

pdf bib abs
A Deep-Parsing Approach to Natural Language Understanding in Dialogue System: Results of a Corpus-Based Evaluation
Alexandre Denis | Matthieu Quignard | Guillaume Pitel

This paper presents an approach to dialogue understanding based on a deep parsing and rule-based semantic analysis. Its performance in the semantic evaluation performed in the framework of the EVALDA/MEDIA campaign is encouraging. The MEDIA project aims to evaluate natural language understanding systems for French on a hotel reservation task (Devillers et al., 2004). For the evaluation, five participating teams had to produce an annotated version of the input utterances in compliance with a commonly agreed format (the MEDIA formalism). An approach based on symbolic processing was not straightforward given the conditions of the evaluation but we achieved a score close to that of statistical systems, without needing an annotated corpus. Despite the architecture has been designed for this campaign, exclusively dedicated to spoken dialogue understanding, we believe that our approach based on a LTAG parser and two ontologies can be used in real dialogue systems, providing quite robust speech understanding and facilities for interfacing with a dialogue manager and the application itself.

pdf bib abs
Work within the W3C Internationalization Activity and its Benefit for the Creation and Manipulation of Language Resources
Felix Sasaki

This paper introduces ongoing and current work within Internationalization (i18n) Activity, in the World Wide Web Consortium (W3C). The focus is on aspects of the W3C i18n Activity which are of benefit for the creation and manipulation of multilingual language resources. In particular, the paper deals with ongoing work concerning encoding, visualization and processing of characters; current work on language and locale identification; and current work on internationalization of markup. The main usage scenario is the design of multilingual corpora. This includes issues of corpus creation and manipulation.

pdf bib abs
Evaluating Symbiotic Systems: the challenge
Margaret King | Nancy Underwood

This paper looks at a class of systems which pose severe problems in evaluation design for current conventional approaches to evaluation. After describing the two conventional evaluation paradigms: the functionality paradigm as typified by evaluation campaigns and the ISO inspired user-centred paradigm typified by the work of the EAGLES and ISLE projects, it goes on to outline the problems posed by the evaluation of systems which are designed to work in critical interaction with a human expert user and to work over vast amounts of data. These systems pose problems for both paradigms although for different reasons. The primary aim of this paper is to provoke discussion and the search for solutions. We have no proven solutions at present. However, we describe a programme of exploratory research on which we have already embarked, which involves ground clearing work which we expect to result in a deep understanding of the systems and users, a pre-requisite for developing a general framework for evaluation in this field.

pdf bib abs
All Greek to me! An automatic Greeklish to Greek transliteration system
Aimilios Chalamandaris | Athanassios Protopapas | Pirros Tsiakoulis | Spyros Raptis

This paper presents research on Greeklish, that is, a transliteration of Greek using the Latin alphabet, which is used frequently in Greek e-mail communication. Greeklish is not standardized and there are a number of competing conventions co-existing in communication, based on personal preferences regarding similarities between Greek and Latin letters in shape, sound, or keyboard position. Our research has led to the development of All Greek to me! the first automatic transliteration system that can cope with any type of Greeklish. In this paper we first present previous research on Greeklish, describing other approaches that have attempted to deal with the same problems. We then provide a brief description of our approach, illustrating the functional flowchart of our system and the main ideas that underlie it. We present measures of system performance, based on about a years worth of usage as a public web service, and preliminary research, based on the same corpus, on the use of Greeklish and the trends in preferred Latin-Greek letter mapping. We evaluate the consistency of different transliteration patterns among users as well as the within-user consistency based on coherent principles. Finally we outline planned future research to further understand the use of Greeklish and improve All Greek to me! to function reliably embedded in integrated communication platforms bridging e-mail to mobile telephony and ubiquitous connectivity.

pdf bib abs
Improving Automatic Emotion Recognition from Speech via Gender Differentiaion
Thurid Vogt | Elisabeth André

Feature extraction is still a disputed issue for the recognition of emotions from speech. Differences in features for male and female speakers are a well-known problem and it is established that gender-dependent emotion recognizers perform better than gender-independent ones. We propose a way to improve the discriminative quality of gender-dependent features: The emotion recognition system is preceded by an automatic gender detection that decides upon which of two gender-dependent emotion classifiers is used to classify an utterance. This framework was tested on two different databases, one with emotional speech produced by actors and one with spontaneous emotional speech from a Wizard-of-Oz setting. Gender detection achieved an accuracy of about 90 % and the combined gender and emotion recognition system improved the overall recognition rate of a gender-independent emotion recognition system by 2-4 %.

pdf bib abs
Sentiments on a Grid: Analysis of Streaming News and Views
Khurshid Ahmad | Lee Gillam | David Cheng

In this paper we report on constructing a finite state automaton comprising automatically extracted terminology and significant collocation patterns from a training corpus of specialist news (Reuters Financial News). The automaton can be used to unambiguously identify sentiment-bearing words that might be able to make or break people, companies, perhaps even governments. The paper presents the emerging face of corpus linguistics where a corpus is used to bootstrap both the terminology and the significant meaning bearing patterns from the corpus. Much of the current content analysis software systems require a human coder to eyeball terms and sentiment words. Such an approach might yield very good quality results on small text collections but when confronted with a 40-50 million word corpus such an approach does not scale, and a large-scale computer-based approach is required. We report on the use of Grid computing technologies and techniques to cope with this analysis.

pdf bib abs
Tools and resources for speech synthesis arising from a Welsh TTS project
Briony Williams | Rhys James Jones | Ivan Uemlianin

The WISPR project ("Welsh and Irish Speech Processing Resources") has been building text-to-speech synthesis systems for Welsh and for Irish, as well as building links between the developers and potential users of the software. The Welsh half of the project has encountered various challenges, in the areas of the tokenisation of input text, the formatting of letter-to-sound rules, and the implementation of the "greedy algorithm" for text selection. The solutions to these challenges have resulted in various tools which may be of use to other developers using Festival for TTS for other languages. These resources are made freely available.

pdf bib abs
Multilingual Lexical Semantic Resources for Ontology Translation
Thierry Declerck | Asunción Gómez Pérez | Ovidiu Vela | Zeno Gantner | David Manzano-Macho

We describe the integration of some multilingual language resources in ontological descriptions, with the purpose of providing ontologies, which are normally using concept labels in just one (natural) language, with multilingual facility in their design and use in the context of Semantic Web applications, supporting both the semantic annotation of textual documents with multilingual ontology labels and ontology extraction from multilingual text sources.

pdf bib abs
The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text
Beatrice Alex | Malvina Nissim | Claire Grover

In this paper we discuss five different corpora annotated forprotein names. We present several within- and cross-dataset proteintagging experiments showing that different annotation schemes severelyaffect the portability of statistical protein taggers. By means of adetailed error analysis we identify crucial annotation issues thatfuture annotation projects should take into careful consideration.

pdf bib abs
Leveraging Machine Readable Dictionaries in Discriminative Sequence Models
Ben Wellner | Marc Vilain

Many natural language processing tasks make use of a lexicon typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).

pdf bib abs
Creating a Large-Scale Arabic to French Statistical MachineTranslation System
Saša Hasan | Anas El Isbihani | Hermann Ney

In this work, the creation of a large-scale Arabic to French statistical machine translation system is presented. We introduce all necessary steps from corpus aquisition, preprocessing the data to training and optimizing the system and eventual evaluation. Since no corpora existed previously, we collected large amounts of data from the web. Arabic word segmentation was crucial to reduce the overall number of unknown words. We describe the phrase-based SMT system used for training and generation of the translation hypotheses. Results on the second CESTA evaluation campaign are reported. The setting was inthe medical domain. The prototype reaches a favorable BLEU score of40.8%.

pdf bib abs
A Study on Terminology Extraction Based on Classified Corpora
Yirong Chen | Qin Lu | Wenjie Li | Zhifang Sui | Luning Ji

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood (Kageura, 1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In classified corpora, the class information provides the cue to the nature of data and can be used in termhood calculation. Three algorithms are provided and evaluated to investigate termhood based on classified corpora. The three algorithms are based on lexicon set computing, term frequency and document frequency, and the strength of the relation between a term and its document class respectively. Our objective is to investigate the effects of these different termhood measurement features. After evaluation, we can find which features are more effective and also, how we can improve these different features to achieve the best performance. Preliminary results show that the first measure can effectively filter out independent terms or terms of general use.

pdf bib abs
Retrieving Terminological Data from the TxtCeram Tagged Domain Corpus: A First Step towards a Terminological Ontology
Anna Estellés | Amparo Alcina | Victoria Soler

In this paper we will focus on corpora as a resource for researching language processing for terminological purposes. Based on the TEI guide, we present the templates used to tag our TxtCeram corpus and its features when working with WordSmith, a text analysis tool. We present an experiment for studying the frequency of hyperonyms in the introduction section of texts, while testing WordSmiths suitability to work with our tagged corpus.

pdf bib abs
IMORPHĒ: An Inheritance and Equivalence Based Morphology Description Compiler
Violetta Cavalli-Sforza | Abdelhadi Soudi

IMORPHĒ is a significantly extended version of MORPHE, a morphology description compiler. MORPHEs morphology description language is based on two constructs: 1) a morphological form hierarchy, whose nodes relate and differentiate surface forms in terms of the common and distinguishing inflectional features of lexical items; and 2) transformational rules, attached to leaf nodes of the hierarchy, which generate the surface form of an item from the base form stored in the lexicon. While MORPHEs approach to morphology description is intuitively appealing and was successfully used for generating the morphology of several European languages, its application to Modern Standard Arabic yielded morphological descriptions that were highly complex and redundant. Previous modifications and enhancements attempted to capture more elegantly and concisely different aspects of the complex morphology of Arabic, finding theoretical grounding in Lexeme-Based Morphology. Those extensions are being incorporated in a more flexible and less ad hoc fashion in IMORPHE, which retains the unique features of our previous work but embeds them in an inheritance-based framework in order to achieve even more concise and modular morphology descriptions and greater runtime efficiency, and lays the groundwork for IMORPHE to become an analyzer as well as a generator.

pdf bib abs
Tools and methods for objective or contextual evaluation of topic segmentation
Laurianne Sitbon | Patrice Bellot

In this paper we discuss the way of evaluating topic segmentation, from mathematical measures on variously constructed reference corpus to contextual evaluation depending on different topic segmentation usages. We present an overview of the different ways of building reference corpora and of mathematically evaluating segmentation methods, and then we focus on three tasks which may involve a topic segmentation: text extraction, information retrieval and document presentation. We have developed two graphical interfaces, one for an intrinsic comparison, and the other one dedicated to an evaluation in an information retrieval context. These tools will be very soon distributed under GPL licences on the Technolangue project web page.

A major barrier to the development of accurate and realistic models of human emotions is the absence of multi-cultural / multilingual databases of real-life behaviours and of a federative and reliable annotation protocol. QUB and LIMSI teams are working towards the definition of an integrated coding scheme combining their complementary approaches. This multilevel integrated scheme combines the dimensions that appear to be useful for the study of real-life emotions: verbal labels, abstract dimensions and contextual (appraisal based) annotations. This paper describes this integrated coding scheme, a protocol that was set-up for annotating French and English video clips of emotional interviews and the results (e.g. inter-coder agreement measures and subjective evaluation of the scheme).

pdf bib abs
POS-based Word Reorderings for Statistical Machine Translation
Maja Popović | Hermann Ney

Translation In this work we investigate new possibilities for improving the quality of statistical machine translation (SMT) by applying word reorderings of the source language sentences based on Part-of-Speech tags. Results are presented on the European Parliament corpus containing about 700k sentences and 15M running words. In order to investigate sparse training data scenarios, we also report results obtained on about 1\% of the original corpus. The source languages are Spanish and English and target languages are Spanish, English and German. We propose two types of reorderings depending on the language pair and the translation direction: local reorderings of nouns and adjectives for translation from and into Spanish and long-range reorderings of verbs for translation into German. For our best translation system, we achieve up to 2\% relative reduction of WER and up to 7\% relative increase of BLEU score. Improvements can be seen both on the reordered sentences as well as on the rest of the test corpus. Local reorderings are especially important for the translation systems trained on the small corpus whereas long-range reorderings are more effective for the larger corpus.

pdf bib abs
Error Analysis of Statistical Machine Translation Output
David Vilar | Jia Xu | Luis Fernando D’Haro | Hermann Ney

Evaluation of automatic translation output is a difficult task. Several performance measures like Word Error Rate, Position Independent Word Error Rate and the BLEU and NIST scores are widely use and provide a useful tool for comparing different systems and to evaluate improvements within a system. However the interpretation of all of these measures is not at all clear, and the identification of the most prominent source of errors in a given system using these measures alone is not possible. Therefore some analysis of the generated translations is needed in order to identify the main problems and to focus the research efforts. This area is however mostly unexplored and few works have dealt with it until now. In this paper we will present a framework for classification of the errors of a machine translation system and we will carry out an error analysis of the system used by the RWTH in the first TC-STAR evaluation.

pdf bib abs
The Sensem Corpus: a Corpus Annotated at the Syntactic and Semantic Level
Irene Castellón | Ana Fernández-Montraveta | Gloria Vázquez | Laura Alonso Alemany | Joan Antoni Capilla

The primary aim of the project SENSEM (Sentence Semantics, BFF2003-06456) is the construction of a Lexical Data Base illustrating the syntactic and semantic behavior of each of the senses of the 250 most frequent verbs of Spanish. With this objective in mind, we are currently building an annotated corpus consisting of sentences extracted from the electronic version of the newspaper El Periódico de Catalunya, totalling approximately 1 million words, with 100 examples of each verb. By the time of the conference, we will be about to complete the annotation of 25,000 sentences, which means roughly a corpus of 800,000 words. Approximately 400,000 of them will have been revised. We expect to make the corpus publicly available by the end of 2006.

pdf bib abs
GAIA: Common Framework for the Development of Speech Translation Technologies
Javier Pérez | Antonio Bonafonte

We present here an open-source software platform for the integration of speech translation components. This tool is useful to integrate into a common framework different automatic speech recognition, spoken language translation and text-to-speech synthesis solutions, as demonstrated in the evaluation of the European LC-STAR project, and during the development of the national ALIADO project. Gaia operates with great flexibility, and it has been used to obtain the text and speech corpora needed when performing speech translation. The platform follows a modular distributed approach, with a specifically designed extensible network protocol handling the communication with the different modules. A well defined and publicly available API facilitates the integration of existing solutions into the architecture. Completely functional audio and text interfaces together with remote monitoring tools are provided.

pdf bib abs
Morphological Tools for Six Small Uralic Languages
Attila Novák

This article presents a set of morphological tools for six small endangered minority languages belonging to the Uralic language family, Udmurt, Komi, Eastern Mari, Northern Mansi, Tundra Nenets and Nganasan. Following an introduction to the languages, the two sets of tools used in the project (MorphoLogic's Humor tools and the Xerox Finite State Tool) are described and compared. The article is concluded by a comparison of the six computational morphologies.

The newly founded European Centre of Excellence for Speech Synthesis (ECESS) is an initiative to promote the development of the European research area (ERA) in the field of Language Technology. ECESS focuses on the great challenge of high-quality speech synthesis which is of crucial importance for future spoken-language technologies. The main goals of ECESS are to achieve the critical mass needed to promote progress in TTS technology substantially, to integrate basic research know-how related to speech synthesis and to attract public and private funding. To this end, a common system architecture based on exchangeable modules supplied by the ECESS members is to be established. The XML-based interface that connects these modules is the topic of this paper.

pdf bib abs
Tree Searching/Rewriting Formalism
Petr Němec

We present a formalism capable of searching and optionally replacing forests of subtrees within labelled trees. In particular, the formalism is developed to process linguistic treebanks. When used as a substitution tool, the interpreter processes rewrite rules consisting of left and right side. The left side specifies a forest of subtrees to be searched for within a tree by imposing a set of constraints encoded as a query formula. The right side contains the respective substitutions for these subtrees. In the search mode only the left side is present. The formalism is fully implemented. The performance of the implemented tool allows to process even large linguistic corpora in acceptable time. The main contribution of the presented work consists of the expressiveness of the query formula, in the elegant and intuitive way the rules are written (and their easy reversibility), and in the performance of the implemented tool.

pdf bib abs
Methods for Creating Semantic Orientation Dictionaries
Maite Taboada | Caroline Anthony | Kimberly Voll

We describe and compare different methods for creating a dictionary of words with their corresponding semantic orientation (SO). We tested how well different dictionaries helped determine the SO of entire texts. To extract SO for each individual word, we used a common method based on pointwise mutual information. Mutual information between a set of seed words and the target words was calculated using two different methods: a NEAR search on the search engine Altavista (since discontinued); an AND search on Google. These two dictionaries were tested against a manually annotated dictionary of positive and negative words. The results show that all three methods are quite close, and none of them performs particularly well. We discuss possible further avenues for research, and also point out some potential problems in calculating pointwise mutual information using Google.

pdf bib abs
Detecting Inter-domain Semantic Shift using Syntactic Similarity
Masaki Itagaki | Anthony Aue | Takako Aikawa

This poster is a preliminary report of our experiments for detecting semantically shifted terms between different domains for the purposes of new concept extraction. A given term in one domain may represent a different concept in another domain. In our approach, we quantify the degree of similarity of words between different domains by measuring the degree of overlap in their domain-specific semantic spaces. The domain-specific semantic spaces are defined by extracting families of syntactically similar words, i.e. words that occur in the same syntactic context. Our method does not rely on any external resources other than a syntactic parser. Yet it has the potential to extract semantically shifted terms between two different domains automatically while paying close attention to contextual information. The organization of the poster is as follows: Section 1 provides our motivation. Section 2 provides an overview of our NLP technology and explains how we extract syntactically similar words. Section 3 describes the design of our experiments and our method. Section 4 provides our observations and preliminary results. Section 5 presents some work to be done in the future and concluding remarks.

pdf bib abs
Methodology of Lombard Speech Database Acquisition: Experiences with CLSD
Hynek Bořil | Tomáš Bořil | Petr Pollák

In this paper, process of the Czech Lombard Speech Database (CLSD'05) acquisition is presented. Feature analyses have proven a strong appearance of Lombard effect in the database. In the small vocabulary recognition task, significant performance degradation was observed for the Lombard speech recorded in the database. Aim of this paper is to describe the hardware platform, scenarios and recording tool used for the acquisition of CLSD'05. During the database recording and processing, several difficulties were encountered. The most important question was how to adjust the level of speech feedback for the speaker. A method for minimization of the speech attenuation introduced to the speaker by headphones is proposed in this paper. Finally, contents and corpus of the database are presented to outline it's suitability for analysis and modeling of Lombard effect. The whole CLSD'05 database with a detailed documentation is now released for public use.

pdf bib abs
Dimensions in Dialogue Act Annotation
Harry Bunt

This paper is concerned with the fundamentals of multidimensional dialogue act annotation, i.e. with what it means to annotate dialogues with information about the communicative acts that are performed with the utterances, taking various 'dimensions' into account. Two ideas seem to be prevalent in the literature concerning the notion of dimension: (1) dimensions correspond to different types of information; and (2) a dimension is formed by a set of mutually exclusive tags. In DAMSL, for instance, the terms dimension and layer are used sometimes in the sense of (1) and sometimes in that of (2). We argue that being mutually exclusive is not a good criterion for a set of dialogue act types to constitute a dimension, even though the description of an object in a multidimensional space should never assign more than one value per dimension. We define a dimension of dialogue act annotation as an aspect of participating in a dialogue that can be addressed independently by means of dialogue acts. We show that DAMSL dimensions such as Info-request, Statement, and Answer do not qualify as proper dimensions, and that the communicative functions in these categories do not fall in any specific dimension, but should be considered as general-purpose in the sense that they can be used in any dimension. We argue that using the notion of dimension that we propose, a multidimensional taxonomy of dialogue acts emerges that optimally supports multidimensional dialogue act annotation.

pdf bib abs
Interoperability of audio corpora : the case of the French corpora
Olivier Baude | Michel Jacobson | Atanas Tchobanov | Richard Walter

We present here the choices which were made within the framework of three oral corpora projects: Socio-linguistics studies on Orleans (ESLO), Phonology of the Contemporary French (PFC), the Archivage corpus of the LACITO lab. This comparative presentation of three corpora of audio linguistic resources comes from a analysis about the options the project have to operate to describe them for discovery purposes and to compare the contents. The aim is to illustrate the interest to think the interoperability and the methodology of codings and the metadata. Through this step, we want to simplify the technical creation of audio corpora and thus the constitution of linguistic resources, usable by enlarged academic and industrial communities.

This paper presents an on-going project aiming at enhancing the OPAC (Online Public Access Catalog) search system of the Library of the Free University of Bozen-Bolzano with multilingual access. The Multilingual search system (MUSIL), we have developed, integrates advanced linguistic technologies in a user friendly interface and bridges the gap between the world of free text search and the world of conceptual librarian search. In this paper we present the architecture of the system, its interface and preliminary evaluations of the precision of the search results.

pdf bib abs
A Factored Functional Dependency Transformation of the English Penn Treebank for Probabilistic Surface Generation
Irene Langkilde-Geary | Justin Betteridge

This paper describes a featurized functional dependency corpus automatically derived from the Penn Treebank. Each word in the corpus is associated with over three dozen features describing the functional syntactic structure of a sentence as well as some shallow morphology. The corpus was created for use in probabilistic surface generation, but could also be useful as a resource for the study of English and the development of other NLP applications.

pdf bib abs
Bootstrapping New Language ASR Capabilities: Achieving Best Letter-to-Sound Performance under Resource Constraints
Jim Talley

One of the most critical components in the process of building automatic speech recognition (ASR) capabilities for a new language is the lexicon, or pronouncing dictionary. For practical reasons, it is desirable to manually create only the minimal lexicon using available native-speaker phonetic expertise and, then, use the resulting seed lexicon for machine learning based induction of a high-quality letter-to-sound (L2S) model for generation of pronunciations for the remaining words of the language. This paper examines the viability of this scenario, specifically investigating three possible strategies for selection of lexemes (words) for manual transcription choosing the most frequent lexemes of the language, choosing lexemes randomly, and selection of lexemes via an information theoretic diversity measure. The relative effectiveness of these three strategies is evaluated as a function of the number of lexemes to be transcribed to create a bootstrapping lexicon. Generally, the newly developed orthographic diversity based selection strategy outperforms the others for this scenario where a limited number of lexemes can be transcribed. The experiments also provide generally useful insight into expected L2S accuracy sacrifice as a function of decreasing training set size.

pdf bib abs
Automated Summarization Evaluation with Basic Elements.
Eduard Hovy | Chin-Yew Lin | Liang Zhou | Junichi Fukumoto

As part of evaluating a summary automati-cally, it is usual to determine how much of the contents of one or more human-produced ideal summaries it contains. Past automated methods such as ROUGE compare using fixed word ngrams, which are not ideal for a variety of reasons. In this paper we describe a framework in which summary evaluation measures can be instantiated and compared, and we implement a specific evaluation method using very small units of content, called Basic Elements that address some of the shortcomings of ngrams. This method is tested on DUC 2003, 2004, and 2005 systems and produces very good correlations with human judgments.

pdf bib abs
Automatic Construction of Japanese WordNet
Hiroyuki Kaji | Mariko Watanabe

Although WordNets have been developed for a number of languages, no attempts to construct a Japanese WordNet have been known to exist. Taking this into account, we launched a project to automatically translate the Princeton WordNet into Japanese by a method of unsupervised word-sense disambiguation using bilingual comparable corpora. The method we propose aligns English word associations with those in Japanese and iteratively calculates a correlation matrix of Japanese translations of an English word versus its associated words. It then determines the Japanese translation for the English word in a synset by calculating scores for translation candidates according to the correlation matrix and the associated words appearing in the gloss appended to the synset. This method is not robust because a gloss only contains a few associated words. To overcome this difficulty, we extended the method so that it retrieves texts by using the gloss as a query and uses the retrieved texts as well as the gloss to calculate scores for translation candidates. A preliminary experiment using Wall Street Journal and Nihon Keizai Shimbun corpora demonstrated that the proposed method is promising for constructing a Japanese WordNet.

pdf bib abs
Generating Typed Dependency Parses from Phrase Structure Parses
Marie-Catherine de Marneffe | Bill MacCartney | Christopher D. Manning

This paper describes a system for extracting typed dependency parses of English sentences from phrase structure parses. In order to capture inherent relations occurring in corpus texts that can be critical in real-world applications, many NP relations are included in the set of grammatical relations used. We provide a comparison of our system with Minipar and the Link parser. The typed dependency extraction facility described here is integrated in the Stanford Parser, available for download.

pdf bib abs
FreP: An electronic tool for extracting frequency information of phonological units from Portuguese written text
S. Frota | M. Vigário | F. Martins

The importance of frequency for phonological phenomena has long been noticed in the literature. However, frequency information available for phonological units in Portuguese is scarce, non-replicable, corpus dependent, and hard to obtain due to the non-existence of a free tool for public use. This paper describes FreP, a new electronic tool that provides frequency counts of phonological units at the word-level and below from Portuguese written text: namely, major classes of segments, syllables and syllable types, phonological clitics, clitic types and size, prosodic words and their shape, word stress location, and syllable type by position within the word and/or status relative to word stress. Useful applications of FreP in general linguistics, phonology, language acquisition and development, speech evaluation and therapy are also described. Forthcoming extensions of the tool include the ability to extract frequency information for different varieties of Portuguese, Brazilian Portuguese in particular, and the ability to provide a SAMPA output from the written text, together with the frequency of segmental features, like manner, place of articulation and laryngeal features. Updated information on FreP can be found at http://www.fl.ul.pt/LaboratorioFonetica/FreP.

pdf bib abs
Querying Both Parallel And Treebank Corpora: Evaluation Of A Corpus Query System
Ulrik Petersen

The last decade has seen a large increase in the number of available corpus query systems. Some of these are optimized for a particular kind of linguistic annotation (e.g., time-aligned, treebank, word-oriented, etc.). In this paper, we report on our own corpus query system, called Emdros. Emdros is very generic, and can be applied to almost any kind of linguistic annotation using almost any linguistic theory. We describe Emdros and its query language, showing some of the benfits that linguists can derive from using Emdros for their corpora. We then describe the underlying database model of Emdros, and show how two corpora can be imported into the system. One of the two is a parallel corpus of Hungarian and English (the Hunglish corpus), while the other is a treebank of German (the TIGER Corpus). In order to evaluate the performance of Emdros, we then run some performance tests. It is shown that Emdros has extremely good performance on small corpora (less than 1 million words), and that it scales well to corpora of many millions of words.

pdf bib abs
Summarizing Answers for Complicated Questions
Liang Zhou | Chin-Yew Lin | Eduard Hovy

Recent work in several computational linguistics (CL) applications (especially question answering) has shown the value of semantics (in fact, many people argue that the current performance ceiling experienced by so many CL applications derives from their inability to perform any kind of semantic processing). But the absence of a large semantic information repository that provides representations for sentences prevents the training of statistical CL engines and thus hampers the development of such semantics-enabled applications. This talk refers to recent work in several projects that seek to annotate large volumes of text with shallower or deeper representations of some semantic phenomena. It describes one of the essential problemscreating, managing, and annotating (at large scale) the meanings of words, and outlines the Omega ontology, being built at ISI, that acts as term repository. The talk illustrates how one can proceed from words via senses to concepts, and how the annotation process can help verify good concept decisions and expose bad ones. Much of this work is performed in the context of the OntoNotes project, joint with BBN, the Universities of Colorado and Pennsylvania, and ISI, that is working to build a corpus of about 1M words (English, Chinese, and Arabic), annotated for shallow semantics, over the next few years.

The goal of this paper is (1) to illustrate a specific procedure for merging different monolingual lexicons, focussing on techniques for detecting and mapping equivalent lexical entries, and (2) to sketch a production model that enables one to obtain lexical resources via unification of existing data. We describe the creation of a Unified Lexicon (UL) from a common sample of the Italian PAROLE-SIMPLE-CLIPS phonological lexicon and of the Italian LCSTAR pronunciation lexicon. We expand previous experiments carried out at ILC-CNR: based on a detailed mechanism for mapping grammatical classifications of candidate UL entries, a consensual set of Unified Morphosyntactic Specifications (UMS) shared by lexica for the written and spoken areas is proposed. The impact of the UL on cross-validation issues is analysed: by looking into conflicts, mismatches and diverging classifications can be detected in both resources. The work presented is in line with the activities promoted by ELRA towards the development of methods for packaging new language resources by combining independently created resources, and was carried out as part of the ELRA Production Committee activities. ELRA aims to exploit the UL experience to carry out such merging activities for resources available on the ELRA catalogue in order to fulfill the users' needs.

pdf bib abs
Compiling large language resources using lexical similarity metrics for domain taxonomy learning
Ronny Melz | Pum-Mo Ryu | Key-Sun Choi

In this contribution we present a new methodology to compile large language resources for domain-specific taxonomy learning. We describe the necessary stages to deal with the rich morphology of an agglutinative language, i.e. Korean, and point out a second order machine learning algorithm to unveil term similarity from a given raw text corpus. The language resource compilation described is part of a fully automatic top-down approach to construct taxonomies, without involving the human efforts which are usually required.

pdf bib abs
Tagset Mapping and Statistical Training Data Cleaning-up
Felix Pîrvan | Dan Tufiş

The paper describes a general method (as well as its implementation and evaluation) for deriving mapping systems for different tagsets available in existing training corpora (gold standards) for a specific language. For each pair of corpora (tagged with different tagsets), one such mapping system is derived. This mapping system is then used to improve the tagging of each of the two corpora with the tagset of the other (this process will be called cross-tagging). By reapplying the algorithm to the newly obtained corpora, the accuracy of the underlying training corpora can also be improved. Furthermore, comparing the results with the gold standards makes it possible to assess the distributional adequacy of various tagsets used in processing the language in case. Unlike other methods, such as those reported in (Brants, 1995) or (Tufis & Dragomirescu, 2004), which assume a subsumption relation between the considered tagsets, and as such they aim at minimizing the tagsets by eliminating the feature-value redundancy, this method is applicable for completely unrelated tagsets. Although the experiments were focused on morpho-syntactic (POS) tagging, the method is applicable to other types of tagging as well.

pdf bib abs
RoCo-News: A Hand Validated Journalistic Corpus of Romanian
Dan Tufiş | Elena Irimia

The paper briefly describes the RoCo project and, in details, one of its first outcomes, the RoCo-News corpus. RoCo-News is a middle-sized journalistic corpus of Romanian, abundant in proper names, numerals and named entities. The initially raw text was previously segmented with MtSeg segmenter, then POS annotated with TNT tagger. RoCo-News was further lemmatized and validated. Because of limited human resources, time constraints and the dimension of the corpus, hand validation of each individual token was out of question. The validation stage required a coherent methodology for automatically identifying as many POS annotation and lemmatization errors as possible. The hand validation process was focused on these automatically spotted possible errors. This methodology relied on three main techniques for automatic detection of potential errors: 1. when lemmatizing the corpus, we extracted all the triples that were not found in the word-form lexicon; 2. we checked the correctness of POS annotation for closed class lexical categories, technique described by (Dickinson & Meurers, 2003); 3. we exploited the hypothesis (Tufiº, 1999) according to which an accurately tagged text, re-tagged with the language model learnt from it (biased evaluation) should have more than 98% tokens identically tagged.

pdf bib abs
Turning a Dependency Treebank into a PSG-style Constituent Treebank
Eckhard Bick

In this paper, we present and evaluate a new method to convert Constraint Grammar (CG) parses of running text into Constituent Treebanks. The conversion is two-step - first a grammar-based method is used to bridge the gap between raw CG annotation and full dependency structure, then phrase structure bracketing and non-terminal nodes are introduced by clustering sister dependents, effectively building one syntactic treebank on top of another. The method is compared with another approach (Bick 2003-2), where constituent structures are arrived at by employing a function-tag based Phrase Structure Grammar (PSG). Results are evaluated on a small reference corpus for both raw and revised CG input, with bracketing F-Scores of 87.5% for raw text and 97.1% for revised CG input, and a raw text edge label accuracy of 95.9% for forms and 86% for functions, or 99.7% and 99.4%, respectively, for revised CG. By applying the tools to the CG-only part of the Danish Arboretum treebank we were able to increase the size of the treebank by 86%, from 197.400 to 367.500 words.

pdf bib abs
Aligning Multilingual Thesauri
Dan Ştefănescu | Dan Tufiş

The aligning and merging of ontologies with overlapping information are actual one of the most active domain of investigation in the Semantic Web community. Multilingual lexical ontologies thesauri are fundamental knowledge sources for most NLP projects addressing multilinguality. The alignment of multilingual lexical knowledge sources has various applications ranging from knowledge acquisition to semantic validation of interlingual equivalence of presumably the same meaning express in different languages. In this paper, we present a general method for aligning ontologies, which was used to align a conceptual thesaurus, lexicalized in 20 languages with a partial version of it lexicalized in Romanian. The objective of our work was to align the existing terms in the Romanian Eurovoc to the terms in the English Eurovoc and to automatically update the Romanian Eurovoc. The general formulation of the ontology alignment problem was set up along the lines established by Heterogeneity group of the KnowledgeWeb consortium, but the actual case study was motivated by the needs of a specific NLP project.

pdf bib abs
Dependency-Based Phrase Alignment
Radu Ion | Alexandru Ceauşu | Dan Tufiş

Phrase alignment is the task that requires the constituent phrases of two halves of a bitext to be aligned. In order to align phrases, one must discover them first and this article presents a method of aligning phrases that are discovered automatically. Here, the notion of a 'phrase' will be understood as being given by a subtree of a dependency-like structure of a sentence called linkage. To discover phrases, we will make use of two distinct, language independent methods: the IBM-1 model (Brown et al., 1993) adapted to detect linkages and Constrained Lexical Attraction Models (Ion & Barbu Mititelu, 2006). The methods will be combined and the resulted model will be used to annotate the bitext. The accuracy of phrase alignment will be evaluated by obtaining word alignments from link alignments and then by checking the F-measure of the latter word aligner.

pdf bib abs
Acquis Communautaire Sentence Alignment using Support Vector Machines
Alexandru Ceauşu | Dan Ştefănescu | Dan Tufiş

Sentence alignment is a task that requires not only accuracy, as possible errors can affect further processing, but also requires small computation resources and to be language pair independent. Although many implementations do not use translation equivalents because they are dependent on the language pair, this feature is a requirement for the accuracy increase. The paper presents a hybrid sentence aligner that has two alignment iterations. The first iteration is based mostly on sentences length, and the second is based on a translation equivalents table estimated from the results of the first iteration. The aligner uses a Support Vector Machine classifier to discriminate between positive and negative examples of sentence pairs.

pdf bib abs
Rule-Based Chunking and Reusability
Claire Grover | Richard Tobin

In this paper we discuss a rule-based approach to chunking implemented using the LT-XML2 and LT-TTT2 tools. We describe the tools and the pipeline and grammars that have been developed for the task of chunking. We show that our rule-based approach is easy to adapt to different chunking styles and that the mark-up of further linguistic information such as nominal and verbal heads can be added to the rules at little extra cost. We evaluate our chunker against the CoNLL 2000 data and discuss discrepancies between our output and the CoNLL mark-up as well as discrepancies within the CoNLL data itself. We contrast our results with the higher scores obtained using machine learning and argue that the portability and flexibility of our approach still make it a more practical solution.

pdf bib abs
Reconsidering Language Identification for Written Language Resources
Baden Hughes | Timothy Baldwin | Steven Bird | Jeremy Nicholson | Andrew MacKinlay

The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approachesto written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain openand ripe for further investigation.

pdf bib abs
Automatic Terminology Intelligibility Estimation for Readership-oriented Technical Writing
Yasuko Senda | Yasusi Sinohara | Manabu Okumura

This paper describes automatic terminology intelligibility estimation for readership-oriented technical writing. We assume that the term frequency weighted by the types of documents can be an indicator of the term intelligibility for a certain readership. From this standpoint, we analyzed the relationship between the following: average intelligibility levels of 46 technical terms that were rated by about 120 laymen; numbers of documents that an Internet search

pdf bib abs
SYMBERED - a Symbol-Concept Editing Tool
Mats Lundälv | Katarina Mühlenbock | Bengt Farre | Annika Brännström

The aim of the Nordic SYMBERED project - funded by NUH (the Nordic Development Centre for Rehabilitation Technology) - is to develop a user friendly editing tool that makes use of concept coding to produce web pages with flexible graphical symbol support targeted towards people with Augmentative and Alternative Communication (AAC) needs. Documents produced with the editing tool will be in XML/XHTML format, well suited for publishing on the Internet. These documents will then contain natural language text, such as Swedish or English. Some, or all, of the words in the text will be marked with a concept code defining its meaning. The coded words/concepts may then easily be represented by alternative kinds of graphical symbols and by additional text representations in alternative languages. Thus, within one web document created by the author with the SYMBERED tool, one symbol language can easily be swapped for another. This means that a Bliss and a PCS symbol user can each have his/her preferred kind of symbol support. The SYMBERED editing tool will initially support a limited vocabulary in four to five Nordic languages plus English, and three to four symbol systems, with built-in extensibility to cover more languages and symbol systems.

pdf bib abs
Automated Deep Lexical Acquisition for Robust Open Texts Processing
Yi Zhang | Valia Kordoni

In this paper, we report on methods to detect and repair lexical errors for deep grammars. The lack of coverage has for long been the major problem for deep processing. The existence of various errors in the hand-crafted large grammars prevents their usage in real applications. The manual detection and repair of errors requires asignificant amount of human effort. An experiment with the British National Corpus shows about 70% of the sentences contain unknownword(s) for the English Resource Grammar. With the help of error mining methods, many lexical errors are discovered, which cause a large part of the parsing failures. Moreover, with a lexical type predictor based on a maximum entropy model, new lexical entries are automatically generated. The contribution of various features for the model is evaluated. With the disambiguated full parsing results, the precision of the predictor is enhanced significantly.

pdf bib abs
Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviours: Validating the Annotation of TV Interviews
J.-C. Martin | G. Caridakis | L. Devillers | K. Karpouzis | S. Abrilian

There has been a lot of psychological researches on emotion and nonverbal communication. Yet, these studies were based mostly on acted basic emotions. This paper explores how manual annotation and image processing can cooperate towards the representation of spontaneous emotional behaviour in low resolution videos from TV. We describe a corpus of TV interviews and the manual annotations that have been defined. We explain the image processing algorithms that have been designed for the automatic estimation of movement quantity. Finally, we explore how image processing can be used for the validation of manual annotations.

pdf bib abs
WS4LR: A Workstation for Lexical Resources
Cvetana Krstev | Ranka Stanković | Duško Vitas | Ivan Obradović

In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and the need for such a tool came from the large volume of resources the Group has developed in the course of many years and within different projects. The tool handles morphological dictionaries, wordnets, aligned texts and transducers equally and has already proved very useful for various tasks. Although it has so far been used mainly for Serbian, WS4LR is not language dependent and can be successfully used for resources in other languages provided that they follow the described formats and methodologies. The tool operates on the .NET platform and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory.

pdf bib abs
Extending VerbNet with Novel Verb Classes
Karin Kipper | Anna Korhonen | Neville Ryant | Martha Palmer

Lexical classifications have proved useful in supporting various natural language processing (NLP) tasks. The largest verb classification for English is Levin's (1993) work which defined groupings of verbs based on syntactic properties. VerbNet - the largest computational verb lexicon currently available for English - provides detailed syntactic-semantic descriptions of Levin classes. While the classes included are extensive enough for some NLP use, they are not comprehensive. Korhonen and Briscoe (2004) have proposed a significant extension of Levin's classification which incorporates 57 novel classes for verbs not covered (comprehensively) by Levin. This paper describes the integration of these classes into VerbNet. The result is the most extensive Levin-style classification for English verbs which can be highly useful for practical applications.

pdf bib abs
Towards a Generative Lexical Resource: The Brandeis Semantic Ontology
James Pustejovsky | Catherine Havasi | Jessica Littman | Anna Rumshisky | Marc Verhagen

In this paper we describe the structure and development of the Brandeis Semantic Ontology (BSO), a large generative lexicon ontology and lexical database. The BSO has been designed to allow for more widespread access to Generative Lexicon-based lexical resources and help researchers in a variety of computational tasks. The specification of the type system used in the BSO largely follows that proposed by the SIMPLE specification (Busa et al., 2001), which was adopted by the EU-sponsored SIMPLE project (Lenci et al., 2000).

pdf bib abs
Act-Topic Patterns for Automatically Checking Dialogue Models
Hans Dybkjær | Laila Dybkjær

When dialogue models are evaluated today, this is normally done by using some evaluation method to collect data, often involving users interacting with the system model, and then subsequently analysing the collected data. We present a tool called DialogDesigner that enables automatic evaluation performed directly on the dialogue model and that does not require any data collection first. DialogDesigner is a tool in support of rapid design and evaluation of dialogue models. The first version was developed in 2005 and enabled developers to create an electronic dialogue model, get various graphical views of the model, run a Wizard-of-Oz (WOZ) simulation session, and extract different presentations in HTML. The second version includes extensions in terms of support for automatic dialogue model evaluation. Various aspects of dialogue model well-formedness can be automatically checked. Some of the automatic analyses simply perform checks based on the state and transition structure of the dialogue model while the core part are based on act-topic annotation of prompts and transitions in the dialogue model and specification of act-topic patterns. This paper focuses on the version 2 extensions.

pdf bib abs
Predicting MT Quality as a Function of the Source Language
David M. Rojas | Takako Aikawa

This paper describes one phase of a large-scale machine translation (MT) quality assurance project. We explore a novel approach to discriminating MT-unsuitable source sentences by predicting the expected quality of the output. The resources required include a set of source/MT sentence pairs, human judgments on the output, a source parser, and an MT system. We extract a number of syntactic, semantic, and lexical features from the source sentences only and train a classifier that we call the Syntactic, Semantic, and Lexical Model (SSLM) (cf. Gamon et al., 2005; Liu & Gildea, 2005; Rajman & Hartley, 2001). Despite the simplicity of the approach, SSLM scores correlate with human judgments and can help determine whether sentences are suitable or unsuitable for translation by our MT system. SSLM also provides information about which source features impact MT quality, connecting this work with the field of controlled language (CL) (cf. Reuther, 2003; Nyberg & Mitamura, 1996). With a focus on the input side of MT, SSLM differs greatly from evaluation approaches such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002) and METEOR (Banerjee & Lavie, 2005) in that these other systems compare MT output with reference sentences for evaluation and do not provide feedback regarding potentially problematic source material. Our method bridges the research areas of CL and MT evaluation by addressing the importance of providing MT-suitable English input to enhance output quality.

pdf bib abs
Named Entity Extraction with Conjunction Disambiguation
Paweł Mazur | Robert Dale

The recognition of named entities is now a well-developed area, with a range of symbolic and machine learning techniques that deliver high accuracy extraction and categorisation of a variety of entity types. However, there are still some named entity phenomena that present problems for existing techniques; in particular, relatively little work has explored the disambiguation of conjunctions appearing in candidate named entity strings. We demonstrate that there are in fact four distinct uses of conjunctions in the context of named entities; we present some experiments using machine-learned classifiers to disambiguate the different uses of the conjunction, with 85% of test examples being correctly classified.

pdf bib abs
Functioning of the Centre for Dutch Language and Speech Technology
Michel Boekestein | Griet Depoorter | Remco van Veenendaal

The TST Centre manages a broad collection of Dutch digital language resources. It is an initiative of the Dutch Language Union (Nederlandse Taalunie), and is meant to reinforce research in the area of language and speech technology. It does this by stimulating the reuse of these language resources. The TST Centre keeps these resources up to date, facilitates their availability, and offers services such as providing information, documentation, online access, offering catalogues, custom-made data, etc. Also, the TST Centre strives for a uniformised, if not standardised, treatment of language resources of the same nature. A well-thought, structured administration system is needed to manage the various language resources, their updates, derived products, IPR, user administration, etc. We will discuss the organisation, tasks and services of the TST Centre, and the language resources it maintains. Also, we will look into practical data management solutions, IPR issues, and our activities in standardisation and linking language resources.

The MULINCO project (MUltiLINgual Corpus of the University of Copenhagen) started early 2005. The purpose of this cross-disciplinary project is to create a corpus platform for education and research in monolingual and translation studies. The project covers two main types of corpus texts: literary and non-literary. The platform is being developed using available tools as far as possible, and integrating them in a very open architecture. In this paper we describe the current status and future developments of both the text and tool side of the corpus platform, and we show some examples of student exercises taking advantage of tagged and aligned texts.

In this paper we present LeXFlow, a web application framework where lexicons already expressed in standardised format semi-automatically interact by reciprocally enriching themselves. LeXFlow is intended for, on the one hand, paving the way to the development of dynamic multi-source lexicons; and on the other, for fostering the adoption of standards. Borrowing from techniques used in the domain of document workflows, we model the activity of lexicon management as a particular case of workflow instance, where lexical entries move across agents and become dynamically updated. To this end, we have designed a lexical flow (LF) corresponding to the scenario where an entry of a lexicon A becomes enriched via basically two steps. First, by virtue of being mapped onto a corresponding entry belonging to a lexicon B, the entry(LA) inherits the semantic relations available in lexicon B. Second, by resorting to an automatic application that acquires information about semantic relations from corpora, the relations acquired are integrated into the entry and proposed to the human encoder. As a result of the lexical flow, in addition, for each starting lexical entry(LA) mapped onto a corresponding entry(LB) the flow produces a new entry representing the merging of the original two.

pdf bib abs
Identifying Named Entities in Text Databases from the Natural History Domain
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch | Pim Arntzen

In this paper, we investigate whether it is possible to bootstrap a named entity tagger for textual databases by exploiting the database structure to automatically generate domain and database-specific gazetteer lists. We compare three tagging strategies: (i) using the extracted gazetteers in a look-up tagger, (ii) using the gazetteers to automatically extract training data to train a database-specific tagger, and (iii) using a generic named entity tagger. Our results suggest that automatically built gazetteers in combination with a look-up tagger lead to a relatively good performance and that generic taggers do not perform particularly well on this type of data.

pdf bib abs
Developing Speech Synthesis for Under-Resourced Languages by “Faking it”: An Experiment with Somali
Harold Somers | Gareth Evans | Zeinab Mohamed

Speech synthesis or text-to-speech (TTS) systems are currently available for a number of the world's major languages, but for thousands of other, unsupported, languages no such technology is available. While awaiting the development of such technology, we propose using an existing TTS system for a major language (the base language, BL) to "fake" TTS for an unsupported language (the target language, TL). This paper describes the factors which determine the choice of a suitable BL for a given TL, and describe an experiment with a fake Somali TTS system evaluated in the real-life situation of a doctorpatient dialogue. 28 Somali participants were asked to judge the comprehensibility of 25 short Somali sentences recorded with a German TTS system. Results suggest that "faking it" provides reasonable stop-gap TTS for unsupported languages.

pdf bib abs
Using Richly Annotated Trilingual Language Resources for Acquiring Reading Skills in a Foreign Language
Dragoş Ciobanu | Tony Hartley | Serge Sharoff

In an age when demand for innovative and motivating language teaching methodologies is at a very high level, TREAT - the Trilingual REAding Tutor - combines the most advanced natural language processing (NLP) techniques with the latest second and third language acquisition (SLA/TLA) research in an intuitive and user-friendly environment that has been proven to help adult learners (native speakers of L1) acquire reading skills in an unknown L3 which is related to (cognate with) an L2 they know to some extent. This corpus-based methodology relies on existing linguistic resources, as well as materials that are easy to assemble, and can be adapted to support other pairs of L2-L3 related languages, as well. A small evaluation study conducted at the Leeds University Centre for Translation Studies indicates that, when using TREAT, learners feel more motivated to study an unknown L3, acquire significant linguistic knowledge of both the L3 and L2 rapidly, and increase their performance when translating from L3 into L1.

This paper introduces a number theoretical and practical issues related to the Syllabus. Syllabusis a multi-lingua ontology based tool, designed to improve the applications of the European Directives in the various European countries.

KUNSTI is the Norwegian national language technology programme, running 2001-2006 inclusive. The goal of the programme is to boost Norwegian language technology research. In this paper we describe the background, the objectives, the methodology applied in the management of the programme, the projects selected, and our first conclusions. We also describe national programmes form Sweden, France and Germany and compare objectives and methods.

pdf bib abs
Using a morphological analyzer in high precision POS tagging of Hungarian
Péter Halácsy | András Kornai | Csaba Oravecz | Viktor Trón | Dániel Varga

The paper presents an evaluation of maxent POS disambiguation systems that incorporate an open source morphological analyzer to constrain the probabilistic models. The experiments show that the best proposed architecture, which is the first application of the maximum entropy framework in a Hungarian NLP task, outperforms comparable state of the art tagging methods and is able to handle out of vocabulary items robustly, allowing for efficient analysis of large (web-based) corpora.

pdf bib abs
Ongoing Developments in Automatically Adapting Lexical Resources to the Biomedical Domain
Dominic Widdows | Adil Toumouh | Beate Dorow | Ahmed Lehireche

This paper describes a range of experiments using empirical methods to adapt theWordNet noun ontology for specific use in the biomedical domain. Our basic technique is to extract relationships between terms using the Ohsumed corpus, a large collection of abstracts from PubMed, and to compare the relationships extracted with those that would be expected for medical terms, given the structure of the WordNet ontology. The linguistic methods involve the use of a variety of lexicosyntactic patterns that enable us to extract pairs of coordinate noun terms, and also related groups of adjectives and nouns, using Markov clustering. This enables us in many cases to analyse ambiguous words and select the correct meaning for the biomedical domain. While results are often encouraging, the paper also highlights evident problems and drawbacks with the method, and outlines suggestions for future work.

pdf bib abs
Multilevel corpus analysis: generating and querying an AGset of spoken Italian (SpIt-MDb).
Renata Savy | Francesco Cutugno | Claudia Crocco

In this paper we present an application of AGTK to a corpus of spoken Italian annotated at many different linguistic levels. The work consists of two parts: a) the presentation of AG-SpIt, a toolkit devoted to corpus data management that we developed according to AGTK proposals; b) the presentation of corpus structure together with some examples and results of cross-level linguistic analyses obtained querying the database (SpIt-MDb). As this work is still an ongoing investigation, results must be considered preliminary, as a demo illustrating the potentiality of the tool and the advantages it introduces to validate linguistic theories and annotation systems. Currently, SpIt-MDb is a linguistic resource under development; it represents one of the first attempts to create an Italian corpus labelled at various linguistic levels (from acoustic/sub-phonetic, to textual/pragmatic ones) which can be queried in the interrelations among levels.

pdf bib abs
Feature-based Encoding and Querying Language Resources with Character Semantics
Baden Hughes | Dafydd Gibbon | Thorsten Trippel

In this paper we discuss the explicit representation of character features pertaining to written language resources, which we argue are critically necessary in the long term of archiving language data. Much focus on the creation of language resources and their associated preservation is at the level of the corpus itself; however it is generally accepted that long term interpretation of these language resources requires more than a best practice data format. In particular, where language resources are created in linguistic fieldwork, and especially for minority languages, the need for preservation not only of the resource itself, but of additional metadata which allows for the resource to be accurately interpreted in the future is becoming a topic of research in itself. In this paper we extend earlier work on semantically based character decomposition to include representation of character properties in a variety of models, and a mechanism for exploiting these properties through queries.

pdf bib abs
Building lexical resources for PrincPar, a large coverage parser that generates principled semantic representations
Rajen Subba | Barbara Di Eugenio | Elena Terenzi

Parsing, one of the more successful areas of Natural Language Processing has mostly been concerned with syntactic structure. Though uncovering the syntactic structure of sentences is very important, in many applications a meaningrepresentation for the input must be derived as well. We report on PrincPar, a parser that builds full meaning representations. It integrates LCFLEX, a robust parser, with alexicon and ontology derived from two lexical resources, VerbNet and CoreLex that represent the semantics of verbs and nouns respectively. We show that these two different lexical resources that focus on verbs and nouns can be successfully integrated. We report parsing results on a corpus of instructional text and assess the coverage of those lexical resources. Our evaluation metric is the number of verb frames that are assigned a correct semantics: 72.2% verb frames are assigned a perfect semantics, and another 10.9% are assigned a partially correctsemantics. Our ultimate goal is to develop a (semi)automatic method to derive domain knowledge from instructional text, in the form of linguistically motivated action schemes.

pdf bib abs
Automatic Detection and Semi-Automatic Revision of Non-Machine-Translatable Parts of a Sentence
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara

We developed a method for automatically distinguishing the machine-translatable and non-machine-translatable parts of a given sentence for a particular machine translation (MT) system. They can be distinguished by calculating the similarity between a source-language sentence and its back translation for each part of the sentence. The parts with low similarities are highly likely to be non-machine-translatable parts. We showed that the parts of a sentence that are automatically distinguished as non-machine-translatable provide useful information for paraphrasing or revising the sentence in the source language to improve the quality of the translation by the MT system. We also developed a method of providing knowledge useful to effectively paraphrasing or revising the detected non-machine-translatable parts. Two types of knowledge were extracted from the EDR dictionary: one for transforming a lexical entry into an expression used in the definition and the other for conducting the reverse paraphrasing, which transforms an expression found in a definition into the lexical entry. We found that the information provided by the methods helped improve the machine translatability of the originally input sentences.

pdf bib abs
Exploring opportunities for Comparability and Enrichment by Linking lexical databases
Isa Maks | Bob Boelhouwer

Results are presented of an ongoing project of the Dutch TST-centre for language and speech technology aiming at linking of various lexical databases. The project involves four Dutch monolingual lexicons: WlNT05, e-Lex, RBN and RBBN. These databases differ in organisational structure and content. To enable linkage between these lexicons, we developed a common feature value set and a common organisational structure. Both are based upon existing standards for the creation and reusability of lexicons: the Lexical Markup Framework and the EAGLES standard. Examples of the content and structure of each of the lexical databases are presented in their original form. Also, the structure and content is shown when mapped onto the common framework and feature value set. Thus, the commonalities and the complementarity of the lexical databases are more readily apparent. Besides, this elaboration of the databases opens up the opportunity for mutual enrichment.

pdf bib abs
Multilingual Multidocument Summarization Tools and Evaluation
Horacio Saggion

We describe a number of experiments carried out to address the problem of creating summaries from multiple sources in multiple languages. A centroid-based sentence extraction system has been developed which decides the content of the summary using texts in different languages and uses sentences from English sources alone to create the final output. We describe the evaluation of the system in the recent Multilingual Summarization Evaluation MSE 2005 using the pyramids and ROUGE methods.

pdf bib abs
Building a network of topical relations from a corpus
Olivier Ferret

Lexical networks such as WordNet are known to have a lack of topical relations although these relations are very useful for tasks such as text summarization or information extraction. In this article, we present a method for automatically building from a large corpus a lexical network whose relations are preferably topical ones. As it does not rely on resources such as dictionaries, this method is based on self-bootstrapping: a network of lexical cooccurrences is first built from a corpus and then, is filtered by using the words of the corpus that are selected by the initial network. We report an evaluation about topic segmentation showing that the results got with the filtered network are the same as the results got with the initial network although the first one is significantly smaller than the second one.

pdf bib abs
The role of lexical resources in matching classification schemas
P. Bouquet | L. Serafini | S. Zanobini

In this paper, we describe the role and the use of WORDNET as an external lexical resource in a methodology for matching hierarchical classification schemas. The main difference between our methodology and others which were presented is that we pay a lot of effort in eliciting the meaning of the structures we match, and we do this by using extensively lexical knowledge about the words occurring in labels. The result of this elicitation process is encoded in a formal language, called WDL (WORDNET Description Logic), which is our proposal for injecting lexical semantics into more standard knowledge representation languages.

pdf bib abs
Dealing with Imbalanced Data using Bayesian Techniques
Manolis Maragoudakis | Katia Kermanidis | Aristogiannis Garbis | Nikos Fakotakis

For the present work, we deal with the significant problem of high imbalance in data in binary or multi-class classification problems. We study two different linguistic applications. The former determines whether a syntactic construction (environment) co-occurs with a verb in a natural text corpus consists a subcategorization frame of the verb or not. The latter is called Name Entity Recognition (NER) and it concerns determining whether a noun belongs to a specific Name Entity class. Regarding the subcategorization domain, each environment is encoded as a vector of heterogeneous attributes, where a very high imbalance between positive and negative examples is observed (an imbalance ratio of approximately 1:80). In the NER application, the imbalance between a name entity class and the negative class is even greater (1:120). In order to confront the plethora of negative instances, we suggest a search tactic during training phase that employs Tomek links for reducing unnecessary negative examples from the training set. Regarding the classification mechanism, we argue that Bayesian networks are well suited and we propose a novel network structure which efficiently handles heterogeneous attributes without discretization and is more classification-oriented. Comparing the experimental results with those of other known machine learning algorithms, our methodology performs significantly better in detecting examples of the rare class.

In the framework of the DIHANA project, we present the acquisitionprocess of a spontaneous speech dialogue corpus in Spanish. Theselected application consists of information retrieval by telephone for nationwide trains. A total of 900 dialogues from 225 users were acquired using the Wizard of Oz technique. In this work, we present the design and planning of the dialogue scenes and the wizard strategy used for the acquisition of the corpus. Then, we also present the acquisition tools and a description of the acquisition process.

pdf bib abs
The Representation of German Prepositional Verbs in a Semantically Based Computer Lexicon
Rainer Osswald | Hermann Helbig | Sven Hartrumpf

We describe the treatment of verbs with prepositional complements inHaGenLex, a semantically based computer lexicon for German.Prepositional verbs such as bestehen auf (insist on) subcategorize for a prepositional phrase where the preposition usually has no independent meaning of its own. The lexical semantic information inHaGenLex is specified by means of MultiNet, a full-fledged knowledge representation formalism, which proves to be particularly useful for representing the semantics of verbs with prepositional complements. We indicate how the semantic representation in HaGenLex can be used to define semantic classes of prepositional verbs and briefly discuss the relation of these classes to Levin's verb classes. Moreover, wepresent first results on the automatic identification of prepositionalverbs by corpus-based methods.

This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted to the evaluation of alignment at sentence and word level respectively. It differs from ARCADE I in the multilingual aspect and the investigation of lexical alignment.

pdf bib abs
Representation and Inference for Open-Domain QA: Strength and Limits of two Italian Semantic Lexicons
Francesca Bertagna

The paper reports on the results of the exploitation of two Italian lexicons (ItalWordNet and SIMPLE-CLIPS) in an Open-Domain Question Answering application for Italian. The intent is to analyse the behavior of the lexicons in application in order to understand what are their limits and points of strength. The final aim of the paper is contributing to the debate about usefulness of computational lexicons in NLP, by providing evidence from the point of view of a particular application.

This paper describes the development of an Arabic document image collection containing 34,651 documents from 1,378 different books and 25 topics with their relevance judgments. The books from which the collection is obtained are a part of a larger collection 75,000 books being scanned for archival and retrieval at the bibliotheca Alexandrina (BA). The documents in the collection vary widely in topics, fonts, and degradation levels. Initial baseline experiments were performed to examine the effectiveness of different index terms, with and without blind relevance feedback, on Arabic OCR degraded text.

pdf bib abs
Gathering a corpus of multimodal computer-mediated meetings
Saturnino Luz | Matt-Mouley Bouamrane | Masood Masoodian

In this paper we describe the gathering of a corpus of synchronised speech and text interaction over the network. The data collection scenarios characterise audio meetings with a significant textual component. Unlike existing meeting corpora, the corpus described in this paper emphasises temporal relationships between speech and text media streams. This is achieved through detailed logging and timestamping of text editing operations, actions on shared user interface widgets and gesturing, as well as generation of speech activity profiles. A set of tools has been developed specifically for these purposes which can be used as a data collection platform for the development of meeting browsers. The data gathered to date consists of nearly 30 hours of recorded audio and time stamped editing operations and gestures.

pdf bib abs
Language identification from suprasegmental cues: Speech synthesis of Greek utterances from different dialectal variations.
Dimou Athanassia Lida | Chalamandaris Aimilios

In this paper we present the continuation of our research on the ability of native Greek adults to identify their mother tongue from synthesized stimuli which contain only prosodic - melodic and rhythmic - information. In the first section we present the ideas that underlie our theory, together with a brief review of our preliminary results. In the second section the detailed description of our experimental approach is given, as well as the results and their statistical analysis. In the final two sections we provide the conclusions derived from our experiments and the future work we are planning to carry out.

pdf bib abs
Tregex and Tsurgeon: tools for querying and manipulating tree data structures
Roger Levy | Galen Andrew

With syntactically annotated corpora becoming increasingly available for a variety of languages and grammatical frameworks, tree query tools have proven invaluable to linguists and computer scientists for both data exploration and corpus-based research. We provide a combined engine for tree query (Tregex) and manipulation (Tsurgeon) that can operate on arbitrary tree data structures with no need for preprocessing. Tregex remedies several expressive and implementational limitations of existing query tools, while Tsurgeon is to our knowledge the most expressive tree manipulation utility available.

pdf bib abs
Question Answering Evaluation Survey
L. Gillard | P. Bellot | M. El-Bèze

Evaluating Question Answering (QA) Systems is a very complex task: state-of-the-art systems involve processing whose influences and contributions on the final result are not clear and need to be studied. We present some key points on different aspects of the QA Systems (QAS) evaluation: mainly, as performed during large-scale campaigns, but also with clues on the evaluation of QAS typical software components; the last part of this paper, is devoted to a brief presentation of the French QA campaign EQueR and presents two issues: inter-annotator agreement during campaign and the reuse of reference patterns.

In this paper we present work in progress for the creation of the Italian Content Annotation Bank (I-CAB), a corpus of Italian news annotated with semantic information at different levels. The first level is represented by temporal expressions, the second level is represented by different types of entities (i.e. person, organizations, locations and geo-political entities), and the third level is represented by relations between entities (e.g. the affiliation relation connecting a person to an organization). So far I-CAB has been manually annotated with temporal expressions, person entities and organization entities. As we intend I-CAB to become a benchmark for various automatic Information Extraction tasks, we followed a policy of reusing already available markup languages. In particular, we adopted the annotation schemes developed for the ACE Entity Detection and Time Expressions Recognition and Normalization tasks. As the ACE guidelines have originally been developed for English, part of the effort consisted in adapting them to the specific morpho-syntactic features of Italian. Finally, we have extended them to include a wider range of entities, such as conjunctions.

pdf bib abs
The BLARK concept and BLARK for Arabic
Bente Maegaard | Steven Krauwer | Khalid Choukri | Lise Damsgaard Jørgensen

The EU project NEMLAR (Network for Euro-Mediterranean LAnguage Resources) on Arabic language resources carried out two surveys on the availability of Arabic LRs in the region, and on industrial requirements. The project also worked out a BLARK (Basic Language Resource Kit) for Arabic. In this paper we describe the further development of the BLARK concept made during the work on a BLARK for Arabic, as well as the results for Arabic.

pdf bib abs
Natural Language Processing: A Terminological and Statistical Approach
Gabriella Pardelli | Manuela Sassi | Sara Goggi | Paola Orsolini

The aim of this article is to provide a statistical representation of significant terms used in the field of Natural Language Processing from the 1960s till nowadays, in order to draft a survey on the most significant research trends in that period. By retrieving these keywords it should be possible to highlight the ebb and flow of some thematic topics. The NLP terminological sample derives from a database created for this purpose using the DBT software (Textual Data Base, ILC patent).

pdf bib abs
Data for question answering: The case of why
Suzan Verberne | Lou Boves | Nelleke Oostdijk | Peter-Arno Coppen

For research and development of an approach for automatically answering why-questions (why-QA) a data collection was created. The data set was obtained by way of elicitation and comprises a total of 395 why-questions. For each question, the data set includes the source document and one or two user-formulated answers. In addition, for a subset of the questions, user-formulated paraphrases are available. All question-answer pairs have been annotated with information on topic and semantic answer type. The resulting data set is of importance not only for our research, but we expect it to contribute to and stimulate other research in the field of why-QA.

pdf bib abs
Shallow Semantic Annotation of Bulgarian
Kiril Simov | Petya Osenova

The paper discusses shallow semantic annotation of Bulgarian treebank. Our goal is to construct the next layer of linguistic interpretation over the morphological and syntactic layers that have already been encoded in the treebank. The annotation is called shallow because it encodes only the senses for the non-functional words and the relations between the semantic indices connected to them. We do not encode quantifiers and scope information. An ontology is employed as a stock of the concepts and relations that form the word senses. Our lexicon is based on the Generative Lexicon (GL) model (Pustejovsky 1995) as it was implemented in the SIMPLE project (Lenci et. al. 2000). GL defines the way in which the words are connected to the concepts and the relations in the ontology. Also it provides mechanisms for literal sense changes like type-coercion, metonymy, and similar. Some of these phenomena are presented in the annotation.

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

pdf bib abs
Annotation of Emotions in Real-Life Video Interviews: Variability between Coders
S. Abrilian | L. Devillers | J-C. Martin

Research on emotional real-life data has to tackle the problem of their annotation. The annotation of emotional corpora raises the issue of how different coders perceive the same multimodal emotional behaviour. The long-term goal of this paper is to produce a guideline for the selection of annotators. The LIMSI team is working towards the definition of a coding scheme integrating emotion, context and multimodal annotations. We present the current defined coding scheme for emotion annotation, and the use of soft vectors for representing a mixture of emotions. This paper describes a perceptive test of emotion annotations and the results obtained with 40 different coders on a subset of complex real-life emotional segments selected from the EmoTV Corpus collected at LIMSI. The results of this first study validate previous annotations of emotion mixtures and highlight the difference of annotation between male and female coders.

pdf bib abs
Hantology-A Linguistic Resource for Chinese Language Processing and Studying
Ya-Min Chou | Chu-Ren Huang

Hantology, a character-based Chinese language resource is created to provide an infrastructure for language processing and research on the writing system. Unlike alphabetic or syllabic writing systems, the ideographic writing system of Chinese poses both a challenge and an opportunity. The challenge is that a totally different resources structure must be created to represent and process speakers conventionalization of the language. The rare opportunity is that the structure itself is enriched with conceptual classification and can be utilized for ontology building. We describe the contents and possible applications of Hantology in this paper. The applications of Hantology include: (1) an account for the diachronic development of Chinese lexica (2) character-based language processing, (3) a study of conceptual structure differences in Chinese and English, and (4) comparisons of different ideographic writing systems.

This paper reports on the multilingual Language Resources (MLRs), i.e. parallel corpora and terminological lexicons for less widely digitally available languages, that have been developed in the INTERA project and the methodology adopted for their production. Special emphasis is given to the reality factors that have influenced the MLRs development approach and their final constitution. Building on the experience gained in the project, a production model has been elaborated, suggesting ways and techniques that can be exploited in order to improve LRs production taking into account realistic issues.

pdf bib abs
Semantic Analysis of Abstract Nouns to Compile a Thesaurus of Adjectives
Kyoko Kanzaki | Qing Ma | Eiko Yamamoto | Hitoshi Isahara

Aiming to compile a thesaurus of adjectives, we discuss how to extract abstract nouns categorizing adjectives, clarify the semantic and syntactic functions of these abstract nouns, and manually evaluate the capability to extract the instance-category relations. We focused on some Japanese syntactic structures and utilized possibility of omission of abstract noun to decide whether or not a semantic relation between an adjective and an abstract noun is an instance-category relation. For 63% of the adjectives (57 groups/90 groups) in our experiments, our extracted categories were found to be most suitable. For 22 % of the adjectives (20/90), the categories in the EDR lexicon were found to be most suitable. For 14% of the adjectives (13/90), neither our extracted categories nor those in EDR were found to be suitable, or examinees own categories were considered to be more suitable. From our experimental results, we found that the correspondence between a group of adjectives and their category name was more suitable in our method than in the EDR lexicon.

pdf bib abs
Shalmaneser - A Toolchain For Shallow Semantic Parsing
Katrin Erk | Sebastian Padó

This paper presents Shalmaneser, a software package for shallow semantic parsing, the automatic assignment of semantic classes and roles to free text. Shalmaneser is a toolchain of independent modules communicating through a common XML format. System output can be inspected graphically. Shalmaneser can be used either as a black box to obtain semantic parses for new datasets (classifiers for English and German frame-semantic analysis are included), or as a research platform that can be extended to new parsers, languages, or classification paradigms.

pdf bib abs
User-friendly ontology authoring using a controlled language
Valentin Tablan | Tamara Polajnar | Hamish Cunningham | Kalina Bontcheva

In recent years, following the rapid development in the Semantic Web and Knowledge Management research, ontologies have become more in demand in Natural Language Processing. An increasing number of systems use ontologies either internally, for modelling the domain of the application, or as data structures that hold the output resulting from the work of the system, in the form of knowledge bases. While there are many ontology editing tools aimed at expert users, there are very few which are accessible to users wishing to create simple structures without delving into the intricacies of knowledge representation languages. The approach described in this paper allows users to create and edit ontologies simply by using a restricted version of the English language. The controlled language described within is based on an open vocabulary and a restricted set of grammatical constructs. Sentences written in this language unambiguously map into a number of knowledge representation formats including OWL and RDF-S to allow round-trip ontology management.

pdf bib abs
NPs for Events: Experiments in Coreference Annotation
Laura Hasler | Constantin Orasan | Karin Naumann

This paper describes a pilot project which developed a methodology for NP and event coreference annotation consisting of detailed annotation schemes and guidelines. In order to develop this, a small sample annotated corpus in the domain of terrorism/security was built. The methodology developed can be used as a basis for large-scale annotation to produce much-needed resources. In contrast to related projects, ours focused almost exclusively on the development of annotation guidelines and schemes, to ensure that future annotations based on this methodology capture the phenomena both reliably and in detail. The project also involved extensive discussions in order to redraft the guidelines, as well as major extensions to PALinkA, our existing annotation tool, to accommodate event as well as NP coreference annotation.

This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, and manually validated with the help of lexical association measures. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.

pdf bib abs
Lexicon Development for Varieties of Spoken Colloquial Arabic
David Graff | Tim Buckwalter | Mohamed Maamouri | Hubert Jin

In Arabic speech communities, there is a diglossic gap between written/formal Modern Standard Arabic (MSA) and spoken/casual colloquial dialectal Arabic (DA): the common spoken language has no standard representation in written form, while the language observed in texts has limited occurrence in speech. Hence the task of developing language resources to describe and model DA speech involves extra work to establish conventions for orthography and grammatical analysis. We describe work being done at the LDC to develop lexicons for DA, comprising pronunciation, morphology and part-of-speech labeling for word forms in recorded speech. Components of the approach are: (a) a two-layer transcription, providing a consonant-skeleton form and a pronunciation form; (b) manual annotation of morphology, part-of-speech and English gloss, followed by development of automatic word parsers modeled on the Buckwalter Morphological Analyzer for MSA; (c) customized user interfaces and supporting tools for all stages of annotation; and (d) a relational database for storing, emending and publishing the transcription corpus as well as the lexicon.

pdf bib abs
MOOD: A Modular Object-Oriented Decoder for Statistical Machine Translation
Alexandre Patry | Fabrizio Gotti | Philippe Langlais

We present an Open Source framework called MOOD developed in order tofacilitate the development of a Statistical Machine Translation Decoder.MOOD has been modularized using an object-oriented approach which makes itespecially suitable for the fast development of state-of-the-art decoders. Asa proof of concept, a clone of the pharaoh decoder has been implemented andevaluated. This clone named ramses is part of the current distribution of MOOD.

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus

pdf bib abs
Building a Swedish-Turkish Parallel Corpus
Beáta Bandmann Megyesi | Anna Sågvall Hein | Éva Csató Johanson

We present a SwedishTurkish Parallel Corpus aimed to be used in linguistic research, teaching, and applications in natural language processing, primarily machine translation. The corpus being under development is built by using a Basic LAnguage Resource Kit (BLARK) for the two languages which is then used in the automatic alignment phase to improve alignment accuracy. The corpus is balanced with respect to source and target language and is automatically processed using the Uplug toolkit.

pdf bib abs
Language Resources for Background Gathering
Horacio Saggion | Robert Gaizauskas

We describe the Cubreporter information access system which allows access to news archives through the use of natural language technology. The system includes advanced text search, question answering, summarization, and entity profiling capabilities. It has been designed taking into account the characteristics of the background gathering task.

pdf bib abs
An Efficient Approach to Gold-Standard Annotation: Decision Points for Complex Tasks
Julie Medero | Kazuaki Maeda | Stephanie Strassel | Christopher Walker

Inter-annotator consistency is a concern for any corpus building effort relying on human annotation. Adjudication is as effective way to locate and correct discrepancies of various kinds. It can also be both difficult and time-consuming. This paper introduces Linguistic Data Consortium (LDC)s model for decision point-based annotation and adjudication, and describes the annotation tools developed to enable this approach for the Automatic Content Extraction (ACE) Program. Using a customized user interface incorporating decision points, we improved adjudication efficiency over 2004 annotation rates, despite increased annotation task complexity. We examine the factors that lead to more efficient, less demanding adjudication. We further discuss how a decision point model might be applied to annotation tools designed for a wide range of annotation tasks. Finally, we consider issues of annotation tool customization versus development time in the context of a decision point model.

pdf bib abs
A Corpus-based Approach to the Interpretation of Unknown Words with an Application to German
Stefan Klatt

Usually a high portion of the different word forms in a corpusreceive no reading by the lexical and/or morphological analysis. These unknown words constitute a huge problem for NLP analysis tasks likePOS-tagging or syntactic parsing. We present a parameterizable (in principle language-independent) corpus-basedapproach for the interpretation of unknown words that only needs a tokenizedcorpus and can be used in both offline and online applications. In combination with a few linguistic (language-dependent) rules unknown verbs, adjectives, nouns, multiword units etc. are identified. Depending on the recognized word class(es), more detailed morphosyntactic and semantic information is additionally identified in opposite to the majority ofother unknown word guessing methods,which only uses a very narrow decision window to assign an unknown wordits correct reading respective Part-of-Speech tag in a given text. We tested our approach by experiments with German data and received very promising results.

pdf bib abs
The Ritel Corpus - An annotated Human-Machine open-domain question answering spoken dialog corpus
Sophie Rosset | Sandra Petel

In this paper we present a real (as opposed to Wizard-of-Oz) Human-Computer QA-oriented spoken dialog corpus collected with our Ritel platform. This corpus has been orthographically transcribed and annotated in terms of Specific Entities and Topics. Twelve main topics have been chosen. They are refined into 22 sub-topics. The Specific Entities are from five categories and cover Named Entities, linguistic entities, topic-defining entities, general entities and extended entities. The corpus contains 582 dialogs for 6 hours of user speech.

pdf bib abs
A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
Anna Feldman | Jirka Hana | Chris Brew

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

pdf bib abs
Greek Named Entity Recognition using Support Vector Machines, Maximum Entropy and Onetime
Ionas Michailidis | Konstantinos Diamantaras | Spiros Vasileiadis | Yannick Frère

We describe our work on Greek Named Entity Recognition using comparatively three different machine learning techniques: (i) Support Vector Machines (SVM), (ii) Maximum Entropy and (iii) Onetime, a shortcut method based on previous work of one of the authors. The majority of our systems features use linguistic knowledge provided by: morphology, punctuation, position of the lexical units within a sentence and within a text, electronic dictionaries, and the outputs of external tools (a tokenizer, a sentence splitter, and a Hellenic version of Brills Part of Speech Tagger). After testing we observed that the application of a few simple Post Testing Classification Correction (PTCC) rules created after the observation of output errors, improved the results of the SVM and the Maximum Entropy systems output. We achieved very good results with the three methods. Our best configurations (Support Vector Machines with a second degree polynomial kernel and Maximum Entropy) achieved both after the application of PTCC rules an overall F-measure of 91.06.

pdf bib abs
A Large Subcategorization Lexicon for Natural Language Processing Applications
Anna Korhonen | Yuval Krymolowski | Ted Briscoe

We introduce a large computational subcategorizationlexicon which includes subcategorization frame (SCF) and frequencyinformation for 6,397 English verbs. This extensive lexicon was acquiredautomatically from five corpora and the Web using the current version of the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997). The lexicon is provided freely for research use, along with a script which can be used to filter and build sub-lexicons suited for different natural languageprocessing (NLP) purposes. Documentation is also provided whichexplains each sub-lexicon option and evaluates its accuracy.

pdf bib abs
Integrating Linguistic Resources: The American National Corpus Model
Nancy Ide | Keith Suderman

This paper describes the architecture of the American National Corpus and the design decisions we have made in order to make the corpus easy to use with a variety of existing tools with varying functionality, and to allow for layering multiple annotations over the data. The overall goal of the ANC project is to provide an open linguistic infrastructure for American English, consisting of as many self-generated or contributed annotations of the data as possible together with derived. The availability of a wide variety of annotations for the same data and in a common format should significantly simplify the processing required to extract annotations from different sources and enable use of the ANC and its annotations with off-the-shelf software.

pdf bib abs
Representing Linguistic Corpora and Their Annotations
Nancy Ide | Laurent Romary

A Linguistic Annotation Framework (LAF) is being developed within the International Standards Organization Technical Committee 37 Sub-committee on Language Resource Management (ISO TC37 SC4). LAF is intended to provide a standardized means to represent linguistic data and its annotations that is defined broadly enough to accommodate all types of linguistic annotations, and at the same time provide means to represent precise and potentially complex linguistic information. The general principles informing the design of LAF have been previously reported (Ide and Romary, 2003; Ide and Romary, 2004a). This paper describes some of the more technical aspects of the LAF design that have been addressed in the process of finalizing the specifications for the standard.

pdf bib abs
An Open Source Prosodic Feature Extraction Tool
Zhongqiang Huang | Lei Chen | Mary Harper

There has been an increasing interest in utilizing a wide variety of knowledge sources in order to perform automatic tagging of speech events, such as sentence boundaries and dialogue acts. In addition to the word spoken, the prosodic content of the speech has been proved quite valuable in a variety of spoken language processing tasks such as sentence segmentation and tagging, disfluency detection, dialog act segmentation and tagging, and speaker recognition. In this paper, we report on an open source prosodic feature extraction tool based on Praat, with a description of the prosodic features and the implementation details, as well as a discussion of its extension capability. We also evaluate our tool on a sentence boundary detection task and report the system performance on the NIST RT04 CTS data.

pdf bib abs
Semantic Tag Extraction from WordNet Glosses
Alina Andreevskaia | Sabine Bergler

We propose a method that uses information from WordNet glosses to assign semantic tags to individual word meanings, rather than to entire words. The produced lists of annotated words will be used in sentiment annotation of texts and phrases and in other NLP tasks. The method was implemented in the Semantic Tag Extraction Program (STEP) and evaluated on the category of sentiment (positive, negative or neutral) using two human-annotated lists. The lists were first compared to each other and then used to assess the accuracy of the proposed system. We argue that significant disagreement on sentiment tags between the two human-annotated lists reflects a naturally occurring ambiguity of words located on the periphery of the category of sentiment. The category of sentiment, thus, is believed to be structured as a fuzzy set. Finally, we evaluate the generalizability of STEP to other semantic categories on the example of the category of words denoting increase/decrease in magnitude, intensity or quality of some state or process. The implications of this study for both semantic tagging system development and for performance evaluation practices are discussed.

pdf bib abs
Getting Deeper Semantics than Berkeley FrameNet with MSFA
Kow Kuroda | Masao Utiyama | Hitoshi Isahara

This paper illustrates relevant details of an on-going semantic-role annotation work based on a framework called MULTILAYERED/DIMENSIONAL SEMANTIC FRAME ANALYSIS (MSFA for short) (Kuroda and Isahara, 2005b), which is inspired by, if not derived from, Frame Semantics/Berkeley FrameNet approach to semantic annotation (Lowe et al., 1997; Johnson and Fillmore, 2000).

pdf bib abs
The wraetlic NLP suite
Enrique Alfonseca | Antonio Moreno-Sandoval | José María Guirao | María Ruiz-Casado

In this paper, we describe the second release of a suite of language analysers, developed over the last five years, called wraetlic, which includes tools for several partial parsing tasks, both for English and Spanish. It has been successfully used in fields such as Information Extraction, thesaurus acquisition, Text Summarisation and Computer Assisted Assessment.

pdf bib abs
Linguistic and Biological Annotations of Biological Interaction Events
Tomoko Ohta | Yuka Tateisi | Jin-Dong Kim | Akane Yakushiji | Jun-ichi Tsujii

This paper discusses an augmentation of a corpus ofresearch abstracts in biomedical domain (the GENIA corpus) with two kinds of annotations: tree annotation and event annotation. The tree annotation identifies the linguistic structure that encodes the relations among entities. The event annotation reveals the semantic structure of the biological interaction events encoded in the text. With these annotations we aim to provide a link between the clue and the target of biological event information extraction.

pdf bib abs
The ASK Corpus - a Language Learner Corpus of Norwegian as a Second Language
Kari Tenfjord | Paul Meurer | Knut Hofland

In our paper we present the design and interface of ASK, a language learner corpus of Norwegian as a second language which contains essays collected from language tests on two different proficiency levels as well as personal data from the test takers. In addition, the corpus also contains texts and relevant personal data from native Norwegians as control data. The texts as well as the personal data are marked up in XML according to the TEI Guidelines. In order to be able to classify errors in the texts, we have introduced new attributes to the TEI corr and sic tags. For each error tag, a correct form is also in the text annotation. Finally, we employ an automatic tagger developed for standard Norwegian, the Oslo-Bergen Tagger, together with a facility for manual tag correction. As corpus query system, we are using the Corpus Workbench developed at the University of Stuttgart together with a web search interface developed at Aksis, University of Bergen. The system allows for searching for combinations of words, error types, grammatical annotation and personal data.

pdf bib abs
Annotation Guidelines for Czech-English Word Alignment
Ivana Kruijff-Korbayová | Klára Chvátalová | Oana Postolache

We report on our experience with manual alignment of Czech and English parallel corpus text. We applied existing guidelines for English and French and augmented them to cover systematically occurring cases in our corpus. We describe the main extensions covered in our guidelines and provide examples. We evaluated both intra- and inter-annotator agreement and obtained very good results of Kappa well above 0.9 and agreement of 95% and 93%, respectively.

pdf bib abs
Sign Language corpus analysis: Synchronisation of linguistic annotation and numerical data
Jérémie Segouat | Annelies Braffort | Emilie Martin

This paper presents a study on synchronization of linguistic annotation and numerical data on a video corpus of French Sign Language. We detail the methodology and sketches out the potential observations that can be provided by such a kind of mixed annotation. The corpus is composed of three views: close-up, frontal and top. Some image processing has been performed on each video in order to provide global information on the movement of the signers. That consists of the size and position of a bounding box surrounding the signer. Linguists have studied this corpus and have provided annotations on iconic structures, such as "personal transfers" (role shifts). We used an annotation software, ANVIL, to synchronize linguistic annotation and numerical data. This new approach of annotation seems promising for automatic detection of linguistic phenomena, such as classification of the signs according to their size in the signing space, and detection of some iconic structures. Our first results must be consolidated and extended on the whole corpus. The next step will consist of designing automatic processes in order to assist SL annotation.

Optimizing the production, maintenance and extension of lexical resources is one the crucial aspects impacting Natural Language Processing (NLP). A second aspect involves optimizing the process leading to their integration in applications. With this respect, we believe that the production of a consensual specification on lexicons can be a useful aid for the various NLP actors. Within ISO, the purpose of LMF is to define a standard for lexicons. LMF is a model that provides a common standardized framework for the construction of NLP lexicons. The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of large number of individual electronic resources to form extensive global electronic resources. In this paper, we describe the work in progress within the sub-group ISO-TC37/SC4/WG4. Various experts from a lot of countries have been consulted in order to take into account best practices in a lot of languages for (we hope) all kinds of NLP lexicons.

We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependent resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic place names world-wide. The system uses a number of language-independent clues and heuristics to disambiguate place name homographs. The final aim is to index texts with the countries and cities they mention and to automatically visualise this information on geographical maps using various tools.

pdf bib abs
Query Expansion on Compounds
Bolette Sandford Pedersen

Compounds constitute a specific issue in search, in particular in languages where they are written in one word, as is the case for Danish and the other Scandinavian languages. For such languages, expansion of the query compound into separate lemmas is a way of finding the often frequent alternative synonymous phrases in which the content of a compound can also be expressed. However, it is crucial to note that the number of irrelevant hits is generally very high when using this expansion strategy. The aim of this paper is to examine how we can obtain better search results on split compounds, partly by looking at the internal structure of the original compound, partly by analyzing the context in which the split compound occurs. We perform an NP analysis and introduce a new, linguistically based threshold for retrieved hits. The results obtained by using this strategy demonstrate that compound splitting combined with a shallow linguistic analysis focusing on the recognition of NPs can improve search by bringing down the number of irrelevant hits.

pdf bib abs
Word Knowledge Acquisition for Computational Lexicon Construction
Thatsanee Charoenporn | Canasai Kruengkrai | Thanaruk Theeramunkong | Virach Sornlertlamvanich | Hitoshi Isahara

The growing of multilingual information processing technology has created the need of linguistic resources, especially lexical database. Many attempts were put to alter the traditional dictionary to computational dictionary, or widely named as computational lexicon. TCLs Computational Lexicon (TCLLEX) is a recent development of a large-scale Thai Lexicon, which aims to serve as a fundamental linguistic resource for natural language processing research. We design either terminology or ontology for structuring the lexicon based on the idea of computability and reusability.

pdf bib abs
The LOIS Project
Wim Peters | Maria Teresa Sagri | Daniela Tiscornia | Sara Castagnoli

The LOIS (Lexical Ontologies for legal Information Sharing) project The legal knowledge base resulting from the LOIS (Lexical Ontologies for legal Information Sharing) (Lexical Ontologies for legal Information Sharing) project consists of legal WordNets in six languages (Italian, Dutch, Portuguese, German, Czech, English). Its architecture is based on the EuroWordNet (EWN) framework (Vossen et al, 1997). Using the EWN framework assures compatibility of the LOIS WordNets with EWN, allowing them to function as an extension of EWN for the legal domain. For each legal system, the document-derived legal concepts are integrated into a taxonomy, which links into existing formal ontologies. These give the legal wordnets a first formal backbone, which can, in future, be further extended. The database consists of 33,000 synsets, and is aimed to be used in information retrieval, where it provides mono- and multi-lingual access to European legal databases for legal experts as well as for laymen. The LOIS knowledge base also provides a flexible, modular architecture that allows integration of multiple classification schemes, and enables the comparison of legal systems by exploring translation, equivalence and structure across the different legal wordnets.

pdf bib abs
A mixed word / morphological approach for extending CELEX for high coverage on contemporary large corpora
Joris Vaneyghen | Guy De Pauw | Dirk Van Compernolle | Walter Daelemans

This paper describes an alternative approach to morphological language modeling, which incorporates constraints on the morphological production of new words. This is done by applying the constraints as a preprocessing step in which only one morphological production rule can be applied to an extended lexicon of knownmorphemes, lemmas and word forms. This approach is used to extend the CELEX Dutch morphological database, so that a higher coverage can be reached on a largecorpus of Dutch newspaper articles. We present experimental results on the coverage of this extended database and use the extension to further evaluate our morphologicalsystem, as well as the impact of the constraints on the coverage of out-of-vocabulary words.

pdf bib abs
Linking Verbal Entries of Different Lexical Resources
Adriana Roventini

In the field of Computational Linguistics, many lexical resources have been developed which aim at encoding complex lexical semantic information according to different linguistic models (WordNet, Frame Semantics, Generative Lexicon, etc.). However, these resources are often not easily accessible nor available in their entirety. Yet, from the point of view of the continuous growth of technology (Semantic Web), their visibility, availability and integration are becoming of utmost importance. ItalWordNet and PAROLE/SIMPLE/CLIPS are two resources which, tackling lexical semantics from different perspectives and being at least partially complementary, can profit from linking each other. In this paper we address the issue of the linking of these resources focusing on the most problematic part of the lexicon: the second order entities. In particular, after a brief description of the two resources, their different approaches to the verb semantics are described; an accurate comparison of a set of verbal entries belonging to Speech Act semantic class is carried out aiming at evaluate the possibilities and the advantages of a semiautomatic link.

This article outlines the evaluation protocol and provides the main results of the French Evaluation Campaign for Machine Translation Systems, CESTA. Following the initial objectives and evaluation plans, the evaluation metrics are briefly described: along with fluency and adequacy assessed by human judges, a number of recently proposed automated metrics are used. Two evaluation campaigns were organized, the first one in the general domain, and the second one in the medical domain. Up to six systems translating from English into French, and two systems translating from Arabic into French, took part in the campaign. The numerical results illustrate the differences between classes of systems, and provide interesting indications about the reliability of the automated metrics for French as a target language, both by comparison to human judges and using correlations between metrics. The corpora that were produced, as well as the information about the reliability of metrics, constitute reusable resources for MT evaluation.

pdf bib abs
Lemma-oriented dictionaries, concept-oriented terminology and translation memories
André Le Meur | Marie-Jeanne Derouin

Market surveys have pointed out translators demand for integrated specialist dictionaries in translation memory tools which they could use in addition to their own compiled dictionaries or stored translated parts of text. For this purpose the German specialist dictionary publisher, Langenscheidt Fachverlag in Munich has developed a method and tools together with experts from the University Rennes 2 in France and well known Translation Memory Providers. The conversion-tools of dictionary entries (lemma-oriented) in terminological entries (concept-oriented) are based on lexicographical and terminological ISO standards: ISO 1951 for dictionaries and ISO 16642 for terminology. The method relies on the analysis of polysemic structures into a set of data categories that can be recombined into monosemic entries compatible with most of the terminology management engines on the market. The whole process is based on the TermBridge semantic repository (http://www.genetrix.org ) for terminology and machine readable dictionaries and on a XML model LexTerm which is a subset of Geneter (ISO 16642 Annex C). It illustrates the interest for linguistic applications to define data elements in semantic repositories so that they are reusable in various contexts. This operation is fully integrated in the editorial XML workflow and applies to a series of specialist dictionaries which are now available.

pdf bib abs
Spanish Synthesis Corpora
Martí Umbert | Asunción Moreno | Pablo Agüero | Antonio Bonafonte

This paper deals with the design of a synthesis database for a high quality corpus-based Speech Synthesis system in Spanish. The database has been designed for speech synthesis, speech conversion and expressive speech. The design follows the specifications of TC-STAR project and has been applied to collect equivalent English and Mandarin synthesis databases. The sentences of the corpus have been selected mainly from transcribed speech and novels. The selection criterion is a phonetic and prosodic coverage. The corpus was completed with sentences specifically designed to cover frequent phrases and words. Two baseline speakers and four bilingual speakers were recorded. Recordings consist of 10 hours of speech for each baseline speaker and one hour of speech for each voice conversion bilingual speaker. The database is labelled and segmented. Pitch marks and phonetic segmentation was done automatically and up to 50% manually supervised. The database will be available at ELRA.

pdf bib abs
Exploiting Linguistic Knowledge in Language Modeling of Czech Spontaneous Speech
Pavel Ircing | Jan Hoidekr | Josef Psutka

In our paper, we present a method for incorporating available linguistic information into a statistical language model that is used in ASR system for transcribing spontaneous speech. We employ the class-based language model paradigm and use the morphological tags as the basis for world-to-class mapping. Since the number of different tags is at least by one order of magnitude lower than the number of words even in the tasks with moderately-sized vocabularies, the tag-based model can be rather robustly estimated using even the relatively small text corpora. Unfortunately, this robustness goes hand in hand with restricted predictive ability of the class-based model. Hence we apply the two-pass recognition strategy, where the first pass is performed with the standard word-based n-gram and the resulting lattices are rescored in the second pass using the aforementioned class-based model. Using this decoding scenario, we have managed to moderately improve the word error rate in the performed ASR experiments.

pdf bib abs
Semantic Descriptors: The Case of Reflexive Verbs
Milena Slavcheva

This paper presents a semantic classification of reflexive verbs in Bulgarian, augmenting the morphosyntactic classes of verbs in the large Bulgarian Lexical Data Base - a language resource utilized in a number of Language Engineering (LE) applications. Thesemantic descriptors conform to the Unified Eventity Representation (UER), developed by Andrea Schalley. The UER is a graphical formalism, introducing the object-oriented system design to linguistic semantics. Reflexive/non-reflexive verb pairs are analyzed where the non-reflexive member of the opposition, a two-place predicate, is considered the initial linguistic entity from which the reflexive correlate is derived. The reflexive verbs are distributed into initial syntactic-semantic classes which serve as the basis for defining the relevant semantic descriptors in the form of EVENTITY FRAME diagrams. The factors that influence the categorization of the reflexives are the lexical paradigmaticapproach to the data, the choice of only one reading for each verb, top level generalization of the semantic descriptors. The language models described in this paper provide the possibility for building linguistic components utilizable in knowledge-driven systems.

pdf bib abs
Exploring HPSG-based Treebanks for Probabilistic Parsing HPSG grammar extraction
Günter Neumann | Berthold Crysmann

We describe a method for the automatic extraction of a Stochastic Lexicalized Tree Insertion Grammar from a linguistically rich HPSG Treebank. The extraction method is strongly guided by HPSG-based head and argument decomposition rules. The tree anchors correspond to lexical labels encoding fine-grained information. The approach has been tested with a German corpus achieving a labeled recall of 77.33% and labeled precision of 78.27%, which is competitive to recent results reported for German parsing using the Negra Treebank.

pdf bib abs
Proper Names and Linguistic Dynamics
Rita Marinelli | Remo Bindi

Pragmatics is the study of how people exchange meanings through the use of language. In this paper we describe our experience with regard to texts belonging to a large contemporary corpus of written language, in order to verify the uses, changes and flexibility of the meaning of Proper Names (PN). As a matter of fact, while building the lexical semantic database ItalWordNet (IWN), a considerable set of PN (up to now, about 4,000) has been inserted and studied. We give prominence to the polysemy of PN and their shifting or moving from one class to another as an example of the extensibility of language and the possibility of change considering meaning as a dynamic process. Many examples of the sense shifting phenomenon can be evidenced by textual corpora. By comparing the percentages regarding the texts belonging to two different periods of time, an increasing use of the PN with sense extension has been verified. This evidence could confirm the tendency to consider the derived or extended senses as more salient and prevailing on the base senses, confirming a gradual fixation of meaning during the time. The object of our study (in progress) is to observe the uses of sense extensions also examining in detail freshly coined examples and taking into account their relationship with meta representational capacity and human creativity and the ways in which linguistic dynamics can activate the meaning potential of the words.

pdf bib abs
Towards machine-readable lexicons for South African Bantu languages
Sonja E. Bosch | Laurette Pretorius | Jackie Jones

Lexical information for South African Bantu languages is not readily available in the form of machine-readable lexicons. At present the availability of lexical information is restricted to a variety of paper dictionaries. These dictionaries display considerable diversity in the organisation and representation of data. In order to proceed towards the development of reusable and suitably standardised machine-readable lexicons for these languages, a data model for lexical entries becomes a prerequisite. In this study the general purpose model as developed by Bell & Bird (2000) is used as a point of departure. Firstly, the extent to which the Bell & Bird (2000) data model may be applied to and modified for the above-mentioned languages is investigated. Initial investigations indicate that modification of this data model is necessary to make provision for the specific requirements of lexical entries in these languages. Secondly, a data model in the form of an XML DTD for the languages in question, based on our findings regarding (Bell & Bird, 2000) and (Weber, 2002) is presented. Included in this model are additional particular requirements for complete and appropriate representation of linguistic information as identified in the study of available paper dictionnaries.

pdf bib abs
Creation of a corpus of multimodal spontaneous expressions of emotions in Human-Machine Interaction
G. Lechenadec | V. Maffiolo | N. Chateau | J.M. Colletta

This paper presents an experience in laboratory dealing with the constitution of a corpus of multimodal spontaneous expressions of emotions. The originality of this corpus resides in its characteristics (interactions between a virtual actor and humans learning a theater text), in its content (multimodal spontaneous expressions of emotions) and in its two sources of characterization (by the participant and by one of his/her close relation). The corpus collection is part of a study on the fusion of multimodal information (verbal, facial, gestural, postural, and physiological) to improve the detection and characterization of expressions of emotions in human-machine interaction (HMI).

pdf bib abs
A Dictionary Model for Unifying Machine Readable Dictionaries and Computational Concept Lexicons
Yoshihiko Hayashi | Toru Ishida

The Language Grid, recently proposed by one of the authors, is a language infrastructure available on the Internet. It aims to resolve the problems of accessibility and usability inherent in the currently available language services. The infrastructure will accommodate an operational environment in which a user and/or a software agent can develop a language service that is tailored to specific requirements derived from the various situations of intercultural communication. In order to effectively operate the infrastructure, each atomic language service has to be discovered by the planner of a composite service and incorporated into the composite service scenario. Meta-description of an atomic service is crucial to accomplish the planning process. This paper focuses on dictionary access services and proposes an abstract dictionary model that is vital for the accurate meta-description of such a service. In principle, the proposed model is based on the organization compatible with Princeton WordNet. Computational lexicons, including the EDR dictionary, as well as a range of human monolingual/bilingual dictionaries are uniformly organized into a WordNet-like lexical concept system. A modeling example with a few dictionary instances demonstrates the fundamental validity of the model.

pdf bib abs
Using Core Ontology for Domain Lexicon Structuring
Rita Marinelli | Adriana Roventini | Giovanni Spadoni

The users demand has determined the need to manage the growing new technical maritime terminology which includes very different domains such as the juridical or commercial ones. A terminological database was built by exploiting the computational tools of ItalWordNet (IWN) and its lexical-semantic model (EuroWordNet).This paper concerns the development of database structure and data coding, relevance of the concepts of term and domain, information potential of the terms, complexity of this domain and detailed ontology structuring recently undertaken and still in progress. Our domain structure is described defining a core set of terms representing the two main sub-domains specified in technical-nautical and maritime transport terminology. These terms are sufficiently general to be the root nodes of the core ontology we are developing. They are mostly domain-dependent, but the link with the Top Ontology of IWN remains, endorsing either general and foundation information, or detailed description directly connected with the specific domain. Through the semantic relations linking the synsets, every term inherits the top ontology definitions and becomes itself an integral part of the structure. While codifying a term in the maritime database, the reference is at the same time allowed to the Base Concepts of the terminological ontology embedding the term in the semantic network, showing that upper and core ontologies make it possible for the framework to integrate different views on the same domain in a meaningful way.

pdf bib abs
Linguistic Suite for Polish Cadastral System
Witold Abramowicz | Agata Filipowska | Jakub Piskorski | Krzysztof Węcel | Karol Wieloch

This paper reports on an endeavour of creating basic linguistic resources for geo-referencing of Polish free-text documents. We have defined a fine-grained named entity hierarchy, produced an exhaustive gazetteer, and developed named-entity grammars for Polish. Additionally, an annotated corpus for the cadastral domain was prepared for evaluation purposes. Our baseline approach to geo-referencing is based on application of aforementioned resources and a lightweight co-referencing technique which utilizes string-similarity metric of Jaro-Winkler. We carried out a detailed evaluation of detecting locations, organizations and persons, which revealed that best results are obtained via application of a combined grammar for all types. The application of lightweight co-referencing for organizations and persons improves recall but deteriorates precision, and no gain is observed for locations. The paper is accompanied by a demo, a geo-referencing application capable of: (a) finding documents and text fragments based on named entities and (b) populating the spatial ontology from texts.

pdf bib abs
Long-term Analysis of Prosodic Features of Spoken Guidance System User Speech
Hiromichi Kawanami | Takahiro Kitamura | Kiyohiro Shikano

As a practical information guidance system, we have been developing a speech-oriented system named "Takemaru-kun". The system has been operated on a public space since Nov. 2002. The system answers to user's question about the hall facilities, sightseeing, transportation, weather information around the city, etc. All triggered inputs to the system have been recorded since the operation started. And all system inputs during 22 months are manually transcribed and labelled for speakers gender and age category. In this paper, we conduct a long-term prosody analysis of user speech to find a clue to obtain users attitude from a users speech. In this preliminary analysis, it is observed that F0 decreases regardless of age and gender category when the stability of the dialogue system is not established.

pdf bib abs
Text Mining for Semantic Relations as a Support Base of a Scientific Portal Generator
Vít Nováček | Pavel Smrž | Jan Pomikálek

Current Semantic Web implementation efforts pose a number of challenges. One of the big ones among them is development and evolution of specific resources --- the ontologies --- as a base for representation of the meaning of the web. This paper deals with the automatic acquisition of semantic relations from the text of scientific publications (journal articles, conference papers, project descriptions, etc.). We also describe the process of building of corresponding ontological resources and their application for semi--automatic generation of scientific portals. Extracted relations and ontologies are crucial for the structuring of the information at the portal pages, automatic classification of the presented documents as well as for personalisation at the presentation level. Besides a general description of the portal generating system, we give also a detailed overview of extraction of semantic relations in the form of a domain--specific ontology. The overview consists of presentation of an architecture of the ontology extraction system, description of methods used for mining of semantic relations and analysis of selected results and examples.

pdf bib abs
POS tagset design for Italian
Raffaella Bernardi | Andrea Bolognesi | Corrado Seidenari | Fabio Tamburini

We aim to automatically induce a PoS tagset for Italian by analysing the distributional behaviour of Italian words. To this end, we propose an algorithm that (a) extracts information from loosely labelled dependency structures that encode only basic and broadly accepted syntactic relations, namely Head/Dependent and the distinction of dependents into Argument vs. Adjunct, and (b) derives a possible set of word classes. The paper reports on some preliminary experiments carried out using the induced tagset in conjunction with state-of-the-art PoS taggers. The method proposed to design a proper tagset exploits little, if any, language-specific knowledge: hence it is in principle applicable to any language.

pdf bib abs
Augmenting a Semantic Verb Lexicon with a Large Scale Collection of Example Sentences
Kentaro Inui | Toru Hirano | Ryu Iida | Atsushi Fujita | Yuji Matsumoto

One of the crucial issues in semantic parsing is how to reduce costs of collecting a sufficiently large amount of labeled data. This paper presents a new approach to cost-saving annotation of example sentences with predicate-argument structure information, taking Japanese as a target language. In this scheme, a large collection of unlabeled examples are first clustered and selectively sampled, and for each sampled cluster, only one representative example is given a label by a human annotator. The advantages of this approach are empirically supported by the results of our preliminary experiments, where we use an existing similarity function and naive sampling strategy.

pdf bib abs
The DiaCORIS project: a diachronic corpus of written Italian
C. Onelli | D. Proietti | C. Seidenari | F. Tamburini

The DiaCORIS project aims at the construction of a diachronic corpus comprising written Italian texts produced between 1861 and 1945, extending the structure and the research possibilities of the synchronic 100-million word corpus CORIS/CODIS. A preliminary in depth study has been performed in order to design a representative and well balanced sample of the Italian language over a time period that contains all the main events of contemporary Italian history from the National Unification to the end of the Second World War. The paper describes in detail such design processes as the definition of the main subcorpora and their proportions, the type of documents inserted in each part of the corpus, the document annotation schema and the technological infrastructure designed to manage the corpus access as well as the web interface to corpus data.

pdf bib abs
A Preliminary Study for Building the Basque PropBank
Eneko Agirre | Izaskun Aldezabal | Jone Etxeberria | Eli Pociello

This paper presents a methodology for adding a layer of semantic annotation to a syntactically annotated corpus of Basque (EPEC), in terms of semantic roles. The proposal we make here is the combination of three resources: the model used in the PropBank project (Palmer et al., 2005), an in-house database with syntactic/semantic subcategorization frames for Basque verbs (Aldezabal, 2004) and the Basque dependency treebank (Aduriz et al., 2003). In order to validate the methodology and to confirm whether the PropBank model is suitable for Basque and our treebank design, we have built lexical entries and labelled all argument and adjuncts occurring in our treebank for 3 Basque verbs. The result of this study has been very positive, and has produced a methodology adapted to the characteristics of the language and the Basque dependency treebank. Another goal of this study was to study whether semi-automatic tagging was possible. The idea is to present the human taggers a pre-tagged version of the corpus. We have seen that many arguments could be automatically tagged with high precision, given only the verbal entries for the verbs and a handful of examples.

pdf bib abs
Detection of inconsistencies in concept classifications in a large dictionary — Toward an improvement of the EDR electronic dictionary —
Eiko Yamamoto | Kyoko Kanzaki | Hitoshi Isahara

The EDR electronic dictionary is a machine-tractable dictionary developed for advanced computer-based processing of natural lan-guage. This dictionary comprises eleven sub-dictionaries, including a concept dictionary, word dictionaries, bilingual dictionaries, co-occurrence dictionaries, and a technical terminology dictionary. In this study, we focus on the concept dictionary and aim to revise the arrangement of concepts for improving the EDR electronic dictionary. We believe that unsuitable concepts in a class differ from other concepts in the same class from an abstract perspective. From this notion, we first try to automatically extract those concepts unsuited to the class. We then try semi-automatically to amend the concept explications used to explain the meanings to human users and rearrange them in suitable classes. In the experiment, we try to revise those concepts that are the lower-concepts of the concept human in the concept hierarchy and that are directly arranged under concepts with concept explications such as person as defined by and person viewed from . We analyze the result and evaluate our approach.

This paper describes the methodology adopted to jointly develop the Basque WordNet and a hand annotated corpora (the Basque Semcor). This joint development allows for better motivated sense distinctions, and a tighter coupling between both resources. The methodology involves edition, tagging and refereeing tasks. We are currently half way through the nominal part of the 300.000 word corpus (roughly equivalent to a 500.000 word corpus for English). We present a detailed description of the task, including the main criteria for difficult cases in the edition of the senses and the tagging of the corpus, with special mention to multiword entries. Finally we give a detailed picture of the current figures, as well as an analysis of the agreement rates.

pdf bib abs
Benefit of a Class-based Language Model for Real-time Closed-captioning of TV Ice-hockey Commentaries
Jan Hoidekr | J.V. Psutka | Aleš Pražák | Josef Psutka

This article describes the real-time speech recognition system for closed-captioning of TV ice-hockey commentaries. Automatic transcription of TV commentary accompanying an ice-hockey match is usually a hard task due to the spontaneous speech of a commentator put often into a very loud background noise created by the public, music, siren, drums, whistle, etc. Data for building this system was collected from 41 matches that were played during World Championships in years 2000, 2001, and 2002 and were transmitted by the Czech TV channels. The real-time closed-captioning system is based on the class-based language model designed after careful analysis of training data and OOV words in new (till now unseen) commentaries with the goal to decrease an OOV (Out-Of-Vocabulary) rate and increase recognition accuracy.

pdf bib abs
Identifying and Classifying Terms in the Life Sciences: The Case of Chemical Terminology
Stefanie Anstein | Gerhard Kremer | Uwe Reyle

Facing the huge amount of textual and terminological data in the life sciences, we present a theoretical basis for the linguistic analysis of chemical terms. Starting with organic compound names, we conduct a morpho-semantic deconstruction into morphemes and yield a semantic representation of the terms' functional and structural properties. These semantic representations imply both the molecular structure of the named molecules and their class membership. A crucial feature of this analysis, which distinguishes it from all similar existing systems, is its ability to deal with terms that do not fully specify a structure as well as terms for generic classes of chemical compounds. Such `underspecified' terms occur very frequently in scientific literature. Our approach will serve for the support of manual database curation and as a basis for text processing applications.

pdf bib abs
Conceptual Vector Learning - Comparing Bootstrapping from a Thesaurus or Induction by Emergence
Mathieu Lafourcade

In the framework of the Word Sense Disambiguation (WSD) and lexical transfer in Machine Translation (MT), the representation of word meanings is one critical issue. The conceptual vector model aims at representing thematic activations for chunks of text, lexical entries, up to whole documents. Roughly speaking, vectors are supposed to encode ideas associated to words or expressions. In this paper, we first expose the conceptual vectors model and the notions of semantic distance and contextualization between terms. Then, we present in details the text analysis process coupled with conceptual vectors, which is used in text classification, thematic analysis and vector learning. The question we focus on is whether a thesaurus is really needed and desirable for bootstrapping the learning. We conducted two experiments with and without a thesaurus and are exposing here some comparative results. Our contribution is that dimension distribution is done more regularly by an emergent procedure. In other words, the resources are more efficiently exploited with an emergent procedure than with a thesaurus terms (concepts) as listed in a thesaurus somehow relate to their importance in the language but nor to their frequency in usage neither to their power of discrimination or representativeness.

pdf bib abs
Rebuilding Lexical Resources for Information Retrieval using Sense Folder Detection and Merging Methods
Ernesto William De Luca | Andreas Nürnberger

In this paper we discuss the problem of sense disambiguation using lexical resources like ontologies or thesauri with a focus on the application of sense detection and merging methods in information retrieval systems. For an information retrieval task it is important to detect the meaning of a query word for retrieving the related relevant documents. In order to recognize the meaning of a search word, lexical resources, like WordNet, can be used for word sense disambiguation. But, analyzing the WordNet structure, we see that this ontology is fraught with different problems. The too fine grained distinction between word senses, for example, is unfavorable for a usage in information retrieval. We describe related problems and present four implemented online methods to merge SynSets based on relations like hypernyms and hyponyms, and further context information like glosses and domain. Afterwards we show a first evaluation of our approach, compare the different merging methods and discuss briefly future work.

pdf bib abs
Automatic Acquisition of Semantics-Extraction Patterns
Pavel Smrž

This paper examines the use of parallel and comparable corpora for automatic acquisition of semantics-extraction patterns. It presents a new method of the pattern extraction which takes advantage of parallel texts to "port" text mining solutions from a source language to a target language. It is shown thatthe technique can help in situations when the extraction procedure is to beapplied in a language (languages) with a limited set of available resources,e.g. domain-specific thesauri. The primary motivation of our work lies in a particular multilingual e-learning system. For testing purposes, other applications of the given approach were implemented. They include pattern extraction from general texts (tested on wordnet relations), acquisition of domain-specific patterns from large parallel corpus of legal EU documents, and mining of subjectivity expressions for multilingual opinion extraction system.

pdf bib abs
Building a resource for studying translation shifts
Lea Cyrus

This paper describes an interdisciplinary approach which brings together the fields of corpus linguistics and translation studies. It presents ongoing work on the creation of a corpus resource in which translation shifts are explicitly annotated. Translation shifts denote departures from formal correspondence between source and target text, i.e. deviations that have occurred during the translation process. A resource in which such shifts are annotated in a systematic way will make it possible to study those phenomena that need to be addressed if machine translation output is to resemble human translation. The resource described in this paper contains English source texts (parliamentary proceedings) and their German translations. The shift annotation is based on predicate-argument structures and proceeds in two steps: first, predicates and their arguments are annotated monolingually in a straightforward manner. Then, the corresponding English and German predicates and arguments are aligned with each other. Whenever a shift - mainly grammatical or semantic - has occurred, the alignment is tagged accordingly.

pdf bib abs
Speech Recordings in Public Schools in Germany - the Perfect Show Case for Web-based Recordings and Annotation
Christoph Draxler | Klaus Jänsch

In the Ph@ttSessionz project, geographically distributed high-bandwidth recordings of adolescent speakers are performed in public schools all over Germany. To achieve a consistent technical signal quality, a standard configuration of recording equipment is sent to the participating schools. The recordings are made using the SpeechRecorder software for prompted speech recordings via the WWW. During a recording session, prompts are downloaded from a server, and the speech data is uploaded to the server in a background process. This paper focuses on the technical aspects of the distributed Ph@ttSessionz speech recordings and their annotation.

pdf bib abs
Court Stenography-To-Text (“STT”) in Hong Kong: A Jurilinguistic Engineering Effort
Benjamin K. Tsou | Tom B.Y. Lai | K.K. Sin | Lawrence Y.L. Cheung

Implementation of legal bilingualism in Hong Kong after 1997 has necessitated the production of voluminous and extensive court proceedings and judgments in both Chinese and English. For the former, Cantonese, a dialect of Chinese, is the home language of more than 90% of the population in Hong Kong and so used in the courts. To record speech in Cantonese verbatim, a Chinese Computer-Aided Transcription system has been developed. The transcription system converts stenographic codes into Chinese text, i.e. from phonetic to orthographic representation of the language. The main challenge lies in the resolution of the sever ambiguity resulting from homocode problems in the conversion process. Cantonese Chinese is typified by problematic homonymy, which presents serious challenges. The N-gram statistical model is employed to estimate the most probable character string of the input transcription codes. Domain-specific corpora have been compiled to support the statistical computation. To improve accuracy, scalable techniques such as domain-specific transcription and special encoding are used. Put together, these techniques deliver 96% transcription accuracy.

pdf bib abs
Word Sense Disambiguation and Semantic Disambiguation for Construction Types in Deep Processing Grammars
Dorothee Beermann | Lars Hellan

The paper presents advances in the use of semantic features and interlingua relations for word sense disambiguation (WSD) as part of unification-based deep processing grammars. Formally we present an extension of Minimal Recursion Semantics, introducing sortal specifications as well as linterlingua semantic relations as a means of semantic decomposition.

pdf bib abs
Annotating Emotions in Meetings
Dennis Reidsma | Dirk Heylen | Roeland Ordelman

We present the results of two trials testing procedures for the annotation of emotion and mental state of the AMI corpus. The first procedure is an adaptation of the FeelTrace method, focusing on a continuous labelling of emotion dimensions. The second method is centered around more discrete labeling of segments using categorical labels. The results reported are promising for this hard task.

The aim of the Media-Evalda project is to evaluate the understanding capabilities of dialog systems. This paper presents the Media protocol for speech understanding evaluation and describes the results of the June 2005 literal evaluation campaign. Five systems, both symbolic or corpus-based, participated to the evaluation which is based on a common semantic representation. Different scorings have been performed on the system results. The understanding error rate, for the Full scoring is, depending on the systems, from 29% to 41.3%. A diagnosis analysis of these results is proposed.

pdf bib abs
Multilingual parallel treebanking: a lean and flexible approach
Jonas Kuhn | Michael Jellinghaus

We propose a bootstrapping approach to creating a phrase-level alignment over a sentence-aligned parallel corpus, reporting concrete treebank annotation work performed on a sample of sentence tuples from the Europarl corpus, currently for English, French, German, and Spanish. The manually annotated seed data will be used as the basis for automatically labelling the rest of the corpus. Some preliminary experiments addressing the bootstrapping aspects are presented. The representation format for syntactic correspondence across parallel text that we propose as the starting point for a process of successive refinement emphasizes correspondences of major constituents that realize semantic arguments or modifiers; language-particular details of morphosyntactic realization are intentionally left largely unlabelled. We believe this format is a good basis for training NLPtools for multilingual application contexts in which consistency across languages is more central than fine-grained details in specific languages (in particular, syntax-based statistical Machine Translation).

pdf bib abs
Automatic Detection of Well Recognized Words in Automatic Speech Transcriptions
Julie Mauclair | Yannick Estève | Simon Petit-Renaud | Paul Deléglise

This work adresses the use of confidence measures for extracting well recognized words with very low error rate from automatically transcribed segments in a unsupervised way. We present and compare several confidence measures and propose a method to merge them into a new one. We study its capabilities on extracting correct recognized word-segments compared to the amount of rejected words. We apply this fusion measure to select audio segments composed of words with a high confidence score. These segments come from an automatic transcription of french broadcast news given by our speech recognition system based on the CMU Sphinx3.3 decoder. Injecting new data resulting from unsupervised treatments of raw audio recordings in the training corpus of acoustic models gives statistically significant improvement (95% confident interval) in terms of word error rate. Experiments have been carried out on the corpus used during ESTER, the french evaluation campaign.

This paper presents a case study concerning the challenges and requirements posed by next generation language resources, realized as an overall model of open, distributed and collaborative language infrastructure. If a sort of new paradigm for language resource sharing is required, we think that the emerging and still evolving technology connected to Grid computing is a very interesting and suitable one for a concrete realization of this vision. Given the current limitations of Grid computing, it is very important to test the new environment on basic language analysis tools, in order to get the feeling of what are the potentialities and possible limitations connected to its use in NLP. For this reason, we have done some experiments on a module of the Linguistic Miner, i.e. the extraction of linguistic patterns from restricted domain corpora. The Grid environment has produced the expected results (reduction of the processing time, huge storage capacity, data redundancy) without any additional cost for the final user.

pdf bib abs
LEXADV - a multilingual semantic Lexicon for Adverbs
Sanni Nimb

Topic: lexical ressources, international project Abstract: The LEXADV-project is a Scandinavian research project (2004-2006, financed by Nordplus Sprog) with the aim of extending three Scandinavian semantic lexicons building on the SIMPLE lexicon model (Lenci et al., 2000) with the word class of adverbs. In the lexicons of approx. 400 Danish, Norwegian and Swedish adverbs the different senses are described with a semantic type and a set of semantic features. A classification covering the many meanings that adverbs can have has been established and integrated in the original SIMPLE ontology. Similarly new features have been added to the model in order to describe the adverb senses. The working method of the project builds on the fact that the vocabularies of Danish, Norwegian and Swedish are closely related. An encoding tool has been developed with the special purpose of permitting easy transfer of semantic types and features between entries in the three languages. The Danish adverb senses have been described first, based on the definition in a modern, comprehensive Danish dictionary. Afterwards the lemmas have been translated and the semantic data have been copied into the Swedish as well as into the Norwegian equivalent entry. Finally these copies have been evaluated and when necessary adjusted by native speakers.

pdf bib abs
Multi-domain Multi-lingual Named Entity Recognition: Revisiting & Grounding the resources issue
Voula Giouli | Alexis Konstandinidis | Elina Desypri | Harris Papageorgiou

The paper reports on the development methodology of a system aimed at multi-domain multi-lingual recognition and classification of names in texts, the focus being on the linguistic resources used for training and testing purposes. The corpus presented here has been collected and annotated in the framework of different projects the critical issue being the development of a final resource that is homogenous, re-usable and adaptable to different domains and languages with a view to robust multi-domain and multi-lingual NERC.

pdf bib abs
Inter-annotator Agreement on a Multilingual Semantic Annotation Task
Rebecca Passonneau | Nizar Habash | Owen Rambow

Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.

pdf bib abs
Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
Rebecca Passonneau

Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.

pdf bib abs
Exploiting Dynamic Passage Retrieval for Spoken Question Recognition and Context Processing towards Speech-driven Information Access Dialogue
Tomoyosi Akiba

Speech interfaces and dialogue processing abilities have promise for improving the utility of open-domain question answering (QA).We propose a novel method of resolving disambiguation problems arisen in those speech and dialogue enhanced QA tasks. The proposed method exploits passage retrieval, which is one of main components common in many QA systems. The basic idea of the method is that the similarity with some passage in the target documents can be used to select the appropriate question from the candidates. In this paper, we applied the method to solve two subtasks of QA, which are (1) N-best rescoring of LVCSR outputs, which selects a most appropriate candidate as a question sentence, in speech-driven QA (SDQA) task and (2) context processing, which compose a complete question sentence from a submitted incomplete one by using the elements appeared in the dialogue context, in information access dialogue (IAD) task. For both tasks, a dynamic passage retrieval is introduced to further improve the performance. The experimental results showed that the proposed method is quite effective in order to improve the performance of QA in both two tasks.

pdf bib abs
Annotation of Temporal Relations with Tango
Marc Verhagen | Robert Knippen | Inderjeet Mani | James Pustejovsky

Temporal annotation is a complex task characterized by low markup speed and low inter-annotator agreements scores. Tango is a graphical annotation tool for temporal relations. It is developed for the TimeML annotation language and allows annotators to build a graph that resembles a timeline. Temporal relations are added by selecting events and drawing labeled arrows between them. Tango is integrated with a temporal closure component and includes features like SmartLink, user prompting and automatic linking of time expressions. Tango has been used to create two corpora with temporal annotation, TimeBank and the AQUAINT Opinion corpus.

pdf bib abs
Annotating Information Structure in a Corpus of Spoken Danish
Patrizia Paggio

This paper presents the work done to annotate a corpus of spoken Danish with information structure tags, and describes a preliminary study in which the corpus has been used to investigate the relation between focus and intra-clausal pauses. The study indicates that the pauses that do fall within the focus domain tend to precede property-expressing words by which the object in focus is distinguished from other similar ones.

pdf bib abs
Corpus Portal for Search in Monolingual Corpora
Uwe Quasthoff | Matthias Richter | Christian Biemann

A simple and flexible schema for storing and presenting monolingual language resources is proposed. In this format, data for 18 different languages is already available in various sizes. The data is provided free of charge for online use and download. The main target is to ease the application of algorithms for monolingual and interlingual studies.

pdf bib abs
Constraint-Based Parsing as an Efficient Solution: Results from the Parsing Evaluation Campaign EASy
Tristan Vanrullen | Philippe Blache | Jean-Marie Balfourier

This paper describes the unfolding of the EASy evaluation campaign for french parsers as well as the techniques employed for the participation of laboratory LPL to this campaign. Three symbolic parsers based on a same resource and a same formalism (Property Grammars) are described and evaluated. The first results of this evaluation are analyzed and lead to the conclusion that symbolic parsing in a constraint-based formalism is efficient and robust.

pdf bib abs
Parallel Corpora and Phrase-Based Statistical Machine Translation for New Language Pairs via Multiple Intermediaries
Andreas Eisele

We present a large parallel corpus of texts published by the United Nations Organization, which we exploit for the creation ofphrase-based statistical machine translation (SMT) systems for new language pairs. We present a setup where phrase tables for these language pairs are used for translation between languages for which parallel corpora of sufficient size are so far not available. We give some preliminary results for this novel application of SMT and discuss further refinements.

This paper describes an effort to investigate the incrementally deepening development of an interlingua notation, validated by human annotation of texts in English plus six languages. We begin with deep syntactic annotation, and in this paper present a series of annotation manuals for six different languages at the deep-syntactic level of representation. Many syntactic differences between languages are removed in the proposed syntactic annotation, making them useful resources for multilingual NLP projects with semantic components.

This paper presents the audio corpus developed in the framework of the ESTER evaluation campaign of French broadcast news transcription systems. This corpus includes 100 hours of manually annotated recordings and 1,677 hours of non transcribed data. The manual annotations include the detailed verbatim orthographic transcription, the speaker turns and identities, information about acoustic conditions, and name entities. Additional resources generated by automatic speech processing systems, such as phonetic alignments and word graphs, are also described.

pdf bib abs
Usability evaluation of 3G multimodal services in Telefónica Móviles España
Juan José Rodríguez Soler | Pedro Concejero Cerezo | Carlos Lázaro Ávila | Daniel Tapias Merino

Third generation (3G) services boost mobile multimodal interaction offering users richer communication alternatives for accessing different applications and information services. These 3G services provide more interaction alternatives as well as active learning possibilities than previous technologies but, at the same time, these facts increase the complexity of user interfaces. Therefore, usability in multimodal interfaces has become a key factor in the service design process. In this paper we present the work done to evaluate the usability of automatic video services based on avatars with real potential users of a video-voice mail service. We describe the methodology, the tests carried out and the results and conclusions of the study. This study addresses UMTS/3G problems like the interface model, the voice-image synchronization and the user attention and memory. All the user tests have been carried out using a mobile device to take into account the constraints imposed by the screen size and the presentation and interaction limitations of a current mobile phone.

pdf bib abs
Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation
Márton Miháltz | Gábor Pohl

In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module of the MT system are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their proper target language translations. Since manually annotating training examples is very costly, we are experimenting with a method to produce examples automatically from parallel corpora. Our algorithm relies on monolingual and bilingual lexicons and dictionaries in addition to statistical methods in order to annotate examples extracted from a large English-Hungarian parallel corpus accurately aligned at sentence level. In the paper, we present an experiment with the English noun state, where we categorized the different occurrences in the Hunglish parallel corpus. For this noun, most of the examples were covered by multiword lexical items originating from our lexical sources.

pdf bib abs
A Framework to Integrate Ubiquitous Knowledge Modeling
Porfírio Filipe | Nuno Mamede

This paper describes our contribution to let end users configure mixed-initiative spoken dialogue systems to suit their personalized goals. The main problem that we want to address is the reconfiguration of spoken language dialogue systems to deal with generic plug and play artifacts. Such reconfiguration can be seen as a portability problem and is a critical research issue. In order to solve this problem we describe a hybrid approach to design ubiquitous domain models that allows the dialogue system to perform recognition of available tasks on the fly. Our approach considers two kinds of domain knowledge: the global knowledge and the local knowledge. The global knowledge, that is modeled using a top-down approach, is associated at design time with the dialogue system itself. The local knowledge, that is modeled using a bottom-up approach, is defined with each one of the artifacts. When an artifact is activated or deactivated, a bilateral process, supported by a broker, updates the domain knowledge considering the artifact local knowledge. We assume that everyday artifacts are augmented with computational capabilities and semantic descriptions supported by their own knowledge model. A case study focusing a microwave oven is depicted.

pdf bib abs
Searching treebanks for functional constraints: cross-lingual experiments in grammatical relation assignment
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli

The paper reports on a detailed quantitative analysis of distributional language data of both Italian and Czech, highlighting the relative contribution of a number of distributed grammatical factors to sentence-based identification of subjects and direct objects. The work is based on a Maximum Entropy model of stochastic resolution of grammatical conflicting constraints, and is demonstrably capable of putting explanatory theoretical accounts to the challenging test of an extensive, usage-based empirical verification.

pdf bib abs
SynAF: Towards a Standard for Syntactic Annotation
Thierry Declerck

In the paper we present the actual state of development of an international standard for syntactic annotation, called SynAF. This standard is being prepared by the Technical Committee ISO/TC 37 (Terminology and Other Language Resources), Subcommittee SC 4 (Language Resource Management), in collaboration with the European eContent Project LIRICS (Linguistic Infrastructure for Interoperable Resources and Systems).

pdf bib abs
EQueR: the French Evaluation campaign of Question-Answering Systems
Christelle Ayache | Brigitte Grau | Anne Vilnat

This paper describes the EQueR-EVALDA Evaluation Campaign, the French evaluation campaign of Question-Answering (QA) systems. The EQueR Evaluation Campaign included two tasks of automatic answer retrieval: the first one was a QA task over a heterogeneous collection of texts - mainly newspaper articles, and the second one a specialised one in the Medical field over a corpus of medical texts. In total, seven groups participated in the General task and five groups participated in the Medical task. For the General task, the best system obtained 81.46% of correct answers during the evalaution of the passages, while it obtained 67.24% during the evaluation of the short answers. We describe herein the specifications, the corpora, the evaluation, the phase of judgment of results, the scoring phase and the results for the two different types of evaluation.

Linguistic Resources for the Study of the Portuguese African Varieties is an ongoing project that aims at the constitution, treatment, analysis and availability of a corpus of the African varieties of Portuguese, with 3 million words of written and spoken texts, constituted by five comparable subcorpora, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe. This material will allow intra and intercorpora comparative studies, which will make visible variations that result from discursive and pragmatic differences of each corpus and aspects of linguistic unity or diversity that characterise the spoken Portuguese of this referred five African countries. The five corpora are comparable in size (600,000 words each), in chronology (the last 30 years) and in types and genres (24,000 spoken words and c. 580,000 written words, the last belonging to newspapers, literature and varia). The corpus is automatically annotated and after the extraction of alphabetical lists of lexical forms, these data will be automatically lemmatised. Five separated lists of vocabulary for each variety will be established. A tool for word extraction and preferential calculus according to predefined indexes in order to achieve lexicon comparison of the African Portuguese Varieties is being developed. Concordances extraction will be also performed.

pdf bib abs
Toward a Pan-Chinese Thesaurus
Benjamin K. Tsou | Oi Yee Kwong

In this paper, we propose a corpus-based approach to the construction of a Pan-Chinese lexical resource, starting out with the aim to enrich existing Chinese thesauri in the Pan-Chinese context. The resulting thesaurus is thus expected to contain not only the core senses and usages of Chinese lexical items but also usages specific to individual Chinese speech communities. We introduce the ideas behind the construction of the resource, outline the steps to be taken, and discuss some preliminary analyses. The work is backed up by a unique and large Chinese synchronous corpus containing textual data from various Chinese speech communities including Hong Kong, Beijing, Taipei and Singapore.

pdf bib abs
User requirements analysis for the design of a reference corpus of written Dutch
Nelleke Oostdijk | Lou Boves

The Dutch Language Corpus Initiative (D-Coi) project aims to specify the design of a 500-million-word reference corpus of written Dutch, and to put the tools and procedures in place that are needed to actually construct such a corpus. One of the tasks in the project is to conduct a user requirements study that should provide the basis for the eventual design of the 500-million-word reference corpus. The present paper outlines the user requirements analysis and reports the results so far.

pdf bib abs
FRASQUES: A Question Answering system in the EQueR evaluation campaign
Brigitte Grau | Anne-Laure Ligozat | Isabelle Robba | Anne Vilnat | Laura Monceaux

Question-answering (QA) systems aim at providing either a small passage or just the answer to a question in natural language. We have developed several QA systems that work on both English and French. This way, we are able to provide answers to questions given in both languages by searching documents in both languages also. In this article, we present our French monolingual system FRASQUES which participated in the EQueR evaluation campaign of QA systems for French in 2004. First, the QA architecture common to our systems is shown. Then, for every step of the QA process, we consider which steps are language-independent, and for those that are language-dependent, the tools or processes that need to be adapted to switch for one language to another. Finally, our results at EQueR are given and commented; an error analysis is conducted, and the kind of knowledge needed to answer a question is studied.

pdf bib abs
Evaluation Methods of a Linguistically Enriched Translation Memory System
Gábor Hodász

The paper gives an overview of the evaluation methods of memory-based translation systems: Translation Memories (TM) and Example Based Machine Translation (EBMT) systems. After a short comparison with the well-discussed methods of evaluation of Machine Translation (MT) Systems we give a brief overview of current methodology on memory-based applications. We propose a new aspect, which takes the content of memory into account: a measure to describe the correspondence between the memory and the current segment to translate. We also offer a brief survey of a linguistically enriched translation memory on which these new methods will be tested.

pdf bib abs
T2O - Recycling Thesauri into a Multilingual Ontology
Alberto Simões | José João Almeida

In this article we present T2O - a workbench to assist the process of translating heterogeneous resources into ontologies, to enrich and add multilingual information, to help programming with them, and to support ontology publishing. T2O is an ontology algebra.

pdf bib abs
Data-driven Amharic-English Bilingual Lexicon Acquisition
Saba Amsalu

This paper describes a simple approach of statistical language modelling for bilingual lexicon acquisition from Amharic-English parallel corpora. The goal is to induce a seed translation lexicon from sentence-aligned corpora. The seed translation lexicon contains matches of Amharic lexemes to weekly inflected English words. Purely statistical measures of term distribution are used as the basis for finding correlations between terms. An authentic scoring scheme is codified based on distributional properties of words. For low frequency terms a two step procedure of: first a rough alignment; and then an automatic filtering to sift the output and improve the precision is made. Given the disparity of the languages and the small size of corpora used the results demonstrate the viability of the approach.

pdf bib abs
ISA & ICA - Two Web Interfaces for Interactive Alignment of Bitexts alignment of parallel texts
Jörg Tiedemann

ISA and ICA are two web interfaces for interactive alignment of parallel texts. ISA provides an interface for automatic and manual sentence alignment. It includes cognate filters and uses structural markup to improve automatic alignment and provides intuitive tools for editing them. Alignment results can be saved to disk or sent via e-mail. ICA provides an interface to the clue aligner from the Uplug toolbox. It allows one to set various parameters and visualizes alignment results in a two-dimensional matrix. Word alignments can be edited and saved to disk.

In this paper we present the setup of an extensive Wizard-of-Oz environment used for the data collection and the development of a dialogue system. The envisioned Perception and Interaction Assistant will act as an independent dialogue partner. Passively observing the dialogue between the two human users with respect to a limited domain, the system should take the initiative and get meaningfully involved in the communication process when required by the conversational situation. The data collection described here involves audio and video data. We aim at building a rich multi-media data corpus to be used as a basis for our research which includes, inter alia, speech and gaze direction recognition, dialogue modelling and proactivity of the system. We further aspire to obtain data with emotional content to perfom research on emotion recognition, psychopysiological and usability analysis.

pdf bib abs
The Evolution of an Evaluation Framework for a Text Mining System
Nancy L. Underwood | Agnes Lisowska

The Parmenides project developed a text mining application applied in three different domains exemplified by case studies for the three user partners in the project. During the lifetime of the project (and in parallel with the development of the system itself) an evaluation framework was developed by the authors in conjunction with the users, and was eventually applied to the system. The object of the exercise was two-fold: firstly to develop and perform a complete user-centered evaluation of the system to assess how well it answered the users' requirements and, secondly, to develop a general framework which could be applied in the context of other users' requirements and (with some modification) to similar systems. In this paper we describe not only the framework but the process of building and parameterising the quality model for each case study and, perhaps most interestingly, the way in which the quality model and users' requirements and expectations evolved over time.

pdf bib abs
A pilot study for a Corpus of Dutch Aphasic Speech (CoDAS)
Eline Westerhout | Paola Monachesi

In this paper, a pilot study for the development of a corpus of Dutch Aphasic Speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. Given the special character of the speech contained in CoDAS, we cannot simply carry over the design and annotation protocols of existing corpora, such as the Corpus Gesproken Nederlands or CHILDES. However, they have been assumed as starting point. We have investigated whether and how the procedures and protocols for the annotation (part-of-speech tagging) and transcription (orthographic and phonetic) used for the CGN should be adapted in order to annotate and transcribe aphasic speech properly. Besides, we have established the basic requirements with respect to text types, metadata, and annotation levels that CoDAS should fulfill.

pdf bib abs
A German Sign Language Corpus of the Domain Weather Report
Jan Bungeroth | Daniel Stein | Philippe Dreuw | Morteza Zahedi | Hermann Ney

All systems for automatic sign language translation and recognition, in particular statistical systems, rely on adequately sized corpora. For this purpose, we created the Phoenix corpus that is based on German television weather reports translated into German Sign Language. It comes with a rich annotation of the video data, a bilingual text-based sentence corpus and a monolingual German corpus. All systems for automatic sign language translation and recognition, in particular statistical systems, rely on adequately sized corpora. For this purpose, we created the Phoenix corpus that is based on German television weather reports translated into German Sign Language. It comes with a rich annotation of the video data, a bilingual text-based sentence corpus and a monolingual German corpus.

In this paper we present an original approach to natural language query interpretation which has been implemented withinthe FuLL (Fuzzy Logic and Language) Italian project of BC S.r.l. In particular, we discuss here the creation of linguisticand ontological resources, together with the exploitation of existing ones, for natural language-driven database access andretrieval. Both the database and the queries we experiment with are Italian, but the methodology we broach naturally extends to other languages.

pdf bib abs
Automatic Detection of Orthographics Cues for Cognate Recognition
Andrea Mulloni | Viktor Pekar

Present-day machine translation technologies crucially depend on the size and quality of lexical resources. Much of recent research in the area has been concerned with methods to build bilingual dictionaries automatically. In this paper we propose a methodology for the automatic detection of cognates between two languages based solely on the orthography of words. From a set of known cognates, the method induces rules capturing regularities of orthographic mutations that a word undergoes when migrating from one language into the other. The rules are then applied as a preprocessing step before measuring the orthographic similarity between putative cognates. As a result, the method allows to achieve an improvement in the F-measure of 11,86% in comparison with detecting cognates based only on the edit distance between them.

pdf bib abs
Open Source Corpus Analysis Tools for Malay
Timothy Baldwin | Su’ad Awab

Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.

pdf bib abs
A task-oriented framework for evaluating theme detection systems: A discussion paper
Fidelia Ibekwe-Sanjuan

This paper discusses the inherent difficulties in evaluating systems for theme detection. Such systems are based essentially on unsupervised clustering aiming to discover the underlying structure in a corpus of texts. As the structures are precisely unknown beforehand, it is difficult to devise a satisfactory evaluation protocol. Several problems are posed by cluster evaluation: determining the optimal number of clusters, cluster content evaluation, topology of the discovered structure. Each of these problems has been studied separately but some of the proposed metrics portray significant flaws. Moreover, no benchmark has been commonly agreed upon. Finally, it is necessary to distinguish between task-oriented and activity-oriented evaluation as the two frameworks imply different evaluation protocols. Possible solutions to the activity-oriented evaluation can be sought from the data and text mining communities.

pdf bib abs
Generation of Language Resources for the Development of Speech Technologies in Catalan
A. Moreno | Albert Febrer | Lluis Márquez

This paper describes a joint initiative of the Catalan and Spanish Government to produce Language Resources for the Catalan language. A similar methodology to the Basic Language Resource Kit (BLARK) concept was applied to determine the priorities on the production of the Language Resources. The paper shows the LR and tools currently available for the Catalan Language both for Language and Speech technologies. The production of large databases for Automatic Speech Recognition purposes already started. All the resources generated in the project follow EU standards, will be validated by an external centre and will be free and public available through ELRA.

pdf bib abs
If “it” were “then”, then when was “it”? Establishing the anaphoric role of “then”
Georgiana Puşcaşu | Ruslan Mitkov

The adverb "then" is among the most frequent Englishtemporal adverbs, being also capable of filling a variety of semantic roles. The identification of anaphoric usages of "then"is important for temporal expression resolution, while thetemporal relationship usage is important for event ordering. Given that previous work has not tackled the identification and temporal resolution of anaphoric "then", this paper presents a machine learning approach for setting apart anaphoric usages and a rule-based normaliser that resolves it with respect to an antecedent. The performance of the two modules is evaluated. The present paper also describes the construction of an annotated corpus and the subsequent derivation of training data required by the machine learning module.

This paper describes morphdb.hu, a Hungarian lexical database and morphological grammar. Morphdb.hu is the outcome of a several-year collaborative effort and represents the resource with the widest coverage and broadest range of applicability presently available for Hungarian. The grammar resource is the formalization of well-founded theoretical decisions handling inflection and productive derivation. The lexical database was created by merging three independent lexical databases, and the resulting resource was further extended.

pdf bib abs
A Lexicalized Tree-Adjoining Grammar for Vietnamese
H. Phuong Le | T. M. Huyen Nguyen | Laurent Romary | Azim Roussanaly

In this paper, we present the first sizable grammar built for Vietnamese using LTAG, developed over the past two years, named vnLTAG. This grammar aims at modelling written language and is general enough to be both application- and domain-independent. It can be used for the morpho-syntactic tagging and syntactic parsing of Vietnamese texts, as well as text generation. We then present a robust parsing scheme using vnLTAG and a parser for the grammar. We finish with an evaluation using a test suite.

pdf bib abs
Semi-automatic Building of Swedish Collocation Lexicon
Silvie Cinková | Pavel Pecina | Petr Podveský | Pavel Schlesinger

This work focuses on semi-automatic extraction of verb-noun collocations from a corpus, performed to provide lexical evidence for the manual lexicographical processing of Support Verb Constructions (SVCs) in the Swedish-Czech Combinatorial Valency Lexicon of Predicate Nouns. Efficiency of pure manual extractionprocedure is significantly improved by utilization of automatic statistical methods based lexical association measures.

pdf bib abs
Creation and analysis of a Polish speech database for use in unit selection synthesis
Dominika Oliver | Krzysztof Szklanny

The main aim of this study is to describe the process of creating a speech database to be used in corpus based text-to-speech synthesis. To help achieve natural sounding speech synthesis, the database construction was aimed at rich phonetic and prosodic coverage based on variable length units (phoneme, diphone, triphone) from different phonetic and prosodic contexts. Following previous work on determining the optimal coverage (Szklanny and Oliver, 2005), text selection was based on the existing text corpus containing parliamentary statements. Corpus balancing was followed by recording of the material. Automatic segmentation was performed, followed by both an automatic and manual check of the data to determine speaker specific phenomena and correct the labelling. Additionally, prosodic annotation involving assignment of the intonation contours was performed in order to assess the accent realisation and determine the prosodic coverage of the database. The prototype speech synthesiser was built to determine the validity of the above steps and test the resulting voice quality.

pdf bib abs
CoGrOO: a Brazilian-Portuguese Grammar Checker based on the CETENFOLHA Corpus
Jorge Kinoshita | Laís do Nascimento Salvador | Carlos Eduardo Dantas de Menezes

This paper describes an ongoing Portuguese Language grammar checker project, called CoGrOO1-Corretor Gramatical para OpenOffice (Grammar Checker for OpenOffice), based on CETENFOLHA, a Brazilian Portuguese morphosyntactic annotated Corpus. Two of its features are highlighted: - hybrid architecture, mixing rules and statistics; - free software project. This project aims at checking grammatical errors such as nominal and verbal agreement, crase (the coalescence of preposition a (to) + definitive singular determiner a yielding à), nominal and verbal government and other common errors in Brazilian Portuguese Language. We also present some empirical results based on the implemented techniques.

pdf bib abs
Evaluation of Automatically Generated Transcriptions of Non-Native Pronunciations using a Phonetic Distance Measure
Stefan Schaden

The paper reports on the evaluation of a rule-based technique to model prototypical non-native pronunciation variants on the symbolic transcription level. This technique was developed to explore the possibility of an automatic generation of adapted pronunciation lexicons for different non-native speaker groups. The rule sets, which are currently available for nine language directions, are based on non-native speech data compiled specifically for this purpose. Since manual phonetic annotations are available for the speech data, the evaluation was performed on the transcription level by measuring the phonetic distance of the automatically generated pronunciations variants and actual pronunciations of non-native speakers. One of the central questions to be addressed by the evaluation is whether the rules have any predictive value: It has to be determined if and to what degree the rules are capable of generating realistic pronunciation variants for previously unseen speakers. Secondly, the rules should not only represent the pronunciations of individual speakers adequately; instead, they should be representative of speaker groups (cross-speaker representation). The paper outlines the evaluation methodology and presents results for selected language directions.

pdf bib abs
Slips and errors in spoken data transcription
Isabella Chiari

The present work illustrates the main results of an experiment on errors and repairs in spoken language transcription, with significant relevance for the evaluation of validity, reliability and correctness of transcriptions of speech belonging to several different typologies, set for the annotation of spoken corpora. In particular, we dealt with errors and repair strategies that appear on the first drafts of the transcription process that are not easily detectable with automatic post-editing procedures. 20 participants were asked to give an accurate transcription of 22 short utterances, repeated from one to four times, belonging two non-spontaneous (10) and spontaneous conversation (10). Error analysis suggests a general preference for meaning preservation even after the alteration of the original form, and for the preference for certain error patterns and repair strategies.

pdf bib abs
Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval
Motoko Ueyama

Recently, there have been efforts to construct written corpora by using the WWW. A promising approach to build Web corpora is to run automated queries to search engines and download pages found in this way. This makes it possible to build corpora rapidly and economically, but we cannot control what are contained in resulting corpora. Under these circumstances, it is important to verify the general nature of Web corpora. This study, in particular, investigated effects of two essential factors on three Japanese corpora that we built: seed terms used for queries; and time interval between different corpus construction sessions, which measures the stability of query results over time. We evaluated the corpora qualitatively, in terms of domains, genres and typical lexical items. Results show these two patterns: 1) both seed selection and time interval affect the distribution of text and lexicon; 2) the effect of seed selection is much stronger. The prominent effect of seed selection suggests that a good understanding of the cause-and-effect relation between seeds and retrieved documents is an important step to gain some control over the characteristics of Web corpora, in particular, for the construction of general corpora meant to represent a language as a whole.

pdf bib abs
An Incremental Tri-Partite Approach To Ontology Learning
José Iria | Christopher Brewster | Fabio Ciravegna | Yorick Wilks

In this paper we present a new approach to ontology learning. Its basis lies in a dynamic and iterative view of knowledge acquisition for ontologies. The Abraxas approach is founded on three resources, a set of texts, a set of learning patterns and a set of ontological triples, each of which must remain in equilibrium. As events occur which disturb this equilibrium various actions are triggered to re- establish a balance between the resources. Such events include acquisition of a further text from external resources such as the Web or the addition of ontological triples to the ontology. We develop the concept of a knowledge gap between the coverage of an ontology and the corpus of texts as a measure triggering actions. We present an overview of the algorithm and its functionalities.

pdf bib abs
Experimental detection of vowel pronunciation variants in Amharic
Thomas Pellegrini | Lori Lamel

The pronunciation lexicon is a fundamental element in an automatic speech transcription system. It associates each lexical entry (usually a grapheme), with one or more phonemic or phone-like forms, the pronunciation variants. Thorough knowledge of the target language is a priori necessary to establish the pronunciation baseforms and variants. The reliance on human expertise can pose difficulties in developing a system for a language where such knowledge may not be readily available. In this article a speech recognizer is used to help select pronunciation variants in Amharic, the official language of Ethiopia, focusing on alternate choices for vowels. This study is carried out using an audio corpus composed of 37 hours of speech from radio broadcasts which were orthographically transcribed by native speakers. Since the corpus is relatively small for estimating pronunciation variants, a first set of studies were carried out at a syllabic level. Word lexica were then constructed based on the observed syllable occurences. Automatic alignments were compared for lexica containing different vowel variants, with both context-independent and context-dependent acoustic models sets. The variant2+ measure proposed in (Adda-Decker and Lamel, 1999) is used to assess the potential need for pronunciation variants.

pdf bib abs
Evaluating Automatically Generated Timelines from the Web
Roberta Catizone | Angelo Dalli | Yorick Wilks

As web searches increase, there is a need to represent the search results in the most comprehensible way possible. In particular, we focus on search results from queries about people and places. The standard method for presentation of search results is an ordered list determined by the Web search engine. Although this is satisfactory in some cases, when searching for people and places, presenting the information indexed by time may be more desirable. We are developing a system called Cronopath, which generates a timeline of web search engine results by determining the time frame of each document in the collection and linking elements in the timeline to the relevant articles. In this paper, we propose evaluation guidelines for judging the quality of automatically generated timelines based on a set of common features.

We describe a corpus of multimodal dialogues with an MP3player collected in Wizard-of-Oz experiments and annotated with a richfeature set at several layers. We are using the Nite XML Toolkit (NXT) to represent and further process the data. We designed an NXTdata model, converted experiment log file data and manualtranscriptions into NXT, and are building tools for additionalannotation using NXT libraries. The annotated corpus will be used to (i) investigate various aspects of multimodal presentation andinteraction strategies both within and across annotation layers; (ii) design an initial policy for reinforcement learning of multimodalclarification requests.

pdf bib abs
CLiMB ToolKit: A Case Study of Iterative Evaluation in a Multidisciplinary Project
Rebecca Passonneau | Roberta Blitz | David Elson | Angela Giral | Judith Klavans

Digital image collections in libraries and other curatorial institutions grow too rapidly to create new descriptive metadata for subject matter search or browsing. CLiMB (Computational Linguistics for Metadata Building) was a project designed to address this dilemma that involved computer scientists, linguists, librarians, and art librarians. The CLiMB project followed an iterative evaluation model: each next phase of the project emerged from the results of an evaluation. After assembling a suite of text processing tools to be used in extracting metada, we conducted a formative evaluation with thirteen participants, using a survey in which we varied the order and type of four conditions under which respondents would propose or select image search terms. Results of the formative evaluation led us to conclude that a CLiMB ToolKit would work best if its main function was to propose terms for users to review. After implementing a prototype ToolKit using a browser interface, we conducted an evaluation with ten experts. Users found the ToolKit very habitable, remained consistently satisfied throughout a lengthy evaluation, and selected a large number of terms per image.

pdf bib abs
Inducing Sense-Discriminating Context Patterns from Sense-Tagged Corpora
Anna Rumshisky | James Pustejovsky

Traditionally, context features used in word sense disambiguation are based on collocation statistics and use only minimal syntactic and semantic information. Corpus Pattern Analysis is a technique for producing knowledge-rich context features that capture sense distinctions. It involves (1) identifying sense-carrying context patterns and using the derived context features to discriminate between the unseen instances. Both stages require manual seeding. In this paper, we show how to automate inducing sense-discriminating context features from a sense-tagged corpus.

pdf bib abs
Building a Large-Scale Repository of Textual Entailment Rules
Milen Kouylekov | Bernardo Magnini

Entailment rules are rules where the left hand side (LHS) specifies some knowledge which entails the knowledge expressed n the RHS of the rule, with some degree of confidence. Simple entailment rules can be combined in complex entailment chains, which n turn are at the basis of entailment-based reasoning, which has been recently proposed as a pervasive and application independent approach to Natural Language Understanding. We present the first elease of a large-scale repository of entailment rules at the lexical level, which have been derived from a number of available resources, including WordNet and a word similarity database. Experiments on the PASCAL-RTE dataset show that this resource plays a crucial role in recognizing textual entailment.

pdf bib abs
A Tree Kernel approach to Question and Answer Classification in Question Answering Systems
Alessandro Moschitti | Roberto Basili

A critical step in Question Answering design is the definition of the models for question focus identification and answer extraction. In case of factoid questions, we can use a question classifier (trained according to a target taxonomy) and a named entity recognizer. Unfortunately, this latter cannot be applied to generate answers related to non-factoid questions. In this paper, we tackle such problem by designing classifiers of non-factoid answers. As the feature design for this learning task is very complex, we take advantage of tree kernels to generate large feature set from the syntactic parse trees of passages relevant to the target question. Such kernels encode syntactic and lexical information in Support Vector Machines which can decide if a sentence focuses on a target taxonomy subject. The experiments with SVMs on the TREC 10 dataset show that our approach is an interesting future research.

The EVALDA/EvaSy project is dedicated to the evaluation of text-to-speech synthesis systems for the French language. It is subdivided into four components: evaluation of the grapheme-to-phoneme conversion module (Boula de Mareüil et al., 2005), evaluation of prosody (Garcia et al., 2006), evaluation of intelligibility, and global evaluation of the quality of the synthesised speech. This paper reports on the key results of the intelligibility and global evaluation of the synthesised speech. It focuses on intelligibility, assessed on the basis of semantically unpredictable sentences, but a comparison with absolute category rating in terms of e.g. pleasantness and naturalness is also provided. Three diphone systems and three selection systems have been evaluated. It turns out that the most intelligible system (diphone-based) is far from being the one which obtains the best mean opinion score.

pdf bib abs
Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho
Winston N Anderson | Petronella M Kotzé

Tokenisation is an important first pre-processing step required to adequately test finite-state morphological analysers. In agglutinative languages each morpheme is concatinatively added on to form a complete morphological structure. Disjunctive agglutinative languages like Northern Sotho write these morphemes, for certain morphological categories only, as separate words separated by spaces or line breaks. These breaks are, by their nature, different from breaks that separate words that are written conjunctively. A tokeniser is required to isolate categories, like a verb, from raw text before they can be correctly morphologically analysed. The authors have successfully produced a finite state tokeniser for Northern Sotho, where verb segments are written disjunctively but nominal segments conjunctively. The authors show that since reduplication in the Northern Sotho language does not affect the pre-processing tokeniser, the disjunctive standard verbal segment as a construct in Northern Sotho is deterministic, finite-state and a regular Type 0 language in the Chomsky hierarchy and that the copulative verbal segment, due to its semi-disjunctivism, is ambiguously non-deterministic.

pdf bib abs
Applying Lexical Constraints on Morpho-Syntactic Patterns for the Identification of Conceptual-Relational Content in Specialized Texts
Jean-François Couturier | Sylvain Neuvel | Patrick Drouin

In this paper, we describe a formal constraint mechanism, which we label Conceptual Constraint Variables (CCVs), introduced to restrict surface patterns during automated text analysis with the objective of increasing precision in the representation of informational contents. We briefly present, and exemplify, the various types of CCVs applicable to the English texts of our corpora, and show how these constraints allow us to resolve some of the problems inherent to surface pattern recognition, more specifically, those related to the resolution of conceptual or syntactic ambiguities introduced by the most frequent English prepositions.

pdf bib abs
Beyond Multimedia Integration: corpora and annotations for cross-media decision mechanisms
Katerina Pastra

In this paper, we look into the notion of cross-media decision mechanisms, focussing on ones that work within multimedia documents for a variety of applications, such as the generation of intelligent multimedia presentations and multimedia indexing. In order for these mechanisms to go beyond the identification of semantic equivalence relations between media, which is what integration does, appropriate corpora and annotations are needed. Drawing from our experience in the REVEAL THIS project, we indicate the characteristics that such corpora should have, and suggest a number of annotations that would allow for training/designing such mechanisms. We conclude with a view on the suitability of two related markup languages (MPEG-7 and EMMA) for accommodating the suggested annotations.

pdf bib abs
Building Carefully Tagged Bilingual Corpora to Cope with Linguistic Idiosyncrasy
Yoshihiko Nitta | Masashi Saraki | Satoru Ikehara

We illustrate the effectiveness of medium-sized carefully tagged bilingual core corpus, that is, semantic typology patterns in our term together with some examples to give concrete evidence of its usefulness. The most important characteristic of these semantic typology patterns is the bridging mechanism between two languages which is based on sequences syntactic codes and semantic codes. This characteristic gives both wide coverage and flexible applicability of core bilingual core corpus though its volume size is not so large. A further work is to be done for grasping some intuitive feeling of pertinent coarseness and fineness of patterns. Here coarseness feeling is concerning the generalization in phrase-level and clause-level semantic patterns and fineness is concerning word-level semantic patterns. Based on this feeling we will complete the core tagged bilingual corpora while enhancing the necessary support functions and utilities.

pdf bib abs
SlinkET: A Partial Modal Parser for Events
Roser Saurí | Marc Verhagen | James Pustejovsky

We present SlinkET, a parser for identifying contexts of event modality in text developed within the TARSQI (Temporal Awareness and Reasoning Systems for Question Interpretation) research framework. SlinkET is grounded on TimeML, a specification language for capturing temporal and event related information in discourse, which provides an adequate foundation to handle event modality. SlinkET builds on top of a robust event recognizer, and provides each relevant event with a value that specifies the degree of certainty about its factuality; e.g., whether it has happened or holds (factive or counter-factive), whether it is being reported or witnessed by somebody else (evidential), or if it is introduced as a possibility (modal). It is based on well-established technology in the field (namely, finite-state techniques), and informed with corpus-induced knowledge that relies on basic information, such as morphological features, POS, and chunking. SlinkET is under continuing development and it currently achieves a performance ratio of 70% F1-measure.

pdf bib abs
More Data and Tools for More Languages and Research Areas: A Progress Report on LDC Activities
Christopher Cieri | Mark Liberman

This presentation reports on recent progress the Linguistic Data Consortium has made in addressing the needs of multiple research communities by collecting, annotating and distributing, simplifying access and developing standards and tools. Specifically, it describes new trends in publication, a sample of recent projects and significant improvements to LDC Online that improve access to LDC data especially for those with limited computing support.

In the framework of the EU funded project TC-STAR (Technology and Corpora for Speech to Speech Translation),research on TTS aims on providing a synthesized voice sounding like the source speaker speaking the target language. To progress in this direction, research is focused on naturalness, intelligibility, expressivity and voice conversion both, in the TC-STAR framework. For this purpose, specifications on large, high quality TTS databases have been developed and the data have been recorded for UK English, Spanish and Mandarin. The development of speech technology in TC-STAR is evaluation driven. Assessment of speech synthesis is needed to determine how well a system or technique performs in comparison to previous versions as well as other approaches (systems & methods). Apart from testing the whole system, all components of the system will be evaluated separately. This approach grants better assesment of each component as well as identification of the best techniques in the different speech synthesisprocesses.This paper describes the specifications of Language Resources for speech synthesis and the specifications for evaluation of speech synthesis activities.

pdf bib abs
An observatory on Spoken Italian linguistic resources and descriptive standards.
Miriam Voghera | Francesco Cutugno

We present the national project Parlare italiano: osservatorio degli usi linguistici, funded by the Italian Ministry of Education, Scientific Research and University (PRIN 2004). Ten research groups participate to the project from various Italian universities. The project has four fundamental objectives: 1) to plan a national website that collects the most recent theoretical and applied results on spoken language; 2) to create an observatory of the linguistic usages of the Italian spoken language; 3) to delineate and implement standard and formalized methods and procedures for the study of spoken language; 4) to develop a training program for young researchers. The website will be accessible starting from November 2006.

pdf bib abs
Linguistic features modeling based on Partial New Cache
Kamel Smaïli | Caroline Lavecchia | Jean-Paul Haton

The agreement in gender and number is a critical problem in statistical language modeling. One of the main problems in the speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomenon directly. This paper focuses on how to handle the issue of agreements. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache for which the size is determined in accordance to syntagm delimitors. This model does not need any syntactic parsing, it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of more than 8 points in terms of perplexity.

We present the lexico-semantic foundations underlying a multilingual lexicon the entries of which are constituted by so-called subwords. These subwords reflect semantic atomicity constraints in the medical domain which diverge from canonical lexicological understanding in NLP. We focus here on criteria to identify and delimit reasonable subword units, to group them into functionally adequate synonymy classes and relate them by two types of lexical relations. The lexicon we implemented on the basis of these considerations forms the lexical backbone for MorphoSaurus, a cross-language document retrieval engine for the medical domain.

pdf bib abs
Finding representative sets of dialect words for geographical regions
Marko Salmenkivi

We investigate a corpus of geographical distributions of 17,126 Finnish dialect words. Our goal is to automatically find sets of words characteristic to geographical regions. Though our approach is related to the problem of dividing the investigation area into linguistically (and geographically) relatively coherent dialect regions, we do not aim at constructing more or less questionable dialect regions. Instead, we let the boundaries of the regions overlap to get insight to the degree of lexical change between adjacent areas. More concretely, we study the applicability of data clustering approaches to find sets of words with tight spatial distributions, and to cluster the extracted distributions according to their distribution areas. The extracted words belonging to the same cluster can then be utilized as a means to characterize the lexicon of the region. We also automatically pick up words with occurrences appearing in two or more areas that are geographically far from each other. These words may give valuable insight to, e.g., the study of cultural history and history of settlement.

pdf bib abs
Coreference Resolution with and without Linguistic Knowledge
Olga Uryupina

State-of-the-art statistical approaches to the Coreference Resolution task rely on sophisticated modeling, but very few (10-20) simple features. In this paper we propose to extend the standard feature set substantially, incorporating more linguistic knowledge. To investigate the usability of linguistically motivated features, we evaluate our system for a variety of machine learners on the standard dataset (MUC-7) with the traditional learning set-up.

pdf bib abs
Formal v. Informal: Register-Differentiated Arabic MT Evaluation in the PLATO Paradigm
Keith J. Miller | Michelle Vanni

Tasks performed on machine translation (MT) output are associated with input text types such as genre and topic. Predictive Linguistic Assessments of Translation Output, or PLATO, MT Evaluation (MTE) explores a predictive relationship between linguistic metrics and the information processing tasks reliably performable on output. PLATO assigns a linguistic signature, which cuts across the task-based and automated metric paradigms. Here we report on PLATO assessments of clarity, coherence, morphology, syntax, lexical robustness, name-rendering, and terminology in a comparison of Arabic MT engines in which register differentiates the input. With a team of 10 assessors employing eight linguistic tests, we analyzed the results of five systems processing of 10 input texts from two distinct linguistic registers: a total we analyzed 800 data sets. The analysis pointed to specific areas, such as general lexical robustness, where system performance was comparable on both types of input. Divergent performance, however, was observed on clarity and name-rendering assessments. These results suggest that, while systems may be considered reliable regardless of input register for the lexicon-dependent triage task, register may have an affect on the suitability of MT systems output for relevance judgment and information extraction tasks, which rely on clearness and proper named-entity rendering. Further, we show that the evaluation metrics incorporated in PLATO differentiate between MT systems performance on a text type for which they are presumably optimized and one on which they are not.

pdf bib abs
X-Score: Automatic Evaluation of Machine Translation Grammaticality
O. Hamon | M. Rajman

In this paper we report an experiment of an automated metric used to analyse the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a translated text, which is supposed similar between a learning corpus and the translation. This method is quite inexpensive, since it does not need any reference translation. First we describe the experimental method and the different tests we used. Then we show the promising results we obtained on the CESTA data, and how they correlate well with human judgments.

pdf bib abs
Reducing the Granularity of a Computational Lexicon via an Automatic Mapping to a Coarse-Grained Sense Inventory
Roberto Navigli

WordNet is the reference sense inventory of most of the current Word Sense Disambiguation systems. Unfortunately, it encodes too fine-grained distinctions, making it difficult even for humans to solve the ambiguity of words in context. In this paper, we present a method for reducing the granularity of the WordNet sense inventory based on the mapping to a manually crafted dictionary encoding sense groups, namely the Oxford Dictionary of English. We assess the quality of the mapping and discuss the potential of the method.

pdf bib abs
A BLARK extension for temporal annotation mining
Dafydd Gibbon | Flaviane Romani Fernandes | Thorsten Trippel

The Basic Language Resource Kit (BLARK) proposed by Krauwer is designed for the creation of initial textual resources. There are a number of toolkits for the development of spoken language resources and systems, but tools for second level resources, that is, resources which are the result of processing primary level speech resources such as speech recordings. Typically, processing of this kind in phonetics is done manually, with the aid of spreadsheets multi-purpose statistics software. We propose a Basic Language and Speech Kit (BLAST) as an extension to BLARK and suggest a strategy for integrating the kit into the Natural Language Toolkit (NLTK). The prototype kit is evaluated in an application to examining temporal properties of spoken Brazilian Portuguese.

pdf bib abs
The Mass-Count Distinction: Acquisition and Disambiguation
Michael Schiehlen | Kristina Spranger

At least in the realm of fast parsing, the masscount distinction has led the life of a wallflower. We argue in this paper that this should not be so. In particular, we argue, both theoretical linguistics and computational linguistics can gain by a corpus-based investigation of this distinction: Computational linguists get more accurate parses; the knowledge extracted from these parses becomes more reliable; theoretical linguists are presented with new data in a field that has been intensely discussed and yet remains in a state that is not satisfactory from a practical point of view.

pdf bib abs
Corpus Development and Publication
Andrew W. Cole

This paper will discuss issues relevant to corpus development and publication at the LDC and will illustrate those issues by examining the history of three LDC corpora. This paper will also briefly examine alternative corpus creation and distribution methods and their challenges. The intent of this paper is to increase the available linguistic resources by describing the regulatory and technical environment and thus improving the understanding and interaction between corpus providers and distributors.

pdf bib abs
Discourse functions of duration in Mandarin: resource design and implementation
Dafydd Gibbon | Shu-Chuan Tseng

A dedicated resource, consisting of annotated speech tools, and workflow design, was developed for the detailed investigation of discourse phenomena in Taiwan Mandarin. The discourse phenomena have functions which are associated with positions in utterances, and temporal properties, and include discourse markers (NAGE, NA, e.g. hesitation, utterance initiation), discourse particles (A, e.g. utterance finality, utterance continuity, focus, etc.), and fillers (UHN, hesitation). The distribution of particles in relation to their position in utterances and the temporal properties of particles are investigated. The results of the investigation diverge considerably from claims in existing grammars of Mandarin with respect to utterance position, and show in general greater length than for regular syllables. These properties suggest the possibility of developing an automatic discourse item tagger.

pdf bib abs
From Natural Language to Databases via Ontologies
Leonardo Lesmo | Livio Robaldo

This paper describes an approach to Natural Language access to databases based on ontologies. Their role is to make the central part of the translation process independent both of the specific language and of the particular database schema. The input sentence is parsed and the parse tree is semantically annotated via references to the ontology describing the application. This first step is, of course, language dependent: the parsing process depends on the syntax of the language and the annotation depends on the meaning of words, expressed as links between words and concepts in the ontology. Then, the annotated tree is used to produce an ontological query, i.e. a query expressed in terms of paths on the ontology. This second step is entirely language- and DB-independent. Finally, the ontological query is translated into a standard SQL query, on the basis of a concept-to-DB mapping, specifying how each concept and relation is mapped onto the database.

The paper describes the ALVIS annotation format and discusses the problems that we encountered for the indexing of large collections of documents for topic specific search engines. This paper is exemplified on the biological domain and on MedLine abstracts, as developing a specialized search engine for biologist is one of the ALVIS case studies. The ALVIS principle for linguistic annotations is based on existing works and standard propositions. We made the choice of stand-off annotations rather than inserted mark-up, and annotations are encoded as XML elements which form the linguistic subsection of the document record.

pdf bib abs
US-based Method for Speech Reception Threshold Measurement in French
Alexander Raake | Brian FG Katz

We propose a new method for measuring the threshold of 50% sentence intelligibility in noisy or multi-source speech communication situations (Speech Reception Threshold, SRT). Our SRT-test complements those available e.g. for English, German, Dutch, Swedish and Finnish by a French test method. The approach we take is based on semantically unpredictable sentences (SUS), which can principally be created for various languages. This way, the proposed method enables better cross-language comparisons of intelligibility tests. As a starting point for the French language, a set of 288 sentences (24 lists of 12 sentences each) was created. Each of the 24 lists is optimized for homogeneity in terms of phoneme-distribution as compared to average French, and for word occurrence frequency of the employed monosyllabic keywords as derived from French language databases. Based on the optimized text material, a speech target sentence database has been recorded with a trained speaker. A test calibration was carried out to yield uniform measurement results over the set of target sentences. First intelligibility measurements show good reliability of the method.

Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals require a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.

pdf bib abs
Champollion: A Robust Parallel Text Sentence Aligner
Xiaoyi Ma

This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. Its freely available to the public.

pdf bib abs
The Eclipse Annotator: an extensible system for multimodal corpus creation
Fabian Behrens | Jan-Torsten Milde

The Eclipse-Annotator is an extensible tool for the creation of multimodal language resources. It is based on the TASX-Annotator, which has been refactored in order to fit into the plugin based architecture of the new application.

pdf bib abs
Adding multi-layer semantics to the Greek Dependency Treebank
Harris Papageorgiou | Elina Desipri | Maria Koutsombogera | Kanella Pouli | Prokopis Prokopidis

In this paper we give an overview of the approach adopted to add a layer of semantic information to the Greek Dependency Treebank [GDT]. Our ultimate goal is to come up with a large corpus, reliably annotated with rich semantic structures. To this end, a corpus has been compiled encompassing various data sources and domains. This collection has been preprocessed, annotated and validated on the basis of dependency representation. Taking into account multi-layered annotation schemes designed to provide deeper representations of structure and meaning, we describe the methodology followed as regards the semantic layer, we report on the annotation process and the problems faced and we conclude with comments on future work and exploitation of the resulting resource.

pdf bib abs
Comparing linguistic information in treebank annotations
Cristina Bosco | Vincenzo Lombardo

The paper investigates the issue of portability of methods and results over treebanks in different languages and annotation formats. In particular, it addresses the problem of converting an Italian treebank, the Turin University Treebank (TUT), developed in dependency format, into the Penn Treebank format, in order to possibly exploit the tools and methods already developed and compare the adequacy of information encoding in the two formats. We describe the procedures for converting the two annotation formats and we present an experiment that evaluates some linguistic knowledge extracted from the two formats, namely sub-categorization frames.

pdf bib abs
Dialectal resources on-line: the ALT-Web experience
Nella Cucurullo | Simonetta Montemagni | Matilde Paoli | Eugenio Picchi | Eva Sassolini

The paper presents an on-line dialectal resource, ALT-Web, which gives access to the linguistic data of the Atlante Lessicale Toscano, a specially designed linguistic atlas in which lexical data have both a diatopic and diastratic characterisation. The paper focuses on: the dialectal data representation model; the access modalities to the ALT dialectal corpus; ontology-based search.

pdf bib abs
Corpus Support for Machine Translation at LDC
Xiaoyi Ma | Christopher Cieri

This paper describes LDC's efforts in collecting, creating and processing different types of linguistic data, including lexicons, parallel text, multiple translation corpora, and human assessment of translation quality, to support the research and development in Machine Translation. Through a combination of different procedures and core technologies, the LDC was able to create very large, high quality, and cost-efficient corpora, which have contributed significantly to recent advances in Machine Translation. Multiple translation corpora and human assessment together facilitate, validate and improve automatic evaluation metrics, which are vital to the development of MT systems. The Bilingual Internet Text Search (BITS) and Champollion sentence aligner enable the finding and processing of large quantities of parallel text. All specifications and tools used by LDC and described in the paper are or will be available to the general public.

We report on the success of a two-pass approach to annotating metadata, speech effects and syntactic structure in English conversational speech: separately annotating transcribed speech for structural metadata, or structural events, (fillers, speech repairs ( or edit dysfluencies) and SUs, or syntactic/semantic units) and for syntactic structure (treebanking constituent structure and shallow argument structure). The two annotations were then combined into a single representation. Certain alignment issues between the two types of annotation led to the discovery and correction of annotation errors in each, resulting in a more accurate and useful resource. The development of this corpus was motivated by the need to have both metadata and syntactic structure annotated in order to support synergistic work on speech parsing and structural event detection. Automatic detection of these speech phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for downstream text processing. Similarly, constraints imposed by text processing systems such as parsers can be used to help improve identification of disfluencies and sentence boundaries. This paper reports on our efforts to develop a linguistic resource providing both spoken metadata and syntactic structure information, and describes the resulting corpus of English conversational speech.

pdf bib abs
UAM Text Tools - a flexible NLP architecture
Tomasz Obrębski | Michał Stolarski

The paper presents a new language processing toolkit developed at Adam Mickiewicz University. Its functionality includes currently tokenization, sentence splitting, dictionary-based morphological analysis, heuristic morphological analysis of unknown words, spelling correction, pattern search, and generation of concordances. It is organized as a collection of command-line programs, each performing one operation. The components may be connected in various ways to provide various text processing services. Also new user-deoned components may be easily incorporated into the system. The toolkit is destined for processing raw (not annotated) text corpora. The system was originally intended for Polish, but its adaptation to other languages is possible.

pdf bib abs
SAM - an annotation editor for parallel texts
Markus Geilfuss | Jan-Torsten Milde

Annotated parallel texts are an important resource for quantitative and qualitative linguistic research. Creating parallel corpora enables the generation of (bilingual) lexica, provides a basis for the extraction of data used for translation memories, makes is possible to describe the differences between text versions simply allows scientists to create texts in cooperation. We describe the design and implementation of an interactive editor allowing the user to annotate parallel texts: SAM, the Script Annotation Manager.

pdf bib abs
The pragmatic combination of different crosslingual resources
Hans Uszkoreit | Feiyu Xu | Jörg Steffen | Ilhan Aslan

We will describe new cross-lingual strategies for the development multilingual information services on mobile devices. The novelty of our approach is the intelligent modeling of cross-lingual application domains and the combination of textual translation with speech generation. The final system helps users to speak foreign languages and communicate with the local people in relevant situations, such as restaurant, taxi and emergencies. The advantage of our information services is that they are robust enough for the use in real-world situations. They are developed for the Beijing Olympic Games 2008, where most foreigners will have to rely on translation assistance. Their deployment is foreseen as part of the planned ubiquitous mobile information system of the Olympic Games.

pdf bib abs
Design, Construction and Validation of an Arabic-English Conceptual Interlingua for Cross-lingual Information Retrieval
Nizar Habash | Clinton Mah | Sabiha Imran | Randy Calistri-Yeh | Páraic Sheridan

This paper describes the issues involved in extending a trans-lingual lexicon, the TextWise Conceptual Interlingua (CI), with Arabic terms. The Conceptual Interlingua is based on the Princeton English WordNet (Fellbaum, 1998). It is a central component in the cross-lingual information retrieval (CLIR) system CINDOR (Conceptual INterlingua for DOcument Retrieval). Arabic has a rich morphological system combining templatic and affixational paradigms for both inflectional and derivational morphology. This rich morphology poses a major challenge to the design and building of the Arabic CI and also its validation. This is because the available resources for Arabic, whether manually constructed bilingual lexicons or lexicons automatically derived from bilingual parallel corpora, exist at different levels of morphological representation. We describe here the issues and decisions made in the design and construction of the Arabic-English CI using different types of manual and automatic resources. We also present the results of an extensive validation of the Arabic CI and briefly discuss the evaluation of its use for CLIR on the TREC Arabic Benchmark collection.

pdf bib abs
Syntactic Lexicon of Polish Predicative Nouns
Grażyna Vetulani | Zygmunt Vetulani | Tomasz Obrębski

In the paper we report realization of SyntLex project aiming at construction of a full lexicon grammar for Polish. The lexicon-grammar based paradigm in computer linguistics is derived from the predicate logic and attributes a central role to the predicative constructions. An important class of syntactic constructions in many languages (French, English, Polish and other Slavonic languages in particular) are those based on verbo-nominal collocations, with the verb playing a support role with respect to the noun considered as carrying the predicative information. In this paper we refer to the former research by one of the authors aiming at full description of verbo-nominal predicative constructions for Polish in the form of an electronic resource for LI applications. We describe procedures to complete and corpus-validate the resource obtained so far.

pdf bib abs
The Italian Metaphor Database
Antonietta Alonge

This paper describes the main features of the Italian Metaphor Database, buing built at the University of Perugia (Italy). The database is being developed as a resource to be used both as a knowledge base on conceptual metaphors in Italian and their lexical expressions, and to enrich general lexical resources. The reason to develop such a database is that most NLP systems have to deal with metaphorical expressions sooner or later but, as previous research has shown, existing lexical resources for Italian do not contain complete and consistent data on metaphors, empirically derived but theoretically motivated. Thus, by referring to the Cognitive Theory of metaphor, conceptual metaphors instantiated in Italian are being represented in the resource, together with data on the way they are expressed in the language (i.e., through lexical units or multiword expressions), examples of them found within a corpus, and data on metaphorical linguistic expressions encoded/missing within ItalWordNet.

pdf bib abs
A Methodology and Tool for Representing Language Resources for Information Extraction
José Iria | Fabio Ciravegna

In recent years there has been a growing interest in clarifying the process of Information Extraction (IE) from documents, particularly when coupled with Machine Learning. We believe that a fundamental step forward in clarifying the IE process would be to be able to perform comparative evaluations on the use of different representations. However, this is difficult because most of the time the way information is represented is too tightly coupled with the algorithm at an implementation level, making it impossible to vary representation while keeping the algorithm constant. A further motivation behind our work is to reduce the complexity of designing, developing and testing IE systems. The major contribution of this work is in defining a methodology and providing a software infrastructure for representing language resources independently of the algorithm, mainly for Information Extraction but with application in other fields - we are currently evaluating its use for ontology learning and document classification.

pdf bib abs
Automatic Evaluation and Composition of NLP Pipelines with Web Services
Harry Halpin

We describe the innovative use of describing an existing natural language pipeline using the Semantic Web, and focus on how the performance and results of the components may be described. Earlier work has shown how NLP Web Services can be automatically composed via Semantic Web Service composition, and once the results of NLP components can be stored directly, they can also be used to direct the composition, leading to advances in the sharing and evaluation of NLP resources.

pdf bib abs
Methodological Aspects of Semantic Annotation
Harry Bunt | Amanda Schiffrin

This paper constitutes a preliminary report on the work carried out on semantic content annotation in the LIRICS project, in close collaboration with the activities of ISO TC 37/SC 4/TDG 31. This consists primarily of: (1) identifying commonalities in alternative approaches to the annotation and representation of various types of semantic information; and (2) developing methodological principles and concepts for identifying and characterising representational concepts for semantic content. The LIRICS project does not aim to develop a standard format for the annotation and representation of semantic content, but at providing well-defined descriptive concepts. In particular, the aim is to build an on-line registry of definitions of such concepts, called data categories, in accordance with ISO standard 12620. These semantic data categories are abstract concepts, whose use is not restricted to any particular format or representation language. We advocate the use of the metamodel as a tool to extract the most important of these abstract overarching concepts, with examples from dialogue act, temporal, reference and semantic role annotation.

pdf bib abs
PYCOT: An Optimality Theory-based Pronoun Resolution Toolkit
Whitney Gegg-Harrison | Donna K. Byron

In this paper, we present PYCOT, a pronoun resolution toolkit. This toolkit is written in the Python programming language and is intended to be an addition to the open-source NLTK collection of natural language processing tools. We discuss the design of the module as well as studies of its performance on pronoun resolution in English and in Korean.

This paper describes a dialogue data collection experiment and resulting corpus for dialogues between a senior mobile journalist and a junior cub reporter back at the office. The purpose of the dialogue is for the mobile journalist to collect background information in preparation for an interview or on-the-site coverage of a breaking story. The cub reporter has access to text archives that contain such background information. A unique aspect of these dialogues is that they capture information-seeking behavior for an open-ended task against a large unstructured data source. Initial analyses of the corpus show that the experimental design leads to real-time, mixedinitiative, highly interactive dialogues with many interesting properties.

pdf bib abs
Exploiting Multiple Semantic Resources for Answer Selection
Jeongwoo Ko | Laurie Hiyakumoto | Eric Nyberg

This paper describes the utility of semantic resources such as the Web, WordNet and gazetteers in the answer selection process for a question-answering system. In contrast with previous work using individual semantic resources to support answer selection, our work combines multiple resources to boost the confidence scores assigned to correct answers and evaluates different combination strategies based on unweighted sums, weighted linear combinations, and logistic regression. We apply our approach to select answers from candidates produced by three different extraction techniques of varying quality, focusing on TREC questions whose answers represent locations or proper-names. Our experimental results demonstrate that the combination of semantic resources is more effective than individual resources for all three extraction techniques, improving answer selection accuracy by as much as 32.35% for location questions and 72% for proper-name questions. Of the combination strategies tested, logistic regression models produced the best results for both location and proper-name questions.

pdf bib abs
Low-cost Customized Speech Corpus Creation for Speech Technology Applications
Kazuaki Maeda | Christopher Cieri | Kevin Walker

Speech technology applications, such as speech recognition, speech synthesis, and speech dialog systems, often require corpora based on highly customized specifications. Existing corpora available to the community, such as TIMIT and other corpora distributed by LDC and ELDA, do not always meet the requirements of such applications. In such cases, the developers need to create their own corpora. The creation of a highly customized speech corpus, however, could be a very expensive and time-consuming task, especially for small organizations. It requires multidisciplinary expertise in linguistics, management and engineering as it involves subtasks such as the corpus design, human subject recruitment, recording, quality assurance, and in some cases, segmentation, transcription and annotation. This paper describes LDC's recent involvement in the creation of a low-cost yet highly-customized speech corpus for a commercial organization under a novel data creation and licensing model, which benefits both the particular data requester and the general linguistic data user community.

pdf bib abs
NOMOS: A Semantic Web Software Framework for Annotation of Multimodal Corpora
John Niekrasz | Alexander Gruenstein

We present NOMOS, an open-source software framework for annotation, processing, and analysis of multimodal corpora. NOMOS is designed for use by annotators, corpus developers, and corpus consumers, emphasizing configurability for a variety of specific annotation tasks. Its features include synchronized multi-channel audio and video playback, compatibility with several corpora, platform independence, and mixed display of capabilities and a well-defined method for layering datasets. Second, we describe how the system is used. For corpus development and annotation we present a typical use scenario involving the creation of a schema and specialization of the user interface. For processing and analysis we describe the GUI- and Java-based methods available, including a GUI for query construction and execution, and an automatically generated schema-conforming Java API for processing of annotations. Additionally, we present some specific annotation and research tasks for which NOMOS has been specialized and used, annotation and research tasks for which NOMOS has been specialized and used, including topic segmentation and decision-point annotation of meetings.

We present a new corpus of tutorial dialogs on mathematical theorem proving that was collected in a Wizard-of-Oz setup. Our study is a follow up on a previous experiment conducted in a similar simulated environment. A major difference between the current and the previous experimental setup was that in this study we varied the presentation of the study-material with which the subjects were provided. One sub-group of the subjects was presented with a highly formalized presentation consisting mainly of formulas, while the other with a presentation mainly in natural language. Our goal was to obtain more data on the kind of mixed-language that is characteristic of informal mathematical discourse. We hypothesized that the language style of the subjects' interaction with the simulated system will reflect the style of presentation of the study-material. In the paper we briefly present the experimental setup, the corpus, and a preliminary quantitative result of the corpus analysis.

pdf bib abs
Task-based MT Evaluation: From Who/When/Where Extraction to Event Understanding
Jamal Laoudi | Calandra R. Tate | Clare R. Voss

Task-based machine translation (MT) evaluation asks, how well do people perform text-handling tasks given MT output? This method of evaluation yields an extrinsic assessment of an MT engine, in terms of users task performance on MT output. While this method is time-consuming, its key advantage is that MT users and stakeholders understand how to interpret the assessment results. Prior experiments showed that subjects can extract individual who-, when-, and where-type elements of information from MT output passages that were not especially fluent. This paper presents the results of a pilot study to assess a slightly more complex task: when given such wh-items already identified in an MT output passage, how well can subjects properly select from and place these items into wh-typed slots to complete a sentence-template about the passages event? The results of the pilot with nearly sixty subjects, while only preliminary, indicate that this task was extremely challenging: given six test templates to complete, half of the subjects had no completely correct templates and 42% had exactly one completely correct template. The provisional interpretation of this pilot study is that event-based template completion defines a task ceiling, against which to evaluate future improvements on MT engines.

pdf bib abs
A New Phase in Annotation Tool Development at the Linguistic Data Consortium: The Evolution of the Annotation Graph Toolkit
Kazuaki Maeda | Haejoong Lee | Julie Medero | Stephanie Strassel

The Linguistic Data Consortium (LDC) has created various annotated linguistic data for a variety of common task evaluation programs and projects to create shared linguistic resources. The majority of these annotated linguistic data were created with highly customized annotation tools developed at LDC. The Annotation Graph Toolkit (AGTK) has been used as a primary infrastructure for annotation tool development at LDC in recent years. Thanks to the direct feedback from annotation task designers and annotators in-house, annotation tool development at LDC has entered a new, more mature and productive phase. This paper describes recent additions to LDC's annotation tools that are newly developed or significantly improved since our last report at the Fourth International Conference on Language Resource and Evaluation Conference in 2004. These tools are either directly based on AGTK or share a common philosophy with other AGTK tools.

pdf bib abs
Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering
Hideki Shima | Mengqiu Wang | Frank Lin | Teruko Mitamura

Multilingual Question Answering systems are generally very complex, integrating several sub-modules to achieve their result. Global metrics (such as average precision and recall) are insufficient when evaluating the performance of individual sub-modules and their influence on each other. In this paper, we present a modular approach to error analysis and evaluation; we use manually-constructed, gold-standard input for each module to obtain an upper-bound for the (local) performance of that module. This approach enables us to identify existing problem areas quickly, and to target improvements accordingly.

This paper presents an evaluation of a spoken dialog system for automotive environments. Our overall goal was to measure the impact of user-system interaction on the users driving performance, and to determine whether adding context-awareness to the dialog system might reduce the degree of user distraction during driving. To address this issue, we incorporated context-awareness into a spoken dialog system, and implemented three system features using user context, network context and dialog context. A series of experiments were conducted under three different configurations: driving without a dialog system, driving while using a context-aware dialog system, and driving while using a context-unaware dialog system. We measured the differences between the three configurations by comparing the average car speed, the frequency of speed changes and the angle between the cars direction and the centerline on the road. These results indicate that context-awareness could reduce the degree of user distraction when using a dialog system during driving.

pdf bib abs
Collaborative Annotation that Lasts Forever: Using Peer-to-Peer Technology for Disseminating Corpora and Language Resources
Magesh Balasubramanya | Michael Higgins | Peter Lucas | Jeff Senn | Dominic Widdows

This paper describes a peer-to-peer architecture for representing and disseminating linguistic corpora, linguistic annotation, and resources such as lexical databases and gazetteers. The architecture is based upon a Universal Database technology in which all information is represented in globally identified, extensible bundles of attribute-value pairs. These objects are replicated at will between peers in the network, and the business rules that implement replication involve checking digital signatures and proper attribution of data, to avoid information being tampered with or abuse of copyright. Universal identifiers enable comprehensive standoff annotation and commentary. A carefully constructed publication mechanism is described that enables different users to subscribe to material provided by trusted publishers on recognized topics or themes. Access to content and related annotation is provided by distributed indexes, represented using the same underlying data objects as the rest of the database.

pdf bib abs
The Look and Feel of a Confident Entailer
Vasile Rus | Art Graesser

The paper presents a software system that embodies a lexico-syntactic approach to the task of Textual Entailment. Although the approach is based on a minimal set of resources it is highly confident. The architecture of the system is open and can be easily expanded with more and deeper processing modules. Results on a standard data set are presented.

pdf bib abs
Using Semantic Overlap Scoring in Answering TREC Relationship Questions
Gregory Marton | Boris Katz

A first step in answering complex questions, such as those in the Relationship' task of the Text REtrieval Conference's Question Answering track (TREC/QA), is finding passages likely to contain pieces of the answer---passage retrieval. We introduce semantic overlap scoring, a new passage retrieval algorithm that facilitates credit assignment for inexact matches between query and candidate answer. Our official submission ranked best among fully automatic systems, at 23% F-measure, while the best system, with manual input, reached 28%. We use our Nuggeteer tool to robustly evaluate each component of our Relationship system post hoc. Ablation studies show that semantic overlap scoring achieves significant performance improvements over a standard passage retrieval baseline.

pdf bib abs
Impact of Question Decomposition on the Quality of Answer Summaries
Finley Lacatusu | Andrew Hickl | Sanda Harabagiu

Generating answers to complex questions in the form of multi-document summaries requires access to question decomposition methods. In this paper we present three methods for decomposing complex questions and we evaluate their impact on the responsiveness of the answers they enable.

pdf bib abs
An Answer Bank for Temporal Inference
Sanda Harabagiu | Cosmin Adrian Bejan

Answering questions that ask about temporal information involves several forms of inference. In order to develop question answering capabilities that benefit from temporal inference, we believe that a large corpus of questions and answers that are discovered based on temporal information should be available. This paper describes our methodology for creating AnswerTime-Bank, a large corpus of questions and answers on which Question Answering systems can operate using complex temporal inference.

pdf bib abs
Principles for annotating and reasoning with spatial information
Paul C. Morărescu

In this paper we present the first phase of the ongoing SpaceBank project that attempts to create a linguistic resource for annotating and reasoning with spatial information from text. SpaceBank is the spatial counterpart of TimeBank, an electronic resource for temporal semantics and reasoning. The paper focuses on building an ontology of lexicalized spatial concepts. The textual occurrences of the concepts in this ontology will be annotated using the SpaceML language, briefly described here. SpaceBank is designed to be integrated with TimeBank, for a spatio-temporal model of the textual information.

pdf bib abs
Interaction between Lexical Base and Ontology with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li | Ruifeng Xu

An ontology describes conceptual knowledge in a specific domain. A lexical base collects a repository of words and gives independent definition of concepts. In this paper, we propose to use FCA as a tool to help constructing an ontology through an existing lexical base. We mainly address two issues. The first issue is how to select attributes to visualize the relations between lexical terms. The second issue is how to revise lexical definitions through analysing the relations in the ontology. Thus the focus is on the effect of interaction between a lexical base and an ontology for the purpose of good ontology construction. Finally, experiments have been conducted to verify our ideas.

pdf bib abs
Semantic-Based Keyword Recovery Function for Keyword Extraction System
Rachada Kongkachandra | Kosin Chamnongthai

The goal of implementing a keyword extraction system is to increase as near as 100% of precision and recall. These values are affected by the amount of extracted keywords. There are two groups of errors happened i.e. false-rejected and false-accepted keywords. To improve the performance of the system, false-rejected keywords should be recovered and the false-accepted keywords should be reduced. In this paper, we enhance the conventional keyword extraction systems by attaching the keyword recovery function. This function recovers the previously false-rejected keywords by comparing their semantic information with the contents of each relevant document. The function is automated in three processes i.e. Domain Identification, Knowledge Base Generation and Keyword Determination. Domain identification process identifies domain of interest by searching domains from domain knowledge base by using extracted keywords. The most general domains are selected and then used subsequently. To recover the false-rejected keywords, we match them with keywords in the identified domain within the domain knowledge base rely on their semantics by keyword determination process. To semantically recover keywords, definitions of false-reject keywords and domain knowledge base are previously represented in term of conceptual graph by knowledge base generator process. To evaluate the performance of the proposed function, EXTRACTOR, KEA and our keyword-database-mapping based keyword extractor are compared. The experiments were performed in two modes i.e. training and recovering. In training mode, we use four glossaries from the Internet and 60 articles from the summary sections of IEICE transaction. While in the recovering mode, 200 texts from three resources i.e. summary section of 15 chapters in a computer textbook and articles from IEICE and ACM transactions are used. The experimental results revealed that our proposed function improves the precision and recall rates of the conventional keyword extraction systems approximately 3-5% of precision and 6-10% of recall, respectively.

pdf bib abs
The Design and Construction of A Chinese Collocation Bank
Ruifeng Xu | Qin Lu | Sujian Li

This paper presents an annotated Chinese collocation bank developed at the Hong Kong Polytechnic University. The definition of collocation with good linguistic consistency and good computational operability is first discussed and the properties of collocations are then presented. Secondly, based on the combination of different properties, collocations are classified into four types. Thirdly, the annotation guideline is presented. Fourthly, the implementation issues for collocation bank construction are addressed including the annotation with categorization, dependency and contextual information. Currently, the collocation bank is completed for 3,643 headwords in a 5-million-word corpus.

pdf bib abs
Merging two Ontology-based Lexical Resources
Nilda Ruimy

ItalWordNet (IWN) and PAROLE/SIMPLE/CLIPS (PSC), the two largest electronic, general-purpose lexical resources of Italian language present many compatible aspects although they are based on two different lexical models having their own underlying principles and peculiarities. Such compatibility prompted us to study the feasibility of semi-automatically linking and eventually merging the two lexicons. To this purpose, the mapping of the ontologies on which basis both lexicons are structured was performed and the sets of semantic relations enabling to relate lexical units were compared. An overview of this preliminary phase is provided in this paper. The linking methodology and related problematic issues are described. Beyond the advantage for the end user to dispose of a more exhaustive and in-depth lexical information combining the potentialities and most outstanding features offered by the two lexical models, resulting benefits and enhancements for the two resources are illustrated that definitely legitimize the soundness of this linking and merging initiative.

pdf bib abs
Towards automatic transcription of Somali language
Abdillahi Nimaan | Pascal Nocera | Jean-François Bonastre

Most African countries follow an oral tradition system to transmit their cultural, scientific and historic heritage through generations. This ancestral knowledge accumulated during centuries is today threatened of disappearing. This paper presents the first steps in the building of an automatic speech to text transcription for African oral patrimony, particularly the Djibouti cultural heritage. This work is dedicated to process Somali language, which represents half of the targeted Djiboutian audio archives. The main problem is the lack of annotated audio and textual resources for this language. We describe the principal characteristics of audio (10 hours) and textual (3M words) training corpora collected. Using the large vocabulary speech recognizer engine, Speeral, developed at the Laboratoire Informatique dAvignon (LIA) (computer science laboratory of Avignon), we obtain about 20.9% word error rate (WER). This is an encouraging result, considering the small size of our corpora. This first recognizer of Somali language will serve as a reference and will be used to transcribe some Djibouti cultural archives. We will also discuss future ways of research like sub-words indexing of audio archives, related to the specificities of the Somali language.

pdf bib abs
Competitive Evaluation of Commercially Available Speech Recognizers in Multiple Languages
Susanne Burger | Zachary A. Sloane | Jie Yang

Recent improvements in speech recognition technology have resulted in products that can now demonstrate commercial value in a variety of applications. Many vendors are marketing products which combine ASR applications including continuous dictation, command-and-control interfaces, and transcription of recorded speech at an accuracy of 98%. In this study, we measured the accuracy of certain commercially available desktop speech recognition engines in multiple languages. Using word error rate as a benchmark, this work compares recognition accuracy across eight languages and the products of three manufacturers. Results show that two systems performed almost the same while a third system recognized at lower accuracy, although none of the systems reached the claimed accuracy. Read speech was recognized better than spontaneous speech. The systems for US-English, Japanese and Spanish showed higher accuracy than the systems for UK-English, German, French and Chinese.

pdf bib abs
Annotation and Analysis of Emotionally Relevant Behavior in the ISL Meeting Corpus
Kornel Laskowski | Susanne Burger

We present an annotation scheme for emotionally relevant behavior at the speaker contribution level in multiparty conversation. The scheme was applied to a large, publicly available meeting corpus by three annotators, and subsequently labeled with emotional valence. We report inter-labeler agreement statistics for the two schemes, and explore the correlation between speaker valence and behavior, as well as that between speaker valence and the previous speaker's behavior. Our analyses show that the co-occurrence of certain behaviors and valence classes significantly deviates from what is to be expected by chance; in isolated cases, behaviors are predictive of valence.

This paper introduces a recently initiated project that focuses on building a lexical resource for Modern Standard Arabic based on the widely used Princeton WordNet for English (Fellbaum, 1998). Our aim is to develop a linguistic resource with a deep formal semantic foundation in order to capture the richness of Arabic as described in Elkateb (2005). Arabic WordNet is being constructed following methods developed for EuroWordNet (Vossen, 1998). In addition to the standard wordnet representation of senses, word meanings are also being defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology and its associated domain ontologies (Niles and Pease, 2001). We will greatly extend the ontology and its set of mappings to provide formal terms and definitions for each synset. Tools to be developed as part of this effort include a lexicographer's interface modeled on that used for EuroWordNet, with added facilities for Arabic script, following Black and Elkateb's earlier work (2004).

pdf bib abs
Deep non-probabilistic parsing of large corpora
Benoît Sagot | Pierre Boullier

This paper reports a large-scale non-probabilistic parsing experiment with a deep LFG parser. We briefly introduce the parser we used, named SXLFG, and the resources that were used together with it. Then we report quantitative results about the parsing of a multi-million word journalistic corpus. We show that we can parse more than 6 million words in less than 12 hours, only 6.7% of all sentences reaching the 1s timeout. This shows that deep large-coverage non-probabilistic parsers can be efficient enough to parse very large corpora in a reasonable amount of time.

pdf bib abs
Automatic Term Extraction from Knowledge Bank of Economics
Magnar Brekke | Kai Innselset | Marita Kristiansen | Kari Øvsthus

KB-N is a web-accessible searchable Knowledge Bank comprising A) a parallel corpus of quality assured and calibrated English and Norwegian text drawn from economic-administrative knowledge domains, and B) a domain-focused database representing that knowledge universe in terms of defined concepts and their respective bilingual terminological entries. A central mechanism in connecting A and B is an algorithm for the automatic extraction of term candidates from aligned translation pairs on the basis of linguistic, lexical and statistical filtering (first ever for Norwegian). The system is designed and programmed by Paul Meurer at Aksis (UiB). An important pilot application of the term base is subdomain and collocations based word-sense disambiguation for LOGON, a system for Norwegian-to-English MT currently being developed.

pdf bib abs
Comparison of Resource Discovery Methods
Alex Klassmann | Freddy Offenga | Daan Broeder | Romuald Skiba | Peter Wittenburg

It is an ongoing debate whether categorical systems created by some experts are an appropriate way to help users finding useful resources in the internet. However for the much more restricted domain of language documentation such a category system might still prove reasonable if not indispensable. This article gives an overview over the particular IMDI category set and presents a rough evaluation of its practical use at the Max-Planck-Institute Nijmegen.

pdf bib abs
The Information Commons Gazetteer
Peter Lucas | Magesh Balasubramanya | Dominic Widdows | Michael Higgins

Advances in location aware computing and the convergence of geographic and textual information systems will require a comprehensive, extensible, information rich framework called the Information Commons Gazetteer that can be freely disseminated to small devices in a modular fashion. This paper describes the infrastructure and datasets used to create such a resource. The Gazetteer makes use of MAYA Design's Universal Database Architecture; a peer-to-peer system based upon bundles of attribute-value pairs with universally unique identity, and sophisticated indexing and data fusion tools. The Gazetteer primarily constitutes publicly available geographic information from various agencies that is organized into a well-defined scalable hierarchy of worldwide administrative divisions and populated places. The data from various sources are imported into the commons incrementally and are fused with existing data in an iterative process allowing for rich information to evolve over time. Such a flexible and distributed public resource of the geographic places and place names allows for both researchers and practitioners to realize location aware computing in an efficient and useful way in the near future by eliminating redundant time consuming fusion of disparate sources.

pdf bib abs
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
Benoît Sagot | Lionel Clément | Éric Villemonte de La Clergerie | Pierre Boullier

In this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fléchies du français - Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French. The web site of the Lefff is http://www.lefff.net/.

pdf bib abs
Structuring a Domain Vocabulary in a General Knowledge Environment
Nilda Ruimy

The study which is reported here aims at investigating the extent to which the conceptual and representational tools provided by a lexical model designed for the semantic representation of general language may suit the requirements of knowledge modelling in a domain-specific perspective. A general linguistic ontology and a set of semantic links, which allow classifying, describing and interconnecting word senses, play a central role in structuring and representing such knowledge. The health and medicine vocabulary has been taken as a case study for this investigation.

pdf bib abs
LexikoNet - a lexical database based on type and role hierarchies
Alexander Geyken | Norbert Schrader

In this paper LexikoNet, a large lexical ontology of German nouns is presented. Unlike GermaNet and the Princeton WordNet, LexikoNet has distinguished type and role hypernyms right from the outset and organizes those lexemes in a parallel, independent hierarchy. In addition to roles and types, LexikoNet uses meronymic and holonymic relations as well as the instance relation. LexikoNet is based on a conceptual hierarchy of currently 1,470 classes to which approximately 90,000 word senses taken from a large German monolingual dictionary, the W\"orterbuch der deutschen Gegenwartssprache (WDG), are attached. The conceptual classes provide a useful degree of abstraction for the lexicographic description of selectional restrictions, thus making LexikoNet a useful filtering tool for corpus based lexicographic analysis. LexikoNet is currently used in-house as a filter for lexicographic extraction tasks in the DWDS project. Furthermore, it is used as a classification tool of the words of the week provided for the newspaper Die ZEIT on www.zeit.de

pdf bib abs
Evaluation of Automatic Speech Recognition and Speech Language Translation within TC-STAR:Results from the first evaluation campaign
Djamel Mostefa | Olivier Hamon | Khalid Choukri

This paper reports on the evaluation activities conducted in the first year of the TC-STAR project. The TC-STAR project, financed by the European Commission within the Sixth Framework Program, is envisaged as a long-term effort to advance research in the core technologies of Speech-to-Speech Translation (SST). SST technology is a combination of Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text To Speech (TTS).

pdf bib abs
Evaluation of multimodal components within CHIL: The evaluation packages and results
Djamel Mostefa | Marie-Neige Garcia | Khalid Choukri

This article describes the first CHIL evaluation campaign in which 12 technologies were evaluated. The major outcomes of the first evaluation campaign are the so-called Evaluation Packages. An evaluation package is the full documentation (definition and description of the evaluation methodologies, protocols and metrics) alongside the data sets and software scoring tools, which an organisation needs in order to perform the evaluation of one or more systems for a given technology. These evaluation packages will be made available to the community through ELDA General Catalogue.

pdf bib abs
The Impact of Evaluation on Multilingual Information Retrieval System Development
Carol Peters

The Cross-Language Evaluation Forum (CLEF) promotes research into the development of truly multilingual systems capable of retrieving relevant information from collections in many languages and in mixed media. The paper discusses some of the main results achieved in the first six years of activity.

This paper presents an overview of the Multilingual Question Answering evaluation campaigns which have been organized at CLEF (Cross Language Evaluation Forum) since 2003. Over the years, the competition has registered a steady increment in the number of participants and languages involved. In fact, from the original eight groups which participated in 2003 QA track, the number of competitors in 2005 rose to twenty-four. Also, the performances of the systems have steadily improved, and the average of the best performances in the 2005 saw an increase of 10% with respect to the previous year.

pdf bib abs
A joint prosody evaluation of French text-to-speech synthesis systems
Marie-Neige Garcia | Christophe d’Alessandro | Gérard Bailly | Philippe Boula de Mareüil | Michel Morel

This paper reports on prosodic evaluation in the framework of the EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the French language. Prosody is evaluated using a prosodic transplantation paradigm. Intonation contours generated by the synthesis systems are transplanted on a common segmental content. Both diphone based synthesis and natural speech are used. Five TTS systems are tested along with natural voice. The test is a paired preference test (with 19 subjects), using 7 sentences. The results indicate that natural speech obtains consistently the first rank (with an average preference rate of 80%), followed by a selection based system (72%) and a diphone based system (58%). However, rather large variations in judgements are observed among subjects and sentences, and in some cases synthetic speech is preferred to natural speech. These results show the remarkable improvement achieved by the best selection based synthesis systems in terms of prosody. In this way; a new paradigm for evaluation of the prosodic component of TTS systems has been successfully demonstrated.