==============================================================
DiscoMT 2017 Shared Task on Cross-lingual Pronoun Prediction
==============================================================
We are pleased to announce an exciting cross-lingual pronoun prediction task for people interested in (discourse-aware) machine translation, anaphora resolution and machine learning in general.
In the cross-lingual pronoun prediction task, participants are asked to predict a target-language pronoun given a source-language pronoun in the context of a sentence. For example, in the English-to-French sub-task, to predict the correct translation of "it" or "they" into French (ce, elle, elles, il, ils, ça, cela, on, OTHER). You may use any type of information that can be extracted from the documents. We provide training and development data and a simple baseline system using an N-gram language model.
Participants are invited to submit systems for the English-French and English-German, German-English and Spanish-English language pairs.
More details can be found below, and on our website: https://www.idiap.ch/workshop/DiscoMT/shared-task
Important Dates:
March 2017 Release of training data
2 May 2017 Release of test data
9 May 2017 System submission deadline
15 May 2017 Release of results
9 June 2017 System paper submission deadline
30 June 2017 Notification of acceptance
14 July 2017 Camera-ready papers due
Discussion group: https://groups.google.com/forum/#!forum/discomt2017-cross-lingual-pronou...
-------------------------------------------------------------------------
Acknowledgements:
The organisation of this task has received support from the following project: Discourse-Oriented Statistical Machine Translation funded by the Swedish Research Council (2012-916)
-------------------------------------------------------------------------
=========================
Detailed Task Description
=========================
OVERVIEW
Pronoun translation poses a problem for current MT systems as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, and to differences in where pronouns may be used. Translation divergences typically lead to mistakes in MT output, as when translating the English "it" into French ("il", "elle", or "cela"?) or into German ("er", "sie", or "es"?). One way to model pronoun translation is to treat it as a cross-lingual pronoun prediction task.
We propose such a task, which asks participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provide a lemmatised target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. In the translation, the words aligned to a subset of the source-language third-person pronouns are substituted by placeholders. The aim of the task is to predict, for each placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the documents.
The cross-lingual pronoun prediction task will be similar to the task of the same name at WMT16:
http://www.statmt.org/wmt16/pronoun-task.html
Participants are invited to submit systems for the English-French, English-German, German-English and Spanish-English language pairs.
TASK DESCRIPTION
In the cross-lingual pronoun prediction task, you are given a source-language document with a lemmatised and POS-tagged human-authored translation and a set of word alignments between the two languages. In the translation, the lemmatised tokens aligned to the source-language third-person pronouns are substituted by placeholders. Your task is to predict, for each placeholder, the fully inflected word token that should replace the placeholder from a small, closed set of classes. I.e., to provide the fully inflected translation of the source pronoun in the context sketched by the lemmatised/tagged target side. You may use any type of information that you can extract from the documents.
Lemmatised and POS-tagged target-language data is provided in place of fully inflected text. The provision of lemmatised data is intended both to provide a challenging task, and to simulate a scenario that is more closely aligned with working with machine translation system output. POS tags provide additional information which may be useful in the disambiguation of lemmas (e.g. noun vs. verb, etc.) and in the detection of patterns of pronoun use.
The pronoun prediction task will be run for the following sub-tasks:
English-to-French
English-to-German
German-to-English
Spanish-to-English ****New****
Details of the source-language pronouns and the prediction classes that exist for each of the above sub-tasks are provided in the following section (below). The different combinations of source-language pronoun and target-language prediction classes represent some of the different problems that MT systems face when translating pronouns for a given language pair and translation direction.
The task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.
The data supplied for the classification task consists of parallel source-target text with word alignments. In the target-language text, a subset of the words aligned to source-language occurrences of a specified set of pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the source-language word the placeholder is aligned to. Your task is to predict one of the classes listed in the relevant source-target section below, for each occurrence of a placeholder.
The development and test datasets have been manually filtered to remove non-subject position pronouns and to ensure the fair and accurate evaluation of system performance. For more information on the format of the data files and their filtering, please see the website.
The complete test data for the classification task, including reference translations and word alignments, will be released on 2nd May 2017. Your submission is due on 9th June 2017.
SOURCE-LANGUAGE PRONOUN SETS AND TARGET-LANGUAGE PREDICTION CLASS DETAILS
The following sections describe the set of source-language pronouns and target-language classes to be predicted, for each of the four sub-tasks.
This year, the sub-task of translation from Spanish-into-English has been included. This pair involves the additional difficulty of having to generate the Spanish null subjects into English. The training data follows an identical format to that of the other language pairs. The difference is that the REPLACE_xx placeholder points to the position of a third person Spanish verb with no overt subject.
You should *always* predict either a word token or "OTHER". See prediction class lists below for a list of word tokens to predict for each sub-task.
English-to-French
This sub-task will concentrate on the translation of subject position "it" and "they" from English into French. The following prediction classes exist for this sub-task:
* ce: The French pronoun ce (sometimes with elided vowel as c') as in the expression c'est "it is"
* elle: Feminine singular subject pronoun
* elles: Feminine plural subject pronoun
* il: Masculine singular subject pronoun
* ils: Masculine plural subject pronoun
* cela: Demonstrative pronouns. Includes "cela", "ça", the misspelling "ca", and the rare elided form "ç'"
* on: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted
Spanish-to-English
This sub-task will concentrate on the translation of third person Spanish verbs without an overt subject. The following prediction classes exist for this sub-task:
he Masculine singular subject pronoun
she Feminine singular subject pronoun
it Non-gendered singular subject pronoun
they Non-gendered plural subject pronoun
there Existential "there"
OTHER Some other word, or nothing at all, should be inserted
English-to-German
This sub-task will concentrate on the translation of subject position "it" and "they" from English into German. The following prediction classes exist for this sub-task:
* er: Masculine singular subject pronoun
* sie: Feminine singular subject pronoun
* es: Neuter singular subject pronoun
* man: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted
German-to-English
This sub-task will concentrate on the translation of subject position "er", "sie" and "es" from German into English. The following prediction classes exist for this sub-task:
* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* you: Second person pronoun (with both generic or deictic uses)
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted