WMT 2016 Shared Task on Bilingual Document Alignment

Event Notification Type: 
Call for Participation
Abbreviated Title: 
Location: 
ACL 2016
State: 
Country: 
Germany
City: 
Berlin
Contact: 
Christian Buck
Philipp Koehn
Submission Deadline: 
Monday, 2 May 2016

========================================================
WMT 2016 Shared Task on Bilingual Document Alignment
========================================================

Website: http://www.statmt.org/wmt16/bilingual-task.html
At WMT 2016 (collocated with ACL 2016)

Parallel corpora are especially important for statistical machine
translation, but so far the collection of such data within the
academic research community has been ad hoc and limited
in scale. To promote this research problem within we organize
a shared task on aligning bilingual documents from crawled
web sites.

More details can be found below, and on our website:
http://www.statmt.org/wmt16/bilingual-task.html

Important Dates:

Release of training data: February 12, 2016
Release of test data: April 11, 2016
Results submission deadline: May 2, 2016
Paper submission deadline: May 8, 2016
Notification of acceptance: June 5, 2016
Camera-ready deadline: June 22, 2016

=========================
Detailed Task Description
=========================

The task is to align French web pages to English web pages
for a given crawled webdomain (a set of web pages under a fully
qualified domain name - FWDN).

TRAINING DATA:
For the crawled data we provide one file per webdomain in .lett format
adapted from Bitextor. This a plain text format with one line per page.
Each line consists of 6 tab-separated values:

Language ID (e.g. en)
Mime type (always text/html)
Encoding (always charset=utf-8)
URL
HTML in Base64 encoding
Text in Base64 encoding

We make sure that the language id is reliable, at least for the
documents in the train and test pairs. We also ensure that all
known pairs have been crawled and no URLs are missing
from the crawls.

Text extraction was performed using an HTML5 parser. As the
original HTML pages are available, participants are welcome
to implement their own text extraction, for example to remove
boilerplate.

To facilitate use of the .lett files we provide a simple reader
class in Python.

Additionally, we have identified spans of French text for which
we produced English translations using MT. These translations
are not part of the lett files but provided separately.

As part of the training data we provide a set of 1,624 correctly
aligned EN-FR pairs from 49 webdomains. The number of pairs per
webdomain varies between 4 and over 200. All pairs are from within
a single webdomain, possible matches between two different
webdomains, e.g. siemens.de and siemens.com, are not considered
in this task.

Answer keys are given in the format
Source_URLTarget_URL

TEST SET:
For testing, we will provide additional crawls of new webdomains,
distinct from the ones in the training data in the same format. For
these no known pairs will be provided. Because the full
set of valid document pairs is unknown evaluation we be based
entirely on precision on an annotated subset of correctly aligned
pairs.

Participants are expected to produce a list of possible pairings in
the format of the training data. Each source url may be matched
with at most one target url and visa-versa. Should a URL occur
repeatedly, later occurrences are ignored. We provide an evaluation
script to assess performance during development.

BASELINE:
We provide a simple baseline method based on URL matching.

Training data and baseline method are available at
http://www.statmt.org/wmt16/bilingual-task.html

ORGANIZERS:
Christian Buck, University of Edinburgh
Philipp Koehn, Johns Hopkins University

ACKNOWLEDGMENT:
This shared task received support from a Google Faculty Research Award.