Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization

In the past few years, we have witnessed---in both MT and summarization evaluation---the innovation of ngram-based intrinsic metrics that automatically score system-outputs against human-produced reference documents (e.g., IBM's BLEU and ISI/USC's counterpart ROUGE). Similarly, there has been renewed interest in user applications and task-based extrinsic measures in both communities (e.g., DUC'05 and TIDES'04). Most recently, evaluation efforts have tested for correlations to cross-validate independently derived intrinsic and extrinsic assessments of system-outputs with each other and with human judgments on output, such as accuracy and fluency.

The concrete questions that we hope to see addressed in this workshop include, but are not limited to:

- How adequately do intrinsic measures capture the variation between system-outputs and human-generated reference documents (summaries or translations)? What methods exist for calibrating and controlling the variation in linguistic complexity and content differences in input test-sets and reference sets? How much variation exists within these constructed sets? How does that variation affect different intrinsic measures? How many reference documents are needed for effective scoring?

- How can intrinsic measures go beyond simple n-gram matching, to quantify the similarity between system-output and human-references? What other features and weighting alternatives lead to better metrics for both MT and summarization? How can intrinsic measures capture fluency and adequacy? Which types of new intrinsic metrics are needed to adequately evaluate non-extractive summaries and paraphrasing (e.g.,interlingual) translations?

- How effectively do extrinsic (or proxy extrinsic) measures capture the quality of system output, as needed for downstream use in human tasks, such as triage (document relevance judgments), extraction (factual question answering), and report writing; and in automated tasks, such as filtering, information extraction, and question-answering? For example, when is an MT system good enough that a summarization system benefits from the additional information available in the MT output?

- How should metrics for MT and summarization be assessed and compared? What characteristics should a good metric possess? When is one evaluation method better than another? What are the most effective ways of assessing the correlation testing and statistical modeling that seek to predict human task performance or human notions of output quality (e.g., fluency and adequacy) from "cheaper" automatic metrics? How reliable are human judgments?

Anyone with an interest in MT or summarization evaluation research or in issues pertaining to the combination of MT and summarization is encouraged to participate in the workshop. We are looking for research papers on the aforementioned topics, as well as position papers that identify limitations in current approaches and describe promising future research directions.

SHARED DATA SETS

To facilitate the comparison of different measures during the workshop, we will be making available data sets in advance for workshop participants to test their approaches to evaluation. Although the shared data sets are separated, we would encourage participants to apply their automatic metrics on both data sets and report comparative results in the workshop.

Shared Data Set for MT Evaluation

To facilitate the comparison of different measures for MT evaluation, we have arranged for LDC to release a shared data package on which researchers can conduct experiments that assess the performance of automatic MT evaluation metrics by correlating their scores with human judgments of MT quality.

The shared data set consists of the 2003 TIDES MT-Eval Test Data for both Chinese-to-English and Arabic-to-English MT. For each of these two language-pair data sets, the following is provided:

- The set of test sentences in the original source language (Chinese or Arabic)
- MT system output for the set of sentences for 7 different MT systems
- A collection of 4 reference translations (human translated) into English
- Human judgments of MT quality (adequacy and fluency) for the various MT system translations of every sentence. Each sentence was judged by two subjects, each of which assigned both an adequacy score and a fluency score, in the integer range of [1-5].

*Note*: human judgment assessments are provided for 7 Chinese-to-English MT systems, but only for 6 Arabic-to-English MT systems. Human assessments are not available for the system that is code-named "ama".

The data set package contains detailed instructions on how the data is organized and formatted.

To receive a copy of the shared data, participants must complete and sign a copy of LDC's ACL 2005 Workshop Agreement, and fax it to LDC. LDC will then email you instructions for how to download the data. Detailed instructions can be found at:
http://www.ldc.upenn.edu/Membership/Agreements/eval/ACL_Index.html

Questions about the shared MT data sets should be addressed to Alon Lavie <alavie@cs.cmu.edu>, who will serve as workshop coordinator for the shared MT data.

Shared Data Set for Summarization Evaluation

Past Document Understanding Conference (DUC) data would be used as the shared data set to facilitate the comparison of different measures for Summarization evaluation.

The shared data set consists of four-year worth of DUC evaluation data including

- Documents
- Summaries, results, etc.

manually created summaries
automatically created baseline summaries
submitted summaries created by the participating groups' systems
tables with the evaluation results
additional supporting data and software

Each data set package contains detailed instructions on how the data is organized and formatted. The DUC data contains collections for different summarization tasks. To provide more meaningful comparisons, the DUC 2003 tasks 1 and 2 data sets are designated as the main collection for this workshop but participants are free to choose any task collections that fit their evaluation scenarios.

To receive a copy of the shared data, participants must complete the DUC data user agreements and Agreement Concerning Dissemination of DUC Results. Detailed instructions can be found at:
http://www-nlpir.nist.gov/projects/duc/data.html

Questions about the shared Summarization data sets with respect to this workshop should be addressed to Chin-Yew Lin <cyl@isi.edu>, who will serve as workshop coordinator for the shared summarization data. However, DUC specific questions should be sent to Lori Buckland (lori.buckland AT nist.gov) or Hoa Dang (hoa.dang AT nist.gov) at NIST.

TARGET AUDIENCE

The topic of this workshop should be of significant interest to the entire MT and Summarization research communities, and also to commercial developers of MT and Summarization systems. It should be of particular interest to the program managers and participants of the MT and Summarization programs funded by the US Government, where common evaluations are an integral part of the research program.

SUBMISSION INFORMATION

Submissions will consist of regular full papers (maximum 8 pages), reports on evaluations using shared data sets, and position papers, formatted following the ACL 2005 guidelines. However, review will be blind. Please do not include names of the authors in your paper. To submit your paper, please click HERE.

IMPORTANT DATES

All submissions due:	Mon, May 2, 2005
	NOTE: We are aware that this submission deadline conflicts with a proposal deadline from a US funding agency. If this conflict causes you significant hardship, please contact the the workshop organizers.
Notification:	Sun, May 22, 2005
Camera-ready papers due:	Wed, June 1, 2005

ORGANIZERS

Jade Goldstein, US Department of Defense, USA
Alon Lavie, Language Technologies Institute, CMU, USA
Chin-Yew Lin, Information Sciences Institute, USC, USA
Clare Voss, Army Research Laboratory, USA

PROGRAM COMMITTEE

Yasuhiro Akiba (ATR, Japan)
Leslie Barrett (TransClick, USA)
Bonnie Dorr (U Maryland, USA)
Tony Hartley (U Leeds, UK)
John Henderson (MITRE, USA)
Chiori Hori (LTI CMU, USA)
Eduard Hovy (ISI/USC, USA)
Doug Jones (MIT Lincoln Laboratory, USA)
Philipp Koehn (CSAIL MIT, USA)
Marie-Francine Moens (Katholieke Universiteit, Leuven, Belgium)
Hermann Ney (RWTH Aachen, Germany)
Franz Och (Google, USA)
Becky Passonneau (Columbia U, NY USA)
Andrei Popescu-Belis (ISSCO/TIM/ETI, U Geneva, Switzerland)
Dragomir Radev (U Michigan, USA)
Karen Sparck Jones (Computer Laboratory, Cambridge U, UK)
Simone Teufel (Computer Laboratory, Cambridge U, UK)
Nicola Ueffing (RWTH Aachen, Germany)
Hans van Halteren (U Nijmegen, The Netherlands)
Michelle Vanni (ARL, USA)
Dekai Wu (HKUST, Hong Kong)

Intrinsic and Extrinsic Evaluation Measures
for MT and/or Summarization

Workshop at the Annual Meeting of
the Association of Computational Linguistics (ACL 2005)

Ann Arbor, Michigan
June 29, 2005
http://www.isi.edu/~cyl/MTSE2005/

Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization

Workshop at the Annual Meeting of the Association of Computational Linguistics (ACL 2005)

Ann Arbor, Michigan June 29, 2005 http://www.isi.edu/~cyl/MTSE2005/

Intrinsic and Extrinsic Evaluation Measures
for MT and/or Summarization

Workshop at the Annual Meeting of
the Association of Computational Linguistics (ACL 2005)

Ann Arbor, Michigan
June 29, 2005
http://www.isi.edu/~cyl/MTSE2005/