Intrinsic and Extrinsic Evaluation Measures
| ||||||||
Workshop at the Annual Meeting of
| ||||||||
Ann Arbor, Michigan
| ||||||||
Final Program | ||||||||
Multilingual Summarization Evaluation | ||||||||
This one-day workshop will focus on the challenges that the MT
and summarization communities face in developing valid and useful
evaluation measures. Our aim is to bring these two communities
together to learn from each other's approaches.
In the past few years, we have witnessed---in both MT and summarization evaluation---the innovation of ngram-based intrinsic metrics that automatically score system-outputs against human-produced reference documents (e.g., IBM's BLEU and ISI/USC's counterpart ROUGE). Similarly, there has been renewed interest in user applications and task-based extrinsic measures in both communities (e.g., DUC'05 and TIDES'04). Most recently, evaluation efforts have tested for correlations to cross-validate independently derived intrinsic and extrinsic assessments of system-outputs with each other and with human judgments on output, such as accuracy and fluency. The concrete questions that we hope to see addressed in this workshop include, but are not limited to: - How adequately do intrinsic measures capture the variation between system-outputs and human-generated reference documents (summaries or translations)? What methods exist for calibrating and controlling the variation in linguistic complexity and content differences in input test-sets and reference sets? How much variation exists within these constructed sets? How does that variation affect different intrinsic measures? How many reference documents are needed for effective scoring? - How can intrinsic measures go beyond simple n-gram matching, to quantify the similarity between system-output and human-references? What other features and weighting alternatives lead to better metrics for both MT and summarization? How can intrinsic measures capture fluency and adequacy? Which types of new intrinsic metrics are needed to adequately evaluate non-extractive summaries and paraphrasing (e.g.,interlingual) translations? - How effectively do extrinsic (or proxy extrinsic) measures capture the quality of system output, as needed for downstream use in human tasks, such as triage (document relevance judgments), extraction (factual question answering), and report writing; and in automated tasks, such as filtering, information extraction, and question-answering? For example, when is an MT system good enough that a summarization system benefits from the additional information available in the MT output? - How should metrics for MT and summarization be assessed and compared? What characteristics should a good metric possess? When is one evaluation method better than another? What are the most effective ways of assessing the correlation testing and statistical modeling that seek to predict human task performance or human notions of output quality (e.g., fluency and adequacy) from "cheaper" automatic metrics? How reliable are human judgments? Anyone with an interest in MT or summarization evaluation research or in issues pertaining to the combination of MT and summarization is encouraged to participate in the workshop. We are looking for research papers on the aforementioned topics, as well as position papers that identify limitations in current approaches and describe promising future research directions.
SHARED DATA SETS
Shared Data Set for MT Evaluation
To facilitate the comparison of different measures for MT evaluation, we have arranged for LDC to release a shared data package on which researchers can conduct experiments that assess the performance of automatic MT evaluation metrics by correlating their scores with human judgments of MT quality. The shared data set consists of the 2003 TIDES MT-Eval Test Data for both Chinese-to-English and Arabic-to-English MT. For each of these two language-pair data sets, the following is provided:
- The set of test sentences in the original source language
(Chinese or Arabic) *Note*: human judgment assessments are provided for 7 Chinese-to-English MT systems, but only for 6 Arabic-to-English MT systems. Human assessments are not available for the system that is code-named "ama". The data set package contains detailed instructions on how the data is organized and formatted.
To receive a copy of the shared data, participants must complete
and sign a copy of LDC's ACL 2005 Workshop Agreement, and fax it
to LDC. LDC will then email you instructions for how to download
the data. Detailed instructions can be found at: Questions about the shared MT data sets should be addressed to Alon Lavie <alavie@cs.cmu.edu>, who will serve as workshop coordinator for the shared MT data.
Shared Data Set for Summarization Evaluation
WORKSHOP FORMATPast Document Understanding Conference (DUC) data would be used as the shared data set to facilitate the comparison of different measures for Summarization evaluation. The shared data set consists of four-year worth of DUC evaluation data including
- Documents
To receive a copy of the shared data, participants must complete
the DUC data user agreements and Agreement Concerning Dissemination
of DUC Results. Detailed instructions can be found at: Questions about the shared Summarization data sets with respect to this workshop should be addressed to Chin-Yew Lin <cyl@isi.edu>, who will serve as workshop coordinator for the shared summarization data. However, DUC specific questions should be sent to Lori Buckland (lori.buckland AT nist.gov) or Hoa Dang (hoa.dang AT nist.gov) at NIST.
The workshop will include presentations of research papers and short reports, an invited report on the TIDES 2005 Multi-lingual, multi-document summarization evaluation, and significant discussion time to compare results of different researchers. The workshop will conclude with a panel of invited discussants to address future research directions.
TARGET AUDIENCE
SUBMISSION INFORMATION
IMPORTANT DATES
ORGANIZERS
PROGRAM COMMITTEE |