The fourth Generation, Evaluation & Metrics Workshop

Event Notification Type: 
Call for Papers
Abbreviated Title: 
GEM^2
Location: 
ACL
Thursday, 31 July 2025 to Friday, 1 August 2025
Country: 
Austria
City: 
Vienna
Contact: 
GEM organizers
Submission Deadline: 
Friday, 11 April 2025

UPDATE with 2 announcements: (i) Datasets are now released, and (ii) we will host the ReproNLP shared task (links below).
-------------------------------------------
The fourth iteration of the Generation, Evaluation &
Metrics (GEM) Workshop
will be held as part of ACL, July 27–August 1st, 2025. This
year we’re planning a major upgrade to the workshop, which we dub
GEM^2, through the introduction of two large datasets of model predictions
together with prompts and gold standard references,
encouraging researchers from all backgrounds to submit work on meaningful,
efficient and robust evaluation of LLMs: DOVE and DataDecide.
The workshop will also host the ReproNLP shared task on
reproducibility of evaluations in NLP, with a presentation of
(i) the task and results overview by the organisers, and (ii) the results of
the individual reproductions by the participants.
.

OVERVIEW
Evaluating large language models (LLMs) is challenging. Running
LLMs over medium or large scale corpus can be prohibitively expensive; they
are consistently shown to be highly sensitive to prompt phrasing, and it is
hard to formulate metrics which differentiate and rank different LLMs in a
meaningful way. Consequently, the validity of the results obtained over
popular benchmarks such as HELM or MMLU, lead to brittle conclusions. We
believe that meaningful, efficient, and robust evaluation is one of the
cornerstones of the scientific method, and that achieving it should be a
community-wide goal. In this workshop we seek innovative research relating to
the evaluation of LLMs and language generation systems in general. We welcome
submissions related, but not limited to, the following topics:

  • Automatic evaluation of generation systems.
  • Creating evaluation corpora and challenge
    sets.
  • Critiques of benchmarking efforts and
    responsibly measuring progress in LLMs.
  • Effective and/or efficient NLG methods that
    can be applied to a wide range of languages and/or scenarios.
  • Application and evaluation of LLMs
    interacting with external data and tools.
  • Evaluation of sociotechnical systems
    employing large language models.
  • Standardizing human evaluation and making
    it more robust.
  • In-depth analyses of outputs of existing
    systems, for example through error analyses, by applying new metrics, or
    by testing the system on new test sets.

Following the success of last iterations, GEM^2 will
also hold an Industrial Track, which aims to provide actionable insights to
industry professionals and to foster collaborations between academia and
industry. This track will address the unique challenges faced by non-academic
colleagues, highlighting the differences in evaluation practices between
academic and industrial research, and explore the challenges in evaluating
generative models with real-world data. The Industrial Track invites
submissions covering the following topics, including (but not limited to):

  • Breaking Barriers: Bridging the Gap between
    Academic and Industrial Research.
  • From Data Diversity to Model Robustness:
    Challenges in Evaluating Generative Models with Real-World Data.
  • Beyond Metrics: Evaluating Generative
    Models for Real-World Business Impact.

HOW TO SUBMIT?
Submissions can take either of the following forms:

  • Archival Papers describing original and
    unpublished work can be submitted in a between 4 and 8 page format.
  • ARR-Reviewed Archival Papers describing
    original and unpublished work that already has ARR reviews can be
    submitted in a between 4 and 8 page format.
  • Non-Archival Abstracts To discuss work
    already presented or under review at a peer-reviewed venue, we allow the
    submission of 2-page abstracts.

Papers should be submitted directly through OpenReview,
selecting the appropriate track, and conform to ACL
2025 style guidelines
. We additionally welcome presentations by authors
of papers in the Findings of the ACL. The selection process is managed
centrally by the workshop chairs for the conference and we thus cannot respond
to individual inquiries about Findings papers. However, we will try our best
to accommodate authors’ requests.

IMPORTANT DATES

  • April 11: Direct paper submission deadline
    (ARR).
  • May 5: Pre-reviewed (ARR) commitment
    deadline.
  • May 19: Notification of acceptance.
  • June 6: Camera-ready paper deadline.
  • July 7: Pre-recorded videos due.
  • July 31 - August 1: Workshop at ACL in
    Vienna.

CONTACT
For any questions, please check the workshop page or email the
organisers: gem-benchmark-chairs@googlegroups.com.