Call for Participation
PROFE 2025: Language Proficiency Evaluation
IberLEF 2025 Shared Task
https://nlp.uned.es/question-answering/profe2025
PROFE 2025 reuses the exams for Spanish proficiency evaluation developed by Instituto Cervantes over many years to evaluate human students. Therefore, automatic systems will be evaluated under the same conditions as humans were. Systems will receive a set of exercises with their corresponding instructions without specific training material. In this way we expect Transfer Learning approaches or the use of Generative Large Language Models.
Subtasks
PROFE 2025 has three subtasks, one per exercise type. Teams can participate in any combination of them. Each subtask contains several exercises of the same type. The subtasks are:
Multiple choice subtask: each exercise includes a text and a set of multiple-choice questions about the text where only one answer is correct. Given a multiple-choice question, systems must select the correct answer among the candidates.
Matching subtask: each exercise contains two sets of texts. Systems must find the text in the second set that best matches the first set. There is only one possible matching per text, but the first set can contain extra unnecessary texts.
Filling the gap subtask: each exercise contains a text with several gaps corresponding to textual fragments that have been removed and presented disorderly as options. Systems must determine the correct position for each fragment. There is only one correct text per gap, but there could be more candidates than gaps.
The different exercises open research on how to approach them, adapting different prompts when using generative models.
Dataset
We will use the IC-UNED-RC-ES dataset created from real examinations at Instituto Cervantes. These exams were created by human experts to assess language proficiency in Spanish. We have already collected the exams and converted them to a digital format, which is ready to be used in the task. The dataset contains exams at different levels (from A1 to C2).
The complete dataset contains 282 exams with 855 exercises. The total number of evaluation points are 6146 (among 16570 options) distributed by exercise type as:
- multiple-choice: 3544 responses
- matching: 2309 responses
- fill-the-gap: 293 responses
In PROFE 2025 we plan to use around 50% of the exams, so the other 50% remains hidden for PROFE second edition.
We intend not to distribute the gold standard to prevent overfitting in post-campaign experiments and data contamination in LLMs.
Evaluation measures and baseline
We will use traditional accuracy (proportion of correct answers) as the main evaluation measure. Systems will receive evaluation scores from two different perspectives:
- At the question level, where correct answers are counted individually without grouping them.
- At the exam level, where scores for each exam are considered. Each exam contains several exercises of different types. An exam is considered to be passed if an accuracy score (accounted as the proportion of correct answers) above 0.5 is reached. Then, the proportion of passed exams is given as a global score. This perspective will only apply to those teams participating in the three subtasks.
More in detail, the exact evaluation per subtask is as follows:
- Multiple choice subtask: we will measure accuracy as the proportion of questions correctly answered
- Matching subtask: we will measure accuracy as the proportion of correct texts matched.
- Fill in the gap subtask: We will measure accuracy as the proportion of correctly filled gaps.
We will use accuracy as the evaluation measure because there is only one correct option among candidates and because it is the measure applied to humans doing the same exams. Thus, we can compare the performance of automatic systems and humans under the same conditions
A preliminary baseline using ChatGPT obtains the following results for each exercise type (provided that different prompting can produce slightly different results):
- Multiple choice accuracy: 0.64
- Filling the gap accuracy: 0.43
- Matching accuracy: 0.51
Schedule
February 6, 2025 Registration opens
March 10, 2025 Training data released
April 28, 2025 Test set release
May 9, 2025 Deadline for submitting runs
May 14, 2025 Release of evaluation results
June 3, 2025 Paper submission deadline
Organizers
Alvaro Rodrigo, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Anselmo Peñas, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Alberto Pérez, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Sergio Moreno, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Javier Fruns, Instituto Cervantes
Inés Soria, Instituto Cervantes
Rodrigo Agerri, HiTz (Universidad del País Vasco, UPV/EHU)