Natural Language Generation (NLG) is in the ascendant both as a stand-alone data-to-text or text-to-text task and as part of downstream applications (see, e.g., abstractive summarization, dialogue-based interaction, question answering, etc.). Only in 2017, three “deep” NLG shared tasks that focused on language generation from abstract semantic representations have been organized: WebNLG, SemEval Task 9 , E2E. However, both the 2017 and the past shared tasks (including the 2011 Surface Realization Shared Task, SR'11) focus on English; multilingual generation has been neglected largely so far.
The 2018 Shared Task (SR'18) follows-up on SR’11, this time with an emphasis of multilingual surface generation from Universal Dependencies (UD) treebanks. The multilingual UD dataset has already been used for the CoNLL'17 parsing shared task. This dataset, which currently consists of 102 treebanks covering about 60 languages and can be downloaded freely, facilitates the development of large scale applications that work potentially across all of the UD treebank languages in a uniform fashion.
As in SR’11, the proposed shared task comprises two tracks with different levels of complexity:
- Shallow Track: This track starts from genuine UD structures from which word order information has been removed and the tokens have been lemmatized, i.e., from unordered
dependency trees with lemmatized nodes that hold PoS tags and morphological information as found in the original annotations. It consists in determining the word order and inflecting words. Datasets are available for the following languages: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish.
- Deep Track: This track starts from UD structures from which functional words (in particular, auxiliaries, functional prepositions and conjunctions) and surface-oriented morphological information have been removed. In addition to what has to be done for the Shallow Track, the Deep Track thus consists of the introduction of the removed functional words and morphological features. Datasets are available for the following languages: English, French, Spanish.
The data used is the Universal Dependency treebanks V2.0, that is, the same data as used for the CoNLL'17 shared task on Multilingual Parsing from Raw Text to Universal Dependencies.
Either or both of the tracks can be addressed by the participating teams.
Important dates
Dec 11, 2017 : Registration for the task open
Dec 11, 2017 : Training and development sets available for registered participants
April 8, 2018 : System descriptions due
April 9, 2018 : Evaluation scripts and Test sets available
April 23, 2018 : System outputs collected
April 30, 2018 : Automatic evaluation due
May 21, 2018 : Human evaluation due
May 28, 2018 : Camera-ready papers due