Difference between revisions of "Data sets for NLG"

From ACL Wiki
Jump to navigation Jump to search
(Various updates)
Line 8: Line 8:
  
 
==Data-to-text/Concept-to-text Generation==
 
==Data-to-text/Concept-to-text Generation==
 +
These datasets contain data and corresponding texts based on this data.
  
 
=== E2E ===  
 
=== E2E ===  
 
http://www.macs.hw.ac.uk/InteractionLab/E2E/#data  
 
http://www.macs.hw.ac.uk/InteractionLab/E2E/#data  
 +
 +
Crowdsourced restaurant descriptions with corresponding restaurant data.  English.
  
 
=== SUMTIME ===  
 
=== SUMTIME ===  
This [https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip data] contain predictions for meteorological parameters such as precipitation, temperature, wind speed, and cloud cover at various altitudes, at regular intervals for various points in the area of interest.
+
https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip
 +
 
 +
Weather forecasts written by human forecasters, with corresponding forecast data, for UK North Sea marine forecasts.
  
 
=== WeatherGov ===
 
=== WeatherGov ===
 
https://cs.stanford.edu/~pliang/data/weather-data.zip  
 
https://cs.stanford.edu/~pliang/data/weather-data.zip  
 +
 +
Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data.
  
 
=== WebNLG===  
 
=== WebNLG===  
 
https://github.com/ThiagoCF05/webnlg  
 
https://github.com/ThiagoCF05/webnlg  
 +
 +
Crowdsourced descriptions of semantic web entities, with corresponding RDF triples.
  
 
== Referring Expressions Generation==
 
== Referring Expressions Generation==
Line 25: Line 34:
  
 
=== GRE3D3 and GRE3D7: Spatial Relations in Referring Expressions ===
 
=== GRE3D3 and GRE3D7: Spatial Relations in Referring Expressions ===
 +
http://jetteviethen.net/research/spatial.html
 +
 
Two web-based production experiments were conducted by Jette Viethen under the supervision of Robert Dale.
 
Two web-based production experiments were conducted by Jette Viethen under the supervision of Robert Dale.
The resulting corpora GRE3D3 and GRE3D7 contain 720  and 4480 referring expressions, respectively. Each referring expression describes a simple object in a simple 3D scene. GRE3D3 scenes contain 3 objects and GRE3D7 scenes contain 7 objects. [http://jetteviethen.net/research/spatial.html The corpora and stimulus scenes are available here.]
+
The resulting corpora GRE3D3 and GRE3D7 contain 720  and 4480 referring expressions, respectively. Each referring expression describes a simple object in a simple 3D scene. GRE3D3 scenes contain 3 objects and GRE3D7 scenes contain 7 objects.
  
 
=== RefClef, RefCOCO, RefCOCO+ and RefCOCOg ===
 
=== RefClef, RefCOCO, RefCOCO+ and RefCOCOg ===
 
https://github.com/lichengunc/refer
 
https://github.com/lichengunc/refer
 +
 +
Referring expressions for objects in images, and the corresponding images.
  
 
=== The REAL dataset ===
 
=== The REAL dataset ===
 
https://datastorre.stir.ac.uk/handle/11667/82
 
https://datastorre.stir.ac.uk/handle/11667/82
 +
 +
Referring expressions for objects in images, and the corresponding images.
  
 
=== GeoDescriptors ===
 
=== GeoDescriptors ===
 
https://gitlab.citius.usc.es/alejandro.ramos/geodescriptors  
 
https://gitlab.citius.usc.es/alejandro.ramos/geodescriptors  
 +
 +
Geographical descriptions (eg, "Norte de Galicia") and corresponding regions on a map
  
 
=== TUNA Reference Corpus ===
 
=== TUNA Reference Corpus ===
The [http://www.csd.abdn.ac.uk/~agatt/tuna/corpus/ TUNA Reference Corpus] is a semantically and pragmatically transparent corpus of identifying references to objects in visual domains. It was constructed via an online experiment and has since been used in a number of evaluation studies on Referring Expressions Generation, as well as in two Shared Tasks: the Attribute Selection for Referring Expressions Generation task (2007), and the Referring Expression Generation task (2008). Main authors: Kees van Deemter, Albert Gatt, Ielka van der Sluis. ([http://www.csd.abdn.ac.uk/~agatt/tuna/corpus/corpus.zip direct download link])
+
https://www.abdn.ac.uk/ncs/departments/computing-science/corpus-496.php
 +
 
 +
https://www.abdn.ac.uk/ncs/documents/corpus.zip    [direct download]
 +
 
 +
The TUNA Reference Corpus is a semantically and pragmatically transparent corpus of identifying references to objects in visual domains. It was constructed via an online experiment and has since been used in a number of evaluation studies on Referring Expressions Generation, as well as in two Shared Tasks: the Attribute Selection for Referring Expressions Generation task (2007), and the Referring Expression Generation task (2008). Main authors: Kees van Deemter, Albert Gatt, Ielka van der Sluis.  
  
 
=== COCONUT Corpus ===
 
=== COCONUT Corpus ===
COCONUT was a project on “Cooperative, coordinated natural language utterances”. The [http://www.pitt.edu/~coconut/coconut-corpus.html COCONUT corpus] is a collection of computer-mediated dialogues in which two subjects collaborate on a simple task, namely buying furniture. SGML annotations were added according to the [http://www.pitt.edu/%7Epjordan/papers/coconut-manual.pdf COCONUT-DRI coding scheme]. ([http://www.pitt.edu/%7Ecoconut/corpora/corpus.tar.gz direct download link])
+
http://www.pitt.edu/~coconut/coconut-corpus.html
 +
 
 +
http://www.pitt.edu/%7Ecoconut/corpora/corpus.tar.gz    [direct download]
 +
 
 +
COCONUT was a project on “Cooperative, coordinated natural language utterances”. The COCONUT corpus is a collection of computer-mediated dialogues in which two subjects collaborate on a simple task, namely buying furniture. SGML annotations were added according to the [http://www.pitt.edu/%7Epjordan/papers/coconut-manual.pdf COCONUT-DRI coding scheme].  
  
 
== Dialogue Systems ==
 
== Dialogue Systems ==
  
 
===CLASSiC WOZ corpus on InformationPresentation in Spoken Dialogue Systems===
 
===CLASSiC WOZ corpus on InformationPresentation in Spoken Dialogue Systems===
CLASSiC is a project on [http://www.classic-project.org/ Computational Learning in Adaptive Systems for Spoken Conversation]. The [http://www.classic-project.org/corpora Wizard-of-Oz corpus] on Information Presentation in Spoken Dialogue Systems contains the wizards' choices on Information Presentation strategy (summary, compare, recommend , or a combination of those) and attribute selection. The domain is restaurant search in Edinburgh. Objective measures (such as dialogue length, number of database hits, number of sentences generated etc.), as well as subjective measures (the user scores) were logged.
+
http://www.classic-project.org/corpora
  
 +
CLASSiC is a project on [http://www.classic-project.org/ Computational Learning in Adaptive Systems for Spoken Conversation]. The Wizard-of-Oz corpus on Information Presentation in Spoken Dialogue Systems contains the wizards' choices on Information Presentation strategy (summary, compare, recommend , or a combination of those) and attribute selection. The domain is restaurant search in Edinburgh. Objective measures (such as dialogue length, number of database hits, number of sentences generated etc.), as well as subjective measures (the user scores) were logged.
  
== Focus on studying the generation target ==
+
 
 +
== Other ==
 
=== PIL: Patient Information Leaflet corpus ===
 
=== PIL: Patient Information Leaflet corpus ===
The [http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/ Patient Information Leaflet (PIL) corpus] is a [http://www.itri.brighton.ac.uk/projects/pills/corpus/PIL/searchtool/search.html searchable] and [http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/ browsable] collection of patient information leaflets available in various document formats as well as structurally annotated SGML. The PIL corpus was initially developed as part of the ICONOCLAST project at ITRI, Brighton. ([http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL-corpus-2.0.tar.gz direct download link])
+
http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/
 +
 
 +
http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL-corpus-2.0.tar.gz    [direct download]
 +
 
 +
The Patient Information Leaflet (PIL) corpus] is a [http://www.itri.brighton.ac.uk/projects/pills/corpus/PIL/searchtool/search.html searchable] and [http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/ browsable] collection of patient information leaflets available in various document formats as well as structurally annotated SGML. The PIL corpus was initially developed as part of the ICONOCLAST project at ITRI, Brighton.
  
  

Revision as of 04:12, 11 April 2019


This page lists sets of structured data to be used as input for natural language generation tasks, or to inform research on NLG.

Data-to-text/Concept-to-text Generation

These datasets contain data and corresponding texts based on this data.

E2E

http://www.macs.hw.ac.uk/InteractionLab/E2E/#data

Crowdsourced restaurant descriptions with corresponding restaurant data. English.

SUMTIME

https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip

Weather forecasts written by human forecasters, with corresponding forecast data, for UK North Sea marine forecasts.

WeatherGov

https://cs.stanford.edu/~pliang/data/weather-data.zip

Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data.

WebNLG

https://github.com/ThiagoCF05/webnlg

Crowdsourced descriptions of semantic web entities, with corresponding RDF triples.

Referring Expressions Generation

Referring expression generation is a sub-task of NLG that focuses only on the generation of referring expressions (descriptions) that identify specific entities called targets.

GRE3D3 and GRE3D7: Spatial Relations in Referring Expressions

http://jetteviethen.net/research/spatial.html

Two web-based production experiments were conducted by Jette Viethen under the supervision of Robert Dale. The resulting corpora GRE3D3 and GRE3D7 contain 720 and 4480 referring expressions, respectively. Each referring expression describes a simple object in a simple 3D scene. GRE3D3 scenes contain 3 objects and GRE3D7 scenes contain 7 objects.

RefClef, RefCOCO, RefCOCO+ and RefCOCOg

https://github.com/lichengunc/refer

Referring expressions for objects in images, and the corresponding images.

The REAL dataset

https://datastorre.stir.ac.uk/handle/11667/82

Referring expressions for objects in images, and the corresponding images.

GeoDescriptors

https://gitlab.citius.usc.es/alejandro.ramos/geodescriptors

Geographical descriptions (eg, "Norte de Galicia") and corresponding regions on a map

TUNA Reference Corpus

https://www.abdn.ac.uk/ncs/departments/computing-science/corpus-496.php

https://www.abdn.ac.uk/ncs/documents/corpus.zip [direct download]

The TUNA Reference Corpus is a semantically and pragmatically transparent corpus of identifying references to objects in visual domains. It was constructed via an online experiment and has since been used in a number of evaluation studies on Referring Expressions Generation, as well as in two Shared Tasks: the Attribute Selection for Referring Expressions Generation task (2007), and the Referring Expression Generation task (2008). Main authors: Kees van Deemter, Albert Gatt, Ielka van der Sluis.

COCONUT Corpus

http://www.pitt.edu/~coconut/coconut-corpus.html

http://www.pitt.edu/%7Ecoconut/corpora/corpus.tar.gz [direct download]

COCONUT was a project on “Cooperative, coordinated natural language utterances”. The COCONUT corpus is a collection of computer-mediated dialogues in which two subjects collaborate on a simple task, namely buying furniture. SGML annotations were added according to the COCONUT-DRI coding scheme.

Dialogue Systems

CLASSiC WOZ corpus on InformationPresentation in Spoken Dialogue Systems

http://www.classic-project.org/corpora

CLASSiC is a project on Computational Learning in Adaptive Systems for Spoken Conversation. The Wizard-of-Oz corpus on Information Presentation in Spoken Dialogue Systems contains the wizards' choices on Information Presentation strategy (summary, compare, recommend , or a combination of those) and attribute selection. The domain is restaurant search in Edinburgh. Objective measures (such as dialogue length, number of database hits, number of sentences generated etc.), as well as subjective measures (the user scores) were logged.


Other

PIL: Patient Information Leaflet corpus

http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/

http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL-corpus-2.0.tar.gz [direct download]

The Patient Information Leaflet (PIL) corpus] is a searchable and browsable collection of patient information leaflets available in various document formats as well as structurally annotated SGML. The PIL corpus was initially developed as part of the ICONOCLAST project at ITRI, Brighton.

Siggen-logo.gif This page was imported semi-automatically from the NLG Resources Wiki which was run by ACL SIGGEN in the years 2005–2009. Please correct conversion errors and help update its contents.

Now this page is associated with the Natural Language Generation Portal.