Difference between revisions of "Data sets for NLG blog"

From ACL Wiki
Jump to navigation Jump to search
(add E2E blog)
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
This blog is a supplement to [[Data sets for NLG]], which lists comments about these data sets from users, authors and other interested parties.  We are especially interested in comments about appropriate and inappropriate usage of a data set, "best practice" use of a data set, useful additional information about a data set (eg, scope, how it was constructed), and pointers to related data sets which may be more appropriate for some users.  Links to relevant papers and other resources are welcome
+
This blog is a supplement to [[Data sets for NLG]], which lists comments about these data sets from users, authors and other interested parties.  We are especially interested in comments about appropriate and inappropriate usage of a data set, "best practice" use of a data set, useful additional information about a data set (eg, scope, how it was constructed), and pointers to related data sets which may be more appropriate for some users.  Links to relevant papers and other resources are welcome.
 +
 
 +
We'd love to see more content here, please email Ehud Reiter (e.reiter@abdn.ac.uk) with contributions or other comments
  
 
=== E2E ===
 
=== E2E ===
 
The E2E dataset was used in the [http://www.macs.hw.ac.uk/InteractionLab/E2E/ E2E challenge].
 
The E2E dataset was used in the [http://www.macs.hw.ac.uk/InteractionLab/E2E/ E2E challenge].
 +
 +
=== SumTime ===
 +
The SumTime corpus is structured as a database, and presented in text (CSV) and MDB (Microsoft Access) formats.
 +
 +
A good example of the use of Sumtime is [https://doi.org/10.1017/S1351324907004664 Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models].
 +
 +
=== Tuna ===
 +
[http://www.lrec-conf.org/proceedings/lrec2010/pdf/251_Paper.pdf Dutch] and [https://www.aclweb.org/anthology/W17-3532 Mandarin] versions of Tuna have been developed.
 +
 +
=== WebNLG ===
 +
Thiago Castro Ferreira and Diego Moussallem spent six months producing an enriched version of WebNLG with high-quality annotations.  This is available on [https://github.com/ThiagoCF05/webnlg GitHub]
 +
 +
The WebNLG dataset was used in the [http://webnlg.loria.fr/pages/results.html WebNLG challenge].
 +
 +
=== Weather ===
 +
The weather dataset leverages tree-structured meaning representations for better discourse-level structuring, and collects ~30K human annotated utterances.
 +
 +
This is available on [https://github.com/facebookresearch/TreeNLG Github].
  
 
=== Weathergov ===
 
=== Weathergov ===
The Weathergov corpus contains the output of a template-based weather forecast generator, not human-written forecasts ([https://ehudreiter.com/2017/05/09/weathergov/ blog post]). Hence ML on Weathergov is an exercise in reverse engineering a template-based NLG system, not in training an NLG system from human data.  If you want to train on human-written weather forecasts, consider using the [https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip SumTime corpus] instead.
+
The Weathergov corpus contains the output of a template-based weather forecast generator, not human-written forecasts ([https://ehudreiter.com/2017/05/09/weathergov/ blog post]). Hence ML on Weathergov is an exercise in reverse engineering a template-based NLG system, not in training an NLG system from human data.  If you want to train on human-written weather forecasts, consider using the [https://github.com/facebookresearch/TreeNLG Weather corpus] and [https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip SumTime corpus] instead.
 +
 
 +
=== WikiBio ===
 +
No manual verification or filtering  [https://ehudreiter.com/2019/09/26/generated-texts-must-be-accurate/#comment-15983]

Latest revision as of 13:09, 6 August 2020

This blog is a supplement to Data sets for NLG, which lists comments about these data sets from users, authors and other interested parties. We are especially interested in comments about appropriate and inappropriate usage of a data set, "best practice" use of a data set, useful additional information about a data set (eg, scope, how it was constructed), and pointers to related data sets which may be more appropriate for some users. Links to relevant papers and other resources are welcome.

We'd love to see more content here, please email Ehud Reiter (e.reiter@abdn.ac.uk) with contributions or other comments

E2E

The E2E dataset was used in the E2E challenge.

SumTime

The SumTime corpus is structured as a database, and presented in text (CSV) and MDB (Microsoft Access) formats.

A good example of the use of Sumtime is Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models.

Tuna

Dutch and Mandarin versions of Tuna have been developed.

WebNLG

Thiago Castro Ferreira and Diego Moussallem spent six months producing an enriched version of WebNLG with high-quality annotations. This is available on GitHub

The WebNLG dataset was used in the WebNLG challenge.

Weather

The weather dataset leverages tree-structured meaning representations for better discourse-level structuring, and collects ~30K human annotated utterances.

This is available on Github.

Weathergov

The Weathergov corpus contains the output of a template-based weather forecast generator, not human-written forecasts (blog post). Hence ML on Weathergov is an exercise in reverse engineering a template-based NLG system, not in training an NLG system from human data. If you want to train on human-written weather forecasts, consider using the Weather corpus and SumTime corpus instead.

WikiBio

No manual verification or filtering [1]