Distributed Distributional Similarities of Google Books Over the Centuries

Martin Riedl, Richard Steuer, Chris Biemann


Abstract
This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.
Anthology ID:
L14-1249
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1401–1405
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/274_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Martin Riedl, Richard Steuer, and Chris Biemann. 2014. Distributed Distributional Similarities of Google Books Over the Centuries. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1401–1405, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Distributed Distributional Similarities of Google Books Over the Centuries (Riedl et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/274_Paper.pdf