Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline

Kawin Ethayarajh


Abstract
Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we first show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.
Anthology ID:
W18-3012
Volume:
Proceedings of the Third Workshop on Representation Learning for NLP
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Isabelle Augenstein, Kris Cao, He He, Felix Hill, Spandana Gella, Jamie Kiros, Hongyuan Mei, Dipendra Misra
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–100
Language:
URL:
https://aclanthology.org/W18-3012
DOI:
10.18653/v1/W18-3012
Bibkey:
Cite (ACL):
Kawin Ethayarajh. 2018. Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 91–100, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline (Ethayarajh, RepL4NLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3012.pdf
Notes:
 W18-3012.Notes.pdf
Data
SST