Distributed Document and Phrase Co-embeddings for Descriptive Clustering

Motoki Sato, Austin J. Brockmeier, Georgios Kontonatsios, Tingting Mu, John Y. Goulermas, Jun’ichi Tsujii, Sophia Ananiadou


Abstract
Descriptive document clustering aims to automatically discover groups of semantically related documents and to assign a meaningful label to characterise the content of each cluster. In this paper, we present a descriptive clustering approach that employs a distributed representation model, namely the paragraph vector model, to capture semantic similarities between documents and phrases. The proposed method uses a joint representation of phrases and documents (i.e., a co-embedding) to automatically select a descriptive phrase that best represents each document cluster. We evaluate our method by comparing its performance to an existing state-of-the-art descriptive clustering method that also uses co-embedding but relies on a bag-of-words representation. Results obtained on benchmark datasets demonstrate that the paragraph vector-based method obtains superior performance over the existing approach in both identifying clusters and assigning appropriate descriptive labels to them.
Anthology ID:
E17-1093
Volume:
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Month:
April
Year:
2017
Address:
Valencia, Spain
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
991–1001
URL:
https://www.aclweb.org/anthology/E17-1093
DOI:
Bib Export formats:
BibTeX MODS XML EndNote