SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, Harish Karnick


Abstract
We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embeddings are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG. We also show that SCDV embeddings perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve a significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.
Anthology ID:
D17-1069
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
659–669
Language:
URL:
https://aclanthology.org/D17-1069
DOI:
10.18653/v1/D17-1069
Bibkey:
Cite (ACL):
Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick. 2017. SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 659–669, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations (Mekala et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1069.pdf
Video:
 https://aclanthology.org/D17-1069.mp4
Code
 dheeraj7596/SCDV +  additional community code