Document Similarity for Texts of Varying Lengths via Hidden Topics

Hongyu Gong, Tarek Sakakini, Suma Bhat, JinJun Xiong


Abstract
Measuring similarity between texts is an important task for several applications. Available approaches to measure document similarity are inadequate for document pairs that have non-comparable lengths, such as a long document and its summary. This is because of the lexical, contextual and the abstraction gaps between a long document of rich details and its concise summary of abstract information. In this paper, we present a document matching approach to bridge this gap, by comparing the texts in a common space of hidden topics. We evaluate the matching algorithm on two matching tasks and find that it consistently and widely outperforms strong baselines. We also highlight the benefits of the incorporation of domain knowledge to text matching.
Anthology ID:
P18-1218
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Iryna Gurevych, Yusuke Miyao
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2341–2351
Language:
URL:
https://aclanthology.org/P18-1218
DOI:
10.18653/v1/P18-1218
Bibkey:
Cite (ACL):
Hongyu Gong, Tarek Sakakini, Suma Bhat, and JinJun Xiong. 2018. Document Similarity for Texts of Varying Lengths via Hidden Topics. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2341–2351, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Document Similarity for Texts of Varying Lengths via Hidden Topics (Gong et al., ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-1218.pdf
Note:
 P18-1218.Notes.pdf
Poster:
 P18-1218.Poster.pdf
Code
 HongyuGong/Document-Similarity-via-Hidden-Topics