Semi-Supervised SimHash for Efficient Document Similarity Search

Qixia Jiang and Maosong Sun
Tsinghua University


Abstract

Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S^3H) for high-dimensional data similarity search. The basic idea of S^3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1010.pdf