GNEG: Graph-Based Negative Sampling for word2vec

Zheng Zhang, Pierre Zweigenbaum


Abstract
Negative sampling is an important component in word2vec for distributed word representation learning. We hypothesize that taking into account global, corpus-level information and generating a different noise distribution for each target word better satisfies the requirements of negative examples for each training word than the original frequency-based distribution. In this purpose we pre-compute word co-occurrence statistics from the corpus and apply to it network algorithms such as random walk. We test this hypothesis through a set of experiments whose results show that our approach boosts the word analogy task by about 5% and improves the performance on word similarity tasks by about 1% compared to the skip-gram negative sampling baseline.
Anthology ID:
P18-2090
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Iryna Gurevych, Yusuke Miyao
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
566–571
Language:
URL:
https://aclanthology.org/P18-2090
DOI:
10.18653/v1/P18-2090
Bibkey:
Cite (ACL):
Zheng Zhang and Pierre Zweigenbaum. 2018. GNEG: Graph-Based Negative Sampling for word2vec. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 566–571, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
GNEG: Graph-Based Negative Sampling for word2vec (Zhang & Zweigenbaum, ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-2090.pdf
Poster:
 P18-2090.Poster.pdf