Simple Queries as Distant Labels for Predicting Gender on Twitter

Chris Emmery, Grzegorz Chrupała, Walter Daelemans


Abstract
The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.
Anthology ID:
W17-4407
Volume:
Proceedings of the 3rd Workshop on Noisy User-generated Text
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Leon Derczynski, Wei Xu, Alan Ritter, Tim Baldwin
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
50–55
Language:
URL:
https://aclanthology.org/W17-4407
DOI:
10.18653/v1/W17-4407
Bibkey:
Cite (ACL):
Chris Emmery, Grzegorz Chrupała, and Walter Daelemans. 2017. Simple Queries as Distant Labels for Predicting Gender on Twitter. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 50–55, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Simple Queries as Distant Labels for Predicting Gender on Twitter (Emmery et al., WNUT 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-4407.pdf