Spotting Spurious Data with Neural Networks

Hadi Amiri, Timothy Miller, Guergana Savova


Abstract
Automatic identification of spurious instances (those with potentially wrong labels in datasets) can improve the quality of existing language resources, especially when annotations are obtained through crowdsourcing or automatically generated based on coded rankings. In this paper, we present effective approaches inspired by queueing theory and psychology of learning to automatically identify spurious instances in datasets. Our approaches discriminate instances based on their “difficulty to learn,” determined by a downstream learner. Our methods can be applied to any dataset assuming the existence of a neural network model for the target task of the dataset. Our best approach outperforms competing state-of-the-art baselines and has a MAP of 0.85 and 0.22 in identifying spurious instances in synthetic and carefully-crowdsourced real-world datasets respectively.
Anthology ID:
N18-1182
Volume:
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Editors:
Marilyn Walker, Heng Ji, Amanda Stent
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2006–2016
Language:
URL:
https://aclanthology.org/N18-1182
DOI:
10.18653/v1/N18-1182
Bibkey:
Cite (ACL):
Hadi Amiri, Timothy Miller, and Guergana Savova. 2018. Spotting Spurious Data with Neural Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2006–2016, New Orleans, Louisiana. Association for Computational Linguistics.
Cite (Informal):
Spotting Spurious Data with Neural Networks (Amiri et al., NAACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/N18-1182.pdf
Data
CIFAR-10