Interpreting Neural Network Hate Speech Classifiers

Cindy Wang


Abstract
Deep neural networks have been applied to hate speech detection with apparent success, but they have limited practical applicability without transparency into the predictions they make. In this paper, we perform several experiments to visualize and understand a state-of-the-art neural network classifier for hate speech (Zhang et al., 2018). We adapt techniques from computer vision to visualize sensitive regions of the input stimuli and identify the features learned by individual neurons. We also introduce a method to discover the keywords that are most predictive of hate speech. Our analyses explain the aspects of neural networks that work well and point out areas for further improvement.
Anthology ID:
W18-5111
Volume:
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
86–92
Language:
URL:
https://aclanthology.org/W18-5111
DOI:
10.18653/v1/W18-5111
Bibkey:
Cite (ACL):
Cindy Wang. 2018. Interpreting Neural Network Hate Speech Classifiers. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 86–92, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Interpreting Neural Network Hate Speech Classifiers (Wang, ALW 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5111.pdf