A Twitter Corpus and Benchmark Resources for German Sentiment Analysis

Mark Cieliebak, Jan Milan Deriu, Dominic Egger, Fatih Uzdilli


Abstract
In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created German word embeddings trained on 300M tweets. These word embeddings were then optimized for sentiment analysis using distant-supervised learning. The new corpus, the German word embeddings (plain and optimized), and source code to re-run the benchmarks are publicly available.
Anthology ID:
W17-1106
Volume:
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Lun-Wei Ku, Cheng-Te Li
Venue:
SocialNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45–51
Language:
URL:
https://aclanthology.org/W17-1106
DOI:
10.18653/v1/W17-1106
Bibkey:
Cite (ACL):
Mark Cieliebak, Jan Milan Deriu, Dominic Egger, and Fatih Uzdilli. 2017. A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pages 45–51, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
A Twitter Corpus and Benchmark Resources for German Sentiment Analysis (Cieliebak et al., SocialNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1106.pdf
Data
SB10k