Comparison of Representations of Named Entities for Document Classification

Lidia Pivovarova, Roman Yangarber


Abstract
We explore representations for multi-word names in text classification tasks, on Reuters (RCV1) topic and sector classification. We find that: the best way to treat names is to split them into tokens and use each token as a separate feature; NEs have more impact on sector classification than topic classification; replacing NEs with entity types is not an effective strategy; representing tokens by different embeddings for proper names vs. common nouns does not improve results. We highlight the improvements over state-of-the-art results that our CNN models yield.
Anthology ID:
W18-3008
Volume:
Proceedings of the Third Workshop on Representation Learning for NLP
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Isabelle Augenstein, Kris Cao, He He, Felix Hill, Spandana Gella, Jamie Kiros, Hongyuan Mei, Dipendra Misra
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–68
Language:
URL:
https://aclanthology.org/W18-3008
DOI:
10.18653/v1/W18-3008
Bibkey:
Cite (ACL):
Lidia Pivovarova and Roman Yangarber. 2018. Comparison of Representations of Named Entities for Document Classification. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 64–68, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Comparison of Representations of Named Entities for Document Classification (Pivovarova & Yangarber, RepL4NLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3008.pdf
Data
RCV1