Docria: Processing and Storing Linguistic Data with Wikipedia

Marcus Klang, Pierre Nugues


Abstract
The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available. Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code. Docria is available as open-source code.
Anthology ID:
W19-6148
Volume:
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Month:
September–October
Year:
2019
Address:
Turku, Finland
Editors:
Mareike Hartmann, Barbara Plank
Venue:
NoDaLiDa
SIG:
Publisher:
Linköping University Electronic Press
Note:
Pages:
400–405
Language:
URL:
https://aclanthology.org/W19-6148
DOI:
Bibkey:
Cite (ACL):
Marcus Klang and Pierre Nugues. 2019. Docria: Processing and Storing Linguistic Data with Wikipedia. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 400–405, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):
Docria: Processing and Storing Linguistic Data with Wikipedia (Klang & Nugues, NoDaLiDa 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-6148.pdf