HeLI, a Word-Based Backoff Method for Language Identification

Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen


Abstract
In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, with 17 teams providing results for the closed track of the test set A. Our system reached the 2nd position in 4 tracks (A closed and open, B1 open and B2 open) and in this paper we are focusing on the methods and data used for those tracks. We describe our word-based backoff method in mathematical notation. We also describe how we selected the corpus we used in the open tracks.
Anthology ID:
W16-4820
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
153–162
Language:
URL:
https://aclanthology.org/W16-4820
DOI:
Bibkey:
Cite (ACL):
Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen. 2016. HeLI, a Word-Based Backoff Method for Language Identification. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 153–162, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
HeLI, a Word-Based Backoff Method for Language Identification (Jauhiainen et al., VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4820.pdf
Code
 tosaja/HeLI