Using Semantics for Granularities of Tokenization

Martin Riedl, Chris Biemann


Abstract
Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.
Anthology ID:
J18-3005
Volume:
Computational Linguistics, Volume 44, Issue 3 - September 2018
Month:
September
Year:
2018
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
483–524
Language:
URL:
https://aclanthology.org/J18-3005
DOI:
10.1162/coli_a_00325
Bibkey:
Cite (ACL):
Martin Riedl and Chris Biemann. 2018. Using Semantics for Granularities of Tokenization. Computational Linguistics, 44(3):483–524.
Cite (Informal):
Using Semantics for Granularities of Tokenization (Riedl & Biemann, CL 2018)
Copy Citation:
PDF:
https://aclanthology.org/J18-3005.pdf
Data
GENIA