Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data

Kirk Baker, Chris Brew


Abstract
This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train a classifier to distinguish etymological classes of actual words. We ran a series of experiments on identifying English loanwords in Korean, in order to explore the consequences of using pseudo-data in place of the original training data. Results show that a sufficient quantity of automatically generated training data, even produced by fairly low precision transliteration rules, can be used to train a classifier that performs within 0.3% of one trained on actual English loanwords (96% accuracy).
Anthology ID:
L08-1095
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/295_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kirk Baker and Chris Brew. 2008. Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data (Baker & Brew, LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/295_paper.pdf