Tonghai Jiang


2018

pdf bib
Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the 27th International Conference on Computational Linguistics

To enrich vocabulary of low resource settings, we proposed a novel method which identify loanwords in monolingual corpora. More specifically, we first use cross-lingual word embeddings as the core feature to generate semantically related candidates based on comparable corpora and a small bilingual lexicon; then, a log-linear model which combines several shallow features such as pronunciation similarity and hybrid language model features to predict the final results. In this paper, we use Uyghur as the receipt language and try to detect loanwords in four donor languages: Arabic, Chinese, Persian and Russian. We conduct two groups of experiments to evaluate the effectiveness of our proposed approach: loanword identification and OOV translation in four language pairs and eight translation directions (Uyghur-Arabic, Arabic-Uyghur, Uyghur-Chinese, Chinese-Uyghur, Uyghur-Persian, Persian-Uyghur, Uyghur-Russian, and Russian-Uyghur). Experimental results on loanword identification show that our method outperforms other baseline models significantly. Neural machine translation models integrating results of loanword identification experiments achieve the best results on OOV translation(with 0.5-0.9 BLEU improvements)

pdf bib
A Neural Network Based Model for Loanword Identification in Uyghur
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Log-linear Models for Uyghur Segmentation in Spoken Language Translation
Chenggang Mi | Yating Yang | Rui Dong | Xi Zhou | Lei Wang | Xiao Li | Tonghai Jiang
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

To alleviate data sparsity in spoken Uyghur machine translation, we proposed a log-linear based morphological segmentation approach. Instead of learning model only from monolingual annotated corpus, this approach optimizes Uyghur segmentation for spoken translation based on both bilingual and monolingual corpus. Our approach relies on several features such as traditional conditional random field (CRF) feature, bilingual word alignment feature and monolingual suffixword co-occurrence feature. Experimental results shown that our proposed segmentation model for Uyghur spoken translation achieved 1.6 BLEU score improvements compared with the state-of-the-art baseline.

2016

pdf bib
Recurrent Neural Network Based Loanwords Identification in Uyghur
Chenggang Mi | Yating Yang | Xi Zhou | Lei Wang | Xiao Li | Tonghai Jiang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers