DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding

Min Li, Marina Danilevsky, Sara Noeman, Yunyao Li


Abstract
Phonetic similarity algorithms identify words and phrases with similar pronunciation which are used in many natural language processing tasks. However, existing approaches are designed mainly for Indo-European languages and fail to capture the unique properties of Chinese pronunciation. In this paper, we propose a high dimensional encoded phonetic similarity algorithm for Chinese, DIMSIM. The encodings are learned from annotated data to separately map initial and final phonemes into n-dimensional coordinates. Pinyin phonetic similarities are then calculated by aggregating the similarities of initial, final and tone. DIMSIM demonstrates a 7.5X improvement on mean reciprocal rank over the state-of-the-art phonetic similarity approaches.
Anthology ID:
K18-1043
Volume:
Proceedings of the 22nd Conference on Computational Natural Language Learning
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Anna Korhonen, Ivan Titov
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
444–453
Language:
URL:
https://aclanthology.org/K18-1043
DOI:
10.18653/v1/K18-1043
Bibkey:
Cite (ACL):
Min Li, Marina Danilevsky, Sara Noeman, and Yunyao Li. 2018. DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 444–453, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding (Li et al., CoNLL 2018)
Copy Citation:
PDF:
https://aclanthology.org/K18-1043.pdf
Code
 System-T/DimSim