Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Jie Yang, Yue Zhang, Shuailong Liang


Abstract
We investigate subword information for Chinese word segmentation, by integrating sub word embeddings trained using byte-pair encoding into a Lattice LSTM (LaLSTM) network over a character sequence. Experiments on standard benchmark show that subword information brings significant gains over strong character-based segmentation models. To our knowledge, this is the first research on the effectiveness of subwords on neural word segmentation.
Anthology ID:
N19-1278
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2720–2725
Language:
URL:
https://aclanthology.org/N19-1278
DOI:
10.18653/v1/N19-1278
Bibkey:
Cite (ACL):
Jie Yang, Yue Zhang, and Shuailong Liang. 2019. Subword Encoding in Lattice LSTM for Chinese Word Segmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2720–2725, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Subword Encoding in Lattice LSTM for Chinese Word Segmentation (Yang et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1278.pdf
Supplementary:
 N19-1278.Supplementary.pdf
Code
 jiesutd/SubwordEncoding-CWS