Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

Jun Suzuki,  Hideki Isozaki,  Masaaki Nagata
NTT CS Lab.


Abstract

This paper proposes a novel approach for effectively utilizing unsupervised data in addition to supervised data for supervised learning. We use unsupervised data to generate informative `condensed feature representations' from the original feature set used in supervised NLP systems. The main contribution of our method is that it can offer dense and low-dimensional feature spaces for NLP tasks while maintaining the state-of-the-art performance provided by the recently developed high-performance semi-supervised learning technique. Our method matches the results of current state-of-the-art systems with very few features, i.e., F-score 90.72 with 344 features for CoNLL-2003 NER data, and UAS 93.55 with 12.5K features for dependency parsing data derived from PTB-III.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2112.pdf