Multi-Input Attention for Unsupervised OCR Correction

Rui Dong, David Smith


Abstract
We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods.
Anthology ID:
P18-1220
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Iryna Gurevych, Yusuke Miyao
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2363–2372
Language:
URL:
https://aclanthology.org/P18-1220
DOI:
10.18653/v1/P18-1220
Bibkey:
Cite (ACL):
Rui Dong and David Smith. 2018. Multi-Input Attention for Unsupervised OCR Correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2363–2372, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Multi-Input Attention for Unsupervised OCR Correction (Dong & Smith, ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-1220.pdf
Poster:
 P18-1220.Poster.pdf
Data
New York Times Annotated Corpus