Transcribing Lines of Handwritten Text Using TrOCR: An Encoder-Decoder Model Based on Pre-Trained Image and Text Transformers
Natalia Bottaioli, Daniel Parres, Yung-Hsin Chen
⚠ This is a preprint. It may change before it is accepted for publication.

Abstract

This article focuses on analyzing several aspects of the handwritten text recognition (HTR) models belonging to the TrOCR family introduced by Minghao Li et al. in [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, AAAI Conference on Artificial Intelligence, 2023]. The TrOCR models are designed to recognize single lines of English text using a transformer-based encoder-decoder architecture. All models incorporate a pre-trained vision transformer as the encoder and a pre-trained text transformer as the decoder. The encoder is responsible for extracting key features from the image, while the decoder autoregressively transcribes the text, subword by subword, based on the extracted features. The authors report state-of-the-art performance across different text types, including handwritten, scene, and printed text. Our analysis has several objectives. The first one is to gain a better understanding of the training process and the data used for producing the handwritten models. The second one is to highlight and explore the functionality and limitations of the TrOCR model in the context of HTR, which poses unique challenges such as variations in individual writing styles. Additionally, we propose an architecture diagram that helped us better understand what the model actually does with the input text line image, which we hope will be useful for the research community using TrOCR.

This is an MLBriefs article, the source code has not been reviewed!
The original source code is available here (last checked 2026/04/23).

Download