Exploring recursive neural networks for compact handwritten text recognition models

Enrique Mas-Candela¹ &
Jorge Calvo-Zaragoza¹

638 Accesses
2 Altmetric
Explore all metrics

Abstract

This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.

Boosting modern and historical handwritten text recognition with deformable convolutions

Article 09 May 2022

PyraD-DCNN: A Fully Convolutional Neural Network to Replace BLSTM in Offline Text Recognition Systems

Hybrid Domain Convolutional Neural Network for Memory Efficient Training

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Handwritten Text Recognition (HTR) is an important field as regards computer vision and natural language processing. It involves the challenging task of converting handwritten texts into machine-readable form, with applications ranging from digitizing historical documents to enabling real-time transcription on note-taking applications [1, 2].

The evolution of deep learning models that has taken place in recent years has led to remarkable advancements in HTR. However, one prevailing trend is the ever-expanding size of these models, a phenomenon that is not limited to HTR alone but which is pervasive across various machine learning domains [3, 4]. Although the size of the model typically affects the processing time, the size of the model itself may be of paramount concern in certain scenarios. Web browsers, for instance, impose strong limits on memory usage and often necessitate downloading the model assets at initialization time. In such cases, the weight of the model is not merely a matter of optimization but is a critical determinant of the feasibility of deploying HTR systems within these constraints. Despite this, the majority of research avenues dealing with model efficiency have focused primarily on enhancing the speed of the time required for training or inference, often at the expense of model size [5]. For example, the Mixture-of-Experts approach reduces the inference speed at the expense of increasing the size of the model [6].

Our objective is to contribute to striking a balance between model size and performance, addressing the challenge of reducing model size within the context of HTR and exploring avenues that could make applications more efficient in this regard. To this end, this work introduces the application of Recursive Neural networks (RecNN) to HTR. RecNN is a neural model that reuses the same weights in different consecutive layers in order to allow greater depth without an increase in memory size. This approach places particular emphasis on reducing model size while keeping other performance indicators (such as accuracy or inference time) fairly constant. [7]

This paper makes the following contributions:

1.
To the best of the authors’ knowledge, this is the first time that RecNN has been used in the context of HTR.
2.
Since state-of-the-art HTR models typically consist of both graphical and language modeling layers, we provide a complete architectural design of where and how to apply recursion in the neural network.
3.
We introduce the concept of weight scaling in the recursion, which allows greater expressiveness of the recursive network with a minimal impact on its size.
4.
We have carried out exhaustive experiments with several typical HTR benchmarks, contrasting the two critical components focused on in this work (accuracy and model size), along with ablation studies on the individualized impact of the different options.

The rest of the paper is structured as follows: in Sect. 2 we provide the necessary background to understand the starting point of this work; Sect. 3 introduces and describes our methodology; the experiments are presented in Sect. 4; a detailed analysis of results is reported in Sect. 5; finally, Sect. 6 concludes the present work.

2 Background

The field of machine learning has, in recent years, made significant progress thanks to deep neural networks. These models comprise deeper and more complex systems that increase their overall capacity to learn intricate data. While they have become the de facto architectures in academic research, using them in real-world scenarios, especially with devices such as mobile phones and web browsers for which resources are relevant, can be challenging.

Scientists spent several years attempting to develop more efficient models, the first of which was over the ubiquitous Convolutional Neural Networks (CNN) [8, 9], and these attempts more recently included the Transformers [10,11,12]. These efforts primarily focused on making these models faster for inference, although this also often led to implicit reductions in model size. However, in certain cases in which there are limited memory and bandwidth, as is that of web browsers, the size of the model becomes the critical indicator, signifying that techniques with which to compress the model are essential. Approaches such as knowledge distillation to smaller models [13], model pruning [14], and quantization [15] have been used for this purpose.

In this work, we explore the Recursive Neural Networks (RecNN) approach [7]. This approach builds upon the fact that many modern deep learning models are overly complex. We, therefore, propose that consecutive layers should apply the same function to the input, i.e., the weights should be shared between them. RecNN provides a solution as regards significantly reducing model size while maintaining the depth and capabilities of deep networks almost intact. Interestingly, the idea is versatile and can be applied to different types of architectures. These networks can fit seamlessly into CNN [16] or Transformer architectures [17], thus making it possible to replace various blocks with repetitions of the same block. Furthermore, the dynamic use of recursion in specific layers can be achieved using techniques such as dynamic gates, which can reduce the number of executions in certain layers, not only reducing the model size but also improving inference speed [16].

In the aforementioned context, RecNN continue to be an untapped avenue in HTR. The objective of this paper is to provide insights into leveraging these recursive architectures for this task by exploring this innovative approach in order to address the imperative of downsizing models for direct deployment with common user devices.

3 Methodology

This section commences by detailing the current methodology employed in HTR before formally introducing RecNN. Then, we will describe our specific implementation of RecNN in the context of HTR.

3.1 Handwritten text recognition

Formally, let $\mathcal {T} = \left\{ \left( x_{i},\textbf{z}_{i}\right) : x_{i}\in \mathcal {X},\;\textbf{z}_{i}\in \mathcal {Z}\right\} _{i=1}^{|\mathcal {T}|}$ represent a set of data, where sample $x_{i}$ is drawn from the space of line-level text images $\mathcal {X}$ and $\textbf{z}_{i} = \left( z_{i1},z_{i2},\ldots ,z_{iM_{i}}\right) $ corresponds to its transcript in terms of a predefined set of character symbols. Note that $\mathcal {Z} = \Sigma ^{*}$, where $\Sigma $ represents the vocabulary.

Given a query $x \in \mathcal {X}$, the task of line-level HTR can be formally defined as retrieving the most probably sequence of characters $\hat{\textbf{z}}$ that satisfy:

$$\begin{aligned} \hat{\textbf{z}} = \arg \max _{\textbf{z} \in \mathcal {Z}} \text {P}( x \mid \textbf{z} ) \end{aligned}$$

(1)

To approximate Eq. 1, the state of the art typically resorts to Convolutional Recurrent Neural Network (CRNN) [18]. A CRNN constitutes a particular neural architecture formed by an initial block of convolutional layers, which aim at learning the adequate set of features for the task, followed by another group of recurrent layers, which model the temporal dependencies of the elements from the initial feature-learning block.

To attain a proper end-to-end scheme, the CRNN is trained using the Connectionist Temporal Classification (CTC) algorithm [19], which allows optimizing the weights of the neural network using unsegmented sequential data. This means that, for a given image $x_{i}\in \mathcal {X}$, we only have its associated sequence of characters $\textbf{z}_{i} \in \mathcal {Z}$ as its expected output, without any correspondence at pixel level or similar input–output alignment. Due to its particular training procedure, CTC requires the inclusion of an additional “blank” symbol within the $\Sigma $ vocabulary, i.e., $\Sigma ' = \Sigma \cup \left\{ {\textit{blank}}\right\} $.

The output of the CRNN can be seen as a posteriorgram, i.e., the probability of each $\sigma \in \Sigma '$ to be located in each frame of the input image. Most commonly, the actual sequence prediction can be obtained out of this posteriogram by using a greedy approach, which keeps the most probable symbol per step, merges repeated symbols in consecutive frames, and eventually removes the blank tokens.

Aside from specific design adjustments (number of layers, regularization techniques, etc.), this architecture belongs to the state of the art in HTR, so it will be employed hereinafter for our study.

3.2 Recursive neural networks

RecNN, as employed within the scope of this research, are defined as neural network architectures characterized by the recursive application of certain components, thus enabling the iterative processing of input data, with the potential to replace specific layers or blocks with recurrent calls. Reference to RecNN is generally made when the neural network in question includes some recursive layers.^{Footnote 1}

Generally speaking, let us define a neural network model as a series of layers $\mathcal {M} = (L_1, L_2, \ldots , L_N)$. Each layer is, in turn, defined by a specific parametric function $f_{\theta }$ with parameters (weights) $\theta $. For the sake of simplicity, let us assume that the function is common throughout the network so the ith layer is simply defined by parameters $\theta _i$. Then, given an input x, the output y computed by $\mathcal {M}$ is:

$$\begin{aligned} y = f_{\theta _N} ( \cdots f_{\theta _2} ( f_{\theta _1} (x) ) ) \end{aligned}$$

Within this context, RecNN denotes a neural network for which one or more layers are recursively applied (see Fig. 1). There are two possible ways of using recursion depending on their purpose:

To keep depth while decreasing the number of parameters. Then, one can substitute the $(i+1)$th layer by a recursive use of the ith layer. Then, $f_{\theta _{i+1}} ( f_{\theta _i} ( \cdot ) )$ becomes $f_{\theta _i} ( f_{\theta _i} ( \cdot ) )$.
To increase depth while keeping the number of parameters. Then, one can reuse layer ith by computing the same function k more times as $f_{\theta _i} ( \cdots _k f_{\theta _i} ( \cdot ) )$.

RecNN can be trained using conventional methods without requiring modifications to classical learning algorithms like gradient descent and backpropagation. Therefore, the integration of recursive structures into neural networks does not disrupt established training frameworks, allowing for their seamless application.

Finally, it is important to emphasize that RecNN leverage recursive structures to iteratively apply operations to input data, while RNN are designed to process all elements of a sequence with the same weights. Although both ideas are based on similar concepts (shared parameters), they are complementary and can be applied concurrently. Specifically, one can apply recursion to the recurrent block of the CRNN (see Sect. 3.3.3), allowing for more recurrent layers without increasing memory requirements. This is particularly interesting as the recurrent block typically entails the highest number of parameters of the CRNN.

3.3 Recursion in handwritten text recognition

This section details how we implement recursion in an HTR model. Specifically, we propose three different means of including recursion in the RecNN.

3.3.1 Recursive convolutional layer

Our first technique involves the utilization of recursive convolutional layers. We introduce the parameter $\lambda _C$, which signifies the depth of recursion applied to the convolutional layers within a given block. We specifically organize the convolutional layers into groups of $\lambda _C$ and subsequently replace these layers with a recursive call of the initial convolutional layer. This process forms the basis of our investigation into the impact of recursion within the convolutional layers. In order to promote generalization and the reusability of recursive convolutional layers, we implement independent batch normalization in each recursive call. This approach has been shown to assist in the specialization of the convolutional kernels for each recursive call [16].

3.3.2 Convolutional kernel scaling

Our second strategy is the kernel scaling for the convolutional layers. This technique entails the use of a single-value learnable parameter $W^{\alpha }$, serving as a multiplier for the convolutional kernel of each layer (see Fig. 2). Our objective as regards using this scale value is to facilitate the specialization of the convolutional kernel at each depth of recursion to enhance the adaptability and expresiveness of the recursive convolutional layers. In this way, the recursive layers, even with the exact same parameters, can give rise to activations at different scales that allow the parametric function to be slightly modified by consuming only one additional weight.

3.3.3 Recursive recurrent layers

Similarly to that which occurs when applying recursion to convolutional layers, we also implement recursive recurrent layers. This technique makes it possible to explore deeper recurrent structures by making recursive calls to these layers. The extent of recursion within the recurrent layers is regulated by the recursive stride parameter $\lambda _R$. This parameter controls the depth of recursion, influencing the network’s ability to capture complex sequential patterns and dependencies.

In order to gain a deeper understanding about recursion in HTR, we will carry out a systematic parametrization of both our models and the techniques employed across a spectrum of values. Detailed information regarding the baseline model and the recursive techniques under consideration is provided in the following subsections.

4 Experiments

In this section, we describe the experimental setup designed to carry out our research, providing details of the corpora used for training and evaluation, the evaluation protocol and its associated metrics, and the details of our experiments.

4.1 Corpora

We consider four different common benchmarks in HTR (graphically depicted in Fig. 3):

The George Washington Dataset [20]. A small dataset focused on the handwriting of George Washington.
The Parzival Database [20]. This contains handwritten historical 13th century manuscripts written in medieval German by three different subjects.
The Saint Gall Dataset [21]. A 9th century dataset containing the writings, in Latin, of a single person.
The IAM Handwriting Database [22]. The most modern dataset, containing samples of 657 different writers in English.

These datasets provide line-level samples, which were used in this work. The corpora were additionally divided into ten random splits. Of these splits, eight were allocated for training purposes and one for validation, while the remaining one was used for evaluation. This partitioning scheme ensured that the data would be effectively utilized.

All line-level images underwent resizing so as to maintain a consistent height of 64 pixels while preserving their original aspect ratio. The images were normalized in the range $[-1, 1]$ in order to feed the neural network. The generalization capabilities of the models were enhanced by incorporating online data augmentation during the training process through the use of well-established data augmentation techniques, including scaling, rotations, translations and shearing. These augmentations were applied randomly with a $50\%$ probability, and the parameters were selected by means of uniform sampling. Details of the augmentation parameters are shown in Table 1.

Table 1 Details of augmentation, with the minimum and maximum values for each one. In the case of scale, translate and shear, the parameter sampling is performed independently for each axis

Full size table

4.2 Evaluation

Accuracy was measured using the widely-known Character Error Rate metric (CER). Furthermore, in order to assess the influence of model reduction, we also included the model size as a parameter of performance. This two-fold evaluation made it possible to gauge the trade-off between model size and recognition accuracy, providing valuable insights into the effectiveness of recursion in terms of model reduction.

In order to quantify the size of the model, we adopted the total number of parameters as the metric of measurement. As the size of a model is correlated with the number of parameters it contains, this metric made it possible to take the model size into account in a simple manner. This metric is also invariant to the precision used in order to store the weights of the model, thus allowing it to be used independently so as to weight quantization techniques.

Furthermore, it should be noted that in many studies focusing on efficiency in neural models, a key factor of interest is the inference time. This factor is often correlated with the size of the model: a larger model typically requires more operations, thus resulting in longer inference times. However, in our work, we are exclusively addressing the size of the model by incorporating recursive layers, which do not reduce computing time (compared to equal depth). Consequently, our results will not include the inference times because they remain closely tied to the depth (layers) of the model, regardless of the inclusion of recursive mechanisms.

4.3 Implementation details

This study is focused on the comprehensive exploration of recursive neural networks in the specific domain of HTR. This objective has been achieved by formulating a collection of variants that serve as a baseline, thus allowing us to evaluate the influence of recursion on model performance.

We start with the selection of a reference model architecture, which acts as the baseline of our study. This reference model was the CRNN architecture proposed by [18]. However, we made slight modification to it in order to enable the use of recursive layers.

The CRNN architecture is structured as a series of convolutional blocks, sequentially followed by a recurrent block, and finishing in a classification layer. Each convolutional block is composed of $N_{c}$ convolutional layers, each of which is configured with a $3\times 3$ kernel. Batch Normalization [23] is applied after each convolutional layer in order to standardize the output, and a Leaky Rectifier Linear Unit activation function [24] is employed. We use a total of five blocks, each with 16, 32, 48, 64, and 80 filters in all their layers, respectively. Maximum Pooling layers with a $2\times 2$ kernel are introduced into the first three convolutional blocks with the objective of reducing the dimensionality of the input images. In order to prevent overfitting, a Dropout layer [25] with a probability of $20\%$ is applied after the last three convolutional blocks. In our adaptation of the architecture, we incorporate a $1\times 1$ convolution within each block, aligning the number of filters with the convolutional layers. This alteration ensures uniform parameters across the convolutional layers and enables specific layers to be replaced with recursive calls.

The output of the last convolutional block is processed column-wise in the recurrent block, which is composed of $N_r$ Bidirectional Long-Short Term Memory (LSTM) layers [26] of R units each. Each LSTM layer incorporates a dropout layer, with a dropout rate set at $50\%$ per layer, contributing to the prevention of overfitting in the model. At the beginning of the recurrent block, we perform a linear projection of the image features in order to again enable the performance of recursive calls with which to replace certain layers. Finally, the classification layer maps each resulting image frame onto the most probable character.

A comprehensive assessment was conducted by establishing a set of parameters to evaluate (see Table 2). This process began with the definition of the depth of the convolutional and recurrent layers. We considered varying values for $N_{c}$ within the set $\{1, 2, 3\}$ and $N_{r}$ in $\{1, 2, 3\}$ in order to explore different depths in both the convolutional and recurrent components of the model. We additionally investigated the impact of different LSTM units (R), with values 64 and 256. This systematic evaluation allowed a proper analysis of the effects of recursion on models of different complexities and depths. Moreover, this setup led us to consider models of different sizes, thus facilitating the comparison of models of the same size but with a different depth and complexity once the recursion techniques had been applied. A similar configuration was adopted for the recursion parameters. We assessed the impact of the recursion depth by evaluating $\lambda _C$ and $\lambda _R$ across the entire spectrum of recursion depths which, given our baseline, was the set $\{1, 2, 3\}$. We additionally considered the use of kernel scaling $\alpha $ by comparing those scenarios in which it was applied ($\alpha =$ T (True)) with those in which it was not ($\alpha = $ F (False)).

Our model weights were trained using the gradient descent algorithm, optimizing the CTC [19] loss function, and working with batches of 64 samples from the training set. In order to facilitate this training, we employed the Adam [27] optimizer with a learning rate set to 0.0003. Our models underwent training for a total of 500 epochs, where each epoch represents the process of fitting the model to all the training samples. At the end of each epoch, we conducted a validation step using the validation set. During this validation, we identified and saved the model with the lowest CER in the validation set, thus ensuring that the most accurate model was retained for evaluation. As each of the training samples was of a different width, we incorporated zero padding, ensuring that all samples within a batch matched the size of the largest sample. However, it is important to note that we did not propagate the loss through the outputs of the model that were computed with this added padding. This approach helped maintain consistency during training while preventing the loss from being influenced by the padded regions of the data.

Table 2 Meta-parameters used in the experiments

Full size table

5 Results

This section provides an in-depth discussion of the results of our experiments and explores how recursive techniques impact on our HTR models. Let us recall that we are focusing on two key metrics: the CER and the number of model parameters. Although these measures make it possible to analyze the performance of each of the strategies considered, it is not possible to compare the whole set of alternatives in order to determine which is the best. The problem is that the two metrics may be contradictory, and improving one of them might, therefore, imply deterioration in the other.

From the aforementioned point of view, our study can be viewed as an instance of a Multi-objective Optimization Problem (MOP) in which two functions are meant to be optimized simultaneously: CER and the number of model parameters. The usual method employed to evaluate this kind of problems is that of the non-dominance concept. One solution is said to dominate another if, and only if, it is better or equal in each goal function and at least strictly better in one of them. The set of non-dominated elements represents the different optimal solutions to the MOP. Each of them is usually referred to as a Pareto-optimal solution, and the whole set is usually known as a Pareto frontier. Our Pareto frontier, therefore, represents a set of model configurations in which no further improvement can be made to one metric (CER) without worsening the other (number of parameters). When exploring the Pareto frontier, we can identify model configurations that provide a favorable balance between performance and model size, thus helping determine which models are the most efficient and finding the optimal trade-offs for our specific HTR task.

For all the above reasons, in order to help understand how these metrics interact, we have created the visual 2D representation shown in Fig. 4, which illustrates the distribution of results from the different configurations confronting CER and the number of parameters in each dataset. The Pareto frontier has also been highlighted.

Table 3 Set of non-dominated configurations from our experiments

Full size table

An examination of Fig. 4 shows that the Pareto frontier is predominantly composed of models that utilize one or more recursive techniques. Note that the points create clusters by number of parameters (with small fluctuations in recursive cases due to parameter increments because of independent Batch Normalization layers and kernel scaling, when applied). Within each cluster, we can see that the points below (lower CER) are usually those with some type of recursion. This observation underscores the fact that these models give raise to optimal configurations, as they strike an optimal balance between CER and the number of parameters.

To provide further insights into these optimal solutions, we list in Table 3 the detailed set of non-dominated configurations. This table includes a majority of recursive approaches. First, it is important to note that for some datasets, approaches without recursion become directly useless (such as in Washington or Parzival), since they do not provide any improvement in any of the evaluated parameters. This also indicates that recursive approaches take better advantage of the model’s weight, making them much more suitable for scenarios where this issue is important.

Concerning recursion measures, convolutional recursion seems especially relevant, since almost all optimal configurations include it ($\lambda _C > 1$). Furthermore, recurrent recursion ($\lambda _R$) reports a lower impact, but it is mostly included in the optimal solutions. Finally, it should be noted that the kernel scale ($\lambda $), even providing some non-dominated solutions (especially in the Washington dataset), does not seem to have the same influence as the other two recursion measures.

In terms of general configuration, it is observed that it is highly dependent on the dataset, much in the same way as the baseline itself. However, the recursive configurations with $\lambda _C = 2$ and $\lambda _R = 1$ or $\lambda _R = 2$ appear in all datasets and with different base configurations, so they are postulated as the most robust configurations to obtain an optimal result. It is worth highlighting that these configurations usually report the lowest CER of the Pareto frontier, so they achieve so with optimal number of parameters for those cases.

Table 4 Mean CER (in %) over all the datasets of each recursive configuration with respect to a specific baseline model with the same depth

Full size table

5.1 Further examination

In this section we want to take a closer look at what using recursive layers means for model performance. First of all, we proceeded to conduct an analysis of the impact of these techniques on overall model effectiveness. In order to facilitate this analysis, we present Fig. 5, which contains a comprehensive bar plot that provides an insight into the percentage of times that each individual parameter contributes to a model outperforming a baseline model of a similar size.

Our evaluation first categorizes experiments into groups based on the number of parameters of the evaluated model. These groups are readily distinguished in all cases presented in Fig. 4, where four clusters of models with similar sizes are clearly displayed. We then assess the percentage of times a particular parameter leads to superior model accuracy compared to any of the baseline models within the same group. This provides a nuanced understanding of the specific parameters that drive performance improvements, contributing to a more fine-grained analysis of our models.

A close examination of this plot reveals that all the techniques applied consistently enhance model performance, with particular emphasis on the convolutional recursion, which yields model improvements in over 70 % of the cases in which it is applied. The convolutional kernel scaling parameter $\alpha $ attains a similarly high success rate as regards driving model improvements. It is noteworthy that both recurrent and convolutional recursions demonstrate that a recursion depth $\lambda $ of 2 tends to yield better performance outcomes when compared to a depth of 3. This suggests that excessive depths may indeed lead to deterioration in model performance. From the above, we can claim that when setting a restriction on the number of model parameters, the RecNN mostly achieves performance improvements in terms of CER. As a rule of thumb, we can see that recursion in the convolutional layer and kernel scaling are generally robust choices.

To finish our analysis, let us present the results in another way in which the impact of the RecNN as a neural architecture itself can be observed. Table 4 presents a first column that lists all possible configurations of the baseline, i.e., without recursion. The columns indicate all the possible recursion measures that can be applied to achieve the same depth of the baseline. One important thing is to note that all configurations (other than “None”) entail a size reduction of the base model of that row, so it is always assured that they are at least improving the size criterion. The average CER over all performed experiments for each possibility is given in the cells, whereas “-” indicates when a measure is not possible for that specific baseline configuration.

This table provides valuable insights into how the application of recursion techniques impacts on each of the baseline models. Remarkably, our experiments reveal that, in all cases, the incorporation of recursion techniques does not significantly deteriorate the performance of the model. Furthermore, for deeper models, note that the CER achieved when applying recursion is even lower than the baseline, demonstrating not only a reduction in the number of model parameters but also an enhanced performance in the HTR task. The results clearly indicate that most of the recursive configurations outperform their baselines, showing the effectiveness of both convolutional and recurrent recursion techniques, and even a combination of both, for HTR. It is worth noting that recursion tends to make a stronger impact in deeper baseline models. Despite not increasing the total number of parameters, the application of recursion enables these models to adopt even deeper configurations, resulting in an enhanced performance in the HTR task. This observation emphasizes the potential of recursion techniques to augment the depth of models without compromising model size, consequently contributing to improved task performance.

6 Conclusion

This study explores the application of Recursive Neural Networks (RecNN) to Handwritten Text Recognition (HTR) tasks. This approach addresses a critical aspect of the model size problem, providing solutions for scenarios with stringent size constraints, such as web environments and edge devices.

Our findings have consistently demonstrated that these techniques not only reduce the model size but also enhance its accuracy in many cases. They also show that deeper model configurations particularly benefit from these techniques, leading to performance improvements without a substantial increase in the number of parameters. These recursion techniques, therefore, represent a successful strategy with which to optimize HTR models, thus providing the potential for more efficient and effective deployment with various real-world applications.

Notes

Much in the same way as reference is made to CNN when an architecture includes a convolutional layer.

References

Muehlberger, G., Seaward, L., Terras, M., Oliveira, S.A., Bosch, V., Bryan, M., Colutto, S., Déjean, H., Diem, M., Fiel, S.: Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. J. Document. 75(5), 954–976 (2019)
Article Google Scholar
Ren, G., Ganapathy, V.: Recognition of online handwriting with variability on smart devices. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7605–7609 (2019). IEEE
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.K., Singhal, S., Som, S., : Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Menghani, G.: Efficient deep learning: a survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 55(12), 1–37 (2023)
Article Google Scholar
Chen, Z., Deng, Y., Wu, Y., Gu, Q., Li, Y.: Towards understanding the mixture-of-experts layer in deep learning. In: NeurIPS (2022)
Eigen, D., Rolfe, J.T., Fergus, R., LeCun, Y.: Understanding deep architectures using a recursive convolutional network. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings (2014)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR arXiv:7040.4861 (2017)
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR, Long Beach, California, USA (2019)
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: vision transformers at mobilenet speed. In: NeurIPS (2022)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, June 18-24, 2022, pp. 5260–5269. IEEE, New Orleans, LA, USA (2022)
Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: The Tenth International Conference on Learning Representations, ICLR 2022, April 25-29, 2022. OpenReview.net, Virtual Event (2022)
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR arXiv:1503.02531 (2015)
Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. In: 5th International Conference on Learning Representations, ICLR 2017, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, Toulon, France (2017)
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, June 18-22, 2018, pp. 2704–2713. Computer Vision Foundation / IEEE Computer Society, Salt Lake City, UT, USA (2018)
Guo, Q., Yu, Z., Wu, Y., Liang, D., Qin, H., Yan, J.: Dynamic recursive neural network. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5142–5151 (2019)
Shen, Z., Liu, Z., Xing, E.: Sliced recursive transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 727–744. Springer, Cham (2022)
Chapter Google Scholar
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06, pp. 369–376. Association for Computing Machinery, New York, NY, USA (2006)
Fischer, A., Keller, A., Frinken, V., Bunke, H.: Lexicon-free handwritten word spotting using character HMMS. Pattern Recogn. Lett. 33(7), 934–942 (2012)
Article Google Scholar
Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing. HIP ’11, pp. 29–36. Association for Computing Machinery, New York, NY, USA (2011)
Marti, U.-V., Bunke, H.: A full english sentence database for off-line handwriting recognition. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR ’99 (Cat. No.PR00318), pp. 705–708 (1999)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR arXiv:1502.03167 (2015)
Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. CoRR arXiv:1505.00853 (2015)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

Download references

Acknowledgements

This paper is part of the project I+D+i PID2020-118447RA-I00 (MultiScore), funded by MCIN/AEI/10.13039/501100011033.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and Affiliations

Pattern Recognition and Artificial Intelligence Group, University of Alicante, Carretera San Vicente s/n, 03690, Alicante, Spain
Enrique Mas-Candela & Jorge Calvo-Zaragoza

Authors

Enrique Mas-Candela
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Calvo-Zaragoza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Calvo-Zaragoza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mas-Candela, E., Calvo-Zaragoza, J. Exploring recursive neural networks for compact handwritten text recognition models. IJDAR 27, 213–223 (2024). https://doi.org/10.1007/s10032-024-00481-y

Download citation

Received: 14 November 2023
Revised: 09 May 2024
Accepted: 28 May 2024
Published: 27 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10032-024-00481-y

Exploring recursive neural networks for compact handwritten text recognition models

Abstract

Similar content being viewed by others

Boosting modern and historical handwritten text recognition with deformable convolutions

PyraD-DCNN: A Fully Convolutional Neural Network to Replace BLSTM in Offline Text Recognition Systems

Hybrid Domain Convolutional Neural Network for Memory Efficient Training

1 Introduction

2 Background