1 Introduction

Handwritten Text Recognition (HTR) is an important field as regards computer vision and natural language processing. It involves the challenging task of converting handwritten texts into machine-readable form, with applications ranging from digitizing historical documents to enabling real-time transcription on note-taking applications [1, 2].

The evolution of deep learning models that has taken place in recent years has led to remarkable advancements in HTR. However, one prevailing trend is the ever-expanding size of these models, a phenomenon that is not limited to HTR alone but which is pervasive across various machine learning domains [3, 4]. Although the size of the model typically affects the processing time, the size of the model itself may be of paramount concern in certain scenarios. Web browsers, for instance, impose strong limits on memory usage and often necessitate downloading the model assets at initialization time. In such cases, the weight of the model is not merely a matter of optimization but is a critical determinant of the feasibility of deploying HTR systems within these constraints. Despite this, the majority of research avenues dealing with model efficiency have focused primarily on enhancing the speed of the time required for training or inference, often at the expense of model size [5]. For example, the Mixture-of-Experts approach reduces the inference speed at the expense of increasing the size of the model [6].

Our objective is to contribute to striking a balance between model size and performance, addressing the challenge of reducing model size within the context of HTR and exploring avenues that could make applications more efficient in this regard. To this end, this work introduces the application of Recursive Neural networks (RecNN) to HTR. RecNN is a neural model that reuses the same weights in different consecutive layers in order to allow greater depth without an increase in memory size. This approach places particular emphasis on reducing model size while keeping other performance indicators (such as accuracy or inference time) fairly constant. [7]

This paper makes the following contributions:

  1. 1.

    To the best of the authors’ knowledge, this is the first time that RecNN has been used in the context of HTR.

  2. 2.

    Since state-of-the-art HTR models typically consist of both graphical and language modeling layers, we provide a complete architectural design of where and how to apply recursion in the neural network.

  3. 3.

    We introduce the concept of weight scaling in the recursion, which allows greater expressiveness of the recursive network with a minimal impact on its size.

  4. 4.

    We have carried out exhaustive experiments with several typical HTR benchmarks, contrasting the two critical components focused on in this work (accuracy and model size), along with ablation studies on the individualized impact of the different options.

The rest of the paper is structured as follows: in Sect. 2 we provide the necessary background to understand the starting point of this work; Sect. 3 introduces and describes our methodology; the experiments are presented in Sect. 4; a detailed analysis of results is reported in Sect. 5; finally, Sect. 6 concludes the present work.

2 Background

The field of machine learning has, in recent years, made significant progress thanks to deep neural networks. These models comprise deeper and more complex systems that increase their overall capacity to learn intricate data. While they have become the de facto architectures in academic research, using them in real-world scenarios, especially with devices such as mobile phones and web browsers for which resources are relevant, can be challenging.

Scientists spent several years attempting to develop more efficient models, the first of which was over the ubiquitous Convolutional Neural Networks (CNN) [8, 9], and these attempts more recently included the Transformers [10,11,12]. These efforts primarily focused on making these models faster for inference, although this also often led to implicit reductions in model size. However, in certain cases in which there are limited memory and bandwidth, as is that of web browsers, the size of the model becomes the critical indicator, signifying that techniques with which to compress the model are essential. Approaches such as knowledge distillation to smaller models [13], model pruning [14], and quantization [15] have been used for this purpose.

In this work, we explore the Recursive Neural Networks (RecNN) approach [7]. This approach builds upon the fact that many modern deep learning models are overly complex. We, therefore, propose that consecutive layers should apply the same function to the input, i.e., the weights should be shared between them. RecNN provides a solution as regards significantly reducing model size while maintaining the depth and capabilities of deep networks almost intact. Interestingly, the idea is versatile and can be applied to different types of architectures. These networks can fit seamlessly into CNN [16] or Transformer architectures [17], thus making it possible to replace various blocks with repetitions of the same block. Furthermore, the dynamic use of recursion in specific layers can be achieved using techniques such as dynamic gates, which can reduce the number of executions in certain layers, not only reducing the model size but also improving inference speed [16].

In the aforementioned context, RecNN continue to be an untapped avenue in HTR. The objective of this paper is to provide insights into leveraging these recursive architectures for this task by exploring this innovative approach in order to address the imperative of downsizing models for direct deployment with common user devices.

3 Methodology

This section commences by detailing the current methodology employed in HTR before formally introducing RecNN. Then, we will describe our specific implementation of RecNN in the context of HTR.

3.1 Handwritten text recognition

Formally, let \(\mathcal {T} = \left\{ \left( x_{i},\textbf{z}_{i}\right) : x_{i}\in \mathcal {X},\;\textbf{z}_{i}\in \mathcal {Z}\right\} _{i=1}^{|\mathcal {T}|}\) represent a set of data, where sample \(x_{i}\) is drawn from the space of line-level text images \(\mathcal {X}\) and \(\textbf{z}_{i} = \left( z_{i1},z_{i2},\ldots ,z_{iM_{i}}\right) \) corresponds to its transcript in terms of a predefined set of character symbols. Note that \(\mathcal {Z} = \Sigma ^{*}\), where \(\Sigma \) represents the vocabulary.

Given a query \(x \in \mathcal {X}\), the task of line-level HTR can be formally defined as retrieving the most probably sequence of characters \(\hat{\textbf{z}}\) that satisfy:

$$\begin{aligned} \hat{\textbf{z}} = \arg \max _{\textbf{z} \in \mathcal {Z}} \text {P}( x \mid \textbf{z} ) \end{aligned}$$
(1)

To approximate Eq. 1, the state of the art typically resorts to Convolutional Recurrent Neural Network (CRNN) [18]. A CRNN constitutes a particular neural architecture formed by an initial block of convolutional layers, which aim at learning the adequate set of features for the task, followed by another group of recurrent layers, which model the temporal dependencies of the elements from the initial feature-learning block.

To attain a proper end-to-end scheme, the CRNN is trained using the Connectionist Temporal Classification (CTC) algorithm [19], which allows optimizing the weights of the neural network using unsegmented sequential data. This means that, for a given image \(x_{i}\in \mathcal {X}\), we only have its associated sequence of characters \(\textbf{z}_{i} \in \mathcal {Z}\) as its expected output, without any correspondence at pixel level or similar input–output alignment. Due to its particular training procedure, CTC requires the inclusion of an additional “blank” symbol within the \(\Sigma \) vocabulary, i.e., \(\Sigma ' = \Sigma \cup \left\{ {\textit{blank}}\right\} \).

The output of the CRNN can be seen as a posteriorgram, i.e., the probability of each \(\sigma \in \Sigma '\) to be located in each frame of the input image. Most commonly, the actual sequence prediction can be obtained out of this posteriogram by using a greedy approach, which keeps the most probable symbol per step, merges repeated symbols in consecutive frames, and eventually removes the blank tokens.

Aside from specific design adjustments (number of layers, regularization techniques, etc.), this architecture belongs to the state of the art in HTR, so it will be employed hereinafter for our study.

3.2 Recursive neural networks

RecNN, as employed within the scope of this research, are defined as neural network architectures characterized by the recursive application of certain components, thus enabling the iterative processing of input data, with the potential to replace specific layers or blocks with recurrent calls. Reference to RecNN is generally made when the neural network in question includes some recursive layers.Footnote 1

Generally speaking, let us define a neural network model as a series of layers \(\mathcal {M} = (L_1, L_2, \ldots , L_N)\). Each layer is, in turn, defined by a specific parametric function \(f_{\theta }\) with parameters (weights) \(\theta \). For the sake of simplicity, let us assume that the function is common throughout the network so the ith layer is simply defined by parameters \(\theta _i\). Then, given an input x, the output y computed by \(\mathcal {M}\) is:

$$\begin{aligned} y = f_{\theta _N} ( \cdots f_{\theta _2} ( f_{\theta _1} (x) ) ) \end{aligned}$$

Within this context, RecNN denotes a neural network for which one or more layers are recursively applied (see Fig. 1). There are two possible ways of using recursion depending on their purpose:

  • To keep depth while decreasing the number of parameters. Then, one can substitute the \((i+1)\)th layer by a recursive use of the ith layer. Then, \(f_{\theta _{i+1}} ( f_{\theta _i} ( \cdot ) )\) becomes \(f_{\theta _i} ( f_{\theta _i} ( \cdot ) )\).

  • To increase depth while keeping the number of parameters. Then, one can reuse layer ith by computing the same function k more times as \(f_{\theta _i} ( \cdots _k f_{\theta _i} ( \cdot ) )\).

Fig. 1
figure 1

Graphical comparison of conventional layers and a recursive layer

RecNN can be trained using conventional methods without requiring modifications to classical learning algorithms like gradient descent and backpropagation. Therefore, the integration of recursive structures into neural networks does not disrupt established training frameworks, allowing for their seamless application.

Finally, it is important to emphasize that RecNN leverage recursive structures to iteratively apply operations to input data, while RNN are designed to process all elements of a sequence with the same weights. Although both ideas are based on similar concepts (shared parameters), they are complementary and can be applied concurrently. Specifically, one can apply recursion to the recurrent block of the CRNN (see Sect. 3.3.3), allowing for more recurrent layers without increasing memory requirements. This is particularly interesting as the recurrent block typically entails the highest number of parameters of the CRNN.

3.3 Recursion in handwritten text recognition

This section details how we implement recursion in an HTR model. Specifically, we propose three different means of including recursion in the RecNN.

3.3.1 Recursive convolutional layer

Our first technique involves the utilization of recursive convolutional layers. We introduce the parameter \(\lambda _C\), which signifies the depth of recursion applied to the convolutional layers within a given block. We specifically organize the convolutional layers into groups of \(\lambda _C\) and subsequently replace these layers with a recursive call of the initial convolutional layer. This process forms the basis of our investigation into the impact of recursion within the convolutional layers. In order to promote generalization and the reusability of recursive convolutional layers, we implement independent batch normalization in each recursive call. This approach has been shown to assist in the specialization of the convolutional kernels for each recursive call [16].

3.3.2 Convolutional kernel scaling

Our second strategy is the kernel scaling for the convolutional layers. This technique entails the use of a single-value learnable parameter \(W^{\alpha }\), serving as a multiplier for the convolutional kernel of each layer (see Fig. 2). Our objective as regards using this scale value is to facilitate the specialization of the convolutional kernel at each depth of recursion to enhance the adaptability and expresiveness of the recursive convolutional layers. In this way, the recursive layers, even with the exact same parameters, can give rise to activations at different scales that allow the parametric function to be slightly modified by consuming only one additional weight.

Fig. 2
figure 2

Visualization of the kernel scaling tecnique

3.3.3 Recursive recurrent layers

Similarly to that which occurs when applying recursion to convolutional layers, we also implement recursive recurrent layers. This technique makes it possible to explore deeper recurrent structures by making recursive calls to these layers. The extent of recursion within the recurrent layers is regulated by the recursive stride parameter \(\lambda _R\). This parameter controls the depth of recursion, influencing the network’s ability to capture complex sequential patterns and dependencies.

In order to gain a deeper understanding about recursion in HTR, we will carry out a systematic parametrization of both our models and the techniques employed across a spectrum of values. Detailed information regarding the baseline model and the recursive techniques under consideration is provided in the following subsections.

4 Experiments

In this section, we describe the experimental setup designed to carry out our research, providing details of the corpora used for training and evaluation, the evaluation protocol and its associated metrics, and the details of our experiments.

4.1 Corpora

We consider four different common benchmarks in HTR (graphically depicted in Fig. 3):

  • The George Washington Dataset [20]. A small dataset focused on the handwriting of George Washington.

  • The Parzival Database [20]. This contains handwritten historical 13th century manuscripts written in medieval German by three different subjects.

  • The Saint Gall Dataset [21]. A 9th century dataset containing the writings, in Latin, of a single person.

  • The IAM Handwriting Database [22]. The most modern dataset, containing samples of 657 different writers in English.

Fig. 3
figure 3

Exemplary manuscript excerpts from the datasets utilized in our study. While our research centers on line-level HTR, we depict larger document segments in this illustration to enhance the visualization of their graphical features

These datasets provide line-level samples, which were used in this work. The corpora were additionally divided into ten random splits. Of these splits, eight were allocated for training purposes and one for validation, while the remaining one was used for evaluation. This partitioning scheme ensured that the data would be effectively utilized.

All line-level images underwent resizing so as to maintain a consistent height of 64 pixels while preserving their original aspect ratio. The images were normalized in the range \([-1, 1]\) in order to feed the neural network. The generalization capabilities of the models were enhanced by incorporating online data augmentation during the training process through the use of well-established data augmentation techniques, including scaling, rotations, translations and shearing. These augmentations were applied randomly with a \(50\%\) probability, and the parameters were selected by means of uniform sampling. Details of the augmentation parameters are shown in Table 1.

Table 1 Details of augmentation, with the minimum and maximum values for each one. In the case of scale, translate and shear, the parameter sampling is performed independently for each axis

4.2 Evaluation

Accuracy was measured using the widely-known Character Error Rate metric (CER). Furthermore, in order to assess the influence of model reduction, we also included the model size as a parameter of performance. This two-fold evaluation made it possible to gauge the trade-off between model size and recognition accuracy, providing valuable insights into the effectiveness of recursion in terms of model reduction.

In order to quantify the size of the model, we adopted the total number of parameters as the metric of measurement. As the size of a model is correlated with the number of parameters it contains, this metric made it possible to take the model size into account in a simple manner. This metric is also invariant to the precision used in order to store the weights of the model, thus allowing it to be used independently so as to weight quantization techniques.

Furthermore, it should be noted that in many studies focusing on efficiency in neural models, a key factor of interest is the inference time. This factor is often correlated with the size of the model: a larger model typically requires more operations, thus resulting in longer inference times. However, in our work, we are exclusively addressing the size of the model by incorporating recursive layers, which do not reduce computing time (compared to equal depth). Consequently, our results will not include the inference times because they remain closely tied to the depth (layers) of the model, regardless of the inclusion of recursive mechanisms.

4.3 Implementation details

This study is focused on the comprehensive exploration of recursive neural networks in the specific domain of HTR. This objective has been achieved by formulating a collection of variants that serve as a baseline, thus allowing us to evaluate the influence of recursion on model performance.

We start with the selection of a reference model architecture, which acts as the baseline of our study. This reference model was the CRNN architecture proposed by [18]. However, we made slight modification to it in order to enable the use of recursive layers.

The CRNN architecture is structured as a series of convolutional blocks, sequentially followed by a recurrent block, and finishing in a classification layer. Each convolutional block is composed of \(N_{c}\) convolutional layers, each of which is configured with a \(3\times 3\) kernel. Batch Normalization [23] is applied after each convolutional layer in order to standardize the output, and a Leaky Rectifier Linear Unit activation function [24] is employed. We use a total of five blocks, each with 16, 32, 48, 64, and 80 filters in all their layers, respectively. Maximum Pooling layers with a \(2\times 2\) kernel are introduced into the first three convolutional blocks with the objective of reducing the dimensionality of the input images. In order to prevent overfitting, a Dropout layer [25] with a probability of \(20\%\) is applied after the last three convolutional blocks. In our adaptation of the architecture, we incorporate a \(1\times 1\) convolution within each block, aligning the number of filters with the convolutional layers. This alteration ensures uniform parameters across the convolutional layers and enables specific layers to be replaced with recursive calls.

The output of the last convolutional block is processed column-wise in the recurrent block, which is composed of \(N_r\) Bidirectional Long-Short Term Memory (LSTM) layers [26] of R units each. Each LSTM layer incorporates a dropout layer, with a dropout rate set at \(50\%\) per layer, contributing to the prevention of overfitting in the model. At the beginning of the recurrent block, we perform a linear projection of the image features in order to again enable the performance of recursive calls with which to replace certain layers. Finally, the classification layer maps each resulting image frame onto the most probable character.

A comprehensive assessment was conducted by establishing a set of parameters to evaluate (see Table 2). This process began with the definition of the depth of the convolutional and recurrent layers. We considered varying values for \(N_{c}\) within the set \(\{1, 2, 3\}\) and \(N_{r}\) in \(\{1, 2, 3\}\) in order to explore different depths in both the convolutional and recurrent components of the model. We additionally investigated the impact of different LSTM units (R), with values 64 and 256. This systematic evaluation allowed a proper analysis of the effects of recursion on models of different complexities and depths. Moreover, this setup led us to consider models of different sizes, thus facilitating the comparison of models of the same size but with a different depth and complexity once the recursion techniques had been applied. A similar configuration was adopted for the recursion parameters. We assessed the impact of the recursion depth by evaluating \(\lambda _C\) and \(\lambda _R\) across the entire spectrum of recursion depths which, given our baseline, was the set \(\{1, 2, 3\}\). We additionally considered the use of kernel scaling \(\alpha \) by comparing those scenarios in which it was applied (\(\alpha =\) T (True)) with those in which it was not (\(\alpha = \) F (False)).

Our model weights were trained using the gradient descent algorithm, optimizing the CTC [19] loss function, and working with batches of 64 samples from the training set. In order to facilitate this training, we employed the Adam [27] optimizer with a learning rate set to 0.0003. Our models underwent training for a total of 500 epochs, where each epoch represents the process of fitting the model to all the training samples. At the end of each epoch, we conducted a validation step using the validation set. During this validation, we identified and saved the model with the lowest CER in the validation set, thus ensuring that the most accurate model was retained for evaluation. As each of the training samples was of a different width, we incorporated zero padding, ensuring that all samples within a batch matched the size of the largest sample. However, it is important to note that we did not propagate the loss through the outputs of the model that were computed with this added padding. This approach helped maintain consistency during training while preventing the loss from being influenced by the padded regions of the data.

Table 2 Meta-parameters used in the experiments
Fig. 4
figure 4

Best viewed in color. Results in terms of the CER and the number of parameters across different databases. Each glyph corresponds to a specific recursion technique: no recursion (crosses), recursion in convolutional layers (squares), recursion in recurrent layers (circles), and the combination of both convolutional and recurrent layers (triangles). Moreover, the Pareto frontier, which denotes optimal model configurations, is represented by a solid line, providing optimal values in the trade-offs between CER and model size

5 Results

This section provides an in-depth discussion of the results of our experiments and explores how recursive techniques impact on our HTR models. Let us recall that we are focusing on two key metrics: the CER and the number of model parameters. Although these measures make it possible to analyze the performance of each of the strategies considered, it is not possible to compare the whole set of alternatives in order to determine which is the best. The problem is that the two metrics may be contradictory, and improving one of them might, therefore, imply deterioration in the other.

From the aforementioned point of view, our study can be viewed as an instance of a Multi-objective Optimization Problem (MOP) in which two functions are meant to be optimized simultaneously: CER and the number of model parameters. The usual method employed to evaluate this kind of problems is that of the non-dominance concept. One solution is said to dominate another if, and only if, it is better or equal in each goal function and at least strictly better in one of them. The set of non-dominated elements represents the different optimal solutions to the MOP. Each of them is usually referred to as a Pareto-optimal solution, and the whole set is usually known as a Pareto frontier. Our Pareto frontier, therefore, represents a set of model configurations in which no further improvement can be made to one metric (CER) without worsening the other (number of parameters). When exploring the Pareto frontier, we can identify model configurations that provide a favorable balance between performance and model size, thus helping determine which models are the most efficient and finding the optimal trade-offs for our specific HTR task.

For all the above reasons, in order to help understand how these metrics interact, we have created the visual 2D representation shown in Fig. 4, which illustrates the distribution of results from the different configurations confronting CER and the number of parameters in each dataset. The Pareto frontier has also been highlighted.

Table 3 Set of non-dominated configurations from our experiments

An examination of Fig. 4 shows that the Pareto frontier is predominantly composed of models that utilize one or more recursive techniques. Note that the points create clusters by number of parameters (with small fluctuations in recursive cases due to parameter increments because of independent Batch Normalization layers and kernel scaling, when applied). Within each cluster, we can see that the points below (lower CER) are usually those with some type of recursion. This observation underscores the fact that these models give raise to optimal configurations, as they strike an optimal balance between CER and the number of parameters.

To provide further insights into these optimal solutions, we list in Table 3 the detailed set of non-dominated configurations. This table includes a majority of recursive approaches. First, it is important to note that for some datasets, approaches without recursion become directly useless (such as in Washington or Parzival), since they do not provide any improvement in any of the evaluated parameters. This also indicates that recursive approaches take better advantage of the model’s weight, making them much more suitable for scenarios where this issue is important.

Concerning recursion measures, convolutional recursion seems especially relevant, since almost all optimal configurations include it (\(\lambda _C > 1\)). Furthermore, recurrent recursion (\(\lambda _R\)) reports a lower impact, but it is mostly included in the optimal solutions. Finally, it should be noted that the kernel scale (\(\lambda \)), even providing some non-dominated solutions (especially in the Washington dataset), does not seem to have the same influence as the other two recursion measures.

In terms of general configuration, it is observed that it is highly dependent on the dataset, much in the same way as the baseline itself. However, the recursive configurations with \(\lambda _C = 2\) and \(\lambda _R = 1\) or \(\lambda _R = 2\) appear in all datasets and with different base configurations, so they are postulated as the most robust configurations to obtain an optimal result. It is worth highlighting that these configurations usually report the lowest CER of the Pareto frontier, so they achieve so with optimal number of parameters for those cases.

Fig. 5
figure 5

Impact of individual meta-parameters on model performance. Each meta-parameter is evaluated to determine the percentage of times it leads to improved model performance when compared to similarly sized baseline models

Table 4 Mean CER (in %) over all the datasets of each recursive configuration with respect to a specific baseline model with the same depth

5.1 Further examination

In this section we want to take a closer look at what using recursive layers means for model performance. First of all, we proceeded to conduct an analysis of the impact of these techniques on overall model effectiveness. In order to facilitate this analysis, we present Fig. 5, which contains a comprehensive bar plot that provides an insight into the percentage of times that each individual parameter contributes to a model outperforming a baseline model of a similar size.

Our evaluation first categorizes experiments into groups based on the number of parameters of the evaluated model. These groups are readily distinguished in all cases presented in Fig. 4, where four clusters of models with similar sizes are clearly displayed. We then assess the percentage of times a particular parameter leads to superior model accuracy compared to any of the baseline models within the same group. This provides a nuanced understanding of the specific parameters that drive performance improvements, contributing to a more fine-grained analysis of our models.

A close examination of this plot reveals that all the techniques applied consistently enhance model performance, with particular emphasis on the convolutional recursion, which yields model improvements in over 70 % of the cases in which it is applied. The convolutional kernel scaling parameter \(\alpha \) attains a similarly high success rate as regards driving model improvements. It is noteworthy that both recurrent and convolutional recursions demonstrate that a recursion depth \(\lambda \) of 2 tends to yield better performance outcomes when compared to a depth of 3. This suggests that excessive depths may indeed lead to deterioration in model performance. From the above, we can claim that when setting a restriction on the number of model parameters, the RecNN mostly achieves performance improvements in terms of CER. As a rule of thumb, we can see that recursion in the convolutional layer and kernel scaling are generally robust choices.

To finish our analysis, let us present the results in another way in which the impact of the RecNN as a neural architecture itself can be observed. Table 4 presents a first column that lists all possible configurations of the baseline, i.e., without recursion. The columns indicate all the possible recursion measures that can be applied to achieve the same depth of the baseline. One important thing is to note that all configurations (other than “None”) entail a size reduction of the base model of that row, so it is always assured that they are at least improving the size criterion. The average CER over all performed experiments for each possibility is given in the cells, whereas “-” indicates when a measure is not possible for that specific baseline configuration.

This table provides valuable insights into how the application of recursion techniques impacts on each of the baseline models. Remarkably, our experiments reveal that, in all cases, the incorporation of recursion techniques does not significantly deteriorate the performance of the model. Furthermore, for deeper models, note that the CER achieved when applying recursion is even lower than the baseline, demonstrating not only a reduction in the number of model parameters but also an enhanced performance in the HTR task. The results clearly indicate that most of the recursive configurations outperform their baselines, showing the effectiveness of both convolutional and recurrent recursion techniques, and even a combination of both, for HTR. It is worth noting that recursion tends to make a stronger impact in deeper baseline models. Despite not increasing the total number of parameters, the application of recursion enables these models to adopt even deeper configurations, resulting in an enhanced performance in the HTR task. This observation emphasizes the potential of recursion techniques to augment the depth of models without compromising model size, consequently contributing to improved task performance.

6 Conclusion

This study explores the application of Recursive Neural Networks (RecNN) to Handwritten Text Recognition (HTR) tasks. This approach addresses a critical aspect of the model size problem, providing solutions for scenarios with stringent size constraints, such as web environments and edge devices.

Our findings have consistently demonstrated that these techniques not only reduce the model size but also enhance its accuracy in many cases. They also show that deeper model configurations particularly benefit from these techniques, leading to performance improvements without a substantial increase in the number of parameters. These recursion techniques, therefore, represent a successful strategy with which to optimize HTR models, thus providing the potential for more efficient and effective deployment with various real-world applications.