Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Language Model Integration for the Recognition of Handwritten Medieval Documents

2009, 2009 10th International Conference on Document Analysis and Recognition

Language Model Integration for the Recognition of Handwritten Medieval Documents Markus Wüthrich∗, Marcus Liwicki†, Andreas Fischer⋆ , Emanuel Indermühle⋆ , Horst Bunke⋆ Gabriel Viehhauser‡, Michael Stolz‡ Abstract Building recognition systems for historical documents is a difficult task. Especially, when it comes to medieval scripts. The complexity is mainly affected by the poor quality and the small quantity of the data available. In this paper we apply an HMM based recognition system to medieval manuscripts from the 13th century written in Middle High German. The recognition system, which was originally developed for modern scripts, has been adapted to medieval scripts. Beside the data processing, one of the major challenges is to create a suitable language model. Because of the lack of appropriate independent text corpora for medieval languages, the language model has to be created on the base of manuscripts only. Due to the small size of the corpus, optimizing the language model parameters can quickly lead to the problem of overfitting. In this paper we describe a strategy to integrate all available information into the language model and to optimize the language model parameters without suffering from this problem. 1. Introduction In recent years, interest in the analysis and recognition of historical documents has grown strongly [2]. In order to preserve valuable old handwritings, a huge number of original documents have been photographed or scanned and are available in form of digital images. Examples include writings of famous presidents, e.g., George Washington’s papers at the Library of Congress, or scientists, e.g., Sir Isaac Newton’s writings at the University of Cambridge Library. Together with the creation of these large collections, there is an increasing demand for accessing text document im∗ Institute of Computer Science and Applied Mathematics, University of Bern, Neubrückstrasse 10, CH-3012 Bern, { mwuethri, afischer, eindermu, bunke }@iam.unibe.ch † German Research Center for Artificial Intelligence (DFKI), Trippstadter Strasse 122, D-67663 Kaiserslautern, marcus.liwicki@dfki.de ‡ Institut für Germanistik, University of Bern, Länggassstrasse 49, CH3012 Bern, { michael.stolz, viehhauser }@germ.unibe.ch ages in digital libraries [1]. A content based indexing is required to search and browse the images. Hence, analysis and recognition of the handwriting is needed for automation. The automatic reading of historical documents is an offline task, where only the images of the documents are available. This task is considered to be harder than on-line recognition, where also temporal information can be exploited [20]. For restricted domains with modern scripts, commercial off-line systems are available, e.g., for postal address [23] and bank check reading [13]. For the task of historical document processing, however, only little work exists. This is due to many difficulties, including the low quality of the original paper or parchment, ink bleedthrough, stains, holes, and other adverse artifacts. Considering the difficulties mentioned it is not astonishing that the problem of automatically generating textual transcriptions of historical documents is still unsolved [6]. Some approaches try to avoid the complete transcription of the documents. On the one hand, word spotting aims at efficiently matching keywords against the document images directly by segmenting the page into word images, performing a clustering based on global features, and matching the keywords against the labeled word clusters [21]. On the other hand, computer aided manual transcription is attempted [3]. While avoiding to generate a complete transcription, these systems do not benefit from any linguistic information, which has shown to be successful for modern languages. Little work has been reported on the automatic analysis and transcription of historical handwritten texts. In [15] a survey of text line extraction methods for historical documents is presented. The issue of text line segmentation into words has been discussed in [7] and [17]. In [8] a novel approach based on hidden Markov models (HMM) and so-called alphabet-soups is applied to transcribe letters from George Washington. Yet the method is constrained in that it deals with a single writer only. The authors of [19] propose an optical character recognition technique to transcribe Greek manuscripts. However, the method proposed is relying heavily on particular characteristics of the Greek alphabet and is difficult to adapt to handwritings in other languages. In the present paper we apply an HMM based recognition system to medieval manuscripts from the 13th century written in Middle High German. This system was adapted from another recognizer that has proven to be quite successful for modern handwritings [18]. Since language information is very important for generating good recognition systems [22, 24], we investigate different strategies for generating and optimizing statistical language models in this paper. In contrast to modern languages, where the language information can be extracted from huge corpora, such as the Brown Corpus [10] or the Corpus of Contemporary American English [5], the language model for Middle High German had to be constructed from the manuscript data. In the application considered in the present paper, finding a good optimization strategy of the language model parameters is rather delicate. Due to the small size of the corpus, we are facing a considerable overfitting problem. To provide a solution to this problem is one of our main concerns in this paper. The rest of the paper is organized as follows. First, Section 2 describes the medieval data set. Second, in Section 3 our HMM based recognition system is introduced. Next, Section 4 motivates the use of language models and presents our strategies. Subsequently, experimental results are reported in Section 5. Finally, Section 6 draws some conclusions and gives an outlook. Figure 1. Example page from the epic poem. St. Gall, collegiate library, cod. 857, page 262 2. Data ally performed. In the following, the problems occurring in the images and the preprocessing steps applied are listed: The data used in this paper originate from an epic poem called Parzival. This poem by Wolfram von Eschenbach dates approximately from the 13th century and is written in Middle High German. Parzival is arranged into 16 books and the poem is written in pairwise rhyming lines. There exists multiple manuscripts of this epic poem, which differ in several ways. For example, they have different writing styles, different dialects, and even the contents slightly differ. Furthermore, the available transcriptions differ heavily between the manuscripts. The manuscript used for the experiments described in this paper is St. Gall, collegiate library, cod. 857. Besides ’Parzival’ it also contains other epic poems. It consists of 318 folios and the transcription of ’Parzival’ is available for only about 45 folios. The pages have two columns of text. There are totally 4, 478 lines of text with transcriptions available. Figure 1 shows an example page from the manuscript used in our experiments. Because of the age and the rather low quality of the parchment, medieval manuscripts need special preprocessing. Since the focus of this paper is on the recognition and the language models, some preprocessing steps are manu- 1. The manuscript contains colored initial characters and decorations (see the ninth line of Figure 1 for an example). They mostly prepend the text paragraph. In some cases, these decorations are within the text column or even span multiple columns. In order to simplify the recognition task, such decorations and characters have been removed manually from the affected pages. The time and effort for this task is quite low because the characters and the decorations can be easily identified and removed, and there is a rather small number of such decorations. 2. There are also other characteristics of the parchment that make line segmentation and the recognition difficult, e.g., seams or holes. Parchment was very expensive in medieval times. Therefore the parchment was stitched together if there was a tear in it. Holes can also be found in the parchment. They are problematic, because the text from the next pages is visible in the image. Artifacts of this type are also manually removed. 3. The text is written in two columns. In order to ap2 the moment in vertical direction, and the number of blackto-white transitions in vertical direction. ply line segmentation after the preprocessing steps, the columns are separated from each other. 4. The result of the last step is one image per text column. These images still contain the colored background. Therefore, an edge-detection method called Difference of Gaussian (DoG) was automatically applied on these images. This method produces a grayscale picture of the writing without the background. 5. A binarization of the images is performed to eliminate the remaining background noise. 4. Language Models The purpose of integrating a language model is to improve a recognition system by using statistical information about the language structure in the recognition process [18]. This is motivated by the fact that humans can easily predict a word while reading. In this paper we use statistical n-gram language models. An n-gram language model stores statistical information about word sequences. Such a language model can be used to predict a word if the n - 1 preceding words are known. The probability p(W ) of a word sequence W = (w1 , ..., wm ) is given by Eq. (1). Eq. (2) is an approximation of Eq. (1) by limiting the context to n - 1 words. The resulting probabilities are called n-gram language model. Moreover, 1-gram, 2-gram and 3-gram language models are usually referred to as unigram-, bigram-, and trigram language models. The transcriptions of the manuscript are not directly available in ASCII text, because they are stored in the TUSTEP1 file format, which is a powerful tool for managing transcriptions for Latin and non-Latin manuscripts. The tool offers a command to export the transcription into an ASCII file. Basically, such an exported file consists of the transcribed text, comments, and several nestable annotations. This format was converted into a format suitable for recognition. For detailed information we refer to [25]. 3. Recognition System p(W ) = p(w1 ) The recognition system used in this paper is based on Hidden Markov Models (HMM). Some of its basic components are similar to the recognizer described in [18]. The input to the recognition system is a single line of text. In order to get these single text lines, a line segmentation method based on dynamic programming is applied on the pages of the manuscript [16]. To obtain better recognition results, several preprocessing steps are applied to the image data. Firstly, there are the special preprocessing steps described in Section 2, which deal with the problems of medieval manuscripts and are performed on complete manuscript pages. Secondly, there are several preprocessing steps that are applied on the segmented text lines, i.e., skew correction, line positioning and width correction. The skew correction horizontally aligns the writing, the line positioning normalizes the extend of the three main writing zones (lower, middle, and upper) and the width normalization affects the width of the characters. Because of the regular upright writing, no slant correction is needed. The features for the recognizer are extracted using a sliding window. In this approach, a window is moved from the left to the right over the text line. In this paper the same features as in [18] are used, namely, the number of black pixels, the position of the uppermost pixel, the orientation of the uppermost pixel, the position of the lowermost pixel, the orientation of the lowermost pixel, the proportion of black pixels to the number of pixels between uppermost and lowermost pixel, the center of gravity, the second derivative of p(W ) ≃ p(w1 ) m ! i=2 m ! p(wi |w1 , ..., wi−1 ) (1) p(wi |wi−n+1 , ..., wi−1 ) (2) i=2 The probabilities of word sequences are usually extracted from a large set of texts (corpus), independent of the training, validation and test set. A simple method to estimate p(wi |wi−n+1 , ..., wi−1 ) in Eq. (2) is: p(wi |wi−n+1 , ..., wi−1 ) = C(wi−n+1 , ..., wi ) C(wi−n+1 , ..., wi−1 ) (3) where function C counts the occurrences of a word sequence in the given corpus. However, in practice, this estimation is unreliable because there are many word sequences which never occur in the text corpus. These word sequences have a probability of zero. Especially for small text corpora, this phenomenon occurs frequently. However, a word sequence should not have a probability of zero because this possibly leads to a situation where the recognition system is not able to find the correct solution. There are several smoothing methods available that solve this problem [4]. In this paper, the Kneser-Ney smoothing method is used [14]. The problem for the Parzival manuscript is that there are no text corpora available that contain sufficiently large amounts of text in the Middle High German language in order to estimate the probabilities in Eq. (3) reliably. One possible solution for this problem is to use a combination 1 http://www.zdv.uni-tuebingen.de/tustep/ 3 of the training and the validation2 set to compile a useful corpus for the creation of the language models. The HMM based recognition system used in this paper supports the integration of n-gram language models during the decoding. In a HMM based recognition system the decoding is performed with the Viterbi algorithm, which finds the optimal path through the states of an HMM. The integration of the n-gram language models can be recursively expressed as follows: φ(si ) system were created and compared with each other. The differences between these systems are the source of their vocabulary and the corpus from which their language model is generated. All language models in this paper are bigram language models. As basic performance measure the word accuracy was used, which is defined as: " insertions + substitutions + deletions 100 ∗ 1 − total length of the ground truth The following systems were initially tested. System A uses the vocabulary and the language model created from the training set. System B uses the vocabulary and the language model created from the union of the training and the validation set. System C is built from the training, validation and test set. The vocabulary is built from all sets, while the language model is only created from the training and the validation set. It can be argued that the test set should not be touched for the creation of a recognition system. However, the purpose of System C is to provide an upper limit of the recognition rate by incorporating all words occurring in the test set. All three systems were tested with different sizes of their vocabulary, including the 1, 000, 2, 000, and 3, 000 most frequent words. Additionally, a version of each system B and C with the full vocabulary of the training and validation set, and a version of System C with the full vocabulary of training, validation, and test set were tested. = φ(si−1 ) + log p(Fi |wi ) + α log p(wi , wi−n+1 , ..., wi−1 ) + β where φ(si ) is the score of the word sequence si = (w1 , ..., wi ) and p(Fi |wi ) is the likelihood returned by the word HMM for wi and feature sequence Fi . The probability p(wi , wi−n+1 , ..., wi−1 ) is the n-gram language model probability. The parameter α is also known as the Grammar Scale Factor (GSF), because it weights the influence of the language model. Any positive value can be assigned to this parameter. The parameter β controls the segmentation rate of the recognizer and is called Word Insertion Penalty (WIP). A negative (positive) value means that the recognizer tends to return fewer (more) words. Both parameters are usually determined experimentally on the validation set. 5. Experiments For the experiments 4, 478 text lines from the manuscript described in Section 2 were used. The data was divided into three sets: a training set, a validation set, and a test set. The sets are equally distributed over the pages because it is not desirable to train a recognizer on pages containing poor quality data and test it on pages containing good quality data, and vice versa. The training set contains 2, 237 text lines (∼ 50%), the validation set 912 text lines (∼ 20%), and the test set 1, 329 text lines (∼ 30%). Training and validation set together contain 3, 980 and the full database contains 4, 937 distinct word classes. The recognition system uses an HMM for each of the characters in the alphabet. These HMMs have a linear topology and were trained on the training set. Each HMM uses 16 hidden states including the non-emitting start and end states. The parameter optimization on the validation set includes the number of Gaussian mixture components, the grammar scale factor, and the word insertion penalty. Four iterations were made for each increase of the number of Gaussian mixture components. This setting has been adopted from [12]. In order to measure the influence of the vocabulary size and the language model, several versions of the recognition 2 Depending # Systems 1,000 2,000 3,000 3,980 4,937 A B B* C 63.52 66.79 69.94 - 63.83 66.79 68.59 70.73 - 64.15 68.50 70.36 73.00 - 65.39 70.71 73.89 77.34 81.50 Table 1. Word accuracy (%) on the test set depending on different sizes of the vocabulary. The results of the systems on the test set are summarized in Table 1. The results show that System A and System B perform similarly. Obviously, System C performs best because, in contrast with systems A and B, its vocabulary contains all words occurring in the test set. Figure 2 shows the influence of the grammar scale factor on the word accuracy for the validation set. The optimal grammar scale factor is between 50 and 60 for the systems B and C. The optimum for System A is 30, which is considerably smaller than for the other two systems. This shows that the grammar scale factor has an important influence on the word accuracy. on the actual task, the test set may also be used. 4 90 73 72 71 word accuracy (%) word accuracy (%) 85 80 System A 75 System B System C 70 65 60 70 69 68 67 66 65 System B 64 System B* 63 0 20 40 60 80 100 1000 GSF Figure 2. Influence of the grammar scale factor on the word accuracy on the validation set for the different systems. 2000 3000 vocabulary size 4000 Figure 3. Results on the test set for System B and System B*. 6. Conclusion and Outlook From Table 1 and Figure 2 we conclude that System A and System B perform similarly on the test set, whereas System B and System C perform similar on the validation set. System A has the worst performance on the validation set. The reason for this is that the language model and the vocabulary from System B and System C were built from the validation set itself. Furthermore, the influence of the grammar scale factor is different for System A. This leads to the assumption that the system suffers from an overfitting effect on the validation set. Therefore, a new system (System B*) was created and tested. In this paper we address the problem of recognizing medieval manuscripts from the 13th century written in Middle High German. We apply an HMM based recognition system, which was originally developed for modern handwritings. The best word accuracy on the test set is 73.00% with an open vocabulary. This is a very promising result for the difficult task of automatically generating textual transcriptions of historical documents. All experiments were performed in a writer-independent fashion. Therefore, it can be expected that the system can be adapted to other historical documents with minimum effort. Creating recognition systems for medieval languages is quite difficult, because there are only few resources available. This affects almost any part of the recognition system. First, the amount of training data is very limited, which makes it hard to train a good recognizer. Second, the sources for creating a suitable language model are restricted as well. Furthermore, the complexity is increased by the poor quality of the data. In modern handwriting recognition, language models are built from large independent text corpora. This is in most cases not feasible for old languages, because such corpora do not exist. Therefore, people are forced to use the ground truth of the available manuscripts for the creation of the language models. In a straight forward approach, all available data, i.e., the training and the validation set, would be used to create a language model. However, optimizing the language model parameters (GSF, WIP) on the validation set is rather delicate and can lead to overfitting. Therefore, we propose to use only part of the available data, i.e., the training set, to create an initial language model and optimize the language model parameters on the independent validation set. In a second step, the language model System B* and System B use the same vocabulary and language model. However, instead of using the optimized parameter values (grammar scale factor, word insertion penalty, and number of Gaussian mixture components) from the validation set, System B* uses the optimized values from System A. This strategy is motivated by the fact that the language model for System A has only been generated on the data of the training set. Therefore, the data for optimizing the language model parameters are from different sets, preventing the system from overfitting. A comparison of systems B and B* is given in Table 1 and Figure 3. System B* performs statistically significantly better than System B for vocabulary sizes of 2, 000 and more (indicated by bold face). It can be concluded that the assumption of an overfitting is true and that the described approach can be used to overcome this problem. A closer look at Figure 3 reveals furthermore that the relative improvement increases for larger vocabulary sizes. This is an interesting observation, because we are facing an open vocabulary recognition task. Larger vocabulary sizes will possibly lead to an even better performance. 5 is generated from all available data and is used in the recognition system with the optimized parameters from the first step. In doing so, the language model takes all available data into account without suffering from overfitting. In our experiments a significant increase of the word accuracy was observed with applying this approach. We are aware of other approaches, which determine the language model parameters using the statistics of the data [9]. However, in our approach, we directly use the resulting word accuracy for optimization instead of estimating the possible recognition rate. This has been shown to be efficient in previous work as well. There are still many issues about medieval manuscripts that need to be investigated, e.g., the automatic extraction and the recognition of the initial letters, the recognition of abbreviations, experiments on larger data sets, or the use of other recognizers (see for example [11]). [9] G. Fink. Markov Models for Pattern Recognition: From Theory to Applications. Berlin Springer, 2007. [10] W. Francis and H. Kucera. Manual of information to accompany a standard sample of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, 1979. [11] A. Graves, S. Fernandez, J. Schmidhuber, M. Liwicki, and H. Bunke. Unconstrained online handwriting recognition with recurrent neural networks. In Advances in Neural Information Processing Systems 21, 2007. [12] S. Günter and H. Bunke. HMM-based handwritten word recognition: on the optimization of the number of states, training iterations and gaussian components. Pattern Recognition, 37:2069–2079, 2004. [13] S. Impedovo, P. Wang, and H. Bunke. Automatic Bankcheck Processing. World Scientific, 1997. [14] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In Proc. International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 181–184, 1995. [15] L. Likforman-Sulem, A. Zahour, and B. Taconet. Text line segmentation of historical documents: a survey. International Journal on Document Analysis and Recognition (IJDAR), 9(2-4):123–138, 2007. [16] M. Liwicki, E. Indermühle, and H. Bunke. On-line handwritten text line detection using dynamic programming. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 447–451, 2007. [17] R. Manmatha and J. L. Rothfeder. A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(8):1212–1225, 2005. [18] U.-V. Marti and H. Bunke. Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. Journal of Pattern Recognition and Art. Intelligence 15, pages 65–90, 2001. [19] K. Ntzios, B. Gatos, I. Pratikakis, T. Konidaris, and S. J. Perantonis. An old greek handwritten ocr system based on an efficient segmentation-free approach. International Journal on Document Analysis and Recognition (IJDAR), 9(2):179–192, 2007. [20] R. Plamondon and S. Srihari. Online and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Analysis and Machine Intelligence, 22:63–84, 2000. [21] T. M. Rath and R. Manmatha. Word spotting for historical documents. International Journal on Document Analysis and Recognition (IJDAR), 9(2-4):139–152, 2007. [22] R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? In Proc. of the IEEE, volume 88, pages 1270–1278, 2000. [23] S. Srihari, Y. Shin, V. Ramanaprasad, and D. Lee. A system to read names and addresses on tax forms. Proceedings of the IEEE, 84(7):1038–1049, 1996. [24] A. Vinciarelli, S. Bengio, and H. Bunke. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. PAMI, 26:709–720, 2003. [25] M. Wüthrich. Automatic Recognition of Medieval German Manuscripts. Master’s thesis, University of Bern, Switzerland, 2008. 7. Acknowledgment This work has been supported by the Swiss National Center of Competence in Research (NCCR) under the project Interactive Multimodal Information Management (IM2). References [1] Second International Workshop on Document Image Analysis for Libraries (DIAL 2006), 27-28 April 2006, Lyon, France. IEEE Computer Society, 2006. [2] International Journal on Document Analysis and Recognition, Special Issue on the Analysis of Historical Documents, volume 9. Springer, 2007. [3] F. L. Bourgeois and H. Emptoz. Debora: Digital access to books of the renaissance. International Journal on Document Analysis and Recognition (IJDAR), 9(2-4):193–221, 2007. [4] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical report, Computer Science Group, Harvard University, 1998. [5] M. Davies. The corpus of contemporary American English (COCA): 385 million words, 1990-present. available online at http://www.americancorpus.org., 2008. [6] A. Downton, J. He, and S. Lucas. User-configurable ocr enhancement for online natural history archives. International Journal on Document Analysis and Recognition (IJDAR), 9(2):263–279, 2007. [7] M. Feldbach and K. Tönnies. Word segmentation of handwritten dates in historical documents by combining semantic a-priori-knowledge with local features. In Proc. 7th Int. Conf. on Document Analysis and Recognition, Edinburgh, Scotland, volume 1, pages 333–337, 2003. [8] S. Feng, N. Howe, and R. Manmatha. A hidden markov model for alphabet-soup word recognition. In Proc. 11th Int. Conference on Frontiers in Handwriting Recognition, 2008. 6 View publication stats