Keywords

1 Introduction

In forensic science, graphoscopy covers, among other aspects of writing, the finding of the authorship of handwritten documents, that is, the association of a questioned text to a specific writer. Further, its complexity is in the subjectivity used by the expert in the establishment of measures of similarities between writing patterns found in the reference models of a writer and the questioned document, allegedly belonging to the same writer.

From a computational point of view, establishing the authorship of handwritten documents is a major challenge, as the transposition of graphometric features to a computational model can present a high complexity.

In recent decades, computational solutions have been presented in the literature. These solutions can be classified into two types of basic segmentation approaches: contextual and non-contextual [2]. Contextual approaches generally use segmentation processes based on the text’s content [3, 4, 7, 8]. The segmentation of the text into lines, words, letters, and letter segments is the approach more preferably used. The results generated by this segmentation are subsequently subjected to different feature extraction techniques, which may or may not be associated with the handwritten features [1, 2]. Approaches that use contextual models feature high complexity and may not meet all writing styles. This results from the high intrapersonal variability and graphical anomalies presented by some writers. For example, there are cases where the contents of adjacent lines overlap (model behavior – Fig. 1(a), their slant is high and they occur irregularly (basic behavior – Fig. 1(b), or the text is unreadable, preventing the identification of connectivity points between letters (moments) and the spacing between words (spacing – Fig. 1(c). Because of these factors, the computational solutions usually presented are based on manual or semi-automatic segmentation.

Fig. 1.
figure 1

(a) Text with related content; (b) Text with irregular alignment; (c) Illegible text.

Non-contextual approaches are based on the morphology of the extracted information and not on its meaning. Therefore, these approaches do not take into account the significance of the information obtained, such as letters and words. In some cases, they seek only stroke segments [1, 2]. The advantage of these approaches, as compared to those mentioned above, is the simplicity of the techniques of segmentation and features extraction, which enable the use of computational applications for forensic purposes.

In both contextual and non-contextual computational approaches, the analysis of the texture obtained from the handwritten document has proven to be a promising alternative. This is due in part to the minimization of the complexity of the features extraction process.

Non-contextual methods offer simplicity and efficiency in computer applications. They allow the use of manuscripts with different contents, in the set of documents to be analyzed, and the use of manuscripts in any language, even among the documents of the analysis set.

Based on the generation of textures, various classification and features extraction methods have been proposed in the literature; the most popular ones among them are Gabor’s filters, Wavelets [9], GLCM (Gray-Level Co-occurrence Matrix) [2], LBP (Local Binary Patterns), and LPQ (Local Phase Quantization) [1].

This paper presents a non-contextual approach for determining the authorship of handwritten documents. The segmentation method used was proposed by Hanuziak [2], and the features were obtained from two distinct classes of texture descriptors, the statistical – GLCM [2] – and the model-based – based on fractals [10]. For classification, a protocol based on an SVM (Support Vector Machine) [1, 2, 7] classifiers was used, and measures of dissimilarities [1, 2, 10] were used as metrics. The results, around 97.8 % witch reduced number of feature and number of references samples per writer are promising.

2 Graphometric Expertise in Handwritten Documents

Given a set of reference manuscripts K (of known authorship) and the questioned manuscript Q (of unknown authorship), from the point of view of an expert, the result of expert analysis is based on the disclosure of a minimum set of graphometric features and the respective degrees of similarity found. The authorship of a questioned document is determined on the basis of the similarities observed during the review process of these features. This set consists of a large group of features such as attacks, pressure, and endings [2], as listed in Table 1.

Table 1. Genetic and generic attributes.

Graphometric features are divided into two major classes, generic and genetic features of writing [1]. Generic features have elements related to the general shape of the text. Genetic features represent the genesis of the author’s writing, presenting unconsciously produced aspects of the stroke. These are difficult for the writer to hide and for others to reproduce. In an expert examination, each feature can be classified into minimum, medium, or maximum convergence or divergence.

One aspect relevant to the expert analysis process is the set of documents used as \( K_{1 \ldots n} \) models or references, n being the number of reference documents. The number of these documents should be sufficient to allow an accurate analysis of the similarities between the features of K i models and the Q questioned document. The expert observes a set of graphometric features in the n reference samples and looks for them in the analysis of the questioned document. By the end of the analysis, he provides a similarity index between them (questioned and reference documents) and subsequently, makes a decision. The resulting expert report depends on the sum of the results obtained from individual comparisons of pairs of documents (reference/questioned).

2.1 Textures and Graphometric Features

One of the texture’s most important properties is the preservation of the dynamic behavior of the writer’s stroke. Table 2 shows that generic and genetic attributes can be observed closely by watching a texture image generated from a manuscript. Therefore, in this study, we seek to associate the observable properties of texture to graphometric features. Thus, two different approaches were used to obtain texture descriptors, namely, model-base and statistical.

Table 2. Relationship between graphometric features and the observable behavior of compressed blocks of handwritten text in texture form.

2.2 GLCM and Haralick’s Descriptors

The descriptors proposed by Haralick [14] are statistical descriptors whose properties seek to model the representation of irregular texture patterns. The set consists of 13 descriptors. They seek to statistically determine the distribution properties and the relationship between the grayscale contained in a texture. The descriptors are calculated by using GLCM obtained from the image of the texture under analysis.

The works based on GLCM demonstrated that among the various descriptors of Haralick [14], entropy has stood out in the approaches where the goal is to find authorship. Entropy describes the randomness of the pixel distribution in the texture. In manuscripts, this randomness is closely related to the writer’s habits during the writing process.

Entropy E can be defined by the following equation:

$$ E = - \sum\nolimits_{i = 0}^{n} {\sum\nolimits_{j = 0}^{m} {} } p\left( {i,j} \right).{ \log }\left( {p\left( {i,j} \right)} \right), $$
(1)

where p(i, j) denotes the matrix generated by GLCM and \( n\neg \) and m represent the maximum dimensions of the matrix.

2.3 Fractal Geometry

Fractals are elementary geometric shapes, whose pattern replicates indefinitely, creating complex objects that preserve, in each of their parts, features of the whole. Two of the most important properties of fractals are self-similarity and non-integer dimension [13]. Self-similarity is evident when a portion of an object can be seen as a replica of the whole, on a smaller scale. The non-integer dimension is equivalent to modeling phenomena only representable in geometric scales in the range between integer values.

The property that combines fractal geometry with graphometric features is the possibility of mapping the structural behavior of the writer’s stroke, such as the slant

The relationship between different dimensions S can be defined as follows:

$$ S = L.D, $$
(2)

where L denotes the linear scale and D represents the fractal dimension.

The relationship between dimension D and linear scaling in a result of increasing size S can be generalized as follows:

$$ D = \frac{log\left( S \right)}{log\left( L \right)} $$
(3)

Samarabandu [11] presents a fractal model based on mathematical morphology. Developed by Serra [12], this model uses a series of morphological transformations to analyze an image. The method proposed by Samarabandu was chosen because of its simple implementation and performance compared with other techniques proposed in the literature.

Consider a grayscale image X as a surface in three dimensions, whose height represents the scale of gray at each point. The surface area in different scales is estimated using a number of dilations of this surface by a structuring element Y, whose size determines the range ρ. Thus, one can define the surface area of a compact set X regarding a structural element Y, symmetrical to its origin, by using the following function:

$$ S\left( {X, Y} \right) = lim_{\rho \to 0} \frac{{V\left( {\vartheta X \oplus \rho Y} \right)}}{2\rho } $$
(4)

where V denotes the surface, ϑX denotes the edge of set X and ⊕ represents the dilation of the edge of set X by the structural element Y, in the ρ scale.

3 Materials and Methods

The proposed method is divided into two main stages, namely, the training procedure and the expert examination procedure, as shown in Fig. 2. Each stage is divided into processes that can be repeated in both of them, such as compression of the manuscript image and extraction of texture descriptors, both detailed in the following sections.

Fig. 2.
figure 2

Block diagram of the proposed method.

3.1 Acquisition and Preparation of Manuscript Database

The database chosen for this work was the IAM Handwriting Database [1, 35, 7, 8]. The IAM database was chosen because it is an international database widely used in works related to manuscripts.

The abovementioned database includes 1,539 documents from 657 different authors. One document sample from the database is shown in Fig. 3. The number of letters by author varies between 1 and 59 copies. For this work, the header and the footer were removed, as displayed in the rectangles in Fig. 3.

Fig. 3.
figure 3

Example of manuscript from the IAM database.

The documents were then submitted to the compression process presented by Hanusiak [2]. The resulting image was fragmented into rectangles of 128 × 128 pixels, as shown in Fig. 4.

Fig. 4.
figure 4

Example of manuscript from the IAM database.

Each set of manuscripts by an author generated an average of 24 fragments measuring 128 × 128 pixels. These images were then separated into training and testing databases and used in the process of obtaining the texture descriptors.

For generation of the model, the writer-independent approach [1, 2] was used. This implied that the feature vectors generated from the combination of two samples of the same writer formed the sample set called “authorship” (W 1). Vectors generated from the combination of samples of different writers formed the set referred to as “non-authorship” (W 2). The writers who participated in training the model were not used in the test. This model has the advantage of not requiring new training when a new writer is inserted into the writer database. Sets of 50, 100, 200, and 410 authors were tested in the training.

Samples of the test base, consisting of 240 writers, were divided into two sets. The first was allocated to the reference set \( K_{1 \ldots n} \), with n ranging from 3 to 9 fragments. The remaining author samples were allocated to the test, appearing as questioned document Q.

3.2 Extraction of Texture Descriptors

The features used are based on fractal geometry, using fractal dimension and GLCM through entropy.

In this stage, the calculation of the fractal dimension was applied on the fragments of compressed manuscript images. The calculation was repeated on the images in different gray levels to thereby form the feature vector. Levels 2, 4, 8, 16, 32, 64, 128, and 256 were used, leading to an array of eight components.

The Samarabandu [11] algorithm was used with ρ = 6 and structuring element Y crosswise with initial size 1. These values were chosen in the preliminary tests.

For entropy, GLCM was used with the size of 2 × 2, applied to a fragment of the image in black and white. The size of the matrix was chosen on the basis of the results obtained by Hanuziak [2] and Busch [6]. Four directions with q = 0°, 45°, 90°, 135°, and five distances d = 1, 2, 3, 4, 5 were used, leading to an m vector of 20 components.

3.3 Calculation of Dissimilarity

The dissimilarity between feature vectors was proposed by Cha [9] to transform a problem of multiple classes into a two-class problem. Dissimilarity was obtained from the difference between the observable features in two distinct fragments of text. For the calculation of this difference, a measure of distance was used, such as the Euclidean distance, Cityblock, Chebychev, Correlation, and Spearman [1, 2]. For this work, the Euclidean distance was chosen D x,y . The choice was based on preliminary tests.

X and Y are two feature vectors of dimensionality m. The dissimilarity vector D x,y between X and Y can be defined as follows:

$$ \varvec{D}_{x,y} = \bigcup\nolimits_{i = 1}^{m} {\left( {\sqrt {\left( {x_{i} - y_{i} } \right)^{2} } } \right)} $$
(5)

3.4 Rule of Decision

The concept of similarity establishes that the distance between the measurements of the features of the same writer tends to zero (similarity), while the distance between the measurements of the features of different writers tends to a value far from zero (dissimilarity). Consequently, the feature vector generated by these distances can be used by the SVM for classification purposes. Therefore, on the basis of the training model, the SVM is then able to classify it into authorship and non-authorship.

Each result of the comparison of the questioned manuscript Q with one of the references K is then combined in a process of merging the results. With the number of references greater than 1 (K ≥ 1), it is possible to define a decision criterion that takes into account the results from the individual comparison of each reference to the questioned sample. The concept presented here is the same used by an expert. There are several decision rules proposed in the literature, such as majority vote, the sum of the results, and the average of the results [1, 2]. In this work, a combination of the linear kernel in the SVM and the sum of the results as the decision rule were used, as they presented the best results in the preliminary tests.

4 Results

The following are the results of the tests performed on the IAM database.

The first test (Table 3) aimed to observe the behavior of the variation of the training set, given that this is a writer-independent model. Under these conditions, the result was expected to show significant improvement with an increase in the number of training samples in both classes, W 1 and W 2. However, the results showed that even when a significantly small number of writers were used in the training, the results ranged around 1.5 %. This was attributed in part to the dimensionality of the feature vector used, which provided stable performance even in small sets of training samples.

Table 3. Results obtained with the proposed method.

The factor that most influenced the results was the number of samples used as reference for each writer K. The use of K = 3 presented the least significant result. The values of K = 7 and 9 had the greatest stability of the tested set. This behavior was also observed in the analysis conducted by the expert. That is, the more reference samples he evaluated, the higher was his capacity to distinguish writer variability and, therefore, the lower was the likelihood of mistakes.

The comparison of the proposed method with others presented in the literature is always an approximate approach, even when using the same database. Various aspects can interfere with this or that result, such as the classifier used, the training and testing protocol, the texture fragment size, the number of references, the feature vector dimensionality, and the text compression process. However, it is possible to establish behavior trends showing how promising the proposed method can be. Table 4 presents a comparative view of some authorship analysis methods.

Table 4. Comparison of results obtained with the methods found in the literature and the proposed method.

5 Conclusion and Future Studies

In this paper, we presented an optimized method for establishing the authorship of questioned handwritten documents by using texture descriptors. Two approaches to obtain the descriptors were used: fractal geometry, using fractal dimension; and entropy, proposed by Haralick, using GLCM. The results show the stability of the writer-independent model with respect to the number of writers used for training, the reduced number of features and the low number of samples, compared to other works.

Two other important aspects are as follows: (1) the texture-based method uses descriptors that provide a direct relationship with graphometric features and (2) the use of a low-dimensional feature vector allows training with a small number of samples to not significantly affect the performance of the method.

The results of 97.7 % obtained by using the proposed method were promising and consistent with those obtained by other studies.

In future studies, the intention is to incorporate other texture descriptors that can describe the curvilinear behavior of strokes, such as curvelets.