03 Filter Pas If
03 Filter Pas If
03 Filter Pas If
Q.B. Dang*, M.M. Luqman*, M. Coustaty*, C.D. Trant and 1.M. Ogier*
* L3i Laboratory, University of La Rochelle, France
t College of Information and Conununication Technology, Can Tho University, Vietnam
e-mail: quoc_bao.dang@univ-lrJr
Abstract-In this paper, we propose a new feature vector, of establishing a way of efficiently matching document images.
named Scale and Rotation Invariant Features (SRIF), for real
time camera-based document image retrieval. SRIF is based
on Locally Likely Arrangement Hashing (LLAH), which has A document image retrieval system with camera phones
been widely used and accepted as an efficient real-time camera was proposed by Liu and Doermann in 2007 [11]. They
based document image retrieval method based on text. SRIF
proposed the features called "Layout Context" descriptor. The
is computed based on geometrical constraints between pairs of
"Layout Context" features are extracted from the geometrical
nearest points around a keypoint. It can deal with feature point
location of the words' bounding boxes in a document image.
extraction errors which are introduced as a result of the camera
capturing of documents. The experimental results show that SRIF
Beginning at the center of the word and looking for the most
outperforms LLAH in terms of retrieval accuracy and processing visible n neighbors, the "Layout Context" of a word w is
time. proposed. From the angle of the view, the visibility is defined.
According to the authors, the top n visible neighbors are
I. INT ROD U C T ION AND R EL AT E D WORK
rotation invariant and the percentage of view angles that a
neighbor words occupies are also subjected to rotation. From
Recently, the search and retrieval of document images the center of w, the coordinate system origin is established
has been used in a wide range of applications [1] such as: with X-axis parallel to the baseline of w and the width of w is
word spotting [2], [3], document similarity measurement [2], used to define the unit metric. Under this coordinate system,
document image retrieval using queries by examples [4], logo the coordinates of n most visible neighbors are invariant to
& symbol spotting [5], [6], [7], [8] and retrieving scanned similarity transformation.
documents in digital libraries [9]. In the digital age, the The "Layout Context" descriptor is robust against perspective
explosion of the number of portable digital imaging devices distortion, occlusion, uneven lighting and even crinkled pages.
has created a tremendous opportunity for camera-based The experiment results showed that the system is able to
document image retrieval applications. Users can access a identify even a small patch of the document image, captured
huge amount of content on the Internet and a big challenge is by a camera phone, in a known set of documents. A drawback
to propose some tools to link real documents to those captured of this system is that it is quite slow to find every candidate
with digital devices. For instance, some augmented reality page [11].
tools appear to propose similar contents (e.g. newspapers and
magazine articles) to the users by simply capturing an image
with their smartphones or cameras. In camera-based textual document image retrieval, the
Camera-based document image retrieval can be summarized method called Locally Likely Arrangement Hashing (LLAH)
as searching for the most relevant document images regarding is known as an efficient method with regard to accuracy, time
the user's query that is captured by a digital camera. This and scalability [12], [l 3], [14], [15], [16]. To compute this
task requires us to tackle many different problems [10]: feature, the centroid of words or letters connected components
(i) Images captured with cameras usually have a low are considered as keypoints, and feature vectors are built
resolution. from the local arrangement of invariants computed from
(ii) A camera has far less control of lighting conditions on an the keypoints. Some local invariants are computed (affine
object compared to a flatbed scanner, so uneven lighting is invariant or perspective transformation invariant) from k
due to both the physical environment and the response from coplanar points (e.g. k=4 or k=5) [12], [13]. What is more
the device. important is that the authors proposed an efficient hashing
(iii) Perspective distortion problems can occur as the capture technique, and LLAH has been shown superior to Geometric
device is not parallel to the imaging plane (and to the text). Hashing method concerning computational complexity [12],
(iv) Since digital devices are designed to operate over a [17].
variety of distances, focus becomes a significant factor. At LLAH was evaluated on a dataset of 10000 scientific paper
short distances and large apertures, even slight perspective documents. Query images were captured covering entire pages
changes can cause uneven focus. with a 6.3 megapixels digital camera (CANON EOS 300D).
(v) Lastly, the acquisition of images from a camera generally The results were impressive in terms of accuracy, time and
results in acquiring a subpart of the original image. The scalability [l 3].
retrieval process can be seen as a matching process between However, all the methods presented previously are not able
the digitized image and the original one. This then consists to deal with very small portions of documents captured
by camera, or with pages with insufficient text. In order When k 5, the invariant of perspective transformation
=
to improve these features, Takeda et al. [14] proposed an called cross-ratio from 5 points A,B, C,D,E was defined as
extension of the LLAH feature by adding some additional follows:
features which are based on the rank of k area ratios of the
extracted word regions. In another work, in order to consider S(A,B,C).S(A, D, E)
(2)
the case of the capture of a portion of a document, they also S(A,B,D).S(A,C, E)
proposed to improve the LLAH features by adding additional where S(A,B,C) is the area of a triangle with apexes A, B, C.
features based on rank of areas of words regions [16].
Similarly, Kise et al. [15] improved the LLAH feature by In order to reduce the sensibility of the system to keypoint
using the rank of k areas of letter regions and the query extraction errors, multiple LLAH vectors are computed for
expansion method in order to cope with small document each keypoint. As all the possible combinations of m points
portions captured by camera-pen system [15]. An example among n are examined, (';:)
LLAH vectors have to be built
of the rank of areas of k letter or word regions is an order from each keypoint. As a consequence, the more LLAH vectors
of k values which belong to 1 to k, this order being affine are built, the more processing time and memory consumption
invariant. For example, the order of ranking result with k = 6 the system is required. Thus, nand m need to be suitably
e.g. ( 6, 1, 2, 3, 5, 4 ) means that area of region 6 is the biggest set depending on each system. Experiments presented in [12]
one, area of region 1 is the second biggest one, similarly area showed that the highest accuracy was obtained with affine
of region 4 is the smallest one. invariant (n 8, m = 7), while cross-ratio was the second
=
highest (n 8, m
= 7) and affine invariant (n
= 7, m 6) = =
II. LLAH
List
Hash
In this section, we give an overview of LLAH. LLAH in Table
cludes three main steps which are feature extraction, indexing
phase and retrieval phase.
Fig. I. The hash table structure.
A. Feature extraction
These performances rely on the use of integer feature
LLAH feature extraction can be summarized as follows vectors r, that are discretized and normalized as follows [18]:
[13], [18]. LLAH considers centroid of each word connected
r(i) = trunc(r(i)) * 2 + round(r(i) - trunc(r(i))) (3)
component as keypoints, which can be obtained even under
perspective distortion, noise, and low resolution. A deep
And the Hash function is defined as follows [18]:
description on the method to obtain centroid of each word
connected component can be found in [13]. From each d-l
keypoint P, the n nearest neighbor points around keypoint Hindex =
(2: ri q i) mod Hsize (4)
P are selected and organized clockwise. Then, all possible i=O
combination of m points among n are examined ( m < n). where d is the number of dimensions of vector r, q is the
From one arrangement combination of m points, the LLAH level of quantization constant (e.g. q 17), Hsize is the size
=
602
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
C. Retrieval phase o
o
o
o o
Starting from a query image captured with a camera, o
D
Captured query image Similarly to LLAH, keypoints for SRIF are firstly extracted
from centroids of word connected components (we can defi
D nitely employ centroids of letters as keypoints if needed). Next,
n
from each keypoint P, nearest neighbor points around P are
selected and organized clockwise (e.g. 6). After this, alln
m n
=
Indexing I Retrieval
points, the SRIF vector r is calculated based on a sequence
Computation of indices ¢ of scale and rotation invariants calculated from all possible
combinations of 2 points (constrained to among points. P) m
i),
" "
Finally, each value of the SRIF vector, r ( is computed using
! t:�:=
Hash table either one of two invariant values: eij.Lmaxii or eij.Lminij
as presented in equation (5, 6)
�
IV. EXPERIMENTATION
Fig. 5. Captured video from a document at four regions, the overlap between
and LmaXij =
max(IPAI/IPPjl,IPPjI/IPAI) are scale spotting region results and captured region from a query image.
603
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
matched results, RANSAC is applied to find the best 2 SRIFmax Yes 6 5 68% 0.4
3 SRIFmax No 7 6 84% 0.6
homography transformation between query's keypoints and 4 SRIFmax Yes 7 6 48% 0.4
the corresponding keypoints of each document in top-t result 5 SRIFmin No 6 5 - 0.6
documents. If no best transformation can be found, the 6 SRIFmin Yes 6 5 74% 0.37
7 SRlFmin No 7 6 44% 0.6
number of votes is set to zero. Lastly, the document with
8 SRIFmin Yes 7 6 83% 0.4
majority of votes in top-t result documents is returned as the 9 LLAH No 7 6 21% 0.8
result. A correct retrieval result is validated if it has a correct 10 LLAH Yes 7 6 20% 0.6
To validate the correct region, firstly RANSAC is applied so the best result in accuracy of retrieval (with 84%), the second
that we can obtain the spotting region of query image in the highest accuracy of retrieval was SRIFmin adding features
returned document through perspective transformation. Next, (n = 7, m 6). SRIFmin (n
= 6, m 5) did not work in the
= =
the overlap between the ground truth region (where query case without adding features, and SRIFmax (n 6, m 5) = =
image was captured) and the spotting region is calculated. with only 10 dimensions gave 61% of accuracy result. As we
The frame is considered as a correct retrieval result if the can see, LLAH got very low accuracy retrieval result with our
area of the overlap is more than 60 percent of the area of the dataset. The processing time of all methods was less than a
spotting region otherwise it is considered as an incorrect result. second per query. SRIF was a little faster than LLAH, and the
An example of the overlap region validation is shown in Fig. 5. processing time of SRIFmax (n 6, m 5) was the fastest
= =
one.
The Table II shows the result of videos testing. As we can
In order to compare SRIF and LLAH, we measured the
retrieval accuracy and the average retrieval time. Both methods TABLE I!. T HE EXPERIMENTAL RESULTS EVALUATED BY VIDEOS
were tested with two protocols respectively corresponding to
Videos Retrieval Accuracy
frames evaluation and video evaluation. With frames evalua Method Id Avg. Retrieval Time (s)
0 90 180 Avg
tion, the frames retrieval accuracy corresponds to the precision 1 62% 62% 61% 61.6% 0.54
2 71% 70% 67% 69.3% 0.60
measured by the ratio between the number of correct retrieval
3 86% 85% 84% 85.0% 0.42
frames in the total dataset (73350 images). With videos, we 4 50% 48% 45% 47.7% 0.43
evaluated the retrieval accuracy for each video called videos re 5 0.62
trieval accuracy. For this evaluation, 15 frames were extracted 6 76% 75% 73% 74.7% 0.48
7 44% 44% 43% 43.7% 0.75
from each video, and each frame was rotated by an angle of 0, 8 86% 84% 83% 84.0% 0.54
90, or 180 degrees before going to the retrieval phase. If there 9 20% 19% 17% 18.7% 0.85
are more than half of correct frames, video was considered as 10 19% 17% 16% 17.3% 0.67
604
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
page. REF E R EN C E S
Another reason is that the dataset contains many pages with [1] M . B . Kokare and M . Shirdhonkar, "Document image retrieval: an
overview," Int. f. Computer Applications, vol. 1, no. 7, pp. 114-119,
2010.
[3] M. Rusiiiol and J. LJad6s, "Word and symbol spotting using spatial
organization of local descriptors," in Document Analysis Systems, 2008.
DAS'08. T he Eighth IAPR International Workshop on. IEEE, 2008,
pp. 489-496.
Fig. 6. Insufficient number of text and uneven focus query examples
[4] K. Kunze, K. Tanaka, M. Iwamura, and K. Kise, "Annotate me:
supporting active reading using real-time document image retrieval
on mobile devices," in Proceedings of the 2013 ACM conference on
Pervasive and ubiquitous computing adjunct publication. ACM, 2013,
pp. 231-234.
insufficient number of text areas (Fig.6). Another reason is [7] V. P. Le, M. Visani, C. D. Tran, and J.-M. Ogier, "Improving logo
spotting and matching for document categorization by a post-filter based
uneven lighting and uneven focus problems caused by the
on homography," in Document Analysis and Recognition (ICDAR), 2013
camera. Lastly, by applying a rotation transformation, the 12th International Coriference on. IEEE, 2013, pp. 270-274.
retrieval accuracy slightly decreased approximately from 1% [8] S. Yang, "Symbol recognition via statistical integration of pixel-level
to 3%. This is mainly due to the fact that texts were changed constraint histograms: A new descriptor," IEEE transactions on pattern
by rotation transformation, which caused some keypoint ex analysis and machine intelligence, vol. 27, no. 2, pp. 278-281, 2005.
bers of texts; furthermore SRIF outperformed LLAH without [15] K. Kise, M. Chikano, K. Iwata, M. Iwamura, S. Uchida, and S. Omachi,
"Expansion of queries and databases for improving the retrieval accu
adding any additional features from both the retrieval accuracy
racy of document portions: an application to a camera-pen system,"
point of view, and processing time point of view. in Proceedings of the 9th IAPR International Workshop on Document
In the future, we are going to evaluate our features on other Analysis Systems. ACM, 2010, pp. 309-316.
datasets comprising of multi-lingual document images. We [16] T. Nakai, K. Kise, and M. Iwamura, "Real-time retrieval for images
will also compare our features to other features such as SIFT, of documents in various languages using a web camera," in Document
SURF, ORB, and Shape Context in textual document images. Analysis and Recognition, 2009. ICDAR'09. 10th International Confer
ence on. IEEE, 2009, pp. 146-150.
We are planning to apply SRIF for other systems that use
[17] H. J. Wolfson and I. Rigoutsos, "Geometric hashing: An overview,"
smartphones and/or wearable cameras.
Computing in Science and Engineering, vol. 4, no. 4, pp. 10-21, 1997.
605