1. Introduction
The eyes and their movements can be used as an unobtrusive method to gain deeper understanding of one’s cognitive and neurological processes: studies show that facial expressions are the major interaction modality in human communication, as they contribute for 55% to the meaning of the message. Robust eye detection and tracking algorithms are crucial for a wide variety of disciplines, including human computer interaction, medical research, optometry, biometry, marketing research, and automotive design (notably through driving attention monitoring systems).
In today’s context of globalization, security is becoming more and more important. Biometry refers to a set of methods that rely on a set of distinctive, measurable characteristics that are used to describe individuals for the purpose of identification and access control. As opposed to traditional methods of identification (token-based identification or knowledge-based identification), biometric data is unique and permanently associated with an individual.
Iris recognition is growing in popularity and is currently being used in a broad set of applications. The first step in any iris recognition process is the accurate segmentation of the region of interest: the localization of the inner and outer boundaries of the iris and removal of eyelids, eyelashes, and any specular reflections that may occlude the region of interest. Secondly, the iris region is transformed into a rubber-sheet model and a unique bit pattern encoding is computed based on these pixels.
Recently, a new ocular biometric technique has received particular interest from the scientific community: sclera recognition [
1]. The sclera region (the white outer coat of the eye) comprises a unique and stable blood vessel structure which can be analyzed to identify humans. Moreover, this method can complement iris biometry by increasing its performance in case of off-angle or off-axis images.
The major challenges that ocular biometry poses is the accurate segmentation of the regions of interest (the iris or the sclera). An incorrect segmentation can reduce the available information, but it can also introduce other patterns (eyelashes, eyelids) that can influence the biometric applicability of the trait.
This work focuses on the accurate segmentation of the eye: the iris region and the external shape of the eye (the eyelids). Therefore, this paper’s results are relevant in a variety of domains, ranging from iris and sclera biometrics, psychology, psychiatry, optometry, education, and driving assistance monitoring.
This paper will highlight the following contributions:
A fast and accurate iris localization and segmentation algorithm;
A new method for representing the external shape of the eyes based on six control points used to generate two parabolas for the upper and lower eyelids;
An original sclera segmentation algorithm based on both color and shape constraints (as opposed to other methods from the specialized literature which are based solely on color information [
2,
3,
4]);
A Monte Carlo segmentation algorithm based on the proposed model and the corresponding matching methodology; and
The annotation [
5] of a publicly-available database with the eye positions and eyelid boundaries.
The rest of this paper is organized as follows.
Section 2 presents a general overview of the most relevant research literature on eye localization, iris and sclera segmentation. The proposed method is detailed in
Section 3. Experimental results on eye localization and segmentation are presented in
Section 4. The conclusions and future research directions are presented in
Section 5.
2. Related Work
Over the past decades eye localization and tracking has been one of the most active research areas in the field of computer vision, mainly due to its application to a wide variety of research fields. The organization of all the eye localization methods into a single and general taxonomy proves to be a challenging task.
Based on the light source, eye detection methods can be classified as active light methods (that rely on infrared or near-infrared light sources) and passive light methods (that use visible light to illuminate the eyes). Two general types of active light method can be exploited using a physical property of the pupil that modifies its appearance in the captured image depending on the position of the IR illuminator: bright pupil and dark pupil. When the IR illuminator is coaxial with the camera, the pupil appears as a bright zone (bright pupil), and when the light source is offset from the optical path the pupil appears dark (dark pupil). This work uses a passive light method.
In [
6] a detailed survey on eye detection and tracking techniques is presented, with emphasis on the challenges they pose, as well as on their importance in a broad range of applications. The authors propose a three class taxonomy based on the model used to represent the eyes:
shape-based methods,
appearance-based methods, and
hybrid methods.
Shape-based methods use a prior model of the eye geometry and its surrounding texture and a similarity measure. Appearance-based methods detect and track eyes based on the color distribution or filter responses of the eye region. These methods require a large amount of training data representing the eyes under different illumination conditions and face poses.
Finally, hybrid methods combine two or more approaches to exploit their benefits and to overcome their drawbacks. As an example, the constrained local models (CLM) [
7] framework, a recently emerged promising method for facial feature detection, uses a shape model to constrain the location where each feature might appear and a patch model to describe the appearance of the feature. In [
8] a hybrid method is proposed to extract features of the human face. First, a Viola-Jones face detector is applied to approximate the location of the face in the input image. Next, individual facial features detectors are evaluated and combined based on shape constraints. Finally, the results are refined using active appearance models tuned for edge and corner cues. A three-stage facial feature detection method is presented in [
9]. On the first stage, the face region is localized based on the Hausdorff distance between the edges from the input image and a template of facial edges. The second stage uses a similar, smaller model for the eyes. Finally, the pupil locations are refined using a multi-layer perceptron trained with pupil centered images.
Other methods localize several features on the face using local landmark descriptors and a predefined face shape model. In [
10] the authors propose a new approach to localize features in facial images. Their method uses local detectors for each feature and combines their output with a set of global models for the part locations computed from a labeled set of examples. To model the appearance of each feature, a sliding window detector based on support vector machine (SVM) classifier with gray-scale scale invariant feature transform (SIFT) features is used. The work presented in [
11] localizes nine facial landmarks in order to investigate the problem of automatic labeling of characters in movies. In the first stage, the position and the approximate scale of the face is detected in the input image. The appearance of each feature is determined using a variant of the AdaBoost classifier and Haar-like image features. Next, a generative model of the feature positions is combined with the appearance score to determine the exact position of each landmark.
The approach presented in our paper uses a fast shape based approach to locate the center of the irises. After the accurate iris segmentation, the external shape of the eyes (the eyelids) is segmented based on color information and shape constraints. The entire method takes, on average, 20 ms per processed image.
In general shape-based eye localization techniques impose a circularity shape constraint to detect the iris and the pupils [
12,
13], making the algorithms suitable only for near-frontal images.
In [
12], the authors developed a fast eye center localization based on image gradients: the center of the circular pattern of the iris is defined as the region where most of the image gradients intersect. An additional post-processing step, based on prior knowledge of the appearance of the eyes, is applied to increase the robustness of the method on images where the contrast between the sclera and the iris is not that obvious, and occlusions (hair, eyelashes, glasses, etc.) are present.
In [
14] a multi-stage circular Hough transformation is used to determine the iris center and its radius. Additionally, a similarity measure for selecting the iris center is computed based on the circularity measure of the Hough transform, the distance between the hypothetical iris centers and the contrast between the presumed iris and its background.
In [
15] the eyes are localized using two distinctive features of the iris compared to the surrounding area: the eye region has an unpredictable local intensity and the iris is darker compared to the neighboring pixels. The eye centers are selected using a score function based on the entropy of the eye region and the darkness of the iris.
With the development of ocular biometrics, accurate localization of the iris and pupil area has drawn the attention of computer vision scientists. Iris and pupil segmentation were pioneered by Daugman [
13] who proposed a methodology that is still actual: an integro-differential operator which uses a circular integral to search for the circular path where the integral derivative is maximal. The method searches for the circular contour that generates the maximum intensity change in pixel values by varying the center (
cx, cy) and the radius (
radii) of the path. This operator has high computational complexity (for each possible center multiple radii scans are necessary to compute this operator) and it has problems detecting the iris boundary in cases of low intensity separability between the iris and the sclera region. Another influential approach [
16] uses a two-stage iris segmentation method: first, a gradient based binary edge map is created, followed by a circular Hough transform to find the parameters of the iris circle. The binary map is generated so that it favors ranges of orientation (for example, to delimit the iris-sclera region, image derivatives are weighted to be more sensible to vertical edges). The main disadvantage of this method is its dependence on the threshold values used in the edge map construction phase.
Other shape based eye detection studies use a more complex model of the eye [
17,
18] by also modeling the eyelids, but they are computationally demanding and their performance is strongly linked to the initial position of the template. In [
18] the eyes are extracted using a deformable template which consists of a circle for describing the iris and two parabolas for the upper and lower eyelid. The template is matched over the input image using energy minimization techniques. A similar approach is used in [
17], but the authors use information about the location of eye corners to initialize the template.
Eye corner detection has gained the attention of several research works, as it is relevant for multiple domains, such as biometrics, assisted driving systems, etc. In [
19] the authors propose a new eye corner detection method in periocular images that simulate real-world data. First, the iris and the sclera region are segmented in the input image to determine the region of interest. Next, the eye corners are detected based on multiple features (response of Harris corners algorithm, the internal angle between the two corner candidates, their relative position in the ROI). The method gives accurate results on degraded data, proving its applicability in real-world conditions.
Recently sclera segmentation has shown to have a particular importance in the context of unconstrained, visible wavelength iris and sclera biometrics. Several works started to address this problem. In addition, sclera segmentation benchmarking competitions [
20] were organized in order to evaluate the recent advances in this field and to attract researchers’ attention towards it.
The sclera region is usually segmented based on color information. In [
21] the sclera region is roughly estimated by thresholding the saturation channel in the HSI color space; based on this segmentation, the region of interest for the iris is determined and the iris is segmented using a modified version of the circular Hough transform. Finally, the eyelid region that overlaps the iris is segmented using a linear Hough transform.
Other methods [
2,
3,
4] use more complex machine vision algorithms to segment out the sclera. In [
4] Bayesian classifiers are used to decide whether the pixels belong to the sclera region or to the skin region, using the difference between the red and green, and blue and green channels from the RGB color space. In [
2] three types of features are extracted from the training images: color features that illustrate the various relationships between pixels in different color spaces, Zernike moments and histogram of oriented gradients (HOG) features (the sclera has significant fewer edges than other regions around the eyes). A two stage classifier is trained to segment out the sclera region. The classifiers from the first stage operate on pixels, and the second stage classifier is a neural network that operates on the probabilities that are the output of the first stage. In [
3] two classifiers are used: one for the sclera region and one for the iris region. The features used for the sclera classifier are Zernike moments and distinctive color features from different color spaces. The iris is localized based on three control points from the iris on the summation of two edge maps. Eyelids are segmented only near the iris area based on Canny edge detection and parabolic curve fitting.
4. Results and Discussion
4.1. Iris Center Localization
The metric used to validate the performance of the eye center localization is the relative error introduced in [
9]: the error obtained by the worst of both eye estimators, normalized with the distance between the eye centers:
where
,
are the positions of the left and right iris centers, and
,
are the positions of the estimated left eye and right iris centers.
This metric is independent of the image size. Based on the fact that the distance between the inner eye corners is approximately equal to the width of an eye, the relative error metric has the following properties: if wec ≤ 0.25 the error is less than or equal to distance between the eye center and the eye corners, if wec ≤ 0.10 the localization error is less than or equal to the diameter of the iris, and finally, if wec ≤ 0.05 the error is less than or equal to the diameter of the pupil.
In addition two other metrics were implemented like suggested in [
12]:
bec and
aec which define the lower and the averaged error, respectively:
where min() and avg() are the minimum and the average operators.
The proposed iris center localization algorithm does not use any color information, only the sclera segmentation part. For comparison purposes, our iris center localization method is evaluated on the BIO-ID face database [
9], one of the most challenging eye databases, which has been used for the validation of numerous eye localization methods. The dataset reflects realistic image capturing conditions, featuring a large range of illumination conditions, background and face size and many state of the art methods were tested on this dataset. The database contains 1521 grey-scale images of 23 persons, captured during different sessions in variable illumination conditions. Moreover, some of the subjects in the database wear glasses, in some images the eyes are (half) closed or the eyes are occluded by strong specular reflections on the glasses. The resolution of the images is low 384 × 286 pixels. This dataset has been widely used to evaluate eye center localization methods and, therefore, it allows us to benchmark the results of our algorithm with prior work.
Results on the BIO-ID face database are depicted in
Figure 13 and the ROC curve is depicted in
Figure 14.
The comparison of our method with other state of the art papers is shown in
Table 1. If the performance for the normalized error ∈ {0.05, 0.10, 0.25} was not mentioned explicitly by the authors, we extracted the values from the performance curves; these values are marked with * in the table.
In the case of pupil localization (
wec ≤ 0.05) the proposed method is outperformed only by [
12]. In the case of eye localization (
wec ≤ 0.25) our method outperforms the other works. However, in the case of
wec ≤ 0.10 the proposed algorithm is outperformed by three other state of the art methods [
8,
12,
15]. This is due to the fact that our eye center localization algorithm relies mostly on circularity constraints and the BIO-ID face database contains multiple images where the eyes are almost closed, in which case the circularity of the iris cannot be observed. Therefore, the accuracy of the algorithm is impaired. The transition between the cases
wec ≤ 0.05 and
wec ≤ 0.25 is smoother because in multiple images of the database the circularity of the iris is not observable due to occlusions and closed eyes. To sum up, the proposed algorithm yields accurate results (
wec ≤ 0.05 in 74.65% of the images) for the images where the iris is visible, and acceptable results otherwise.
However, our method was designed for use cases, such as biometry, optometry, or human emotion understanding, in which the face is the main component under analysis and the facial region has medium to good quality. The BIO-ID face database is not adequate for this purpose due to the low quality of the images. We tested our algorithm on this database so that we can compare with other methods.
The proposed iris localization method based on the accumulation of first order derivatives obtains accurate results if the iris is relatively visible in the input image. The search region for the eyes often contains other elements, such as eyeglasses, eyebrows, hair, etc., and, if the iris is occluded in some way (semi-closed eyes or strong specular reflections of the eyeglasses), these external elements could generate a higher circularity response than the actual iris (
Figure 15).
We try to filter out these false candidates by imposing appearance constraints—the pupil center must be darker than the surrounding area, so the result of the symmetry transform image is weighted by the inversed, blurred gray-scale image, and several geometrical constraints: separation of the left and right eye candidates, penalization of the eye candidates that are too close to the eyebrow area. Considering that we are interested in obtaining a pair of eye centers, we have also included a metric that models the confidence of the candidates as a pair. The score of a pair is weighted by a Gaussian function of the inter-pupillary distance normalized by the face width having as mean the average ratio between the inter pupillary distance and the face width [
24]. However, this method was designed mainly for application domains (such as optometry, human-computer interaction, etc.) in which the user is cooperative and the eye is visible in the input image and therefore these exaggerated occlusions are less likely to occur.
As we will further demonstrate, the performance of the algorithm increases with the quality of the image, while keeping the computational power low (on average, the eye center localization algorithm takes six milliseconds on an Intel Core i7 processor).
To test the proposed method we annotated a publicly-available face database [
26], created by the University of Michigan Psychology department. In the rest of the paper we will refer to this database as University of Michigan Face Database (UMFD). The database comprises facial images of 575 individuals with ages ranging from ages 18 to 93 and is intended to capture the representative features of age groups across the lifespan. The dataset contains pictures of 218 adults age 18–29, 76 adults age 30–49, 123 adults age 50–69, and 158 adults age 70 and older.
Six points were marked on each eye: the center of the pupil, the eye corners, the top and the bottom eyelids and a point on the boundary of the iris (
Figure 16). The database annotation data can be accessed from [
5]. The structure of the annotation data is detailed in
Appendix A.
The ROC curves for the iris center localization on the age groups from the University of Michigan database are illustrated in
Figure 17 and
Table 2 shows the eye center localization results on this database.
From the
Table 2 it can be noticed that the on medium quality facial images (640 × 480) the performance of the proposed algorithm is highly increased: in 96.30% of the cases, the worst of the two eye center approximations falls into the pupil area.
The accuracy of the algorithm is lower (93.63%) for older subjects of ages between 70 and 94 years old due to the fact that a lower portion of the iris is visible and its circularity cannot be observed.
In conclusion, the proposed eye localization method proves to be efficient on all the use cases considered: pupil localization (wec ≤ 0.05), iris localization (wec ≤ 0.10), and eye localization (wec ≤ 0.25).
4.2. Iris Radius Computation
To evaluate the accuracy of the iris radius estimation algorithm we compute the following normalized errors:
where
rl and
rr are the radiuses of the left and right iris, and
and
are the estimated radiuses of the right and left eye, respectively. The
wer (worst error radius) metric represents the radius error for the worst radius estimator, the
aer (average error radius) is the average radius error for the left and right iris and
ber (best error radius) is the radius error for the best radius estimator. The functions are normalized by the average of the correct irises radius.
Table 3 shows the performance of the iris radius computation algorithm on the different age groups from the University of Michigan Face Database.
On average, the normalized aer value is 0.0991; in other words, the average error of the iris radius is less that 10% of the actual radius. Taking into account the fact that the iris radius has approximately 12 pixels on the images from the database, the magnitude of the error is about 1–2 pixel.
4.3. Eye Shape Segmentation
To evaluate the eye shape segmentation algorithm, we used several statistical measures of the performance by analyzing the proportion of pixels that are assigned to the eye or non-eye region.
We computed the number of true positives (
TP), true negatives (
TN), false positives (
FP), and false negatives (
FN), by comparing the results of the algorithm with the ground truth from the test databases. The terms true (
T) and false (
F) refer to ground truth and the terms positive (
P) and negative (
N) refer to the algorithm decision. Based on these values, the following statistical measures were determined:
Sensitivity (or recall) is a measure of the proportion of eye pixels that are correctly identified and indicates the algorithm’s ability to correctly detect the eye region; specificity (or true negative rate) measures the proportion of non-eye pixels that are correctly identified as such and relates to the algorithm’s ability to rule out the pixels that do not belong to the eye region. In other words sensitivity quantifies the algorithm ability to avoid false negatives, while specificity quantifies its ability to avoid false positives.
Eye shape segmentation results are illustrated in
Figure 18. The detected iris is marked with a green circle and the upper and lower eyelid parabolas are depicted in yellow.
From the results on University of Michigan Face Database it can be noticed that the accuracy of the algorithm is decreasing with age. On older subjects the sclera portion of the eye becomes less visible due to eyelid ptosis and skin excess around the eyes. The shape of eyelid is distorted and it can no longer be estimated with a parabola (
Figure 19). In addition, sclera color quality degradation [
29] is a well-known effect of aging on the eyes that influences the performance of the algorithm. The loss of performance is about 1%–2% in average.
The results of the eye shape segmentation algorithm are strongly dependent on the accuracy of the eye center localization. For example, in the last image from
Figure 18 it can be seen that the performance of the algorithm is impaired due to the wrong eye center estimation.
To the best of our knowledge, full eye segmentation methods that we can compare with have not been reported previously in the specialized literature. An older work [
17] uses a deformable template to find the full shape of the eye, but only the run time of the algorithm is reported.
The algorithm was also tested on an image database that is independent from our training set, the IMM frontal face database [
30], which contains 120 facial images of 12 different subjects. All of the images are annotated with 73 landmarks that define the facial features for the eyebrows, nose, and jaws. The contour of each eye is marked with eight points (
Figure 20).
Results on the IMM face database are depicted in
Figure 21 and the numerical scores are shown in
Table 4.
Labeled Face Parts in the Wild (LFPW) [
10] is a large, real-world dataset of hand labeled images, acquired from Internet search sites using simple text queries. The images from the dataset are captured in unconstrained environments and contain several elements that can impair the performance of the algorithm: the eyes are occluded by heavy shadowing or (sun-) glasses, hats, hair, etc., some faces contain a lot of make-up and present various (theatrical) facial expressions. The only precondition of the images is that they are detectable by a face detector. Each image was annotated with 29 fiducial points. The images contain the annotation of three different workers and the average of these annotations was used as the ground truth. Due to copyright issues, the image files are not distributed, but rather a list of URLs is provided from which the images can be downloaded. Therefore, not all the original images are still available, as some of the image links have disappeared. We have downloaded all the images that were still accessible (576 images) from the original set of images and we evaluated the performance of our method on this dataset.
The mean errors of our algorithm compared to other state of the art works and a commercial off the shelf (COTS) system [
10] are shown in
Table 5.
From
Table 5 it can be noticed that our method is comparable with the COTS system and [
11], but [
10] is more accurate. However, [
10] detects 29 fiducial points on the face and the total processing time for an image is 29 s. Our method takes on average 20 ms to find all of the six landmarks around the eyes, being several orders of magnitude faster than [
10].
For the eye center localization, the average normalized error is in average 0.0426. A normalized error less than 0.05 implies that the detected iris center is within the pupil area. Therefore, our method has a good performance even on images that are degraded due to capturing conditions.
For the sclera landmarks, the algorithm yields larger localization errors: on average the normalized distance between the sclera landmarks and the annotated landmarks is 0.0607. First, we note a difference of semantics between the annotated landmarks and the result of our algorithm: the proposed method is intended to segment the sclera region as accurately as possible and not to detect the position of the eye corners. While in some cases the sclera landmarks can determine the exact eye corners this cannot be generalized. For example, from
Figure 22 it can be noticed that the sclera region is correctly segmented but the distance between the annotated left eye inner corner and the corresponding sclera landmarks is large; the normalized error between these two points is 0.0708.
In addition, some of the images in the database contain sunglasses that totally obstruct the eyes and in some other images the sclera is not visible, due to the low image resolution, and cannot be accurately segmented even by a human operator. The problems targeted by our solution are iris and sclera segmentation; in order to solve these problems the features under consideration must be visible in the input image.
Figure 23 shows the results of our method on some images from the LFPW database. Due to the smaller resolution of the images, we only draw the landmarks (eye centers and the landmarks used to generate the eyelid parabolas).
The main application domains targeted by our method are optometry and ophthalmology, augmented reality, and human-computer interaction, where the quality of the images is usually medium to good and the user is cooperative.
The method is integrated into a virtual contact lens simulator application and into a digital optometric application that measures the iris diameter and the segment height (the vertical distance in mm from the bottom of the lens to the beginning of the progressive addition on a progressive lens) based on facial images. Snapshots of the virtual contact lens simulator application are presented in
Figure 24: different contact lenses colors are simulated: dark-green, hazel, natural green, and ocean blue respectively.
The average execution time for the full eye segmentation algorithm (the iris and the eyelids) is on average 20 ms on a on an Intel Core i7 processor on 640 × 480 resolution images.
5. Conclusions
This paper presents a fast eye segmentation method that extracts multiple features of the eye region: including the center of the pupil, the iris radius, and the external shape of the eyes. Our work has superior accuracy compared to the majority of the state of the art methods that measure only a subset of these features.
The eye features are extracted using a multistage algorithm: first the iris center is accurately detected using circularity constraints and, on the second stage, the external eye shape is extracted based on color and shape information through a Monte Carlo sampling framework.
Compared to other state of the art works our method extracts the full shape of the eye (iris and full eyelid boundaries), and we consider that it is sufficiently generic so that it has applications in a variety of domains: optometry, biometry, eye tracking, and so on.
Extensive experiments were performed to demonstrate the effectiveness of the algorithm. Our experiments show that the accuracy of the method is dependent to the image resolution: increasing the image quality leads to an increase of accuracy without excessively increasing the computation time.
Future work will include increasing the accuracy of the estimated eye shape using more measurement cues (like corner detectors for the eye corners) and tracking of the iris centers. By tracking the iris centers the detection system will benefit by reducing the detection failure in the case of illumination changes or when one of the irises is not fully visible. Additionally, we intend to use other curves for representing the eyelid shapes, such as third degree polynomials, spline, or Bezier curves, in order to increase the performance of the eye shape segmentation.