Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

A New Look at Filtering Techniques for Illumination Invariance in Automatic Face Recognition

2006, IEEE International Conference on Automatic Face and Gesture Recognition

Illumination invariance remains the most researched, yet the most challenging aspect of automatic face recognition. In this paper we propose a novel, general recognition framework for efficient matching of individual face images, sets or sequences. The framework is based on simple image processing filters that compete with unprocessed greyscale input to yield a single matching score between individuals. It is shown how the discrepancy between illumination conditions between novel input and the training data set can be estimated and used to weigh the contribution of two competing representations. We describe an extensive empirical evaluation of the proposed method on 171 individuals and over 1300 video sequences with extreme illumination, pose and head motion variation. On this challenging data set our algorithm consistently demonstrated a dramatic performance improvement over traditional filtering approaches. We demonstrate a reduction of 50-75% in recognition error rates, the best performing method-filter combination correctly recognizing 96% of the individuals.

A New Look at Filtering Techniques for Illumination Invariance in Automatic Face Recognition Ognjen Arandjelović Roberto Cipolla Department of Engineering University of Cambridge Cambridge, UK CB2 1PZ {oa214,cipolla}@eng.cam.ac.uk Abstract In this paper we propose a novel framework for rapid recognition under varying illumination, based on simple image filtering techniques. The framework is very general: we demonstrate that it offers a dramatic performance improvement to a wide range of filters and different baseline matching algorithms, without sacrificing their online efficiency. Illumination invariance remains the most researched, yet the most challenging aspect of automatic face recognition. In this paper we propose a novel, general recognition framework for efficient matching of individual face images, sets or sequences. The framework is based on simple image processing filters that compete with unprocessed greyscale input to yield a single matching score between individuals. It is shown how the discrepancy between illumination conditions between novel input and the training data set can be estimated and used to weigh the contribution of two competing representations. We describe an extensive empirical evaluation of the proposed method on 171 individuals and over 1300 video sequences with extreme illumination, pose and head motion variation. On this challenging data set our algorithm consistently demonstrated a dramatic performance improvement over traditional filtering approaches. We demonstrate a reduction of 50–75% in recognition error rates, the best performing method-filter combination correctly recognizing 96% of the individuals. 1.1. Previous Work – AFR across Illumination Two of the most influential approaches to achieving robustness to changing lighting conditions are the illumination cones of Belhumeur et al. [4, 10] and the 3D morphable model of Blanz and Vetter [5]. In [4] the authors showed that the set of images of a convex, Lambertian object, illuminated by an arbitrary number of point light sources at infinity, forms a convex polyhedral cone in the image space with dimension equal to the number of distinct surface normals. In [10], Georghiades et al. successfully used this result for AFR by reilluminating images of frontal faces. In the 3D morphable model method, parameters of a complex generative model which includes the pose, shape and albedo of a face are recovered in an analysis-by-synthesis fashion. Both illumination cones and the 3D morphable model have significant shortcomings for practical AFR use. The former approach assumes very accurately registered face images, illuminated from seven to nine different well-posed directions for each head pose. This is difficult to achieve in practical imaging conditions (see §3 for typical image data quality). On the other hand, the 3D morphable model requires a (in our case prohibitively) high resolution [7], struggles with non-Lambertian effects and multiple light sources, has convergence problems in the presence of background clutter and partial occlusion (glasses, facial hair), and is very computationally demanding. Most relevant to the material presented in this paper are methods that can be broadly described as quasi illumination-invariant image filters. These include highpass [3] and locally-scaled high-pass filters [17], directional 1. Introduction In this work we are interested in illumination invariance for automatic face recognition (AFR), and, in particular, the case when both training and novel data to be matched are image sets or sequences. Invariance to changing lighting is perhaps the most significant practical challenge for AFR. The illumination setup in which recognition is performed is in most cases impractical to control, its physics difficult to accurately model and face appearance differences due to changing illumination are often larger than differences between individuals [1]. Additionally, the nature of most realworld AFR application is such that prompt, often real-time system response is needed, demanding appropriately efficient matching algorithms. 1 derivatives [1, 7] and edge-maps [1], to name a few. These are most commonly based on very simple image formation models, for example modelling illumination as a spatially low-frequency band of the Fourier spectrum and identitybased information as high-frequency [3, 8]. Methods of this group can be applied in a straightforward manner to either single or multiple-image AFR and are often extremely efficient. However, due to the simplistic nature of the underlying models, in general they do not perform well in the presence of extreme illumination changes. 0.9 0.8 0.7 Distance, raw input 0.6 0.5 0.4 0.3 0.2 2. Adapting to Data Acquisition Conditions 0.1 The framework proposed in this paper is most closely motivated by the findings first reported in [2]. In that work several AFR algorithms were evaluated on a large database using (i) raw greyscale input, (ii) a high-pass (HP) filter and (iii) the Self-Quotient Image (QI) [17]. Both the high-pass and even further Self Quotient Image representations produced an improvement in recognition for all methods over raw grayscale, which is consistent with previous findings in the literature [1, 3, 8, 17]. Of importance to this paper is that it was also examined in which cases these filters help and how much depending on the data acquisition conditions. It was found, consistently over different algorithms, that recognition rates using greyscale and either the HP or the QI filter negatively correlated (with ρ ≈ −0.7). This is an interesting result: it means that while on average both representations increase the recognition rate, they actually worsen it in “easy” recognition conditions when no normalization is needed. The observed phenomenon is well understood in the context of energy of intrinsic and extrinsic image differences and noise (see [18] for a thorough discussion). Higher than average recognition rates for raw input correspond to small changes in imaging conditions between training and test, and hence lower energy of extrinsic variation. In this case, the two filters decrease the SNR, worsening the performance. On the other hand, when the imaging conditions between training and test are very different, normalization of extrinsic variation is the dominant factor and performance is improved. This is an important observation: it suggests that the performance of a method that uses either of the representations can be increased further by detecting the difficulty of recognition conditions. In this paper we propose a novel learning framework to do this. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Distance, filtered input Figure 1. Distances (0 − 1) between sets of faces – interpersonal and intrapersonal comparisons are shown respectively as large red and small blue dots. Individuals are poorly separated. of this task: different classes (i.e. persons) are not well separated in the space of 2D feature vectors obtained by stacking raw and filtered similarity scores. Let {X1 , . . . , XN } be a database of known individuals, X novel input corresponding to one of the gallery classes and ρ() and F (), respectively, a given similarity function and a quasi illumination-invariant filter. We then express the degree of belief η that two face sets X and Xi belong to the same person as a weighted combination of similarities between the corresponding unprocessed and filtered image sets: η = (1 − α∗ )ρ(X , Xi ) + α∗ ρ(F (X ), F (Xi )) (1) In the light of the previous discussion, we want α∗ to be small (closer to 0.0) when novel and the corresponding gallery data have been acquired in similar illuminations, and large (closer to 1.0) when in very different ones. We show that α∗ can be learnt as a function: α∗ = α∗ (µ), (2) where µ is the confusion margin – the difference between the similarities of the two Xi most similar to X . We compute an estimate of α∗ (µ) in a maximum a posteriori sense: α∗ (µ) = arg max p(α, µ), α (3) where p(α, x) is the probability that α is the optimal value of the mixing coefficient. 2.1. Adaptive Framework 2.2. Learning the α-Function Our goal is to implicitly learn how similar the novel and training (or gallery) illumination conditions are, to appropriately emphasize either the raw input guided face comparisons or of its filtered output. Fig. 1 shows the difficulty To learn the α-function defined in (3), we first need an estimate of the joint probability density p(α, µ). The main 2 difficulty of this problem is of practical nature: in order to obtain an accurate estimate, a prohibitively large training database is needed. Instead, we propose a heuristic alternative, computed offline, from a small training corpus of individuals in different illumination conditions. Our algorithm is based on an iterative incremental update of the density, initialized as a uniform density over the domain α, µ ∈ [0, 1]. We iteratively simulate matching of an unknown person against a set gallery individuals. In each iteration of the algorithm, these are randomly drawn from the offline training database. Since the ground truth identities of all persons in the offline database is known, we can compute the confusion margin µ(α) for each α = k∆α, using the inter-personal similarity score defined in (1). Density p(α, µ) is then incremented at each (k∆α, µ(0)) proportionally to µ(k∆α). The proposed offline learning algorithm is summarized in Fig. 2. A typical estimate of the probability density p(α, µ) is shown in Fig. 3, with the corresponding αfunction in Fig. 4. 3 training data D(person, illumination), filtered data F (person, illumination), similarity function ρ, filter F . estimate p̂(α, µ). Input: Output: 1: Init p̂(α, µ) = 0, 2: Iteration for all illuminations i, j and persons p 3: Initial separation δ0 = minq6=p [ρ(D(p, i), D(q, j)) − ρ(D(p, i), D(p, j))] 4: Iteration for all k = 0, . . . , 1/∆α, α = k∆α 5: Separation given α δ(k∆α) = minq6=p [αρ(F (p, i), F (q, j)) −αρ(F (p, i), F (p, j)) +(1 − α)ρ(D(p, i), D(q, j)) −(1 − α)ρ(D(p, i), D(p, j))] Empirical Evaluation Methods in this paper were evaluated on three databases: 6: Update density estimate p̂(k∆α, δ0 ) = p̂(k∆α, δ0 ) + δ(k∆α) • FaceDB100, with 100 individuals of varying age and ethnicity, and equally represented genders. For each person in the database we collected 7 video sequences of the person in arbitrary motion (significant translation, yaw and pitch, negligible roll), each in a different illumination setting, see Fig. 5 (a) and 6, at 10fps and 320 × 240 pixel resolution (face size ≈ 60 pixels). 7: Smooth the output p̂(α, µ) = p̂(α, µ) ∗ Gσ=0.05 8: Normalize to unit integral R R p̂(α, µ) = p̂(α, µ)/ α x p̂(α, x)dxdα Figure 2. Offline training algorithm. • FaceDB60, kindly provided to us by Toshiba Corp. This database contains 60 individuals of varying age, mostly male Japanese, and 10 sequences per person. Each sequence corresponds to a different illumination setting, at 10fps and 320 × 240 pixel resolution (face size ≈ 60 pixels), see Fig. 5 (b). −4 x 10 8 6 • FaceVideoDB, freely available and described in [11]. Briefly, it contains 11 individuals and 2 sequences per person, little variation in illumination, but extreme and uncontrolled variations in pose and motion, acquired at 25fps and 160 × 120 pixel resolution (face size ≈ 45 pixels), see Fig. 5 (c). 4 2 0 1 0.8 0 0.6 Mixing 0.4 parameter α Data acquisition: The discussion so far focused on recognition using fixed-scale face images. Our system uses a cascaded detector [16] for localization of faces in cluttered images, which are then rescaled to the unform resolution of 50 × 50 pixels (approximately the average size of detected faces in our data set). 0.2 0.4 0.6 0.2 0.8 Confusion magin µ 0 1 Figure 3. Learnt probability density p(α, µ) (greyscale surface) and a superimposed raw estimate of the α-function (solid red line) for a high-pass filter. 3 1 0.9 0.8 0.8 0.7 0.7 α−function α*(µ) * α−function α (µ) 1 0.9 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Confusion margin µ 0.7 0.8 0.9 0 1 0 0.1 (a) Raw α∗ (µ) estimate 0.2 0.3 0.4 0.5 0.6 Confusion margin µ 0.7 0.8 0.9 1 (b) Smooth and monotonic α∗ (µ) Figure 4. Typical estimates of the α-function plotted against confusion margin µ. The estimate shown was computed using 40 individuals in 5 illumination conditions for a Gaussian high-pass filter. As expected, α∗ assumes low values for small confusion margins and high values for large confusion margins (see (1)). (a) FaceDB100 (a) FaceDB100 (b) FaceDB60 Figure 6. Different illumination conditions in databases FaceDB100 and FaceDB60. (b) FaceDB60 Figure 7. Face representations evaluated. (c) FaceVideoDB • Laplacian-of-Gaussian [1] (LG): Figure 5. Frames from typical video sequences from the 3 databases used for evaluation. XL = X ∗ ∇Gσ=3 , and Methods and representations: The proposed framework was evaluated using the following filters (illustrated in Fig. 7): • directional grey-scale derivatives [1, 7] (DX, DY): Xx = X ∗ • Gaussian high-pass filtered images [3, 8] (HP): XH = X − (X ∗ Gσ=1.5 ), (4) Xy = X ∗ ∂ Gσy =6 . ∂y (8) • State-of-the-art commercial system FaceItr by Identix [12] (the best performing software in the most recent Face Recognition Vendor Test [13]), (5) the division being element-wise, • Constrained MSM (CMSM) [9] used in a state-of-theart commercial system FacePassr [15], • distance-transformed edge map [2, 6] (ED): XE = DistTrans(Canny(X)), ∂ Gσx =6 , ∂x As a baseline, to establish the difficulty of the evaluation data set, we compared the performance of our recognition algorithm to that of: • local intensity-normalized high-pass filtered images – similar to the Self-Quotient Image [17] (QI): XQ = XH /(X − XH ), (7) (6) • Mutual Subspace Method (MSM) [9], and 4 • KL divergence-based algorithm of Shakhnarovich et al. (KLD) [14]. Table 1. Recognition rates (mean/STD, %). FaceIt CMSM MSM KLD FaceDB100 64.1/9.2 73.6/22.5 58.3/24.3 17.0/ 8.8 81.8/9.6 79.3/18.6 46.6/28.3 23.0/15.7 FaceDB60 91.9 91.9 81.8 59.1 FaceVideoDB Average 72.1 76.8 55.7 21.8 In all tests, both training data for each person in the gallery, as well as test data, consisted of only a single sequence. Offline training of the proposed algorithm was performed using 40 individuals in 5 illuminations from the FaceDB100 – we emphasize that these were not used as test input for the evaluations reported in this section. improvement is obtained with a negligible increase in the computational cost as all time-demanding learning is performed offline. 3.1. Results We first evaluated the performance of the four established methods used for comparison purposes using raw greyscale input. A summary of the results is shown in Tab. 3.1. Firstly, note the poor performance of the KLD method. KLD can be considered as a proxy for gauging the difficulty of the recognition task, seeing that this algorithm can be expected to perform relatively well if the imaging conditions are not greatly different between training and test data sets [14]. This is further corroborated by observing that even the two best-performing methods, Identix’s FaceIt and Toshiba’s CMSM, incorrectly recognized about a quarter of individuals in our database. Interestingly, while performing marginally better, CMSM showed significantly less robustness to the particular data acquisition conditions used, as witnessed by its more than twice higher deviation of recognition scores across training/test combinations used. Next, we evaluated the performance of CMSM and MSM using each of the 7 face image representations (raw input and 6 filter outputs). Recognition results for the 3 databases are shown in blue in Fig. 8 (the results on FaceVideoDB are tabulated in Fig. 8 (c), for the ease of visualization). Confirming the first premise of this work as well as previous research findings, all of the filters produced an improvement in average recognition rates. Little interaction between method/filter combinations was found, Laplacianof-Gaussian and the horizontal intensity derivative producing the best results and bringing the best and average recognition errors down to 12% and 9% respectively. In the last set of experiments, we employed each of the 6 filters in the proposed data-adaptive framework. Recognition results for the 3 databases are shown in red in Fig. 8. The proposed method produced a dramatic performance improvement in the case of all filters, reducing the average recognition error rate to only 4% in the case of CMSM/Laplacian-of-Gaussian combination.This is a very high recognition rate for such unconstrained conditions (see Fig. 5), small amount of training data per gallery individual and the degree of illumination, pose and motion pattern variation between different sequences. An improvement in the robustness to illumination changes can also be seen in the significantly reduced standard deviation of the recognition. Finally, it should be emphasized that the demonstrated 4. Conclusions We described a novel framework for automatic face recognition in the presence of varying illumination, applicable to matching face sets or sequences, as well as to single shot-based recognition. Evaluated on a large, real-world data corpus, the proposed framework was shown to be successful in video-based recognition across a wide range of illumination, pose and face motion pattern changes. References [1] Y Adini, Y. Moses, and S. Ullman. Face recognition: The problem of compensating for changes in illumination direction. PAMI, 19(7), 1997. [2] O. Arandjelović and R. Cipolla. Face recognition from video using the global shape-illumination manifold. ECCV, 2006. [3] O. Arandjelović and A. Zisserman. Automatic face recognition for film character retrieval in feature-length films. CVPR, 2005. [4] P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object under all possible lighting conditions? CVPR, 1996. [5] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. SIGGRAPH, 1999. [6] J. Canny. A computational approach to edge detection. PAMI, 8(6), 1986. [7] M. Everingham and A. Zisserman. Automated person identification in video. CIVR, 2004. [8] A. Fitzgibbon and A. Zisserman. On affine invariant clustering and automatic cast listing in movies. ECCV, 2002. [9] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpoint patterns for robot vision. International Symposium of Robotics Research, 2003. [10] A. S. Georghiades, D. J. Kriegman, and P. N Belhumeur. Illumination cones for recognition under variable lighting: Faces. CVPR, 1998. [11] D. O. Gorodnichy. Associative neural networks as means for lowresolution video-based recognition. International Joint Conference on Neural Networks, 2005. [12] Identix. Faceit. http://www.FaceIt.com/. [13] P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and J. M. Bone. FRVT 2002: Overview and summary. Technical report, National Institute of Justice, 2003. [14] G. Shakhnarovich, J. W. Fisher, and T. Darrel. Face recognition from long-term observations. ECCV, 3, 2002. [15] Toshiba. Facepass. www.toshiba.co.jp/mmlab/tech/w31e.htm. [16] P. Viola and M. Jones. Robust real-time face detection. IJCV, 57(2), 2004. [17] H. Wang, S. Z. Li, and Y. Wang. Face recognition under varying lighting conditions using self quotient image. AFG, 2004. [18] X. Wang and X. Tang. Unified subspace analysis for face recognition. ICCV, 2003. 5 45 25 MSM MSM−AD CMSM CMSM−AD 40 Error rate, std (%) Error rate, mean (%) 35 MSM MSM−AD CMSM CMSM−AD 20 30 25 20 15 15 10 10 5 5 0 RW HP QI ED Method LG DX 0 DY RW HP QI ED Method LG DX DY (a) FaceDB100 60 30 MSM MSM−AD CMSM CMSM−AD 40 30 20 20 15 10 10 0 MSM MSM−AD CMSM CMSM−AD 25 Error rate, std (%) Error rate, mean (%) 50 5 RW HP QI ED Method LG DX 0 DY RW HP QI ED Method LG DX DY (b) FaceDB60 MSM MSM-AD CMSM CMSM-AD RW 0.00 0.00 0.00 0.00 HP 0.00 0.00 9.09 0.00 QI 0.00 0.00 0.00 0.00 ED 0.00 0.00 0.00 0.00 LG 9.09 0.00 0.00 0.00 DX 0.00 0.00 0.00 0.00 DY 0.00 0.00 0.00 0.00 (c) FaceVideoDB, mean error (%) Figure 8. Error rate statistics. The proposed framework (-AD suffix) dramatically improved recognition performance on all method/filter combinations, as witnessed by the reduction in both error rate averages and their standard deviations. 6