WO2016050729A1

WO2016050729A1 - Face inpainting using piece-wise affine warping and sparse coding

Info

Publication number: WO2016050729A1
Application number: PCT/EP2015/072354
Authority: WO
Inventors: Joaquin ZEPEDA SALVATIERRA; Patrick Perez; Xavier BURGOS
Original assignee: Thomson Licensing
Priority date: 2014-09-30
Filing date: 2015-09-29
Publication date: 2016-04-07

Abstract

A method and apparatus for performing face occlusion removal are described as shown in Figures 5 and 6 including receiving a face image and an occlusion mask, the occlusion mask indicating missing pixels (505), receiving training images (605), performing face alignment on the received training images and the face image and the occlusion mask (510), receiving a mask (515), receiving a learned dictionary (520) and reconstructing the face image using the mask and the learned dictionary (525).

Description

FACE INPAINTING USING PIECE- WISE AFFINE WARPING AND SPARSE

CODING

FIELD OF THE INVENTION

The present invention relates to the reconstruction of lost or deteriorated parts of images or videos and, in particular, to reconstruction of facial expressions of images or videos.

BACKGROUND OF THE INVENTION

This section is intended to introduce the reader to various aspects of art, which may be related to the present embodiments that are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light.

Inpainting is the process of reconstructing lost or deteriorated parts of images and videos. In the case of pictures containing human faces, inpainting refers to the process of reconstructing regions of a face that were hidden due to typical occlusions such as sunglasses, hair, etc. From this initial problem formulation one can derive a wide number of similar tasks (detailed below) such as facial transfer, facial hallucination or facial expression transfer, etc.

An aspect of the present disclosure involves applying sparse coding to efficiently recover large regions of a face. Sparse coding methods have been successfully applied to a large number of image processing problems, including denoising, inpainting, compression, classification and face recognition. The aim of sparse coding is to represent each signal vector using a linear combination of a few column vectors, called atoms, from a rectangular matrix called the dictionary. A good dictionary will contain atoms including spatial patterns that occur commonly in natural images. Many off-the-shelf dictionary matrices exist, such as the DCT dictionary, but better results can be obtained by learning the dictionary from a set of training images.

The vast majority of algorithms employing sparse coding for image processing operate by first splitting the image into small image blocks of the same size (e.g., 8x8), and then rasterizing each block to obtain a signal vector (see, for example, the top of Fig. 3). There are two main reasons for this. One reason is that the complexity of sparse coding increases with the size of the signal vector. Yet recent approaches by Rubenstein, Zibulevsky and Elad and also by Zepeda successfully address this complexity issue by structuring the dictionary to make larger dictionaries suitable for large signal vectors accessible. The second reason is that, for generic natural images, spatial patterns become more diverse with increasing signal vector size. Hence, it is more difficult to represent large signal vectors extracted from generic natural images with a small number of atoms, making large signal vectors taken from generic images ill-suited for sparse coding methods.

There exist nonetheless non- generic image classes that will display high spatial dependency even for large block sizes. This is the case for images of faces, particularly when pre-processed to a standard physiognomy and size via piece-wise affine warping. Indeed two existing approaches exploit this property of face images. The first approach proposed by Bryt and Elad deals with compression of face images and employs piecewise- affine warping. Yet Bryt and Elad subsequently apply sparse coding using a standard per- block approach, albeit using per-block learned dictionaries. The second approach by Wright, Yang, Ganesh, Sastry, and Ma addresses face recognition and does not employ a face warping mechanism. The dictionary in this case includes a concatenation of multiple images of each targeted subject, as opposed to being learned for a reconstruction task (or better yet for the recognition task that the authors address).

Given a signal vector G ?^J that is to be represented using a sparse selection of columns of an over- complete matrix DGi?^JxiV. The d are referred to as atoms, and D as the dictionary. A small number L of atoms are selected so that the atoms produce the best approximation error: rnin_xly-Dxl2s lxl₀<L, (1) where Ixl denotes the number of non- zero coefficients of the vector x, or equivalently, the number of atoms selected. This problem is NP-hard, but standard algorithms exist that obtain approximate solutions using greedy methods or by convexifying the problem through substitution of the Ixl constraint with an additive penalty term 1x1 =∑ Ix.l as i follows: x°(y,D)=min_xly-Dxl2+Wxl₁, (2)

Given the decomposition x° of the vector y, an approximation ^"y of y can be obtained using ^"y=Dx°.

The dictionary matrix D required in equation (2) needs to be chosen carefully for the task at hand. A good dictionary will contain atoms that represent commonly occurring spatial patterns. Many off-the-shelf dictionary matrices exist, such as the DCT dictionary. But better results can be obtained by learning the dictionary from a training set of vectors { y R^d)_t such as proposed by Aharon, Elad and Bruckstein and also by Mairal, Bach, Ponce and Sapiro using the following objective: argmin ∑ ly_rDx Ι ₊λ Ιχ Ι_χ, l¾=l,V*, (3)

' t

Inpainting based on sparse coding works as follows: let A represent the indices of the available pixels of y. Letting y (respectively D ) denote the sub-vector (sub-matrix) obtained by retaining the coefficients (rows) at positions JL, an approximation of the whole image block can be obtained from Dx°(y ,D ).

Estimating the shape of human faces from photos or videos is a widely studied field in computer vision. A goal is to locate the position of a sparse set of pre-defined P 2D key- point landmark locations encoding shape S (commonly including, for example, the corners of the eyes, mouth, and nose):

S=[x,y],where,x,y R^p

Early work on shape estimation includes Active Contours Models by Kass, Witkin and Terzopolos, Template Matching by Yuille, Hallinan and Cohen, Active Shape Models (ASM) by Cootes and Taylor and Active Appearance Models (AAM) by Cootes, Edwards and Taylor. Popular modern approaches such as described by Felzenszwalb, Girshivk , McAUester and Ramanan involve first detecting the object parts independently and then estimating shape through flexible parts models. Another family of approaches by Cao, Wei, Wen and Sun and by Burgos-Artizzu, Perona and Dollar and by Ren, cao, Wei and Sun, is that which tackles shape estimation as a regression problem, learning regressors that directly predict the object shape or the location of its parts, starting from a raw estimate of its position. These methods are extremely fast and precise, being able to deal with large amounts of occlusion. An aspect of the present disclosure comprises using the method proposed by Burgos-Artizzui, Perona and Dollar, but any other could equally be used.

Sparse coding has been successfully applied to create face-tailored image compression schemes. An example is the work of Bryt and Elad which also employs a piecewise-affine warping of the face to normalize physiognomy and size. However, the application targeted by Bryt and Elad is compression, not face inpainting and their method uses the standard block-by-block rasterization approach, not a whole-image rasterization approach.

The work of Yuille, Hallinan and Cohen also use sparse coding for face restoration, but in the context of face super-resolution and not the face-inpainting problem. Furthermore, the method of Yuille, Hallinan and Cohen does not consider piecewise-affine face alignment as described in the present disclosure and the sparse coding stage applied subsequently uses a standard block-by-block sparse-coding approachusing dictionaries of approximately 100,000 patches of size 5x5 taken from many face images. In contrast, the principles of the present disclosure involve learning the dictionary for the reconstruction task.

The work of Wright, Nowak and Figueiredo applies sparse-coding using a whole- image rasterization approach, but in the context of face recognition, not face inpainting, and does not employ a piecewise affine warping. In addition, their dictionary matrix is not learned, consisting rather of a concatenation of face examples of the subjects known to the system.

Inpainting parts of an image or video using content from other images/video ahs been considered by others but, their methods suppose that the content to be replaced can be found somewhere else in the image/video, which is not the case addressed by the present disclosure because it cannot be supposed or assumed that the user will have his face fully visible at some point (e.g. by removing his sunglasses or moving away from occluding objects).

Facial expression transfer has been studied with fully unoccluded faces with the goal of transferring expressions across individuals, or from a video stream into a 3D animated model by estimating 3D facial landmarks. In contrast, the principles of the present disclosure allows recovery of the original expression in large occluded regions of the face.

The work of Jenatton, Obozinski and Bach considers sparse coding and dictionary learning using whole-image rasterization but in a different context. Specifically, the authors impose a constraint that forces their dictionary atoms to be localized in space (e.g., an atom might correspond only to the right eye). Such constraints are contrary to the task of spatial prediction, which requires atoms to model the spatial dependencies in face images. Their approach is tailored for the face recognition application, not face inpainting. SUMMARY OF THE INVENTION

The proposed method applies sparse coding to inpainting of face images, particularly when large spatial regions of the face are missing. An aspect comprises applying sparse coding to the entire face image following geometrical normalization via piecewise-affine warping. This allows exploitation of subtle spatial dependencies to inpaint in an expression-coherent manner, as it is the case that expressions are manifested in all parts of the face (for example, both the eyes and the mouth take a particular form when one smiles).

The proposed method has a wide range of applications. Examples include recovering full facial expression portrayed by a subject even when large regions of his/her face are hidden, useful for video-conferencing or network social communication. Other examples include security (e.g. , removal of face masks, sunglasses etc.) and video editing (removal of glasses or jewelry, removal of face-covering hair dos). Security is of particular interest to law enforcement and anti-terrorism.

Another example involves virtual removal of head mounted displays (HMDs) such as those produced by Oculus (http://www.oculusvr.com/). A method and apparatus for performing face occlusion removal are described including receiving a face image and an occlusion mask, the occlusion mask indicating missing pixels, receiving training images, performing face alignment on the received training images and the face image and the occlusion mask, receiving a mask, receiving a learned dictionary and reconstructing the face image using the mask and the learned dictionary.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is best understood from the following detailed description when read in conjunction with the accompanying drawings. The drawings include the following figures briefly described below:

Fig. 1 is an overview of the proposed approach for expression-aware inpainting through sparse coding.

Fig. 2 is the portion of Fig. 1 that deals with face alignment.

Fig. 3 shows two different image rasterization methods.

Fig. 4 shows the dictionary learning portion of Fig. 1.

Fig. 5 is a flowchart of an exemplary implementation of the proposed method shown in Fig. 1.

Fig. 6 is a flowchart of an exemplary implementation of the offline component of the face alignment step (act) 505 of Fig. 5.

Fig. 7 is a flowchart of an exemplary implementation of the online component of the face alignment step (act) 505 of Fig. 5.

Fig. 8 is a flowchart of an exemplary implementation of the face reconstruction (inpainting) portion of the proposed method.

Fig. 9 is a block diagram of an exemplary apparatus for face occlusion removal. It should be understood that the drawing(s) are for purposes of illustrating the concepts of the disclosure and is not necessarily the only possible configuration for illustrating the disclosure.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.

All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.

Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.

Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.

In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

An overview of the proposed approach can be seen in Fig. 1. It is composed of three main steps: 1) Face alignment (pre-processing) 2) Dictionary offline learning and 3) Face reconstruction through sparse coding. In Fig. 1, the face alignment step is shown in a dark grey box with white letters. That is, face alignment is face landmark based warping. The offline dictionary learning is shown by a grey box with black lettering. The remaining boxes are the steps required for face reconstruction through sparse coding (inpainting). Each of the steps is detailed below.

Faces captured in uncontrolled conditions can present a heterogeneity of sizes and positions in the image due to 1) use of different cameras (each with a different field of view, pixel resolution, etc.) 2) the distance of the subject from the camera and 3) the subject's physiognomy.

In order for the proposed approach to be robust to these variations, the proposed method pre-processes images to align the observed face with a standard face, well centered and of a predefined fixed scale. This process is illustrated in Fig. 2. The first step is to estimate the shape of the face S, encoded as a sparse set of predefined P 2D key-point landmark locations. This can be achieved using any state-of-the- art algorithm such as proposed by Burgos-Artizzu, Perona and Dollar.

Then, from each shape training image a shape S is extracted and its associated scale invariant shape S' is computed by removing size variations:

SHx ,y ],where x = (⁴)

'

Then, the standard face ^~S is computed as the average of all N training faces after size normalization:

1 N 1 N

-M„ (5) n=l n=l Now, given an input image and its estimated face shape S, a goal is to warp the current shape S onto the average shape ^"5. this is achieved by performing a piecewise affine transform, as illustrated on the bottom of Fig. 2.

First, a Delaunay triangulation DT(S) is computed from the set of P landmark locations in the average shape ^~S. Then, the average shape is projected onto the current image, by performing the inverse of equation (4) and applying the same Delaunay triangulation to current shape, yielding DT(S). Finally, every triangle in DT(S) is warped to DT(S) using an affine transform A such that A*DT(S)=DT(S) (for which there is a closed form solution).

As a result, the face is successfully warped onto the average shape, removing variations due to differences in pixel resolutions, camera projections and to different physiognomies.

Prior to utilization of the system one needs to train the dictionary used to carry out sparse decompositions. To this end, it is assumed that a training set consisting of a large number of shape-normalized images without occlusion is available. Each shape- normalized image is rasterized using a mask <F indicating the position of face pixels in the normalized image, thus producing the signal vector y. The mask <F is computed from the standard face discussed in previous section, as illustrated on the bottom of Fig. 4.

The resulting set of training vectors { } is used to learn a dictionary by minimizing equation (3). The dictionary learning portion of the proposed method is illustrated on the top of Fig. 4.

Given a face image suffering from partial occlusion, the image is first pre-processed using the face alignment method above. Then, letting fl. denote the indices of available pixels inside the shape-normalized face, and let M denote the indices of the occluded pixels (for an illustration of these masks, see the bottom of Fig. 4). If the occlusion mask is specified in the image before shape normalization, one only needs to apply the shape normalization function computed in the first step to the occluded pixel positions. The pixels indicated by are then concatenated to build the signal vector y that is decomposed via sparse coding using a dictionary D including only the of rows of D corresponding to

A- available pixels.

The resulting sparse code vector x is used to obtain an approximation of the pixels in M using D^x. This estimate is substituted in place of the occlusion in the shape normalized image, and the composite image is subsequently de-normalized to map it back to the original signal shape.

The occlusion mask M required in the above proposed method can be manually input by the user. Alternatively, an automatic occlusion detection system that works as follows is proposed: A large training set of two parts is required. The first part are shape- normalized images without occlusions. The second part includes occluded shape- normalized images with known M. A feature vector (e.g., the well-known SIFT feature proposed by Lowe) is extracted from each pixel of every image. For each pixel, a binary classifier is learned using the occluded and non-occluded features as a training set. Standard classifier learning algorithms exist such as the Support Vector Machine (SVM) classifier that has been used extensively in image classification, for example in the work by Chatfield, Lempitsky, Vedaldi and Zisserman. Fig. 5 is a flowchart of an exemplary implementation of the proposed method shown in Fig. 1. At 505 a face image and occlusion mask M are accepted (received, input). At 510 face alignment is performed. At 515 Mask A (warped) specifying position of n<m available pixels is accepted (received, input). At 520 the learned dictionary D is accepted (received, input). At 525 face reconstruction using sparse coding (inpainting) is performed.

There are two components to the face alignment step (act) 510 of Fig. 5. The first component is an offline component. By offline it is meant that the method of the component can be performed ahead of time offline on the same or another processor as any other portions of the proposed method of Fig. 5. Fig. 6 is a flowchart of an exemplary implementation of the offline portion of the face alignment step (act) 505 of Fig. 5. At 605 training images with or without occlusion are accepted (received, input). At 610 cascaded regression landmark estimation is performed on the training images. At 615 the average face shape is calculated (determined, computed). At 620 Delaunay triangulation is performed. Delaunay triangulation for a set P of points in a plane is a triangulation DT(P) such that no point in P is inside the circumcircle of any triangle in DT(P).

The second component of the face alignment step (act) of 505 of Fig. 5 is an online component. Fig. 7 is a flowchart of an exemplary implementation of the online component of the face alignment step (act) 505 of Fig. 5. At 705 cascaded regression landmark estimation is performed on the face image with occlusion that was accepted (received, input) at 505. At 710 the results of the Delaunay triangulation are accepted (received, input). At 715 piece-wise affine transform estimation is performed. The piece- wise affine transform estimation yields a warped face image of standard shape and size. In geometry, an affine transformation is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.

Dictionary learning is accomplished by applying any of a number of available dictionary learning algorithms to the vectors y obtained from face images without occlusion. Fig. 8 is a flowchart of an exemplary implementation of the face reconstruction (inpainting) portion of the proposed method. At 805 vector A of available pixels is extracted from the warped face image of standard shape and size. At 810 sparse coding using DA (A rows of D) is performed using the learned dictionary and the vector of available pixels. The result is sparse code vector x. At 815 the missing pixels (indicated by the occlusion mask) are reconstructed and DMX (matrix of reconstructed pixels) is substituted into positions M of the warped face image. At 820 the inpainted (reconstructed face image) is unwarped. The result of the unwarping module is a reconstructed face (a face image with inpainted occlusion).

Fig. 9 is a block diagram of an exemplary apparatus for face occlusion removal.

The apparatus in which the proposed method is performed may be any suitable processor. Such a suitable processor will also include memory (storage), at least one communications interface, antennas if wireless communications are necessary or available, an internal communications means (such as but not limited to a bus, token ring etc.), at least one display device. Such components are standard and not shown in Fig. 9 so as to not clutter Fig. 9. The memory (storage) may include but is not limited to disks, CDs, any form of RAM, optical disks etc. The at least one communications interface acts to accept (receive, input) the face image and occlusion mask, Mask A, the learned dictionary (if the learned dictionary processing is performed offline in a standalone processor). The at least one communications interface also outputs the reconstructed (inpainted) face image. That output may be to a printer (for hard copy) to a removable storage device, to a display device or by a network link to another computer system for further processing or face matching. Any or all of the processors herein may be computer systems or may be partially or entirely implemented in application specific integrated circuits (ASICs), filed programmable gate arrays (FPGAs), reduced instructions set computers (RISCs) or any other form that a processor may take. The learned dictionary portion of the proposed method may be performed in the same apparatus (processor) or in a standalone processor or a co-processor of the apparatus having the face alignment module and the face reconstruction module. The face alignment module has two components. The offline component accepts (receives) training images with or without occlusion. The offline component of the face alignment module may be performed within the face occlusion removal apparatus or in a standalone processor or in a co-processor. The offline component of the face alignment module then performs cascaded regression landmark estimation on the training images. The average face shape is then calculated (determined, computed) in the offline component of the face alignment module. Delaunay triangulation is then performed in the offline component of the face alignment module. The online component of the face alignment module then accepts (receives) a face image and occlusion mask. The online component of the face alignment module then performs cascaded regression landmark estimation on the face image. The online component of the face alignment module then performs piece-wise affine transform estimation using the results of the Delaunay triangulation to yield a warped face image of standard shape and size. The warped face image of standard shape and size is provided to the face reconstruction module, which includes an extraction module, a sparse coding module, a substitution module and a unwarping module. The extraction module also accepts Mask A (warped) specifying position of n<m available pixels. The extraction module extracts vector A of available pixels from the warped face image of standard shape and size. The results of the extraction module are provided to the sparse coding module. The Mask A (warped) specifying position of n<m available pixels is also provided to the sparse coding module. The sparse coding module also accepts the learned dictionary. As shown in Fig. 9 the sparse coding module accepts the learned dictionary from the learned dictionary module shown in a dashed outline to indicate that it may be performed within the inpainting (face reconstruction) apparatus and the learned dictionary is shown as input to the sparse coding module with a solid line (arrow) to indicate that the learned dictionary is provided from a standalone processor. The sparse coding module uses DA (A rows of D), the learned dictionary and the vector of available pixels to generate (compute, determine, calculate) a sparse vector x. The results of the sparse coding module are provided to the substitution module. The substitution module reconstructs the missing pixels (indicated by the occlusion mask) and substitutes DMX (matrix of reconstructed pixels) into positions M of the warped face image. The results of the substitution module are provided to the unwarping module, which unwarps the inpainted face image. The result of the unwarping module is a reconstructed face (a face image with inpainted occlusion). The dictionary learning method described above can be specialized for the specific task addresed by the proposed method by considering the following learning problem in place of equation (3):

2

rgmin_{O Ό} ∑ \y _tM~O ^° (Ό _g,y _A)\₂, (6)

r s _f where the columns of are constrained to be unit norm. Here the dictionary is a selection dictionary, and each column in is coupled to a column in the reconstruction dictionary D^. The above problem can be solved using standard gradient-based solvers.

Using whole image rasterization makes the sparse decomposition task in equation (2) computationally demanding. In order to reduce this complexity of the inpainting online stage, the learned atoms can be forced to have a compact support in the region specified by the available-mask. When using equation (6), for example, the resulting learning objective is:

2

argmin_{O Ό} ∑ Ιγ^Ό^φ^^+λΙΌ^ , (7)

r s _{ where the Ιγ matrix norm notation is used to denote the summation of the absolute values of all entries in the matrix. Other possibilities include enforcing a minimum support per atom.

The problem in equation (6) needs to be solved individually for each mask M. In order to avoid the extra overhead, random masks can be used that are varied for each sample in the training set. The resulting dictionary is sub-optimal for any specific mask, but performs well on average for any mask.

Rather than operating on the entire face image, one can instead define many image strips so that that the entire face is covered, and execute the proposed method on a per-strip basis. Each strip will provide an inpainting prediction for a subset of the missing pixels M. If the strips are not disjoint, the average of the available predicted pixel values is taken for each pixel. The proposed method is applicable to a picture or video containing occluded faces for which it is desirable to reconstruct. The proposed method attempts to preserve the true expression of the subject, where even if the eyes were originally occluded when the person smiles one can see changes in the expression of his/her eyes. This is in clear contrast which classical "static" reconstructions which are constant regardless of facial expression.

It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs). Preferably, the present invention is implemented as a combination of hardware and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.

It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.

It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.

Claims

CLAIMS:

A method for performing face occlusion removal, said method comprising:

receiving a face image and an occlusion mask, said occlusion mask indicating missing pixels (505);

receiving training images (605);

performing face alignment on said received training images and said face image and said occlusion mask (510);

receiving a mask (515);

receiving a learned dictionary (520); and

reconstructing said face image using said mask and said learned dictionary

(525).

The method according to claim 1, wherein said face alignment further comprises: performing cascaded regression landmark estimation on said training images (610);

determining an average face shape using said landmark estimation of said training images (615); and

performing triangulation on said average face shape (620).

The method according to claim 2, wherein said triangulation is Delaunay triangulation.

The method according to claim 2, wherein said face alignment further comprises: performing cascaded regression landmark estimation on said face image (705); and

performing piece-wise affine transform estimation using said landmark estimation of said face image and said occlusion mask and said triangulation of said average face shape to generate a warped face image (715).

The method according to claim 1 , wherein said mask is a warped mask specifying positions of available pixels.

The method according to claim 5, wherein said reconstruction of said face image further comprises:

extracting a vector of available pixels of said warped mask (805); performing sparse coding using said learned dictionary and said vector of available pixels to generate a sparse code vector (810);

reconstructing the missing pixels using the learned dictionary and the sparse code vector (815)

substituting the reconstructed pixels into positions of said warped face image to generate a warped inpainted face image (815); and

unwarping said warped inpainted face image to generate an unwarped inpainted face image (820).

7. The method according to claim 6, further comprising outputting said unwarped inpainted face image.

8. A face occlusion removal apparatus, comprising:

a communications interface, said communications interface receiving a face image and occlusion mask;

said communications interface receiving training images;

a face alignment module, said face alignment module performing face alignment on said received training images and said face image and said occlusion mask, said face alignment module in communication with said communications interface;

said communications interface receiving a mask;

said communications interface receiving a learned dictionary; and a face reconstruction module, said face reconstruction module reconstructing said face image using said mask and said learned dictionary, said face reconstruction module in communication with said communications interface and said face reconstruction module in communication with said face alignment module.

9. The face alignment module according to claim 8, wherein said face alignment module comprises an offline face alignment component and an online face alignment component, said offline face alignment component operates on said training images and accomplishes an offline portion of face alignment by: performing cascaded regression landmark estimation on said training images;

determining an average face shape using said landmark estimation of said training images; and

performing triangulation on said average face shape.

10. The offline component of said face alignment module according to claim 9, wherein said triangulation is Delaunay triangulation.

11. The online face alignment component according to claim 9, wherein said online face alignment operates on said face image and said occlusion mask and accomplishes an online portion of face alignment by:

performing cascaded regression landmark estimation on said face image; and

performing piece-wise affine transform estimation using said landmark estimation of said face image and said occlusion mask and said triangulation of said average face shape to generate a warped face image.

12. The face occlusion removal apparatus according to claim 8, wherein said mask is a warped mask specifying positions of available pixels.

13. The face occlusion removal apparatus according to claim 12, wherein said face reconstruction module further comprises:

an extraction module, said extraction module extracting a vector of available pixels of said warped mask, said extraction module in communication with said communications interface, said extraction module also in communication with said face alignment module;

a sparse coding module, said sparse coding module performing sparse coding using said learned dictionary and said vector of available pixels to generate a sparse code vector, said sparse coding module in communication with said communication interface and also in communication with said extraction module; a substitution module, said substitution module reconstructing said missing pixels using said learned dictionary and said sparse code vector and substituting said reconstructed pixels into positions of said warped face image to generate a warped inpainted face image, said substitution module in communication module in communication with said sparse coding module; and

an unwarping module, said unwarping module unwarping said warped inpainted face image, said unwarping module in communication with said substitution module.

14. The face occlusion removal apparatus according to claim 13, further comprising said unwarping module providing an unwarped inpainted face image to said communications interface for output.