CN110580461A

CN110580461A - A Facial Expression Recognition Algorithm Combining Multi-Level Convolutional Feature Pyramid

Info

Publication number: CN110580461A
Application number: CN201910806700.2A
Authority: CN
Inventors: 符强; 昌涛; 孙希延; 纪元法; 任风华; 严素清; 付文涛
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-12-17

Abstract

The invention discloses a human facial expression recognition algorithm combined with a multi-level convolution feature pyramid, which comprises the following steps: inputting the original human facial expression picture into the first feature extraction network for feature extraction, and extracting global features; The enlarged facial expression picture is input to the second feature extraction network for feature extraction, and local features are extracted; a feature pyramid network is constructed in the first feature extraction network and the second feature extraction network; The local features in the feature extraction network are used for regional positioning; the global features and local features of the facial expression pictures are fused using the feature fusion network, and the facial expression pictures are classified through the fully connected layer. To achieve the purpose of improving the accuracy of facial expression recognition technology for facial expression recognition.

Description

A Facial Expression Recognition Algorithm Combining Multi-Level Convolutional Feature Pyramid

技术领域technical field

本发明涉及图像处理领域，尤其涉及一种结合多级卷积特征金字塔的人脸表情识别算法。The invention relates to the field of image processing, in particular to a facial expression recognition algorithm combined with a multi-level convolution feature pyramid.

背景技术Background technique

人脸表情不仅是人们本身情绪的一种流露，而且能够传递极其丰富的人体行为信息。人脸表情识别技术是对人类情感研究的主要形式，对人脸表情的准确识别能够有效的判断出人的情感状态。人脸表情识别可以应用于许多领域，如安全驾驶领域、智能化的人机交互领域、医疗监护领域、刑侦审讯中的测谎以及心理治疗等领域。但是现有的人脸表情识别技术对人脸表情识别的准确率较低。Facial expressions are not only an expression of people's own emotions, but also can convey extremely rich information about human behavior. Facial expression recognition technology is the main form of human emotion research. Accurate recognition of human facial expressions can effectively judge people's emotional state. Facial expression recognition can be applied in many fields, such as safe driving, intelligent human-computer interaction, medical monitoring, lie detection in criminal investigation, and psychotherapy. However, the existing facial expression recognition technology has a low accuracy rate for facial expression recognition.

发明内容Contents of the invention

本发明的目的在于提供一种结合多级卷积特征金字塔的人脸表情识别算法，旨在解决现有技术中的人脸表情识别技术对人脸表情识别的准确率较低的技术问题。The purpose of the present invention is to provide a facial expression recognition algorithm combined with a multi-level convolutional feature pyramid, aiming to solve the technical problem of low accuracy of facial expression recognition technology in the prior art.

为实现上述目的，本发明采用的一种结合多级卷积特征金字塔的人脸表情识别算法，包括如下步骤：In order to achieve the above object, a kind of face expression recognition algorithm that the present invention adopts in conjunction with multi-level convolution feature pyramid comprises the following steps:

将原始人脸表情图片输入到第一特征提取网络进行特征提取，并提取全局特征；Input the original facial expression picture to the first feature extraction network for feature extraction, and extract global features;

将经过裁剪、放大的人脸表情图片输入到第二特征提取网络进行特征提取，并提取局部特征；Input the cropped and enlarged facial expression picture to the second feature extraction network for feature extraction, and extract local features;

在第一特征提取网络和第二特征提取网络中构建特征金字塔网络；Constructing a feature pyramid network in the first feature extraction network and the second feature extraction network;

利用注意力区域定位网络对到第二特征提取网络中的局部特征进行区域定位；Using the attention region localization network to localize the local features in the second feature extraction network;

利用特征融合网络对人脸表情图片的全局特征与局部特征进行融合，并通过全连接层对人脸表情图片进行分类。The feature fusion network is used to fuse the global features and local features of the facial expression pictures, and the facial expression pictures are classified through the fully connected layer.

其中，第二级特征提取网络的输入样本是经过第一级特征提取网络中输入样本图片进行裁剪和放大得到。Wherein, the input samples of the second-level feature extraction network are obtained by cropping and enlarging the input sample pictures in the first-level feature extraction network.

其中，第一级特征提取网络和第二级特征提取网络均是一种全卷积结构网络。Among them, the first-level feature extraction network and the second-level feature extraction network are both a fully convolutional network.

其中，在提取全局特征和提取局部特征中的第一层卷积层均使用大小为3*3的卷积核。Among them, the first convolutional layer in extracting global features and extracting local features uses a convolution kernel with a size of 3*3.

其中，原始人脸表情图片输入第一级特征特征提取网络，通过卷积神经网络提取人脸表情图片的特征，然后输出类别和每一类的概率值，所述每一类别的概率值的获得方法为：Wherein, the original facial expression picture is input into the first-level feature extraction network, and the feature of the facial expression picture is extracted by a convolutional neural network, and then the output category and the probability value of each category, the acquisition of the probability value of each category The method is:

C⁽ⁱ⁾＝f(W_i⊙X_i)C ⁽ⁱ⁾ ＝f(W _i ⊙X _i )

p⁽ⁱ⁾＝f(W_i⊙X_i)p ⁽ⁱ⁾ = f(W _i ⊙X _i )

其中，输入人脸表情图片为X，X_i和W_i表示第i级卷积神经网络的输入和权重参数，⊙表示在卷积过程中进行的卷积、池化、激活或归一化操作，f(·)表示特征提取网络，C⁽ⁱ⁾(i＝1,2)表示第i级卷积神经网络的输出类别标签，p⁽ⁱ⁾(i＝1,2)表示第i级卷积神经网络对于输出每一类别的概率值。Among them, the input facial expression picture is X, X _i and W _i represent the input and weight parameters of the i-th level convolutional neural network, and ⊙ represents the convolution, pooling, activation or normalization operations performed during the convolution process , f( ) represents the feature extraction network, C ⁽ⁱ⁾ (i=1,2) represents the output category label of the i-th level convolutional neural network, p ⁽ⁱ⁾ (i=1,2) represents the i-th level volume The product neural network outputs the probability value of each category.

其中，每一级网络都会生成关于输入人脸表情图片的特征图，用表示每一级网络生成特征图的集合，m表示该模型包含的网络级数，其中m＝(1,2)，n表示每一级网络最后输出的特征图的个数，输入图像X用以下式子表示：Among them, each level of network will generate a feature map about the input facial expression picture, using Represents the set of feature maps generated by each level of network, m represents the number of network series included in the model, where m=(1,2), n represents the number of feature maps output by each level of network, and the input image X is represented by the following The formula means:

其中，卷积神经网络中浅层卷积层产生的特征图尺度大，卷积神经网络的深处卷积层产生的特征图尺度小，将后一层的卷积层产生的特征图进行转置卷积操作，然后再将两张特征图相加，统一特征图的尺度。Among them, the feature map generated by the shallow convolution layer in the convolutional neural network has a large scale, and the feature map generated by the deep convolution layer of the convolutional neural network has a small scale. Set the convolution operation, and then add the two feature maps to unify the scale of the feature maps.

其中，注意力区域定位网络将输入的特征图映射为以(D_x,D_y)为中心、D_α为边长的正方形注意力区域，然后将注意力区域映射到原始人脸表情图片，再对其通过裁剪、放大后送入到下一级网络。Among them, the attention area positioning network maps the input feature map to a square attention area with (D _x , D _y ) as the center and D _α as the side length, and then maps the attention area to the original facial expression picture, and then It is sent to the next level network after cropping and amplifying.

其中，第二特征提取网络在对人脸表情进行识别提取中，可自动将人脸表情的显著性区域进行激活，首先必须对特征图中响应最大的区域实现精准定位，然后将这些精准定位的区域映射到原始人脸表情图片中，得到原始人脸表情的重要的局部区域，对局部区域进行特征提取实现图片的局部区域识别。Among them, the second feature extraction network can automatically activate the salient areas of facial expressions during the recognition and extraction of facial expressions. The area is mapped to the original facial expression picture, and the important local area of the original facial expression is obtained, and the feature extraction of the local area is carried out to realize the local area recognition of the picture.

其中，在初始化时将由第一级特征提取网络最后一层产生的特征图全部相加在一起，然后将相加后的的特征图中数值较大的区域定位出来，将定位的区域拟合为一个正方形，然后获取正方形的中心坐标以及正方形边长，并将其设置为注意力网络的初始化参数，表达为：Among them, at the time of initialization, all the feature maps generated by the last layer of the first-level feature extraction network are added together, and then the area with a larger value in the added feature map is located, and the located area is fitted as A square, then get the center coordinates of the square and the side length of the square, and set them as the initialization parameters of the attention network, expressed as:

其中，f表示第一特征提取网络最后一层卷积层产生的特征图，d表示该网络产生的总的特征图数量，F为将每一张特征图的对应的每一个点相加后的总的特征图，h和w表示特征图的高和宽，p表示总的特征图的像素均值。Among them, f represents the feature map generated by the last convolutional layer of the first feature extraction network, d represents the total number of feature maps generated by the network, and F is the sum of each point corresponding to each feature map. The total feature map, h and w represent the height and width of the feature map, and p represents the pixel mean of the total feature map.

本发明的一种结合多级卷积特征金字塔的人脸表情识别算法，通过将原始人脸表情图片输入到第一特征提取网络进行特征提取，并提取全局特征；将经过裁剪、放大的人脸表情图片输入到第二特征提取网络进行特征提取，并提取局部特征；在第一特征提取网络和第二特征提取网络中构建特征金字塔网络；利用注意力区域定位网络对到第二特征提取网络中的局部特征进行区域定位；利用特征融合网络对人脸表情图片的全局特征与局部特征进行融合，并通过全连接层对人脸表情图片进行分类。其中通过采用两级人脸特征提取网络的级联结构来实现人脸特征由全局到局部的转移，在每级人脸特征提取网络中加入了特征金字塔网络，从而提高了该算法对人脸目标大小的鲁棒性。以此获得提高人脸表情识别技术对人脸表情识别的准确率的效果。A kind of facial expression recognition algorithm combined with multi-level convolution feature pyramid of the present invention, by inputting the original facial expression picture into the first feature extraction network for feature extraction, and extracting global features; Input the expression picture to the second feature extraction network for feature extraction, and extract local features; construct a feature pyramid network in the first feature extraction network and the second feature extraction network; use the attention area positioning network to pair it into the second feature extraction network The local features of the facial expression are used to locate the area; the global features and local features of the facial expression picture are fused using the feature fusion network, and the facial expression picture is classified through the fully connected layer. Among them, the transfer of face features from global to local is realized by adopting the cascaded structure of two-level face feature extraction network, and a feature pyramid network is added to each level of face feature extraction network, thereby improving the algorithm's accuracy for face targets. Size robustness. In this way, the effect of improving the accuracy of facial expression recognition by the facial expression recognition technology is obtained.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1是本发明的结合多级卷积特征金字塔的人脸表情识别算法的步骤图。Fig. 1 is a step diagram of the facial expression recognition algorithm combined with the multi-level convolution feature pyramid of the present invention.

图2是本发明的结合多级卷积特征金字塔的人脸表情识别算法的流程图。Fig. 2 is a flow chart of the facial expression recognition algorithm combined with the multi-level convolutional feature pyramid of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

在本发明的描述中，需要理解的是，术语“长度”、“宽度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In describing the present invention, it should be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientation or positional relationship indicated by "horizontal", "top", "bottom", "inner", "outer", etc. are based on the orientation or positional relationship shown in the drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than Nothing indicating or implying that a referenced device or element must have a particular orientation, be constructed, and operate in a particular orientation should therefore not be construed as limiting the invention. In addition, in the description of the present invention, "plurality" means two or more, unless otherwise specifically defined.

请参阅图1和图2，本发明提供了一种结合多级卷积特征金字塔的人脸表情识别算法，包括如下步骤：Please refer to Fig. 1 and Fig. 2, the present invention provides a kind of facial expression recognition algorithm in conjunction with multi-level convolution feature pyramid, comprises the steps:

S100：将原始人脸表情图片输入到第一特征提取网络进行特征提取，并提取全局特征；S100: Input the original facial expression picture to the first feature extraction network for feature extraction, and extract global features;

S200：将经过裁剪、放大的人脸表情图片输入到第二特征提取网络进行特征提取，并提取局部特征；S200: Input the cropped and enlarged facial expression picture to the second feature extraction network for feature extraction, and extract local features;

S300：在第一特征提取网络和第二特征提取网络中构建特征金字塔网络；S300: Construct a feature pyramid network in the first feature extraction network and the second feature extraction network;

S400：利用注意力区域定位网络对到第二特征提取网络中的局部特征进行区域定位；S400: Use the attention area positioning network to perform area positioning on the local features in the second feature extraction network;

S500：利用特征融合网络对人脸表情图片的全局特征与局部特征进行融合，并通过全连接层对人脸表情图片进行分类。S500: Use a feature fusion network to fuse the global features and local features of the facial expression pictures, and classify the facial expression pictures through a fully connected layer.

在本实施方式中，所述第一级特征提取网络能够对448*448像素的原始人脸表情图片进行特征提取，并提取全局特征；所述第二级特征提取网络对经过裁剪、放大的人脸表情图片进行特征提取，并提取局部特征；其中第二级特征提取网络的输入样本是经过第一级特征提取网络中输入样本图片进行裁剪和放大得到，第一特征提取网络和第二特征提取网络均是一种全卷积结构网络。网络的结构类似于VGG19网络结构，该网络的每一个模块都由一个卷积层，一个BatchNorm层，一个relu层和一个平均池化层来构成的，共有16个卷积模块。在提取全局特征和提取局部特征中的第一层卷积层均使用大小为3*3的卷积核。In this embodiment, the first-level feature extraction network can perform feature extraction on the original facial expression picture of 448*448 pixels, and extract global features; Feature extraction is performed on facial expression pictures, and local features are extracted; the input samples of the second-level feature extraction network are obtained by cropping and enlarging the input sample pictures in the first-level feature extraction network, and the first feature extraction network and the second feature extraction network The network is a fully convolutional network. The structure of the network is similar to the VGG19 network structure. Each module of the network consists of a convolutional layer, a BatchNorm layer, a relu layer and an average pooling layer. There are 16 convolutional modules in total. The first convolutional layer in extracting global features and extracting local features uses a convolution kernel with a size of 3*3.

原始人脸表情图片输入第一级特征特征提取网络，通过卷积神经网络提取人脸表情图片的特征，然后输出类别和每一类的概率值，用以下式子表示：The original facial expression picture is input into the first-level feature extraction network, and the features of the facial expression picture are extracted through the convolutional neural network, and then the output category and the probability value of each category are expressed by the following formula:

C⁽ⁱ⁾＝f(W_i⊙X_i)C ⁽ⁱ⁾ ＝f(W _i ⊙X _i )

p⁽ⁱ⁾＝f(W_i⊙X_i)p ⁽ⁱ⁾ = f(W _i ⊙X _i )

上式中，输入人脸表情图片为X，X_i和W_i表示第i级卷积神经网络的输入和权重参数，⊙表示在卷积过程中进行的卷积、池化、激活或归一化操作，f(·)表示特征提取网络，C⁽ⁱ⁾(i＝1,2)表示第i级卷积神经网络的输出类别标签，p⁽ⁱ⁾(i＝1,2)表示第i级卷积神经网络对于输出每一类别的概率值。In the above formula, the input facial expression picture is X, Xi and W _i represent the input and weight parameters of the i- _th convolutional neural network, and ⊙ represents the convolution, pooling, activation or normalization performed in the convolution process operation, f( ) represents the feature extraction network, C ⁽ⁱ⁾ (i=1,2) represents the output category label of the i-th convolutional neural network, p ⁽ⁱ⁾ (i=1,2) represents the i-th The level convolutional neural network outputs the probability value for each category.

每一级网络都会生成关于输入人脸表情图片的特征图，用表示每一级网络生成特征图的集合，m表示该模型包含的网络级数，其中m＝(1,2)，n表示每一级网络最后输出的特征图的个数，输入图像X用以下式子表示： Each level of the network will generate a feature map of the input facial expression picture, using Represents the set of feature maps generated by each level of network, m represents the number of network series included in the model, where m=(1,2), n represents the number of feature maps output by each level of network, and the input image X is represented by the following The formula means:

卷积神经网络通过每一层的卷积核对输入图片进行逐层提取特征，不同卷积层的特征图对应于输入图片的不同区域与目标，为了解决人脸大小多样化的问题，同时为了更好地利用卷积神经网络的浅层卷积层提取到人脸表情信息，在卷积神经网络的各个卷积层之间构建了特征金字塔网络，由于卷积神经网络中浅层卷积层产生的特征图尺度较大，而卷积神经网络的身处卷积层产生的特征图尺度小，为了统一特征图的尺度，首先将后一层的卷积层产生的特征图进行转置卷积操作，然后再将两张特征图相加，生成的特征图就是整张人脸表情图片的最终表示。The convolutional neural network extracts the features of the input image layer by layer through the convolution kernel of each layer. The feature maps of different convolution layers correspond to different areas and targets of the input image. Make good use of the shallow convolutional layer of the convolutional neural network to extract facial expression information, and construct a feature pyramid network between each convolutional layer of the convolutional neural network, due to the shallow convolutional layer in the convolutional neural network. The feature map of the convolutional neural network has a large scale, while the feature map generated by the convolutional neural network is small. In order to unify the scale of the feature map, the feature map generated by the convolutional layer of the next layer is first transposed and convolved. operation, and then add the two feature maps, and the generated feature map is the final representation of the entire facial expression picture.

人脸表情图片的全局信息可以提供人脸表情的整体信息，而局部特征可以提供更加精细化人脸表情的细节信息，为了实现人脸表情从全局特征到局部特征的过渡与转移，注意力区域定位网络在特征提取网络之间，负责定位原始人脸表情图片中的局部特征区域(如眉毛、眼睛、鼻子、嘴巴)，并将对应的区域进行裁剪，然后将其送入到下一级网络。已经训练完成的人脸表情识别网络在对人脸表情进行识别时可以自动将人脸表情的显著性区域(即对人脸表情识别效果有重要影响力的区域)进行激活，所以首先必须对特征图中响应最大的区域实现精准定位，然后将这些精准定位的区域映射到原始人脸表情图片中，这样就可以得到原始人脸表情的非常重要的局部区域，对局部区域进行特征提取实现图片的局部区域识别。通过采用逐级定位的注意力网络自动定位对人脸表情识别具有重要影响力的区域。The global information of facial expression pictures can provide the overall information of facial expressions, while the local features can provide more refined detailed information of facial expressions. In order to realize the transition and transfer of facial expressions from global features to local features, the attention area The positioning network is between the feature extraction network, responsible for locating the local feature areas (such as eyebrows, eyes, nose, mouth) in the original facial expression picture, and cutting the corresponding area, and then sending it to the next level network . The facial expression recognition network that has been trained can automatically activate the salient area of facial expression (that is, the area that has an important influence on the recognition effect of facial expression) when recognizing facial expression, so the feature must first be The region with the largest response in the figure is precisely positioned, and then these precisely positioned regions are mapped to the original facial expression picture, so that the very important local area of the original facial expression can be obtained, and feature extraction is performed on the local area to realize the picture. Local area recognition. Automatically localize regions that have significant influence on facial expression recognition by employing a hierarchical localization attention network.

注意力区域定位网络将输入的特征图映射为以(D_x，D_y)为中心、D_a为边长的正方形注意力区域，然后将注意力区域映射到原始人脸表情图片，再对其通过裁剪、放大后送入到下一级网络中，这样就可以实现人脸表情图片由全局特征图片到局部特征区域的精细化识别。The attention area localization network maps the input feature map into a square attention area with (D _x , D _y ) as the center and D _a as the side length, and then maps the attention area to the original facial expression picture, and then After cropping and zooming in, it is sent to the next-level network, so that the refined recognition of facial expression pictures from global feature pictures to local feature areas can be realized.

将通过卷积神经网络得到的特征图中强响应区域映射到原始人脸表情图片的关键在于获得特征图中强响应区域的坐标，于是通过设计一个3层全连接神经网络的注意力区域定位网络，将特征图通过该全连接层网络可以实现对特征图中强响应区域的自动定位。为了加快注意力区域定位网络的收敛速度，在初始化时将由第一级特征提取网络最后一层产生的特征图全部相加在一起，然后将相加后的的特征图中数值较大的区域定位出来，将定位的区域拟合为一个正方形，然后获取正方形的中心坐标以及正方形边长，并将其设置为注意力网络的初始化参数，具体实现方式为：The key to mapping the strong response area in the feature map obtained by the convolutional neural network to the original facial expression picture is to obtain the coordinates of the strong response area in the feature map, so by designing a 3-layer fully connected neural network attention area positioning network , passing the feature map through the fully connected layer network can realize the automatic positioning of the strong response area in the feature map. In order to speed up the convergence speed of the attention area positioning network, all the feature maps generated by the last layer of the first-level feature extraction network are added together during initialization, and then the areas with larger values in the added feature maps are located. Come out, fit the positioned area into a square, then get the center coordinates of the square and the side length of the square, and set them as the initialization parameters of the attention network. The specific implementation method is:

上式中，f表示第一特征提取网络最后一层卷积层产生的特征图，d表示该网络产生的总的特征图数量，F为将每一张特征图的对应的每一个点相加后的总的特征图，h和w表示特征图的高和宽，p表示总的特征图的像素均值，然后将总的特征图的每一个像素点与该像素均值进行比较，相当于以该像素均值为总的特征图的阈值，大于该阈值的像素点则置为1，小于或等于该阈值的像素点则置为0，然后选择置为1的最大连通区域的最长边为正方形的边长，最后将正方形的中心坐标及边长初始化注意力定位网络的参数。注意力区域定位网络可以自动定位特征图中响应最大的区域，定位后再对注意力区域进行裁剪与放大，假设以图片的左下角为原点，水平方向为x轴，从左到右逐渐增大，竖直方向为y轴，从下到上逐渐增大。根据前面所得到的正方形注意力区域的中心点坐标和其边长，我们可以计算出正方形注意力区域的四个顶点从而实现了注意力区域的自动裁剪，最后利用双线性插值对注意力区域图片进行放大。得到我们需要的图片。In the above formula, f represents the feature map generated by the last convolutional layer of the first feature extraction network, d represents the total number of feature maps generated by the network, and F is the sum of each point corresponding to each feature map After the total feature map, h and w represent the height and width of the feature map, p represents the pixel mean value of the total feature map, and then compare each pixel point of the total feature map with the pixel mean value, which is equivalent to the The pixel mean value is the threshold of the total feature map, the pixel points greater than the threshold value are set to 1, and the pixel points less than or equal to the threshold value are set to 0, and then the longest side of the largest connected region set to 1 is selected as a square Side length, and finally initialize the parameters of the attention positioning network with the center coordinates and side length of the square. The attention area positioning network can automatically locate the area with the largest response in the feature map, and then crop and enlarge the attention area after positioning. Assume that the lower left corner of the picture is the origin, the horizontal direction is the x-axis, and it gradually increases from left to right. , the vertical direction is the y-axis, which gradually increases from bottom to top. According to the coordinates of the center point of the square attention area and its side length obtained earlier, we can calculate the four vertices of the square attention area to realize the automatic cropping of the attention area, and finally use bilinear interpolation to the attention area The picture is enlarged. Get the pictures we need.

上述提出的算法的每一级特征提取网络都会提取人脸表情图片的特征，第一特征提取网络提取人脸表情图片的全局特征，第二网络提取人脸表情图片的局部细节特征，将每一个通道的特征图对应像素点相加，然后通过使用1*1大小的卷积核对特征维数降维然后再进行特征融合和池化，最后通过全连接层输出人脸表情的类别标签。Each level of feature extraction network of the algorithm proposed above will extract the features of the facial expression picture, the first feature extraction network extracts the global features of the facial expression picture, the second network extracts the local detail features of the facial expression picture, and each The feature map of the channel is added to the corresponding pixels, and then the feature dimension is reduced by using a 1*1 convolution kernel, and then the feature fusion and pooling are performed, and finally the category label of the facial expression is output through the fully connected layer.

在按照上述方法设计好网络之后，为了能够同时优化特征提取网络和注意力区域定位网络，采用了两种损失函数来对网络进行优化，使用这两个网络的交叉熵损失函数与融合网络的交叉熵损失函数的加权和作为该网络的整体损失函数，这样对区分不同人脸表情的效果很好，同时为了约束相似人脸表情的差异，采用了特征提取网络和注意力区域定位网络的两个网络的惩罚验证损失，从而极大地增强了相似人脸表情之间的识别性能。After designing the network according to the above method, in order to optimize the feature extraction network and the attention area positioning network at the same time, two loss functions are used to optimize the network, using the cross entropy loss function of these two networks and the intersection of the fusion network The weighted sum of the entropy loss function is used as the overall loss function of the network, which is very effective in distinguishing different facial expressions. At the same time, in order to constrain the difference of similar facial expressions, two features extraction network and attention area positioning network are used. The network penalizes the validation loss, which greatly enhances the recognition performance between similar facial expressions.

综上所述：该算法可以自动定位人脸表情图片区分度较大的区域，融合人脸表情图片的全局与局部特征，实现了对人脸表情图片的准确识别。第一特征提取网络和第二特征提取网络负责对人脸表情图片进行特征提取，注意力区域定位网络负责对人脸表情图片局部特征区域定位，对人脸局部特征区域进行表示与定位，对于相似度较高的人脸表情图片，如生气与厌恶，该算法实现了高区分度的人脸表情局部特征的区域定位，综合利用卷积神经网络的浅层高分辨率人脸结构信息与深层低分辨率语义信息，获得了更好的人脸局部特征区域的定位，极大地提高了对于相似度较高的人脸表情图片的识别率。To sum up: the algorithm can automatically locate the areas with high discrimination of facial expression pictures, integrate the global and local features of facial expression pictures, and realize the accurate recognition of facial expression pictures. The first feature extraction network and the second feature extraction network are responsible for feature extraction of facial expression pictures, and the attention area positioning network is responsible for locating local feature areas of facial expression pictures, representing and locating local feature areas of faces, and for similar High-resolution facial expression pictures, such as anger and disgust, the algorithm realizes the regional localization of high-resolution facial expression local features, and comprehensively utilizes the superficial high-resolution facial structure information of the convolutional neural network and the deep low-level The high-resolution semantic information obtains better positioning of local feature areas of the face, and greatly improves the recognition rate of facial expression pictures with high similarity.

特征融合网络负责将人脸表情的全局特征与局部特征进行融合，融合人脸表情图片的全局特征信息与人脸表情图片的局部特征信息。人脸表情图片的局部特征信息可以提高相似度较大的人脸表情的区分能力，同时保留人脸表情图片的全局结构信息，所以采用特征融合网络对各级人脸表情特征提取网络提取到的特征进行融合，获得人脸表情图片更加全面的特征信息，从而提高了对人脸表情识别的准确率。由于单极人脸表情特征提取网络无法同时获得人脸表情图片的全局与局部特征，因此采用了两级人脸表情特征提取网络级联的结构实现了特征由全局到局部的转移，针对人脸表情图片存在尺度变化的特点，在每级人脸表情特征提取网络的特征图之间构建了特征金字塔网络，从而提高了网络的特征描述能力，极大地提高了对于存在尺度变化的人脸表情图片、存在遮挡的人脸表情图片以及相似度人脸表情图片的识别准确率。The feature fusion network is responsible for fusing the global features and local features of facial expressions, and fusing the global feature information of facial expression pictures with the local feature information of facial expression pictures. The local feature information of facial expression pictures can improve the ability to distinguish facial expressions with high similarity, and at the same time retain the global structural information of facial expression pictures, so the feature fusion network is used to extract the facial expression feature extraction network at all levels. The features are fused to obtain more comprehensive feature information of facial expression pictures, thereby improving the accuracy of facial expression recognition. Since the unipolar facial expression feature extraction network cannot obtain the global and local features of facial expression pictures at the same time, a two-level facial expression feature extraction network cascade structure is adopted to realize the transfer of features from global to local. Expression pictures have the characteristics of scale changes. A feature pyramid network is constructed between the feature maps of the facial expression feature extraction network at each level, thereby improving the feature description ability of the network and greatly improving the performance of facial expression pictures with scale changes. , the recognition accuracy of facial expression pictures with occlusion and similarity facial expression pictures.

以上所揭露的仅为本发明一种较佳实施例而已，当然不能以此来限定本发明之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本发明权利要求所作的等同变化，仍属于发明所涵盖的范围。What is disclosed above is only a preferred embodiment of the present invention, and of course it cannot limit the scope of rights of the present invention. Those of ordinary skill in the art can understand all or part of the process for realizing the above embodiments, and according to the rights of the present invention The equivalent changes required still belong to the scope covered by the invention.

Claims

1. A facial expression recognition algorithm combined with a multilevel convolution feature pyramid is characterized by comprising the following steps:

Inputting the original facial expression picture into a first feature extraction network for feature extraction, and extracting global features;

Inputting the cut and amplified facial expression picture into a second feature extraction network for feature extraction, and extracting local features;

Constructing a feature pyramid network in the first feature extraction network and the second feature extraction network;

Carrying out regional positioning on the local features in the second feature extraction network by using an attention regional positioning network;

And fusing the global features and the local features of the human face expression pictures by using the feature fusion network, and classifying the human face expression pictures through a full connection layer.

2. The algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 1,

The input sample of the second-level feature extraction network is obtained by cutting and amplifying an input sample picture in the first-level feature extraction network.

3. The algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 1,

The first level feature extraction network and the second level feature extraction network are both full convolution structure networks.

4. The algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 3,

The first convolutional layer in both global feature extraction and local feature extraction uses a convolution kernel of size 3 x 3.

5. The algorithm for facial expression recognition in combination with the pyramid of multilevel convolution features of claim 4,

Inputting an original facial expression picture into a first-level feature extraction network, extracting features of the facial expression picture through a convolutional neural network, and then outputting categories and probability values of each category, wherein the probability value of each category is obtained by the following steps:

C⁽ⁱ⁾＝f(W_i⊙X_i)

p⁽ⁱ⁾＝f(W_i⊙X_i)

Wherein, the input facial expression picture is X, X_iAnd W_irepresents the input and weight parameters of the i-th convolutional neural network, indicates the convolution, pooling, activation, or normalization operation performed during convolution, f (-) indicates the feature extraction network, C⁽ⁱ⁾(i ═ 1,2) denotes an output class label of the i-th convolutional neural network, p⁽ⁱ⁾(i ═ 1,2) denotes the probability value of the i-th order convolutional neural network for outputting each class.

6. The algorithm for facial expression recognition in combination with the pyramid of multilevel convolution features of claim 5,

Each level of network generates a feature map about the input facial expression picturerepresenting a set of generated feature maps of each level of network, wherein m represents the number of network levels contained in the model, wherein m is equal to (1,2), n represents the number of the feature maps finally output by each level of network, and an input image X is represented as:

7. The algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 6,

The feature graph generated by the shallow convolutional layer in the convolutional neural network is large in scale, the feature graph generated by the deep convolutional layer in the convolutional neural network is small in scale, the feature graph generated by the next convolutional layer is transposed and convolved, and then the two feature graphs are added to unify the scales of the feature graphs.

8. the algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 7,

the attention area positioning network maps the input feature map into (D)_x,D_y) Is a center, D_αthe attention area is a square attention area with side length, then the attention area is mapped to an original facial expression picture, and the original facial expression picture is cut and amplified and then sent to a next-level network.

9. The algorithm for facial expression recognition in combination with the pyramid of multilevel convolution features of claim 8,

The second feature extraction network can automatically activate the salient regions of the facial expressions in the process of identifying and extracting the facial expressions, firstly, accurate positioning must be achieved on the regions with the largest response in the feature images, then the accurately positioned regions are mapped to the original facial expression images, important local regions of the original facial expressions are obtained, and feature extraction is conducted on the local regions to achieve local region identification of the images.

10. The algorithm for facial expression recognition in combination with the multi-level pyramid of convolution features of claim 9,

When initializing, all feature maps generated by the last layer of the first-level feature extraction network are added together, then an area with a larger numerical value in the added feature maps is positioned, the positioned area is fitted into a square, then the central coordinate and the side length of the square are obtained and set as initialization parameters of the attention network, and the initialization parameters are expressed as:

Wherein F represents a feature map generated by the last convolutional layer of the first feature extraction network, d represents the total number of feature maps generated by the network, F is the total feature map obtained by adding each corresponding point of each feature map, h and w represent the height and width of the feature map, and p represents the pixel mean value of the total feature map.