CN108764250A

CN108764250A - A method of extracting essential image with convolutional neural networks

Info

Publication number: CN108764250A
Application number: CN201810407424.8A
Authority: CN
Inventors: 蒋晓悦; 冯晓毅; 李会方; 吴俊�; 何贵青; 谢红梅; 夏召强
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-05-02
Filing date: 2018-05-02
Publication date: 2018-11-06
Anticipated expiration: 2038-05-02
Also published as: CN108764250B

Abstract

The invention provides a method for extracting essential images by using a convolutional neural network. First, construct a two-stream convolutional network with a parallel structure from image to image; then, use a specific training data set to train the network, optimize network parameters to extract multi-layer features with environmental invariance, and directly reconstruct Generate the essential image (reflection map and light map). Since the two-stream convolutional neural network built on the basis of deep learning theory has powerful feature extraction capabilities, it can directly separate the reflection map and illumination map from the original image. At the same time, the model is a fully convolutional network model from image to image, including two branch flows, which are used to generate illumination maps and reflection maps respectively, and the network structure combines the convolution results of higher layers with the deconvolution operation The combination of the final results can reduce the reconstruction error of the illumination map and reflection map to a certain extent, and improve the ability of network feature reconstruction.

Description

A Method for Extracting Essential Images Using Convolutional Neural Networks

技术领域technical field

本发明属图像处理技术领域，具体涉及一种运用卷积神经网络提取本质图像的方法。The invention belongs to the technical field of image processing, and in particular relates to a method for extracting essential images by using a convolutional neural network.

背景技术Background technique

图像的理解与分析是图像处理最根本的任务之一。由于图像成像过程受多种因素如目标物体本身特性、拍摄环境以及采集设备等条件的共同影响，使得图像处理过程需要充分考虑诸多因素的干扰，如阴影、色彩的不连续性、目标姿态的变换等。这些变化因素对图像处理算法带来了较大的挑战，使得现有的图像分析算法在复杂环境下性能受到较大的影响。因此，如何在复杂环境下提高图像分析算法的鲁棒性已经成为近年的研究热点。事实上，如果能够基于现有的观测图像，分析出图像中的本质特征，便可以很好的解决图像分析过程中所遇到的上述问题。本质特征是指与周围环境无关的目标物体所固有的特征，即物体的反射特性(包含颜色、纹理等信息)和物体的形状特性。虽然对于目标物体而言，这两个固有特征本身不会随着周围环境的变化而发生改变，但是，采集到的目标物体的观测图像却会受到环境的影响。因此，若可直接从观测图像中分析出物体的本质特征，便提取出了物体的固有形状、颜色、纹理等信息，消除了环境变化对图像的影响，从而可实现对图像更加准确的理解，同时也为进一步实现复杂环境中鲁棒的图像分析提供了更加可靠的信息基础。Image understanding and analysis is one of the most fundamental tasks in image processing. Since the image imaging process is affected by many factors such as the characteristics of the target object itself, the shooting environment, and the acquisition equipment, the image processing process needs to fully consider the interference of many factors, such as shadows, color discontinuities, and changes in target postures. Wait. These changing factors have brought great challenges to image processing algorithms, which have greatly affected the performance of existing image analysis algorithms in complex environments. Therefore, how to improve the robustness of image analysis algorithms in complex environments has become a research hotspot in recent years. In fact, if the essential features in the image can be analyzed based on the existing observation image, the above-mentioned problems encountered in the image analysis process can be well solved. Essential features refer to the inherent features of the target object that have nothing to do with the surrounding environment, that is, the reflection characteristics of the object (including information such as color and texture) and the shape characteristics of the object. Although for the target object, these two inherent features will not change with the change of the surrounding environment, but the collected observation image of the target object will be affected by the environment. Therefore, if the essential characteristics of the object can be directly analyzed from the observation image, the inherent shape, color, texture and other information of the object can be extracted, and the influence of environmental changes on the image can be eliminated, so that a more accurate understanding of the image can be realized. At the same time, it also provides a more reliable information basis for further realizing robust image analysis in complex environments.

目前已有的算法根据提取目标本质特征的方式可以分为三类：一类算法是隐式的本质特征分析算法，即通过模式识别算法对目标物体的多模态(目标物体在不同光照条件以及不同姿态下的表象)进行学习。在学习的过程中，该类算法并没有特别的考虑各种表象之间的内在联系，而是直接对各种观测结果进行模式分析，从而试图得到该目标在特征空间中的分布函数。该类算法遇到的一个严重问题就是目标描述函数的推广化问题。也就是说，训练样本的分布严重影响着最终学习得到的分布函数。如果学习样本只是目标物体在单一光照状态或姿态下的图像，则训练学习的结果很难推广到目标物体在不同的光照状态或者新的姿态下的图像上。因此，在样本不完善的情况下，该类算法很难推广到目标物体在各种复杂环境中的情况。另一类算法是显式的本质特征分析算法。该类算法会根据目标物体在不同状态下的表象分析其中的内在联系。与第一类隐式算法相比较，该类算法根据物理成像原理以及反射特征和形状的先验知识，直接对物体的反射特征和形状做出分析。从而，根据这些固有特征就可以直接计算得到目标物体在新的状态下的图像。因此，通过显式的分析算法得到的结果更加准确，同时也更具有推广性。然而，该类算法通常采用基于结构、纹理、色彩等约束实现对本质特征的估计，即遵循Retinex的理论框架将信号分类问题转化为能量优化问题，并在单一尺度下完成计算分析。因而，分析结果的准确性很大程度上依赖于优化算法的性能，同时由于不能够保证所建立的待优化函数的凸性求解的过程往往会陷入局部最小值而无法求得最优解，或者要求初始化步骤要尽量接近最优解。这些都限制了该类算法的性能。还有一类算法是采用基于深度学习的神经网络提取本质图像，通过训练一个卷积神经网络，直接从一幅RGB图像中预测出本质图像。但是现有该类算法网络结构比较简单，并且训练集是由计算机图形软件合成的人工图像，所以提取的本质图像并不是很清晰，特别是将它应用到自然图像时。The existing algorithms can be divided into three categories according to the way of extracting the essential features of the target: one type of algorithm is an implicit essential feature analysis algorithm, that is, the multimodal analysis of the target object (the target object under different lighting conditions and representation under different postures) for learning. In the process of learning, this type of algorithm does not particularly consider the internal relationship between various appearances, but directly conducts pattern analysis on various observation results, thus trying to obtain the distribution function of the target in the feature space. A serious problem encountered by this type of algorithm is the generalization of the target description function. That is to say, the distribution of training samples seriously affects the final learned distribution function. If the learning samples are only images of the target object in a single lighting state or pose, it is difficult to generalize the results of training and learning to images of the target object in different lighting states or new poses. Therefore, in the case of imperfect samples, it is difficult to generalize this type of algorithm to the situation where the target object is in various complex environments. Another type of algorithm is the explicit essential feature analysis algorithm. This type of algorithm will analyze the internal connection of the target object according to its appearance in different states. Compared with the first type of implicit algorithm, this type of algorithm directly analyzes the reflection characteristics and shape of the object according to the physical imaging principle and the prior knowledge of reflection characteristics and shape. Therefore, the image of the target object in a new state can be directly calculated according to these inherent features. Therefore, the results obtained by explicit analysis algorithms are more accurate and more generalizable. However, this type of algorithm usually uses constraints such as structure, texture, and color to realize the estimation of essential features, that is, following the theoretical framework of Retinex, the signal classification problem is transformed into an energy optimization problem, and the calculation and analysis are completed on a single scale. Therefore, the accuracy of the analysis results largely depends on the performance of the optimization algorithm. At the same time, because the convexity of the established function to be optimized cannot be guaranteed, the process of solving often falls into a local minimum and cannot obtain the optimal solution, or It is required that the initialization step should be as close to the optimal solution as possible. These all limit the performance of this type of algorithm. There is another type of algorithm that uses a deep learning-based neural network to extract the essential image, and directly predicts the essential image from an RGB image by training a convolutional neural network. However, the existing algorithm network structure is relatively simple, and the training set is an artificial image synthesized by computer graphics software, so the extracted essential image is not very clear, especially when it is applied to natural images.

发明内容Contents of the invention

为了克服现有隐式和显式本质特征分析算法特征提取能力不足，以及基于深度学习神经网络算法主要针对人工图像等问题，本发明提供一种运用卷积神经网络提取本质图像的方法。首先构建一个从图像到图像的具有平行结构的双流卷积网络，然后对该网络进行训练，优化网络参数，以提取具有环境不变性的多层特征，直接重构出本质图像(反射图与光照图)。采用多流结构，一方面可以使任务分离，不同的分流提取不同的特征；另一方面，二者互为限制条件，可以提高算法精度。In order to overcome the lack of feature extraction capabilities of existing implicit and explicit essential feature analysis algorithms, and the deep learning-based neural network algorithm is mainly aimed at problems such as artificial images, the present invention provides a method for extracting essential images using a convolutional neural network. Firstly, a two-stream convolutional network with a parallel structure from image to image is constructed, and then the network is trained to optimize network parameters to extract multi-layer features with environmental invariance, and directly reconstruct the essential image (reflection map and illumination) picture). Using a multi-stream structure, on the one hand, tasks can be separated, and different streams can extract different features; on the other hand, the two are mutually restrictive conditions, which can improve the accuracy of the algorithm.

一种运用卷积神经网络提取本质图像的方法，其特征在于步骤如下：A method for extracting essential images using a convolutional neural network, characterized in that the steps are as follows:

步骤1：构建具有平行结构的双流卷积神经网络结构模型，该网络模型分为一个公有分支、两个专有分支。Step 1: Construct a dual-stream convolutional neural network structure model with a parallel structure, which is divided into a public branch and two proprietary branches.

其中，公有分支由5个卷积层构成，每个卷积层后接一个池化层。卷积层的卷积核均为3×3，每一层输出一幅特征图像，第一卷积层输出特征图像维度为64，第二卷积层输出特征图像维度为128，第三卷积层输出特征图像维度为256，第四卷积层和第五卷积层输出特征图像维度均为512，池化层采用大小为2×2的平均池化。Among them, the public branch is composed of 5 convolutional layers, and each convolutional layer is followed by a pooling layer. The convolution kernels of the convolutional layers are all 3×3, and each layer outputs a feature image. The first convolutional layer outputs a feature image with a dimension of 64, the second convolutional layer outputs a feature image with a dimension of 128, and the third convolutional layer outputs a feature image with a dimension of 128. The dimension of the output feature image of the layer is 256, the dimension of the output feature image of the fourth convolutional layer and the fifth convolutional layer is 512, and the pooling layer adopts the average pooling with a size of 2×2.

两个专有分支结构相同，分别包含3个反卷积层，卷积核均为4×4，一个分支用于重构光照图像，另一个分支用于重构反射图像，所有反卷积层的输出维度均为256。The two proprietary branches have the same structure, each containing 3 deconvolution layers, the convolution kernels are 4×4, one branch is used to reconstruct the illumination image, the other branch is used to reconstruct the reflection image, all deconvolution layers The output dimensions of both are 256.

所述的公有分支的第三卷积层输出的特征图像与专有分支的第二反卷积层的输出作为专有分支的第三反卷积层的输入；所述的公有分支的第四卷积层输出的特征图像与专有分支的第一反卷积层的输出作为专有分支的第二反卷积层的输入。The feature image output by the third convolution layer of the public branch and the output of the second deconvolution layer of the proprietary branch are used as the input of the third deconvolution layer of the proprietary branch; the fourth convolution layer of the public branch The feature image output by the convolution layer and the output of the first deconvolution layer of the proprietary branch are used as the input of the second deconvolution layer of the proprietary branch.

步骤2：构建训练数据集，由Jiang等人创建的BOLD数据集的每一幅图像的中部截取大小为1280×1280的图像，并在行与列上分别将截取图像五等分，则原数据集中的每一幅图像得到25幅大小为256×256的图像，随机抽取其中的53720组图像构成测试集，剩余图像构成训练集。Step 2: Construct the training data set. The middle part of each image in the BOLD data set created by Jiang et al. intercepts an image with a size of 1280×1280, and divides the intercepted image into five equal parts on the row and column respectively, then the original data For each image in the set, 25 images with a size of 256×256 are obtained, 53720 sets of images are randomly selected to form the test set, and the remaining images form the training set.

步骤3：利用步骤2得到的训练集对步骤1构建的双流卷积神经网络进行训练，首先对网络各层的权值进行随机初始化，然后采用有监督的误差反向传播的训练方法，对网络进行训练，得到训练好的网络。其中，网络的基础学习率为10^-13，采用固定学习率策略，网络的batch size为5，损失函数为SoftmaxwithLoss，网络收敛条件为前后两次迭代的损失函数值之差在其值的±5％范围内。Step 3: Use the training set obtained in step 2 to train the two-stream convolutional neural network constructed in step 1. First, randomly initialize the weights of each layer of the network, and then use the supervised error backpropagation training method to train the network Perform training to obtain a trained network. Among them, the basic learning rate of the network is 10 ^-13 , a fixed learning rate strategy is adopted, the batch size of the network is 5, the loss function is SoftmaxwithLoss, and the network convergence condition is that the difference between the loss function values of the two iterations before and after is within ±5 of its value % range.

步骤4：利用训练好的网络对步骤2中得到的测试集进行处理，得到提取的本质图像，即光照图和反射图。Step 4: Use the trained network to process the test set obtained in step 2 to obtain the extracted essential images, namely the illumination map and reflection map.

本发明还将该方法在本质图像公共数据集MIT Intrinsic Images dataset上进行了测试，结果表明，该方法依然具有有效性。The present invention also tests the method on the MIT Intrinsic Images dataset, a public dataset of intrinsic images, and the result shows that the method is still effective.

本发明的有益效果是：由于采用了基于深度学习理论的本质图像提取的技术路线，运用基于深度学习理论构建的神经网络强大的特征提取能力，可以直接从原始图像中分离出反射图与光照图。此外，本发明提出的双流卷积神经网络是一种从图像到图像的全卷积网络模型，包含两个分支流向，分别用于生成光照图和反射图；并且，该网络结构将较高层的卷积结果与反卷积操作后的结果相结合，可以增强反卷积操作后特征图的细节，在一定程度上降低光照图和反射图的重构误差，提高网络特征重构的能力。The beneficial effects of the present invention are: due to the adoption of the technical route of essential image extraction based on deep learning theory, the reflection map and the illumination map can be directly separated from the original image by using the powerful feature extraction ability of the neural network constructed based on the deep learning theory . In addition, the two-stream convolutional neural network proposed by the present invention is a fully convolutional network model from image to image, which contains two branch flows, which are used to generate illumination maps and reflection maps respectively; and, the network structure combines higher-level Combining the convolution result with the result after the deconvolution operation can enhance the details of the feature map after the deconvolution operation, reduce the reconstruction error of the illumination map and reflection map to a certain extent, and improve the ability of network feature reconstruction.

附图说明Description of drawings

图1为本发明的一种运用卷积神经网络提取本质图像的方法流程图Fig. 1 is a flow chart of a method for extracting essential images using a convolutional neural network of the present invention

图2为本发明构建的双流卷积神经网络结构图Fig. 2 is the two-stream convolutional neural network structural diagram that the present invention constructs

图3是本发明构建的数据集部分图像示例Fig. 3 is the partial image example of the dataset constructed by the present invention

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步说明，本发明包括但不仅限于下述实施例。The present invention will be further described below in conjunction with the accompanying drawings and embodiments, and the present invention includes but not limited to the following embodiments.

本发明提供了一种运用卷积神经网络提取本质图像的方法，如图1所示，主要过程如下：The present invention provides a method for extracting essential images using a convolutional neural network, as shown in Figure 1, the main process is as follows:

1、构建具有平行结构的双流卷积神经网络结构模型1. Construct a two-stream convolutional neural network structure model with a parallel structure

图像的重构过程实际上是对从图像中提取的特征赋予不同的权重，并将同类型的特征结合起来以完成从原始图像重构出光照图和反射图的目标。换句话说，所有需要的特征都存在于同一个原始图像中。因此特征提取部分可以共享，而两种不同类型的本质图像的重构则需要分开完成。因而，本发明构建的网络分为公共分支和专有分支两个部分。在经过公有分支的卷积运算后，各层输出的特征图大小逐步减少。为了使输入图像与输出图像在空间结构上保持相同大小，分别在两个专有分支上设计了三个反卷积层，使特征图的空间大小逐步恢复到原始大小。受残差网络结构的启发，本发明在实验过程中发现将公有分支中的后面两层与专有分支中的后面两层相组合，可以使网络参数获得更好的优化效果。基于以上原因，本发明构建了如图2所示的具有平行结构的双流卷积神经网络结构。该网络模型分为一个公有分支、两个专有分支。The image reconstruction process actually assigns different weights to the features extracted from the image, and combines the same type of features to complete the goal of reconstructing the illumination map and reflection map from the original image. In other words, all desired features are present in the same original image. Therefore, the feature extraction part can be shared, while the reconstruction of two different types of essential images needs to be done separately. Therefore, the network constructed by the present invention is divided into two parts, the public branch and the private branch. After the convolution operation of the public branch, the size of the feature map output by each layer is gradually reduced. In order to keep the same size of the input image and the output image in the spatial structure, three deconvolution layers are designed on the two proprietary branches respectively, so that the spatial size of the feature map can be gradually restored to the original size. Inspired by the residual network structure, the present invention finds that combining the last two layers of the public branch with the last two layers of the proprietary branch can achieve better optimization results for network parameters. Based on the above reasons, the present invention constructs a two-stream convolutional neural network structure with a parallel structure as shown in FIG. 2 . The network model is divided into one public branch and two private branches.

其中，公有分支由5个卷积层构成，每个卷积层后接一个池化层。卷积层的卷积核均为3×3，每一层输出一幅特征图像，第一卷积层输出特征图像维度为64，第二卷积层输出特征图像维度为128，第三卷积层输出特征图像维度为256，第四卷积层和第五卷积层输出特征图像维度均为512，池化层采用大小为2×2的平均池化。两个专有分支结构相同，分别包含3个反卷积层，卷积核均为4×4，一个分支用于重构光照图像，另一个分支用于重构反射图像，所有反卷积层的输出维度均为256。并且，公有分支的第三卷积层输出的特征图像与专有分支的第二反卷积层的输出共同作为专有分支的第三反卷积层的输入；公有分支的第四卷积层输出的特征图像与专有分支的第一反卷积层的输出共同作为专有分支的第二反卷积层的输入。Among them, the public branch is composed of 5 convolutional layers, and each convolutional layer is followed by a pooling layer. The convolution kernels of the convolutional layers are all 3×3, and each layer outputs a feature image. The first convolutional layer outputs a feature image with a dimension of 64, the second convolutional layer outputs a feature image with a dimension of 128, and the third convolutional layer outputs a feature image with a dimension of 128. The dimension of the output feature image of the layer is 256, the dimension of the output feature image of the fourth convolutional layer and the fifth convolutional layer is 512, and the pooling layer adopts the average pooling with a size of 2×2. The two proprietary branches have the same structure, each containing 3 deconvolution layers, the convolution kernels are 4×4, one branch is used to reconstruct the illumination image, the other branch is used to reconstruct the reflection image, all deconvolution layers The output dimensions of both are 256. Moreover, the feature image output by the third convolutional layer of the public branch and the output of the second deconvolution layer of the proprietary branch are used as the input of the third deconvolution layer of the proprietary branch; the fourth convolutional layer of the public branch The output feature image and the output of the first deconvolution layer of the proprietary branch are used as the input of the second deconvolution layer of the proprietary branch.

2、数据集构建2. Data set construction

本发明所提出的网络结构比较复杂，需要训练的网络参数较多。为了使网络发挥其最优性能，本发明在Jiang等人创建的BOLD数据集(Jiang X Y,Schofield AJ,Wyatt JL.Correlation-Based Intrinsic Image Extraction from a Single Image[C].European Conference on Computer Vision,2010:58-71)的基础上构建了一个用于研究本质图像提取算法的数据集。该数据集包含268,600组图片，每组图片包含一张原始图片、一张光照图和一张反射图。随机从中抽取了53,720组构成测试集，用于测试本质图像提取算法性能。剩余的214,880组构成训练集，用于训练深度学习神经网络。BOLD数据库包含大量高分辨率的图像组，它们都是在精心调整好的照明条件下拍摄的物体，主要包括各种复杂花纹、人脸和户外场景，图3给出了数据集部分图像示例。Jiang等人构建该数据库的目的是为图像处理算法提供一个测试平台。具体而言，主要包括本质图像提取算法、去光照算法和光源估计算法等。为此，他们提供了光照条件图和物体表面图，即光照图和反射图，并且都是具有线性亮度特性的标准RGB色彩空间图片。本发明从图片数量、图片质量和场景复杂度等多方面综合考虑，最终决定选取以复杂花纹为拍摄对象的图片组为基础去构建用于本研究的数据集，原始图片在每个维度横向上有1280个像素点，纵向上有1350个像素点，对于普通计算机来说数据量过于庞大，而且容易导致过学习的问题，这很不利于深度学习神经网络的训练。本发明选取的图像类别具有一个很明显的特征：关键信息集中在图像中部。因此，本发明在图像中部选取一个1280×1280的特征框，截取原始图像，然后在行与列上分别将图像五等分。这样，一张原始图像可以分成25张256×256的较小的图像。通过这种方式对原始图像进行裁剪，保留了原始图像中的关键信息，实现了数据利用最大化，同时也为本研究提供了多种便利条件：每组图片的数据量适中，对计算机性能没有太高要求；相对合理的图像大小，设计卷积神经网络时更加方便；剪切后的图像同时包含正负样本，可以在一定程度上避免过拟合。The network structure proposed by the present invention is relatively complicated, and there are many network parameters to be trained. In order to make the network play its optimal performance, the present invention uses the BOLD dataset created by Jiang et al. (Jiang X Y, Schofield AJ, Wyatt JL. Correlation-Based Intrinsic Image Extraction from a Single Image [C]. European Conference on Computer Vision, 2010:58-71) to construct a dataset for studying essential image extraction algorithms. The dataset contains 268,600 sets of images, each of which contains an original image, an illumination image, and a reflection image. 53,720 groups were randomly selected to form a test set for testing the performance of the essential image extraction algorithm. The remaining 214,880 groups constitute the training set, which is used to train the deep learning neural network. The BOLD database contains a large number of high-resolution image groups, which are all objects shot under well-adjusted lighting conditions, mainly including various complex patterns, faces and outdoor scenes. Figure 3 shows some image examples of the data set. The purpose of building this database by Jiang et al. is to provide a test platform for image processing algorithms. Specifically, it mainly includes essential image extraction algorithms, de-illumination algorithms, and light source estimation algorithms. To this end, they provide maps of lighting conditions and surfaces of objects, namely light maps and reflectance maps, and are both standard RGB color space images with linear luminance characteristics. The present invention comprehensively considers the number of pictures, picture quality and scene complexity, etc., and finally decides to select the picture group with complex patterns as the shooting object to construct the data set used in this research. The original pictures are in each dimension horizontally There are 1280 pixels and 1350 pixels in the vertical direction. The amount of data is too large for ordinary computers, and it is easy to cause the problem of over-learning, which is not conducive to the training of deep learning neural networks. The image category selected by the present invention has an obvious feature: the key information is concentrated in the middle of the image. Therefore, the present invention selects a 1280×1280 feature frame in the middle of the image, intercepts the original image, and then divides the image into five equal parts in rows and columns. In this way, an original image can be divided into 25 smaller images of 256×256. In this way, the original image is cropped, the key information in the original image is preserved, and the data utilization is maximized. At the same time, it also provides a variety of convenient conditions for this research: the amount of data in each group of pictures is moderate, and there is no impact on computer performance. Too high requirements; a relatively reasonable image size makes it more convenient to design a convolutional neural network; the cropped image contains both positive and negative samples, which can avoid overfitting to a certain extent.

3、网络训练3. Network training

本实施例在Caffe框架下使用基于BOLD创建的数据集中的训练集来训练所构建的反卷积神经网络。与其它框架对比，Caffe框架不仅安装简便，而且支持所有操作系统，还对Python和Matlab有良好的接口支持。由于所构建的网络结构相对较复杂，需要学习的数据量较大，网络需要迭代的次数也比较多，同时也为了避免网络学习过快错过最优解，所以在对网络进行训练的过程中，本发明经过反复试验决定将基础学习率设定为10^-13，学习率策略设定为“fixed”，即固定学习率。考虑到计算机性能，也为了避免网络收敛过快，网络的batch size被设定为5，损失函数为SoftmaxwithLoss。In this embodiment, the training set in the data set created based on BOLD is used to train the constructed deconvolutional neural network under the framework of Caffe. Compared with other frameworks, the Caffe framework is not only easy to install, but also supports all operating systems, and has good interface support for Python and Matlab. Since the constructed network structure is relatively complex, the amount of data to be learned is large, and the number of iterations required by the network is also relatively large. At the same time, in order to prevent the network from learning too quickly and missing the optimal solution, in the process of training the network, The present invention decides to set the basic learning rate as 10 ^-13 and the learning rate policy as "fixed", ie, fixed learning rate, after trial and error. Considering the performance of the computer, and in order to avoid too fast network convergence, the batch size of the network is set to 5, and the loss function is SoftmaxwithLoss.

损失函数用来计算输出结果与真实标签之间的差值随着迭代次数的增加，网络损失越来越小，即估计结果与真实标签越来越接近。单一维度上的损失函数可以写成：The loss function is used to calculate the difference between the output result and the real label. As the number of iterations increases, the network loss becomes smaller and smaller, that is, the estimated result is getting closer to the real label. The loss function on a single dimension can be written as:

其中，{(x¹,y¹),...,(x^m,y^m)}表示m组标记好的训练数据_，x表示输入，y表示对应的标签，且y∈[0,255]。1{F}是指示函数，当F为真的时候函数值为1，为假的时候函数值为0，θ表示卷积神经网络的参数。在训练过程中，估计图像和真实标签之间的误差被反向传播到神经网络中的用以优化其参数，使误差逐渐缩小。对于RGB图像，损失函数为上述损失函数在图像R、G、B三个维度上的误差之和。Among them, {(x ¹ ,y ¹ ),...,(x ^m ,y ^m )} represent m groups of labeled training data _{, x} represents the input, y represents the corresponding label, and y∈[0,255]. 1{F} is an indicator function. When F is true, the function value is 1, and when it is false, the function value is 0. θ represents the parameters of the convolutional neural network. During the training process, the error between the estimated image and the real label is back-propagated to the neural network to optimize its parameters, so that the error is gradually reduced. For RGB images, the loss function is the sum of the errors of the above loss function in the three dimensions of the image R, G, and B.

大约在迭代210,000次后，损失函数值前后差异在±5％范围内浮动，网络逐步收敛，即在当前网络结构下网络参数趋于最优，网络提取本质图像的能力趋于最佳，得到训练好的网络。虽然两个专有分支虽然在结构上看起来一样，但由于提供给它们的真实标签不同，在网络训练的过程中，它们学习得到的网络参数也会不同。因而，在使用网络提取本质图像的时候，它们对数据的运算也会有相应的差异，使得不同的分支可以提取不同类型的本质图像。After about 210,000 iterations, the difference before and after the loss function value fluctuates in the range of ±5%, and the network gradually converges, that is, the network parameters tend to be optimal under the current network structure, and the ability of the network to extract the essential image tends to be the best, and it is trained good internet. Although the two proprietary branches look the same in structure, due to the different real labels provided to them, the network parameters they learn will be different during the network training process. Therefore, when using the network to extract essential images, their operations on data will also be different accordingly, so that different branches can extract different types of essential images.

4、运用训练好的网络提取本质图像4. Use the trained network to extract the essential image

利用训练好的网络对步骤2中建立的数据集中包含的测试集进行处理，即将其中包含的RGB图片转化为三维矩阵，作为网络的输入，经过网络的多层运算后得到提取的本质图像，即光照图和反射图。本发明还将该方法在本质图像公共数据集MIT IntrinsicImages dataset上进行了测试，结果表明，该方法依然具有有效性。Use the trained network to process the test set contained in the data set established in step 2, that is, convert the RGB image contained in it into a three-dimensional matrix, as the input of the network, and obtain the extracted essential image after the multi-layer operation of the network, that is Lightmaps and reflection maps. The present invention also tests the method on the MIT IntrinsicImages dataset, a public dataset of intrinsic images, and the result shows that the method is still effective.

Claims

1. a kind of method for extracting essential image with convolutional neural networks, it is characterised in that steps are as follows：

Step 1：The double-current convolutional neural networks structural model with parallel construction is built, which is divided into one publicly-owned point Branch, two proprietary branches；

Wherein, publicly-owned branch is made of 5 convolutional layers, and each convolutional layer is followed by a pond layer；The convolution kernel of convolutional layer is 3 × 3, each layer of one width characteristic image of output, it is 64 that the first convolutional layer, which exports characteristic image dimension, and the second convolutional layer exports feature Image dimension is 128, and it is 256 that third convolutional layer, which exports characteristic image dimension, and Volume Four lamination and the 5th convolutional layer export feature Image dimension is 512, pond layer use size for 2 × 2 average pond；

Two proprietary branched structures are identical, separately include 3 warp laminations, and convolution kernel is 4 × 4, and a branch is for reconstructing Light image, for reconstructing reflected image, the output dimension of all warp laminations is 256 for another branch；

The characteristic image of third convolutional layer output of the publicly-owned branch is made with the output of the second warp lamination of proprietary branch For the input of the third warp lamination of proprietary branch；The publicly-owned branch Volume Four lamination output characteristic image with it is proprietary Input of the output of first warp lamination of branch as the second warp lamination of proprietary branch；

Step 2：Training dataset is built, is intercepted by the middle part of every piece image of Jiang et al. BOLD data sets created big The small image for being 1280 × 1280, and respectively by five decile of interception image in row and column, then each width figure in original data set The image for being 256 × 256 as obtaining 25 width sizes, randomly selects 53720 groups of image construction test sets therein, residual image structure At training set；

Step 3：The training set obtained using step 2 is trained the double-current convolutional neural networks that step 1 is built, first to net The weights of each layer of network carry out random initializtion, then using the training method for the error back propagation for having supervision, are carried out to network Training, obtains trained network；Wherein, the basic learning rate of network is 10^-13, using fixed learning rate strategy, network Batch size are 5, loss function SoftmaxwithLoss, and network convergence condition is the loss function of front and back iteration twice The difference of value is in ± 5% range of its value；

Step 4：The test set that step 2 obtains is handled using trained network, the essential image extracted, i.e. light According to figure and reflectogram.