CN108377387A

CN108377387A - Virtual reality method for evaluating video quality based on 3D convolutional neural networks

Info

Publication number: CN108377387A
Application number: CN201810240647.XA
Authority: CN
Inventors: 杨嘉琛; 刘天麟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2018-08-07

Abstract

The invention relates to a 3D CNN-based stereoscopic image quality evaluation method, comprising the following steps: video preprocessing: using the left view video and the right view video of the VR video to obtain a VR differential video, uniformly extracting frames from the differential video, and giving each Frame non-overlapping blocks, video blocks at the same position in each frame constitute a VR video patch to generate enough data for 3D CNN training; build 3D CNN model; train 3D CNN model: use stochastic gradient descent method, Taking VR video patches as input, each patch is labeled with the original video quality score, and input into the network in batches. After multiple iterations, the weights of each layer of the network are fully optimized, and finally a volume that can be used to evaluate the quality of virtual reality video is obtained. product neural network model; get the final result. The invention improves the accuracy rate of the objective evaluation method.

Description

Virtual Reality Video Quality Evaluation Method Based on 3D Convolutional Neural Network

技术领域technical field

本发明属视频处理领域，涉及虚拟现实视频质量评价方法。The invention belongs to the field of video processing and relates to a virtual reality video quality evaluation method.

背景技术Background technique

作为一个新的仿真与交互技术——虚拟现实(VR)技术在许多领域如建筑、游戏与军事中使用，它可以创建一个与现实世界的规则一致的虚拟环境，或者建立一个完全脱离现实的模拟环境，这会带给人们更加真实的视听感受和临场体验[1]。作为虚拟现实的重要载体，当前最接近VR视频定义的为全景立体视频，其发挥着巨大的作用。然而，VR视频在采集、储存和传输的过程中由于设备和处理手段等原因，不可避免的会引入一些失真，进而影响 VR视频的质量。因此，研究一种能有效评价虚拟现实视频质量的评价方法至关重要。但是主观评价方法易受多种因素的干扰，且费时费力，评价结果也不够稳定。相对主观评价，客观评价以软件的方式评价图像的质量，同时不需参与者和大量的主观试验，操作简单，且与主观评价高度相关，越来越受到相关研究者的关注。As a new simulation and interaction technology-virtual reality (VR) technology is used in many fields such as architecture, games and military, it can create a virtual environment consistent with the rules of the real world, or create a simulation completely out of reality environment, which will bring people a more realistic audio-visual experience and on-the-spot experience [1]. As an important carrier of virtual reality, panoramic stereoscopic video is currently closest to the definition of VR video, which plays a huge role. However, in the process of capturing, storing, and transmitting VR videos, due to equipment and processing methods, some distortion will inevitably be introduced, which will affect the quality of VR videos. Therefore, it is very important to study an evaluation method that can effectively evaluate the quality of virtual reality video. However, the subjective evaluation method is easily disturbed by many factors, and it is time-consuming and laborious, and the evaluation results are not stable enough. Compared with subjective evaluation, objective evaluation uses software to evaluate image quality, does not require participants and a large number of subjective experiments, is easy to operate, and is highly correlated with subjective evaluation, and has attracted more and more attention from relevant researchers.

由于虚拟现实技术在近些年刚刚兴起，目前还没有针对VR视频规范标准与客观评价体系[2]。VR视频具有真实感，沉浸感，立体感等特性[3]，在传统多媒体类型中立体视频与 VR视频的特点最接近，因此，对VR视频进行评价需要参考当前立体视频质量评价的思想。当前立体视频的客观评价方法主要有三类，第一类是基于人眼视觉系统(HVS)的评价方法。第二类是基于图像特征并结合机器学习的评价方法。第三类是利用深度学习的评价方法。上述方法都对VR视频客观评价具有良好的借鉴意义。Since virtual reality technology has just emerged in recent years, there is no standard and objective evaluation system for VR video [2]. VR video has the characteristics of realism, immersion, and stereoscopic effect [3]. In traditional multimedia types, stereoscopic video has the closest characteristics to VR video. Therefore, the evaluation of VR video needs to refer to the current idea of stereoscopic video quality evaluation. There are mainly three types of objective evaluation methods for stereoscopic video at present. The first type is an evaluation method based on the Human Visual System (HVS). The second category is an evaluation method based on image features combined with machine learning. The third category is the evaluation method using deep learning. All of the above methods have good reference significance for the objective evaluation of VR videos.

[1]Minderer M,Harvey C D,Donato F,et al.Neuroscience:Virtual realityexplored.[J]. Nature,2016,533(7603):324.[1]Minderer M, Harvey C D, Donato F, et al. Neuroscience: Virtual reality explored. [J]. Nature, 2016, 533(7603): 324.

[2]X.Ge,L.Pan,Q.Li.Multi-Path Cooperative Communications NetworksforAugmented and Virtual Reality Transmission.IEEE Transactions onMultimedia,vol.19,no.10,pp.2345-2358, 2017.[2] X.Ge, L.Pan, Q.Li. Multi-Path Cooperative Communications Networks for Augmented and Virtual Reality Transmission. IEEE Transactions on Multimedia, vol.19, no.10, pp.2345-2358, 2017.

[3]Hosseini M,Swaminathan V.Adaptive 360VR Video Streaming:Divide andConquer[C]//IEEE International Symposium on Multimedia.IEEE,2017:107-110.[3]Hosseini M, Swaminathan V.Adaptive 360VR Video Streaming: Divide and Conquer[C]//IEEE International Symposium on Multimedia.IEEE,2017:107-110.

发明内容Contents of the invention

本发明的目的在于建立一个充分考虑虚拟现实特性的VR视频质量评价方法。本发明提出的VR视频客观质量评价方法，利用深度学习模型3D卷积神经网络(3D CNN)让机器提取VR视频特征，而不是传统的手工提取特征，并且最新的深度学习模型3D CNN可以充分考虑视频的时域运动信息。与此同时本发明设计了贴合VR视频制作与播放特点的分数融合策略，从而做出准确与客观的评价。技术方案如下：The purpose of the present invention is to establish a VR video quality evaluation method that fully considers the characteristics of virtual reality. The VR video objective quality evaluation method proposed by the present invention uses the deep learning model 3D convolutional neural network (3D CNN) to allow the machine to extract VR video features instead of traditional manual feature extraction, and the latest deep learning model 3D CNN can fully consider Temporal motion information of the video. At the same time, the present invention designs a score fusion strategy that fits the characteristics of VR video production and playback, so as to make accurate and objective evaluations. The technical solution is as follows:

一种基于3D CNN的立体图像质量评价方法，评价方法包括以下步骤：A method for evaluating the quality of stereoscopic images based on 3D CNN, the evaluation method comprising the following steps:

1)视频预处理：利用VR视频的左视图视频与右视图视频得到VR差分视频，从差分视频中均匀抽帧，给每一帧不重叠的切块，每一帧相同位置的视频块构成一个VR视频补丁，以产生足够多的数据用于3D CNN的训练；1) Video preprocessing: Use the left-view video and right-view video of the VR video to obtain the VR differential video, uniformly extract frames from the differential video, and cut each frame into non-overlapping blocks, and the video blocks at the same position in each frame form a VR video patches to generate enough data for 3D CNN training;

2)建立3D CNN模型：包含两个卷积层，两个池化层与两个全连接层，激活函数采用整流线性单元(ReLU)，采用Dropout策略防止过拟合；随后调整网络的层内结构及训练参数以达到更好的分类效果；2) Establish a 3D CNN model: it includes two convolutional layers, two pooling layers and two fully connected layers, the activation function uses rectified linear unit (ReLU), and the Dropout strategy is used to prevent overfitting; then adjust the inner layers of the network structure and training parameters to achieve better classification results;

3)训练3D CNN模型：利用随机梯度下降法，以VR视频补丁为输入，每个补丁配上原视频质量分数作为标签，分批次将其输入网络，经过多次迭代后网络各层权重得到充分优化，最终得到可用于评价虚拟现实视频质量的卷积神经网络模型；3) Training 3D CNN model: using the random gradient descent method, taking VR video patches as input, each patch is labeled with the original video quality score, and input it into the network in batches. After multiple iterations, the weights of each layer of the network are fully obtained. Optimization, and finally a convolutional neural network model that can be used to evaluate the quality of virtual reality videos;

4)得出最终结果：用过3D CNN得到VR视频补丁的分数，再利用分数融合策略分别对不同位置的VR视频补丁赋予不同的权重，经过加权得到最终的虚拟现实视频质量客观评价分数。4) Get the final result: use 3D CNN to get the score of the VR video patch, and then use the score fusion strategy to assign different weights to the VR video patches at different positions, and obtain the final objective evaluation score of the virtual reality video quality after weighting.

本发明所提出的VR视频客观质量评价方法利用最新的深度学习模型，能够提取VR视频更高维度的特征，不仅无需手工提取视频的特征，利用机器自身学习提取需要的特征，同时充分考虑到视频时域的运动信息。除此之外本发明结合VR视频的制作与播放特点，对不同的视频补丁分数给与不同的权值进行加权，然后利用分数融合策略来综合表述VR视频的客观质量。本发明采取的视频预处理方法简单，具有较强的实用性，所提出的测试模型耗时小，易于操作。本方法得到的VR视频质量客观评价结果与主观评价结果具有很高的一致性，能够较为准确的反映VR视频的质量。The VR video objective quality evaluation method proposed by the present invention uses the latest deep learning model to extract higher-dimensional features of VR videos, not only does not need to manually extract video features, but also uses machine learning to extract required features, and fully considers the video Motion information in the time domain. In addition, the present invention combines the production and playback characteristics of VR videos, assigns different weights to different video patch scores, and then uses the score fusion strategy to comprehensively express the objective quality of VR videos. The video preprocessing method adopted by the invention is simple and has strong practicability, and the proposed test model consumes less time and is easy to operate. The objective evaluation results of VR video quality obtained by this method are highly consistent with the subjective evaluation results, and can reflect the quality of VR video more accurately.

附图说明Description of drawings

图1 VR视频预处理流程图。Figure 1 VR video preprocessing flow chart.

图2 3D CNN网络框架图。Figure 2 3D CNN network framework diagram.

图3 3D卷积图。Figure 3 3D convolution map.

图4主客观分数关系散点图：(a)对称失真，(b)非对称失真，(c)H.264失真，(d)JPEG2000 失真。Figure 4 Scatter diagram of the relationship between subjective and objective scores: (a) symmetric distortion, (b) asymmetric distortion, (c) H.264 distortion, (d) JPEG2000 distortion.

具体实施方式Detailed ways

本发明提供的基于3D CNN的立体图像质量评价方法，每个失真VR视频对由左视频V_l和右视频V_r组成，评价方法包括以下步骤：The stereoscopic image quality evaluation method based on 3D CNN provided by the present invention, each distorted VR video pair is made up of left video V _l and right video V _r , and evaluation method comprises the following steps:

第一步：根据立体感知原理构建差值视频V_d。首先将原始VR视频与失真VR视频每一帧灰度化，然后利用左视频V_l与右视频V_r得到需要的差值视频。计算在视频位置(x,y,z)上的和值视频V_d的值如公式(1)所示：Step 1: Construct difference video V _d according to the principle of stereo perception. First, each frame of the original VR video and the distorted VR video is grayscaled, and then the left video V _l and the right video V _r are used to obtain the required difference video. The value of the sum value video V _d calculated at the video position (x, y, z) is shown in formula (1):

V_d(x,y,z)＝|V_l(x,y,z)-V_r(x,y,z)| (1)V _d (x,y,z)=|V _l (x,y,z)-V _r (x,y,z)| (1)

第二步：将VR差值视频切块以构成视频补丁，从而扩充数据集的容量。具体来说，从所有VR差值视频中每隔8帧抽取1帧，共提取N帧。在提取帧的相同位置切割出32×32 像素大小的正方形图像块，然后将相同视频、相同位置的图像块构成一个VR视频补丁。为了充分提取视频空间信息，每一帧应均匀不重叠的切割图像块，每一帧视频切割出M个图像块。根据分辨率大小的不同每个VR视频共能提取M个大小为32×32×N的视频补丁。Step 2: Divide the VR differential video into blocks to form video patches, thereby expanding the capacity of the dataset. Specifically, one frame is extracted every eight frames from all VR difference videos, and a total of N frames are extracted. A square image block with a size of 32×32 pixels is cut out at the same position of the extracted frame, and then the image blocks of the same video and the same position form a VR video patch. In order to fully extract the spatial information of the video, each frame should be cut evenly without overlapping image blocks, and M image blocks should be cut out of each frame of video. A total of M video patches with a size of 32×32×N can be extracted from each VR video according to different resolutions.

第三步：构建与训练3D CNN深度学习模型。本发明模型由两个3D卷积层，两个3D池化层与两个全连接层组成。在2D CNN的基础上，3D CNN考虑到多个输入之间的信息，能有效提取视频时域的运动信息，因此有必要说明3D CNN的卷积与池化过程。3D卷积的公式为：Step 3: Construct and train a 3D CNN deep learning model. The model of the present invention consists of two 3D convolutional layers, two 3D pooling layers and two fully connected layers. On the basis of 2D CNN, 3D CNN can effectively extract the motion information in the time domain of video by considering the information between multiple inputs. Therefore, it is necessary to explain the convolution and pooling process of 3D CNN. The formula for 3D convolution is:

其中k表示(l-1)层中的特征映射的索引连接到当前卷积核，表示第(l-1)层中的第k个3D特征图，是第l层中的第i个3D卷积核，其卷积在上。一个加性偏差项和一个非线性项激活功能被执行以获得最终特征图。是加性偏差项，f(·)是非线性激励函数，如S形函数，双曲正切函数和整数线性函数。where k represents the index of the feature map in the (l-1) layer connected to the current convolution kernel, Represents the kth 3D feature map in the (l-1)th layer, is the i-th 3D convolution kernel in the l-th layer, and its convolution is in superior. An additive bias term and a nonlinear activation function are performed to obtain the final feature map. is an additive bias term, and f( ) is a nonlinear activation function, such as a sigmoid function, a hyperbolic tangent function, and an integer linear function.

3D池化的公式为：The formula for 3D pooling is:

其中m,n,j代表特征图选取点区域的大小。Among them, m, n, j represent the size of the selected point area of the feature map.

本发明中3D CNN结构利用随机梯度下降与ReLU作为激活函数，为了防止过拟合，本发明采取了dropout策略，即在每个池化层后使用参数为0.5的dropout策略，在第一个全连接层后，我们使用参数为0.25的dropout策略。网络中minibatch大小为128，模型训练学习率设定为0.001。此外，在每次卷积和随后的激活之间使用批量归一化以加速网络训练。本模型采取的目标函数如下：In the present invention, the 3D CNN structure uses stochastic gradient descent and ReLU as the activation function. In order to prevent overfitting, the present invention adopts a dropout strategy, that is, a dropout strategy with a parameter of 0.5 is used after each pooling layer. After concatenating the layers, we use a dropout strategy with a parameter of 0.25. The minibatch size in the network is 128, and the learning rate of model training is set to 0.001. Additionally, batch normalization is used between each convolution and subsequent activation to speed up network training. The objective function adopted by this model is as follows:

其中λ代表正则化参数，y_i表示真实区域质量分数，f(x_i)表示预测得分。模型构建完成后利用80％的数据作为训练，20％的数据作为测试。where λ represents the regularization parameter, y _i represents the ground-truth region quality score, and f(xi ₎ represents the predicted score. After the model is built, 80% of the data is used for training, and 20% of the data is used for testing.

第四步：通过深度模型得到视频补丁分数后通过分数融合策略得到VR视频最终分数。本发明使用的分数融合策略依据VR视频的Equirectangular投影方式，对不同位置的VR视频补丁赋予不同的权重，从而得到最终的客观评价质量分数。Equirectangular投影方式会使视频在投影过程中两极部分被大幅度拉伸，影响平面模型下VR视频的空间分布。由于客观质量评价方法是以平面视频作为输入，而主观评价分数则是以球面视频观感体验为依据，因此本发明设计分数融合策略如公式(4)所示：Step 4: After obtaining the video patch score through the depth model, the final score of the VR video is obtained through the score fusion strategy. The score fusion strategy used in the present invention assigns different weights to VR video patches at different positions according to the Equirectangular projection method of VR video, so as to obtain the final objective evaluation quality score. The equirectangular projection method will cause the polar parts of the video to be greatly stretched during the projection process, which will affect the spatial distribution of the VR video under the planar model. Since the objective quality evaluation method uses planar video as input, and the subjective evaluation score is based on the viewing experience of spherical video, the present invention designs a score fusion strategy as shown in formula (4):

其中S_f表示最终得分，S_xy表示在视频帧位置(x,y)的视频补丁的预测分数，x代表宽度位置，y代表高度位置，W_xy表示相应位置的权重，h表示VR视频的垂直高度，h'表示视频补丁中心位置距离VR视频中心的垂直距离。where _Sf denotes the final score, _Sxy denotes the predicted score of the video patch at the video frame position (x,y), x denotes the width location, y denotes the height location, _Wxy denotes the weight of the corresponding location, h denotes the vertical direction of the VR video Height, h' indicates the vertical distance from the center of the video patch to the center of the VR video.

第五步：选取数据库。为证明本发明方法获得的图像预测客观质量分数与主观质量分数有很高的一致性，预测客观质量分数能准确反映图像的质量，将本发明方法在VRQ-TJU数据库上进行测试。该数据库包含13个原始VR视频与364个失真VR视频，失真类型包括H.264与JPEG2000，且同时包含对称失真与非对称失真，其中对称失真视频104个，非对称失真260个。Step 5: Select the database. In order to prove that the predicted objective quality score of the image obtained by the method of the present invention has a high consistency with the subjective quality score, and the predicted objective quality score can accurately reflect the quality of the image, the method of the present invention is tested on the VRQ-TJU database. The database contains 13 original VR videos and 364 distorted VR videos. The distortion types include H.264 and JPEG2000, and includes both symmetrical and asymmetrical distortions. There are 104 symmetrically distorted videos and 260 asymmetrically distorted videos.

取4个国际上常用的衡量客观图像质量评价算法的指标评估本发明方法的性能，4个指标分别为皮尔森线性相关系数(Pearson linear correlation coefficient,PLCC)、斯皮尔曼排序相关系数(Spearman rank-order correlation coefficient,SRCC)、肯德尔秩次相关系数 (Kendallrank-order correlation coefficient，KROCC)和均方根误差(Root Mean SquaredError, RMSE)。以上三个相关系数的值越接近于1，RMSE值越小，说明算法越准确。Get 4 indicators commonly used in the world to measure the performance of the objective image quality evaluation algorithm to evaluate the performance of the method of the present invention, 4 indicators are respectively Pearson linear correlation coefficient (Pearson linear correlation coefficient, PLCC), Spearman rank correlation coefficient (Spearman rank -order correlation coefficient, SRCC), Kendall rank-order correlation coefficient (Kendallrank-order correlation coefficient, KROCC) and root mean square error (Root Mean SquaredError, RMSE). The closer the value of the above three correlation coefficients is to 1, the smaller the RMSE value is, indicating that the algorithm is more accurate.

第六步：分析和比较算法性能。了验证本发明对于VR视频质量评价的针对性与有效性，本发明针对图像质量评价IQA，立体图像质量评价SIQA，视频质量评价VQA，立体视频质量评价SVQA各引用了一种方法在数据库中对比验证，依次分别对应为[1]，[2]，[3]和[4]。Step 6: Analyze and compare algorithm performance. In order to verify the pertinence and effectiveness of the present invention for VR video quality evaluation, the present invention refers to a method for image quality evaluation IQA, stereoscopic image quality evaluation SIQA, video quality evaluation VQA, and stereoscopic video quality evaluation SVQA for comparison in the database Verification, respectively corresponding to [1], [2], [3] and [4].

表1综合数据指标Table 1 Comprehensive data indicators

表2本发明不同失真类型指标Table 2 Different distortion type indexes of the present invention

[1]A.Liu,W.Lin,and M Narwaria.Image quality assessment basedongradient similarity. IEEE Transactions on Image Processing APublication ofthe IEEE Signal Processing Society, 21(4):1500,2012.[1] A. Liu, W. Lin, and M Narwaria. Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 21(4):1500,2012.

[2]Alexandre Benoit,Patrick Le Callet,Patrizio Campisi,andRomainCousseau.Using disparity for quality assessment of stereoscopicimages.In IEEE International Conference on Image Processing,pages 389–392,2008.[2] Alexandre Benoit, Patrick Le Callet, Patrizio Campisi, and Romain Cousseau. Using disparity for quality assessment of stereoscopic images. In IEEE International Conference on Image Processing, pages 389–392, 2008.

[3]Kalpana Seshadrinathan,Rajiv Soundararajan,Alan Conrad Bovik,andLawrence K Cormack.Study of subjective and objective quality assessment ofvideo.IEEE Transactions on Image Processing,19(6):1427–1441,2010.[3] Kalpana Seshadrinathan, Rajiv Soundararajan, Alan Conrad Bovik, and Lawrence K Cormack. Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing, 19(6):1427–1441, 2010.

[4]Nukhet Ozbek and A.Murat Tekalp.Unequal inter-view rateallocationusing scalable stereo video coding and an objective stereo videoqualitymeasure.In IEEE Intern。[4] Nukhet Ozbek and A. Murat Tekalp. Unequal inter-view rate allocation using scalable stereo video coding and an objective stereo video quality measure. In IEEE Intern.

Claims

1. A method for evaluating the quality of stereo images based on 3D CNN, the evaluation method may further comprise the steps:

1) Video preprocessing: Use the left-view video and right-view video of the VR video to obtain the VR differential video, uniformly extract frames from the differential video, and cut each frame into non-overlapping blocks, and the video blocks at the same position in each frame form a VR video patches to generate enough data for 3D CNN training;

2) Establish a 3D CNN model: it includes two convolutional layers, two pooling layers and two fully connected layers, the activation function uses rectified linear unit (ReLU), and the Dropout strategy is used to prevent overfitting; then adjust the inner layers of the network structure and training parameters to achieve better classification results;

3) Training 3D CNN model: using the random gradient descent method, taking VR video patches as input, each patch is labeled with the original video quality score, and input it into the network in batches. After multiple iterations, the weights of each layer of the network are fully obtained. Optimization, and finally a convolutional neural network model that can be used to evaluate the quality of virtual reality videos;

4) Get the final result: use 3D CNN to get the score of the VR video patch, and then use the score fusion strategy to assign different weights to the VR video patches at different positions, and obtain the final objective evaluation score of the virtual reality video quality after weighting.