Nothing Special   »   [go: up one dir, main page]

US20170289539A1 - Method for Determining a Visual Quality Index of a High Dynamic Range Video Sequence - Google Patents

Method for Determining a Visual Quality Index of a High Dynamic Range Video Sequence Download PDF

Info

Publication number
US20170289539A1
US20170289539A1 US15/087,578 US201615087578A US2017289539A1 US 20170289539 A1 US20170289539 A1 US 20170289539A1 US 201615087578 A US201615087578 A US 201615087578A US 2017289539 A1 US2017289539 A1 US 2017289539A1
Authority
US
United States
Prior art keywords
hdr
sequence
frame
hdr sequence
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/087,578
Other versions
US9794554B1 (en
Inventor
Patrick Le Callet
Matthieu Perreira Da Silva
Manish Narwaria
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Centre National de la Recherche Scientifique CNRS
Universite de Nantes
Original Assignee
Centre National de la Recherche Scientifique CNRS
Universite de Nantes
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Centre National de la Recherche Scientifique CNRS, Universite de Nantes filed Critical Centre National de la Recherche Scientifique CNRS
Priority to US15/087,578 priority Critical patent/US9794554B1/en
Assigned to CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE - CNRS, UNIVERSITE DE NANTES reassignment CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE - CNRS ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NARWARIA, MANISH, DA SILVA, MATTHIEU PERREIRA, LE CALLET, PATRICK
Publication of US20170289539A1 publication Critical patent/US20170289539A1/en
Application granted granted Critical
Publication of US9794554B1 publication Critical patent/US9794554B1/en
Assigned to NANTES UNIVERSITE reassignment NANTES UNIVERSITE MERGER (SEE DOCUMENT FOR DETAILS). Assignors: UNIVERSITE DE NANTES
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • H04N17/02Diagnosis, testing or measuring for television systems or their details for colour television signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • G06K9/6215
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • G06V10/993Evaluation of the quality of the acquired pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present invention relates generally to the field of the High Video Range (HDR) video sequences and more specifically to determine a visual quality index for such HDR video sequences after being distorted by image processing operations.
  • HDR High Video Range
  • HDR imaging has attracted attention from both academia and industry, and there has been interest and effort to develop tools/algorithms for HDR video processing.
  • MPEG Moving Picture Experts Group
  • HEVC High Efficiency Video Coding
  • JPEG has announced extensions that will feature the original JPEG standard with support for HDR image compression.
  • both subjective and objective approaches can be used.
  • the former involves the use of human subjects to judge and rate the quality of the test stimuli. With appropriate laboratory conditions and a sufficiently large subject panel, it remains the most accurate method.
  • the latter quality assessment method employs a computational model to provide estimates of the subjective video quality. While such objective models may not mimic subjective opinions accurately in a general scenario, they can be reasonably effective in specific conditions/applications. Hence, they can be an important tool towards automating the testing and standardization of HDR video processing algorithms such as HDR video compression, post-processing, inverse video tone mapping, etc., especially when subjective tests may not be feasible.
  • the present invention relates to a method for determining a visual quality index of at least one high dynamic range video sequence, called HDR sequence, distorted by image processing operations and issued from a reference high dynamic range video sequence, called reference sequence, each of the HDR sequence and the reference sequence comprising N frame video frames t, with N ⁇ 2 and t ⁇ [1 . . . N frame ], each video frame t comprising a plurality of pixels organized into rows and columns and each pixel having at least a luminance value, said method comprising the steps of:
  • N scale ⁇ N orient similarity frames Sim t,s,o representative of a perceptual similarity between the frame t of the HDR sequence and the frame t of the reference sequence at different spatial scales s and different spatial orientations o, with s ⁇ [1 . . . N scale ] and o ⁇ [1 . . . N orient ], a similarity value being associated to each pixel of the similarity frame Sim t,s,o ,
  • the visual quality index is computed based on HDR signal transformation and subsequent analysis of spatio-temporal segments or tubes of the HDR sequence to be qualified and the reference sequence from which the HDR sequence is issued.
  • said portion of the short term error values of each error map comprises the m lowest short term error values of the error map, with m being an integer value lower than the total number of short term error values in the error map.
  • the number m is a predetermined percentage of the total number of short term error values in the error map.
  • said predetermined percentage is comprised between 5% and 50%.
  • the method of the invention further comprises a preliminary step, before transforming the HDR sequence and the reference sequence into the perceived luminance domain, said preliminary step consisting in transforming the luminance values of the HDR sequence and the reference sequence into emitted luminance values, said emitted luminance values depending on at least luminance characteristics of the display device used to the video sequences.
  • the similarity frame Sim t,s,o associated to the frames t of the HDR sequence and the reference sequence for a spatial scale s and a spatial orientation o is computed by the steps of:
  • the global similarity frame Sim t is defined by the formula:
  • the spatio-temporal tubes are non-overlapping spatio-temporal tubes.
  • the video frames t of the HDR sequence and the reference sequence in the perceived luminance domain are generated by applying a perceptually uniform encoding to the video frames t of the HDR sequence and the reference sequence or, when appropriate, to the video frames t of the HDR sequence and the reference sequence issued from said preliminary step (transformation into emitted luminance values).
  • FIG. 1 is a flow chart of the successive steps implemented when performing a method for determining the visual quality index of a HDR sequence according to an embodiment of the invention
  • FIG. 2 a and FIG. 2 b are response curves of luminance values for a logarithmic transform and for a perceptually uniform encoding in two different ranges of luminance;
  • FIG. 3 is a flow chart describing in detail the final steps of the flow chart of FIG. 1 .
  • HVS human visual system
  • HDR High dynamic range
  • the rods are more sensitive at luminance levels between 10 ⁇ 2 cd/m 2 to 10 8 cd/m 2 (referred to as the photopic or daylight vision) Furthermore, color vision is due to three types of cones: short, middle and long wavelength cones.
  • the rods are sensitive at luminance levels between 10 ⁇ 6 cd/m 2 to 10 cd/m 2 (scotopic or night vision). The rods are more sensitive than cones but do not provide color vision.
  • direct sunlight at noon can be of the order in excess of 10 7 cd/m 2 while a starlit night in the range of 10 ⁇ 1 cd/m 2 .
  • human eyes their dynamic range depends on the time allowed to adjust or adapt to the given luminance levels. Due to the presence of rods and cones, human eyes have a remarkable ability to adjust to varying luminance levels, both dynamically (i.e. instantaneous) and over a period of time (i.e. adaptation time). Given sufficient adaptation time, the dynamic range of human eyes is about 13 orders of magnitude. However, without adaptation, the instantaneous human vision range is smaller and they are capable of dynamically adjusting so that a person can see about 5 orders of magnitude throughout the entire range.
  • the dynamic vision range (5 orders of magnitude) is more relevant in the context of the present invention as well as HDR video processing in general.
  • typical digital imaging sensors (assuming the typical single exposure setting) and LDR displays are not capable of dealing with such large dynamic range present in the real world, and most of them (both capturing sensors and displays) can handle up to 3 orders of magnitude. Due to this limitation, the scenes captured and viewed via LDR technologies will have lower contrast (visual details are either saturated or noisy) and smaller color gamut than what the eyes can perceive. This in turn can decrease the immersive experience quotient of the end-user.
  • HDR imaging technologies therefore has been developed to overcome the inadequacies of the LDR capture and display technologies via better video signal capture, representation and display, so that the dynamic range of the video can better match the instantaneous range of the eye.
  • the major distinguishing factor of HDR imaging is its focus on capturing and displaying scenes as natively (i.e. how they appear in the real world) as possible by considering physical luminance of the scene in question. Two important points should, however, be mentioned at the very outset. First, it may be emphasized that in HDR imaging one usually deals with proportional (and not absolute) luminance values.
  • luminance values in an HDR video file represent the real world luminance up to an unknown scale. This, nonetheless, is sufficient for most purposes.
  • the HDR displays currently available cannot display luminance beyond the specified limit, given the hardware limitations. This necessitates a pre-processing step for both subjective and objective HDR video quality measurement, as elaborated further in the step S 0 .
  • HDR imaging can improve the viewer experience significantly as compared to LDR imaging. So the present invention seeks to address the issue of objective video quality measurement for HDR video.
  • FIG. 1 represents a block diagram describing the steps of the method according to a preferred embodiment of the invention. It takes as input the distorted HDR sequence to be analyzed, noted HDR, and the reference HDR sequence, noted REF, from which the HDR sequence is issued.
  • the distortions of the sequence HDR can be the results of video or image processing operations, such as video compression, post-processing, inverse video tone mapping, on the original sequence REF.
  • the method comprises the following steps:
  • step S 0 transformation of native input luminance values from the sequence HDR and REF into emitted luminance values
  • step S 1 transformation of emitted luminance values from the sequence HDR and REF into perceived luminance values
  • Step S 2 computation of a similarity map Sim t for each couple of frames t of the sequences HDR and REF representative of the perceptual similarity between the frame t of the sequence HDR and the frame t of the sequence REF;
  • Step S 3 short term temporal pooling on the similarity maps Sim t ;
  • Step S 4 spatial pooling
  • Step S 5 Long-term temporal pooling.
  • HDR video signal representation Two observations with regard to HDR video signal representation can be firstly mentioned. First, native HDR signal values are in general only proportional to the actual scene luminance and not equal to it. Therefore, the exact scene luminance at each pixel location is generally unknown. Second, since the maximum luminance values of real-world scenes can be vastly different, the concept of a fixed maximum (or white point) does not exist for HDR values. In view of these two observations, HDR video signals must be interpreted based on the display device. Thus, their values should be advantageously recalibrated according to the characteristics of the HDR display device used to view them. This is unlike the case of LDR video where the format is more standardized, e.g.
  • the maximum value is 255 which would be mapped to the peak display luminance that does not typically exceed 500 cd/m 2 .
  • the inherent hardware limitations impose a limit on the maximum luminance that can be displayed.
  • HDR videos are generally viewed on HDR display devices that may have different peak luminance and/or contrast ratios.
  • artifact visibility for the same HDR video can be different depending on the display device used e.g. there are different levels of saturation according to peak luminance.
  • This step S 0 can be skipped if the HDR data are already display adapted. This step is therefore optional.
  • This pre-processing consists for example in rescaling the luminance values with respect to the maximum displayable luminance of the display device used for the HDR sequences. This maximum displayable luminance is equal to 4000 cd/m 2 for a SIM2Solar47 HDR display device.
  • a normalization operation is applied on the native luminance values.
  • a normalization factor is determined as the maximum of the mean of top 5% native luminance values of all the frames in the sequence HDR. Specifically, a vector MT 5 whose elements are the mean of top 5% luminance values in each frame of the sequence HDR is computed, that is
  • N v,t denotes the native luminance values at spatial location v for the frame t
  • N frame is the total number of frames of the sequence HDR
  • T 5 denotes the set with highest 5% luminance values in the frame.
  • the native luminance values N are converted to emitted luminance values E as
  • the multiplication factor of 179 is the luminous efficacy of equal energy white light that is defined and used by the radiance file format (RGBE) for the conversion to actual luminance value.
  • RGBE radiance file format
  • a clipping function is applied to limit the E values in the range defined by the black point (lowest displayable luminance) and the maximum displayable luminance both depending on the display characteristics.
  • Step S 1
  • the step S 1 is the transformation of the emitted luminance values of the sequences HDR and REF, noted E HDR and E REF respectively, into perceived luminance values noted P HDR and P REF respectively.
  • This step is required since there exists a non-linear relationship between the perceived and emitted luminance values given the response of the human visual system to different luminance levels. An implication of such non-linearity is that the changes introduced by an HDR video processing algorithm in the emitted luminance may not have a direct correspondence to the actual modification of visual quality. This is different from the case of LDR representation in which the pixel values are typically gamma encoded.
  • LDR video encodes information that is non-linearly (the non-linearity arising due to the gamma curve) related to the scene luminance.
  • the changes in LDR pixel values can be approximately linearly related to the actual change perceived by the HVS.
  • many LDR image/video quality measurement methods directly employ the said gamma encoded pixel values as input and assume that changes in LDR pixels (or changes in features extracted from those pixels) due to distortion can quantify quality degradation (the reference video is always assumed to be of perfect quality). Therefore, to achieve a similar functionality as the LDR domain, the said nonlinearity of the HVS to the emitted luminance should be taken into account for objective HDR video quality evaluation. In this way, the input values to the objective HDR video quality estimator would be expected to be approximately linearly related to the changes induced due to distortions.
  • FIG. 2 a shows the response of these two transformations to input luminance values which are in the range from 1 to 200 cd/m 2
  • FIG. 2 b shows the response of these two transformations to input luminance values which are in the range from 200 to 10 000 cd/m 2 .
  • the step S 1 is performed by performing a PU encoding.
  • PU encoding is for example implemented as a look-up table operation in order not to increase substantially the computational overhead.
  • a spatio-temporal comparison of segments of the sequences HDR and REF is performed in order to generate similarity maps for each couple of frames t of the sequences HDR and REF.
  • subband signals are generated by applying log-Gabor filters to the luminance values P HDR and P REF .
  • Such Log-Gabors filters are for example introduced in “Relations between the statistics of natural images and the response properties of cortical cells” D. Field, J. Opt. Soc. Am. A4, December 1987, 2379-2394.
  • Subband signals are calculated at different spatial scales and spatial orientations.
  • Log-Gabor filters are widely used in image analysis and are used here to compare intrinsic characteristics of natural scenes.
  • H s , o ⁇ ( f , ⁇ ) exp ⁇ ( - log ⁇ ( f / f s ) 2 2 ⁇ ⁇ log ⁇ ( ⁇ s / f s ) 2 ) ⁇ exp ⁇ ( - ( ⁇ - ⁇ 0 ) 2 2 ⁇ ⁇ 0 2 ) ( 2 )
  • H s,0 is the filter denoted by spatial scale index s and orientation index o
  • f s is the normalized center frequency of the scale
  • is the orientation
  • ⁇ 0 represents the center orientation of the filter
  • Video frames P HDR and P REF in the perceived luminance domain are decomposed into a set of subbands by computing the inverse DFT (Discrete Fourier Transform) of the product of the frame's DFT with frequency domain filter defined in the relation (2).
  • inverse DFT Discrete Fourier Transform
  • the resulting subband values for the video frames P HDR and P REF are denoted l t,s,o (HDR) and l t,s,o (REF) respectively.
  • s 1, 2, . . . , N scale
  • o 1, 2, . . . , N orient
  • t 1, 2, . . . , N frame
  • N scale is the total number of scales
  • N orient is the total number of orientations
  • N frame is the total number of frames in the sequences HDR and REF.
  • a similarity map between the subband values l t,s,o (HDR) and l t,s,o (REF) is then computed for each couple of frames t of the sequences HDR and REF at each spatial scale s and each orientation o in a second time.
  • the similarity map for a frame t at a scale s and an orientation o is computed as follows:
  • the similarity map comprises as many pixels as the frames t of the sequences HDR and REF.
  • Each pixel or point of the similarity map is related to a specific pixel P of the frames t of the sequences HDR and REF.
  • the value of this point is representative of a similarity level between the pixel P of the frame t of the sequence HDR and the pixel P of the frame t of the sequence REF.
  • a global similarity map Sim t for the frame t can then be computed by pooling across spatial scales and orientations.
  • Different methods can be used for computing the global similarity map such as those based on contrast sensitivity function (CSF) but a possible bottleneck is that of computing the desired CSF accurately, especially the one which may be applicable for both near-threshold and supra-threshold distortions.
  • CSF contrast sensitivity function
  • the global similarity map Sim t is computed as follows:
  • the global similarity map Sim t is representative of the similarity level between the frame t of the sequence HDR and the frame t of the sequence REF.
  • humans fixate their attention to local regions when viewing a video because only a small area of the eye retina, generally referred to as fovea, has a high visual acuity. This is due to higher density of photoreceptor cells cones present in the fovea. Consequently, human eyes have to rapidly shift their gaze (the time between such movements is the fixation duration) to bring localized regions of the visual signal into the fovea field.
  • humans tend to judge video quality in local context both spatially and temporally, and determine the overall video quality based on those assessments.
  • the impact of distortions introduced in video frames is not limited just to the spatial dimension but rather manifests spatio-temporally.
  • a possible strategy for objective video quality measurement is by analyzing the video sequence in a spatio-temporal (ST) dimension, so that the impact of distortions can be localized along both spatial and temporal axes.
  • ST spatio-temporal
  • the axes x and y define the spatial axes while the axis z determines the temporal axis.
  • the values of x and y together define the area of the fixated region.
  • a good range of z can be determined by considering the average fixation duration when viewing a video sequence. While this can vary due to content and/or distortions, studies related to the analysis of eye-movement during video viewing indicate that values in the range of 300-500 ms (8-12 frames) is a reasonable choice.
  • a short term temporal pooling is performed.
  • the aim of this step is to pool or fuse the data in local spatio-temporal neighborhoods and, more specifically, to pool or fuse data present in ST tubes.
  • the ST tubes are non-overlapping tubes. In a variant, they could be partially overlapping.
  • V is the viewing distance in cm
  • R is the display resolution
  • D A is the display area.
  • V 178 cm
  • R 1080 ⁇ 1920 pixels
  • D A 6100 cm 2 .
  • Plugging these values into the relation (I) gives W ⁇ 115.
  • the method may be run on down sampled (by a factor of 2) video frames, and hence the approximate length of the fixated window is W/2 ⁇ 58.
  • the values x and y are set to 64 pixels in order to be nearest to a more standard block size.
  • the short term error values are grouped in a 2D error map, denoted ST v,t s .
  • the error map comprises a point for each ST tube.
  • the maps ST v,t s help to quantify signal coherence level in local neighborhoods.
  • the next steps S 5 and S 6 are to perform spatial and long term temporal pooling to obtain an overall video quality score HDR_VQM for the whole sequence HDR.
  • the score HDR_VQM is the visual quality index.
  • the local errors present in the error maps ST v,t s are pooled further in two stages:
  • a long term temporal pooling is performed in a step S 6 to fuse short term quality scores into a single number denoting the overall annoyance level.
  • a spatial pooling S 5 is performed on the error maps ST v,t s in order to obtain short-term quality scores, as illustrated in FIG. 3 .
  • HDR - VQM 1 ⁇ t s ⁇ L p ⁇ ⁇ ⁇ v ⁇ L p ⁇ ⁇ ⁇ t s ⁇ L p ⁇ ⁇ v ⁇ L p ⁇ ST v , t s ( 6 )
  • the pooling factor p is for example set to 5% but it may be comprised between 5% and 50% without introducing significant changes in the results.
  • the method according to the invention which is illustrated by the flow charts of FIGS. 1 and 3 , may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof.
  • the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium.
  • a processor(s) may perform the necessary tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

A method for determining objectively a visual quality index of at least one high dynamic range video sequence, referred to as an HDR sequence, distorted by image processing operations and issued from a reference high dynamic range video sequence, referred to a reference sequence or a reference HDR sequence. The method is based on signal pre-processing, transformation, and subsequent frequency based decomposition. Video quality is then computed based on a spatio-temporal analysis that relates to human eye fixation behavior during video viewing. One advantage of this method is that it does not involve expensive computations.

Description

    BACKGROUND
  • The present invention relates generally to the field of the High Video Range (HDR) video sequences and more specifically to determine a visual quality index for such HDR video sequences after being distorted by image processing operations.
  • BACKGROUND
  • The advent of better technologies in the field of visual signal capture and processing has fueled a paradigm shift in todays' multimedia communication systems. As a result, the notion of network-centric quality of service (QoS) in multimedia systems is being extended by relying on the concept of quality of experience (QoE). In this quest of increasing the immersive video experience and the overall QoE of the end user, newer technologies such as 3D, ultra-high definition (UHD) and, more recently, high dynamic Range (HDR) imaging have gained prominence within the multimedia signal processing community. HDR in particular has attracted attention since it in a way revisits the way we capture and display natural scenes. This is motivated by the fact that natural scenes often exhibit large ranges of illumination values. However, such high luminance values often exceed the capabilities of the traditional low dynamic range (LDR) capturing and display devices. Consequently, it is not possible to properly expose the dark and the bright areas simultaneously in one image or one video during capture. This may lead to over-exposure (saturated pixels that are fully white) and/or under-exposure (very dark or noisy pixels as sensor's response falls below its noise threshold). In both cases, visual information is either lost or altered. HDR imaging focuses on minimizing such losses and therefore aims at improving the quality of the displayed pixels by incorporating higher contrast and luminance.
  • As a result, HDR imaging has attracted attention from both academia and industry, and there has been interest and effort to develop tools/algorithms for HDR video processing. For instance, there have been recent efforts within the Moving Picture Experts Group (MPEG) for extending High Efficiency Video Coding (HEVC) to HDR. Likewise, the JPEG has announced extensions that will feature the original JPEG standard with support for HDR image compression. Despite of some work on evaluating quality of HDR images and video sequences, there is overall lack of such efforts to quantify and measure the impact of such tools on HDR video quality using both subjective and objective approaches.
  • It is therefore important to develop objective methods for HDR video quality measurement and benchmark their performance against subjective ground truth.
  • With regards to visual quality measurement, both subjective and objective approaches can be used. The former involves the use of human subjects to judge and rate the quality of the test stimuli. With appropriate laboratory conditions and a sufficiently large subject panel, it remains the most accurate method. The latter quality assessment method employs a computational model to provide estimates of the subjective video quality. While such objective models may not mimic subjective opinions accurately in a general scenario, they can be reasonably effective in specific conditions/applications. Hence, they can be an important tool towards automating the testing and standardization of HDR video processing algorithms such as HDR video compression, post-processing, inverse video tone mapping, etc., especially when subjective tests may not be feasible.
  • Therefore, there is a need for a tool for determining automatically a visual quality index of a HDR video sequence that has undergone distortions due to image processing operations such as HDR video compression/decompression, post-processing, inverse video tone mapping.
  • SUMMARY
  • The present invention relates to a method for determining a visual quality index of at least one high dynamic range video sequence, called HDR sequence, distorted by image processing operations and issued from a reference high dynamic range video sequence, called reference sequence, each of the HDR sequence and the reference sequence comprising Nframe video frames t, with N≧2 and tε[1 . . . Nframe], each video frame t comprising a plurality of pixels organized into rows and columns and each pixel having at least a luminance value, said method comprising the steps of:
  • applying a transformation to the video frames t of the HDR sequence and the reference sequence in order to obtain video frames t of the HDR sequence and the reference sequence in a perceived luminance domain, the transformed luminance values of the video frames tin the perceived luminance domain being substantially linear to the luminance values perceived by the human visual system for the HDR sequence and the reference sequence,
  • computing, for each couple of frames t of the HDR sequence and the reference sequence in the perceived luminance domain, Nscale×Norient similarity frames Simt,s,o representative of a perceptual similarity between the frame t of the HDR sequence and the frame t of the reference sequence at different spatial scales s and different spatial orientations o, with sε[1 . . . Nscale] and oε[1 . . . Norient], a similarity value being associated to each pixel of the similarity frame Simt,s,o,
  • computing, for each couple of frames t of the HDR sequence and the reference sequence, a global similarity frame Simt based on the computed similarity frames Simt,s,o at the different spatial scales s and the different spatial orientations o,
  • pooling, for each group of q consecutive global similarity frames Simt, with q≧2, and for each one of a plurality of spatio-temporal tubes within said group of q consecutive global similarity frames Simt, the similarity values of the pixels included in said spatio-temporal tubes in order to generate a short term error value for each said spatio-temporal tubes, the short term error values of each spatio-temporal tube being included into an error map,
  • pooling at least a portion of the short term error values of each error map in order to generate a short term quality score for each group of q consecutive global similarity frames Simt, and
  • computing the visual quality index of the HDR sequence based on said short term quality scores.
  • According to the invention, the visual quality index is computed based on HDR signal transformation and subsequent analysis of spatio-temporal segments or tubes of the HDR sequence to be qualified and the reference sequence from which the HDR sequence is issued.
  • According to a particular embodiment, said portion of the short term error values of each error map comprises the m lowest short term error values of the error map, with m being an integer value lower than the total number of short term error values in the error map.
  • In a particular embodiment, the number m is a predetermined percentage of the total number of short term error values in the error map.
  • In a particular embodiment, said predetermined percentage is comprised between 5% and 50%.
  • In a particular embodiment, the method of the invention further comprises a preliminary step, before transforming the HDR sequence and the reference sequence into the perceived luminance domain, said preliminary step consisting in transforming the luminance values of the HDR sequence and the reference sequence into emitted luminance values, said emitted luminance values depending on at least luminance characteristics of the display device used to the video sequences.
  • In a particular embodiment, the similarity frame Simt,s,o associated to the frames t of the HDR sequence and the reference sequence for a spatial scale s and a spatial orientation o is computed by the steps of:
  • applying a log-Gabor filter to the frame t of the HDR sequence and the frame t of the reference sequence in the perceived luminance domain at the spatial scale s and the spatial orientation o,
  • computing an inverse Fourier Transform of the product of the results of the log-Gabor filter for the frame t of the HDR sequence and the frame t of the reference sequence in order to generate a subband frame lt,s,o (HDR) for the frame t of the HDR sequence and a subband frame lt,s,o (REF) or for the frame t of the reference sequence, and
  • computing the similarity frame Simt,s,o based on the subband frames lt,s,o (HDR) and lt,s,o (REF).
  • In a particular embodiment, the global similarity frame Simt is defined by the formula:
  • Sim t = 1 N scale × N orient s = 1 N scale o = 1 N orient Sim t , s , o
  • In a particular embodiment, the spatio-temporal tubes are non-overlapping spatio-temporal tubes.
  • In a particular embodiment, the video frames t of the HDR sequence and the reference sequence in the perceived luminance domain are generated by applying a perceptually uniform encoding to the video frames t of the HDR sequence and the reference sequence or, when appropriate, to the video frames t of the HDR sequence and the reference sequence issued from said preliminary step (transformation into emitted luminance values).
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention can be better understood with reference to the following description and drawings, given by way of example and not limiting the scope of protection, and in which:
  • FIG. 1 is a flow chart of the successive steps implemented when performing a method for determining the visual quality index of a HDR sequence according to an embodiment of the invention;
  • FIG. 2a and FIG. 2b are response curves of luminance values for a logarithmic transform and for a perceptually uniform encoding in two different ranges of luminance; and
  • FIG. 3 is a flow chart describing in detail the final steps of the flow chart of FIG. 1.
  • The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
  • DETAILED DESCRIPTION
  • Preliminary information on the human visual system (HVS) and High dynamic range (HDR) video are given in order to properly understand the invention and its context. Humans perceive the outside visual world through the interaction between luminance (measured in candela per square meter, cd/m2) and the eyes. Luminance first passes through the cornea. Then it enters the pupil, an aperture that is modified by the iris, a muscular diaphragm. Subsequently, light is refracted by the lens and hits the photoreceptors in the retina. There are two types of photoreceptors: cones and rods. The cones are located mostly in the fovea. They are more sensitive at luminance levels between 10−2 cd/m2 to 108 cd/m2 (referred to as the photopic or daylight vision) Furthermore, color vision is due to three types of cones: short, middle and long wavelength cones. The rods, on the other hand, are sensitive at luminance levels between 10−6 cd/m2 to 10 cd/m2 (scotopic or night vision). The rods are more sensitive than cones but do not provide color vision.
  • Pertaining to the luminance levels found in the real world, direct sunlight at noon can be of the order in excess of 107 cd/m2 while a starlit night in the range of 10−1 cd/m2. This corresponds to more than 8 orders of magnitude. With regards to human eyes, their dynamic range depends on the time allowed to adjust or adapt to the given luminance levels. Due to the presence of rods and cones, human eyes have a remarkable ability to adjust to varying luminance levels, both dynamically (i.e. instantaneous) and over a period of time (i.e. adaptation time). Given sufficient adaptation time, the dynamic range of human eyes is about 13 orders of magnitude. However, without adaptation, the instantaneous human vision range is smaller and they are capable of dynamically adjusting so that a person can see about 5 orders of magnitude throughout the entire range.
  • Since the typical frequency in video signals does not allow sufficient adaptation time, the dynamic vision range (5 orders of magnitude) is more relevant in the context of the present invention as well as HDR video processing in general. However, typical digital imaging sensors (assuming the typical single exposure setting) and LDR displays are not capable of dealing with such large dynamic range present in the real world, and most of them (both capturing sensors and displays) can handle up to 3 orders of magnitude. Due to this limitation, the scenes captured and viewed via LDR technologies will have lower contrast (visual details are either saturated or noisy) and smaller color gamut than what the eyes can perceive. This in turn can decrease the immersive experience quotient of the end-user.
  • HDR imaging technologies therefore has been developed to overcome the inadequacies of the LDR capture and display technologies via better video signal capture, representation and display, so that the dynamic range of the video can better match the instantaneous range of the eye. In particular, the major distinguishing factor of HDR imaging (in comparison to the traditional LDR one) is its focus on capturing and displaying scenes as natively (i.e. how they appear in the real world) as possible by considering physical luminance of the scene in question. Two important points should, however, be mentioned at the very outset. First, it may be emphasized that in HDR imaging one usually deals with proportional (and not absolute) luminance values. More specifically, unless there is a prior and accurate camera calibration, luminance values in an HDR video file represent the real world luminance up to an unknown scale. This, nonetheless, is sufficient for most purposes. Secondly, the HDR displays currently available cannot display luminance beyond the specified limit, given the hardware limitations. This necessitates a pre-processing step for both subjective and objective HDR video quality measurement, as elaborated further in the step S0. Despite the two mentioned caveats, HDR imaging can improve the viewer experience significantly as compared to LDR imaging. So the present invention seeks to address the issue of objective video quality measurement for HDR video.
  • FIG. 1 represents a block diagram describing the steps of the method according to a preferred embodiment of the invention. It takes as input the distorted HDR sequence to be analyzed, noted HDR, and the reference HDR sequence, noted REF, from which the HDR sequence is issued. The distortions of the sequence HDR can be the results of video or image processing operations, such as video compression, post-processing, inverse video tone mapping, on the original sequence REF.
  • As illustrated by FIG. 1, the method comprises the following steps:
  • step S0: transformation of native input luminance values from the sequence HDR and REF into emitted luminance values;
  • step S1: transformation of emitted luminance values from the sequence HDR and REF into perceived luminance values;
  • Step S2: computation of a similarity map Simt for each couple of frames t of the sequences HDR and REF representative of the perceptual similarity between the frame t of the sequence HDR and the frame t of the sequence REF;
  • Step S3: short term temporal pooling on the similarity maps Simt;
  • Step S4: spatial pooling; and
  • Step S5: Long-term temporal pooling.
  • The steps S0-S5 are described in detail in the following paragraphs.
  • Step S0
  • Two observations with regard to HDR video signal representation can be firstly mentioned. First, native HDR signal values are in general only proportional to the actual scene luminance and not equal to it. Therefore, the exact scene luminance at each pixel location is generally unknown. Second, since the maximum luminance values of real-world scenes can be vastly different, the concept of a fixed maximum (or white point) does not exist for HDR values. In view of these two observations, HDR video signals must be interpreted based on the display device. Thus, their values should be advantageously recalibrated according to the characteristics of the HDR display device used to view them. This is unlike the case of LDR video where the format is more standardized, e.g. for 8-bit representation, the maximum value is 255 which would be mapped to the peak display luminance that does not typically exceed 500 cd/m2. With regard to HDR display devices, the inherent hardware limitations impose a limit on the maximum luminance that can be displayed.
  • Thus, a pre-processing of the HDR video signal is advantageously required in order that the pre-defined maximum luminance point is not exceeded. Specifically, unlike the LDR domain, HDR videos are generally viewed on HDR display devices that may have different peak luminance and/or contrast ratios. Thus, artifact visibility for the same HDR video can be different depending on the display device used e.g. there are different levels of saturation according to peak luminance.
  • This step S0 can be skipped if the HDR data are already display adapted. This step is therefore optional.
  • Different strategies from simple ones like linear scaling to more sophisticated ones can be adopted for this pre-processing step. This pre-processing consists for example in rescaling the luminance values with respect to the maximum displayable luminance of the display device used for the HDR sequences. This maximum displayable luminance is equal to 4000 cd/m2 for a SIM2Solar47 HDR display device.
  • In a variant, a normalization operation is applied on the native luminance values. A normalization factor is determined as the maximum of the mean of top 5% native luminance values of all the frames in the sequence HDR. Specifically, a vector MT5 whose elements are the mean of top 5% luminance values in each frame of the sequence HDR is computed, that is
  • MT 5 = { 1 v T 5 v T 5 N v , t } t = 1 , 2 , Nframe
  • where Nv,t denotes the native luminance values at spatial location v for the frame t, Nframe is the total number of frames of the sequence HDR, T5 denotes the set with highest 5% luminance values in the frame.
  • Then, the native luminance values N are converted to emitted luminance values E as
  • E = N × 179 max ( MT 5 ) ( 1 )
  • where the multiplication factor of 179 is the luminous efficacy of equal energy white light that is defined and used by the radiance file format (RGBE) for the conversion to actual luminance value. Then, a clipping function is applied to limit the E values in the range defined by the black point (lowest displayable luminance) and the maximum displayable luminance both depending on the display characteristics.
  • Step S1:
  • The step S1 is the transformation of the emitted luminance values of the sequences HDR and REF, noted EHDR and EREF respectively, into perceived luminance values noted PHDR and PREF respectively. This step is required since there exists a non-linear relationship between the perceived and emitted luminance values given the response of the human visual system to different luminance levels. An implication of such non-linearity is that the changes introduced by an HDR video processing algorithm in the emitted luminance may not have a direct correspondence to the actual modification of visual quality. This is different from the case of LDR representation in which the pixel values are typically gamma encoded. Thus, LDR video encodes information that is non-linearly (the non-linearity arising due to the gamma curve) related to the scene luminance. As a result of such non-linear representation, the changes in LDR pixel values can be approximately linearly related to the actual change perceived by the HVS. Due to this, many LDR image/video quality measurement methods directly employ the said gamma encoded pixel values as input and assume that changes in LDR pixels (or changes in features extracted from those pixels) due to distortion can quantify quality degradation (the reference video is always assumed to be of perfect quality). Therefore, to achieve a similar functionality as the LDR domain, the said nonlinearity of the HVS to the emitted luminance should be taken into account for objective HDR video quality evaluation. In this way, the input values to the objective HDR video quality estimator would be expected to be approximately linearly related to the changes induced due to distortions.
  • According to the Weber law, a short increment of luminance at low level is perceived higher than the same increment at higher luminance level. Therefore, two transformations can be used:
  • the logarithmic transformation, or
  • the Perceptually Uniform (PU) encoding as disclosed in “Extending quality metrics to full luminance range images” T. Aydin, R. Mantiuk, H. Seidel, Proceedings of the SPIE, vol. 6806, 2008, pp. 68060B-68060B-10.
  • These two transformations can be used to transform the emitted luminance values in the range from IC to 108 cd/m2 into approximately perceptually uniform code values. These two transformations are plotted in FIG. 2a and FIG. 2b . FIG. 2a shows the response of these two transformations to input luminance values which are in the range from 1 to 200 cd/m2 and FIG. 2b shows the response of these two transformations to input luminance values which are in the range from 200 to 10 000 cd/m2.
  • From FIG. 2a , it can be noticed that the response of PU encoding is relatively more linear at lower luminance as compared to the logarithmic one.
  • To further quantify this, it has been found that the linear correlation between the original and transformed signals was 0.9334 for PU encoding and 0.9071 for logarithmic, for the range between 1 and 200 cd/m2. On the other hand, both PU and logarithmic curves have a similar response for higher luminance values (above 1000 cd/m2) as shown in FIG. 2b . In this case, the linear correlations were 0.8703 and 0.8763 respectively for PU and logarithmic transformation. Thus, PU encoding better approximates the response of HVS which is approximately linear at lower luminance and increasingly logarithmic for higher luminance values. Due to this, PU encoding is expected to better model the underlying non-linear relationship between HVS's response and emitted luminance.
  • Therefore, in a preferred embodiment, the step S1 is performed by performing a PU encoding. PU encoding is for example implemented as a look-up table operation in order not to increase substantially the computational overhead.
  • Step S2
  • According to the invention, a spatio-temporal comparison of segments of the sequences HDR and REF is performed in order to generate similarity maps for each couple of frames t of the sequences HDR and REF. In a first time, subband signals are generated by applying log-Gabor filters to the luminance values PHDR and PREF. Such Log-Gabors filters are for example introduced in “Relations between the statistics of natural images and the response properties of cortical cells” D. Field, J. Opt. Soc. Am. A4, December 1987, 2379-2394. Subband signals are calculated at different spatial scales and spatial orientations.
  • Log-Gabor filters are widely used in image analysis and are used here to compare intrinsic characteristics of natural scenes. In our approach, the log-Gabor filters are used in the frequency domain and can be defined in polar coordinates by h(f,θ)=Hf×Hθ with Hf and Hθ being the radial and angular components, respectively:
  • H s , o ( f , θ ) = exp ( - log ( f / f s ) 2 2 log ( σ s / f s ) 2 ) × exp ( - ( θ - θ 0 ) 2 2 σ 0 2 ) ( 2 )
  • where Hs,0 is the filter denoted by spatial scale index s and orientation index o, fs is the normalized center frequency of the scale, θ is the orientation, a defines the radial bandwidth B in octaves with B=2√{square root over (2/log(2))}*|log(σs/fs)|, θ0 represents the center orientation of the filter, and σ0 defines the angular bandwidth ΔΩ=2σ0√{square root over (2/log(2))}.
  • Video frames PHDR and PREF in the perceived luminance domain are decomposed into a set of subbands by computing the inverse DFT (Discrete Fourier Transform) of the product of the frame's DFT with frequency domain filter defined in the relation (2).
  • The resulting subband values for the video frames PHDR and PREF are denoted lt,s,o (HDR) and lt,s,o (REF) respectively. Here, s=1, 2, . . . , Nscale, o=1, 2, . . . , Norient and t=1, 2, . . . , Nframe, wherein N scale is the total number of scales, Norient is the total number of orientations and Nframe is the total number of frames in the sequences HDR and REF.
  • A similarity map between the subband values lt,s,o (HDR) and lt,s,o (REF) is then computed for each couple of frames t of the sequences HDR and REF at each spatial scale s and each orientation o in a second time.
  • The similarity map for a frame t at a scale s and an orientation o is computed as follows:
  • Sim t , s , o = 2 · l t , s , o ( HDR ) · l t , s , o ( REF ) + k { l t , s , o ( HDR ) } 2 + { l t , s , o ( HDR ) } 2 + k ( 3 )
  • wherein k is a small constant added to avoid division by zero. The similarity map comprises as many pixels as the frames t of the sequences HDR and REF.
  • Each pixel or point of the similarity map is related to a specific pixel P of the frames t of the sequences HDR and REF. The value of this point is representative of a similarity level between the pixel P of the frame t of the sequence HDR and the pixel P of the frame t of the sequence REF.
  • A global similarity map Simt for the frame t can then be computed by pooling across spatial scales and orientations. Different methods can be used for computing the global similarity map such as those based on contrast sensitivity function (CSF) but a possible bottleneck is that of computing the desired CSF accurately, especially the one which may be applicable for both near-threshold and supra-threshold distortions. Thus, according to preferred embodiment, the global similarity map Simt is computed as follows:
  • Sim t = 1 N scale × N orient s = 1 N scale o = 1 N orient Sim t , s , o ( 4 )
  • The global similarity map Simt is representative of the similarity level between the frame t of the sequence HDR and the frame t of the sequence REF. The similarity map for the whole video sequence can be represented as {Simt}t=1 N frame .
  • The similarity map {Simt}t=1 N frame helps to quantify the effect of local distortions by assessing their impact across frequency and orientation. This effect can then be exploited via a spatio-temporal analysis in order to calculate a short term quality value in a spatially and temporally localized neighborhood, and subsequently obtain an overall HDR video quality score as described in the following steps.
  • Step S3
  • Video signals propagate information along both spatial and temporal dimensions. However, due to visual acuity limitations of the eye, humans fixate their attention to local regions when viewing a video because only a small area of the eye retina, generally referred to as fovea, has a high visual acuity. This is due to higher density of photoreceptor cells cones present in the fovea. Consequently, human eyes have to rapidly shift their gaze (the time between such movements is the fixation duration) to bring localized regions of the visual signal into the fovea field. Thus, humans tend to judge video quality in local context both spatially and temporally, and determine the overall video quality based on those assessments. In other words, the impact of distortions introduced in video frames is not limited just to the spatial dimension but rather manifests spatio-temporally.
  • Therefore, a possible strategy for objective video quality measurement is by analyzing the video sequence in a spatio-temporal (ST) dimension, so that the impact of distortions can be localized along both spatial and temporal axes.
  • The next steps will be described in reference to FIG. 1 and FIG. 3. Thus, according to the invention, the similarity maps {Simt}t=1 N frame are each divided into short-term ST (for Spatio-Temporal) tubes defined by a 3-dimensional region with x horizontal, y vertical and the z temporal data points, i.e. a cuboid with dimensions x×y×z, as illustrated in FIG. 3. The axes x and y define the spatial axes while the axis z determines the temporal axis. The values of x and y together define the area of the fixated region. Therefore, these can be computed by taking into account the viewing distance, the central angle of the visual field in the fovea and the display resolution. On the other hand, a good range of z can be determined by considering the average fixation duration when viewing a video sequence. While this can vary due to content and/or distortions, studies related to the analysis of eye-movement during video viewing indicate that values in the range of 300-500 ms (8-12 frames) is a reasonable choice.
  • In a first step S3, a short term temporal pooling is performed. The aim of this step is to pool or fuse the data in local spatio-temporal neighborhoods and, more specifically, to pool or fuse data present in ST tubes. In the embodiment illustrated by FIG. 3, the ST tubes are non-overlapping tubes. In a variant, they could be partially overlapping.
  • Keeping in mind that the goal is to characterize the effect of spatial distortions over short term duration which is equal to the fixation time (300-500 ms), a standard similarity deviation value is computed for of each ST tube.
  • Consequently, a short term error value is computed for each ST tube of a group of q consecutive similarity maps Simt, q being for example equal to 10 (the standard similarity deviation is computed for 10 consecutive frames of the video sequences HDR and REF).
  • The determination of the values of x, y and z can be performed as follows. It is assumed that the central angle of the visual field in the fovea is 2°. Then, a quantity W representing the length of the fixated window in terms of number of pixels can be computed as

  • W=tan 2°×v×√{square root over (R/D A)}  (5)
  • where V is the viewing distance in cm, R is the display resolution and DA is the display area. In an example, V=178 cm, R=1080×1920 pixels and DA=6100 cm2. Plugging these values into the relation (I) gives W≈115. To reduce the computational effort, the method may be run on down sampled (by a factor of 2) video frames, and hence the approximate length of the fixated window is W/2≈58. Thus the values x and y are set to 64 pixels in order to be nearest to a more standard block size. To determine z, a fixation duration of 400 ms is set and, with a frame rate of 25 frames per second, z=10 frames. The number of scales s and orientations o are for example 5 and 4, respectively, i.e. Nscale=5 and Norient=4. The orientations are equally spaced by 45°.
  • The short term error values are grouped in a 2D error map, denoted STv,t s . v represents the spatial coordinates and ts (=1, 2, . . . Nframe/z) is the index of resulting spatio-temporal frames. The error map comprises a point for each ST tube. By this definition, a video sequence with lower visual quality will have higher localized standard values in the error map STv,t s , while this will decrease as the signal quality improves.
  • Thus, the maps STv,t s , help to quantify signal coherence level in local neighborhoods.
  • Steps S5 and S6
  • The next steps S5 and S6 are to perform spatial and long term temporal pooling to obtain an overall video quality score HDR_VQM for the whole sequence HDR. The score HDR_VQM is the visual quality index.
  • To obtain an overall video quality score that can quantify the level of annoyance in the video sequence, the local errors present in the error maps STv,t s are pooled further in two stages:
  • (a) a spatial pooling is performed in a step S5 to generate a time series of short term quality scores, and
  • (b) a long term temporal pooling is performed in a step S6 to fuse short term quality scores into a single number denoting the overall annoyance level.
  • These steps are based on the premise that humans evaluate the overall video quality based on continuous assessments of the impact of short term errors or annoyance they came across while viewing the video sequence. Therefore, a spatial pooling S5 is performed on the error maps STv,t s in order to obtain short-term quality scores, as illustrated in FIG. 3.
  • Then, a long term pooling S6 is applied to compute the overall video quality score. The following equation is used for implementing both steps S5 and S6
  • HDR - VQM = 1 t s L p × v L p t s L p v L p ST v , t s ( 6 )
  • where Lp denotes the set with lowest p % values (=m lowest values) and | | stands for cardinality of the set. Both short term spatial pooling S5 and long term temporal pooling S6 are preferably performed over the lowest p % values (=m lowest values). This is because the HVS does not process necessarily visual data in its entirety and makes certain choices to minimize the amount of data to be analyzed. It is, of course, non-trivial to realize and integrate such exact HVS mechanisms into an objective method.
  • The pooling factor p is for example set to 5% but it may be comprised between 5% and 50% without introducing significant changes in the results.
  • The results of this method have been compared to the quality measurements made by 25 observers (subjective quality measurements). The method of the invention has showed good results.
  • While example embodiments are capable of various modifications and alternative forms, embodiments thereof are shown by way of example in the drawings and will herein be described in details. It should be understood, however, that there is no intent to limit example embodiments to the particular forms disclosed, but on the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of the claims. Like numbers refer to like elements throughout the description of the figures.
  • The method according to the invention which is illustrated by the flow charts of FIGS. 1 and 3, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. A processor(s) may perform the necessary tasks. Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments of the present invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

Claims (9)

1. A method for determining a visual quality index of at least one high dynamic range video sequence (HDR sequence), distorted by image processing operations and issued from a reference high dynamic range video sequence, each of the HDR sequence and the reference HDR sequence comprising Nframe video frames t, with N≧2 and tε[1 . . . Nframe], each video frame t comprising a plurality of pixels organized into rows and columns and each pixel having at least a luminance value, the method comprising:
applying a transformation to the video frames t of the HDR sequence and the reference HDR sequence in order to obtain video frames t of the HDR sequence and the reference HDR sequence in a perceived luminance domain, the transformed luminance values of the video frames tin the perceived luminance domain being substantially linear to the luminance values perceived by the human visual system for the HDR sequence and the reference HDR sequence,
computing, for each couple of frames t of the HDR sequence and the reference HDR sequence in the perceived luminance domain, Nscale×Norient similarity frames Simt,s,o representative of a perceptual similarity between the frame t of the HDR sequence and the frame t of the reference HDR sequence at different spatial scales s and different spatial orientations o, with sε[1 . . . Nscale] and oε[1 . . . Norient], a similarity value being associated to each pixel of the similarity frame Simt,s,o,
computing, for each couple of frames t of the HDR sequence and the reference HDR sequence, a global similarity frame Simt based on the computed similarity frames Simt,s,o at the different spatial scales s and the different spatial orientations o,
pooling, for each group of q consecutive global similarity frames Simt, with q≧2, and for each one of a plurality of spatio-temporal tubes within said group of q consecutive global similarity frames Simt, the similarity values of the pixels included in said spatio-temporal tubes in order to generate a short term error value for each said spatio-temporal tubes, the short term error values of each spatio-temporal tube being included into an error map,
pooling at least a portion of the short term error values of each error map in order to generate a short term quality score for each group of q consecutive global similarity frames Simt, and
computing the visual quality index of the HDR sequence based on said short term quality scores.
2. The method of claim 1, wherein the portion of the short term error values of each error map comprises the m lowest short term error values of the error map, with m being an integer value lower than the total number of short term error values in the error map.
3. The method of claim 2, wherein the number m is a predetermined percentage of the total number of short term error values in the error map.
4. The method of claim 3, wherein the predetermined percentage is between about 5% and about 50%.
5. The method of claim 1, wherein the method further comprises a preliminary step, before transforming the HDR sequence and the reference HDR sequence into the perceived luminance domain, the preliminary step including transforming the luminance values of the HDR sequence and the reference HDR sequence into emitted luminance values.
6. The method of claim 1, wherein the similarity frame Simt,s,o associated to the frames t of the HDR sequence and the reference HDR sequence for a spatial scale s and a spatial orientation o is computed by the steps of:
applying a log-Gabor filter to the frame t of the HDR sequence and the frame t of the reference HDR sequence in the perceived luminance domain at the spatial scale s and the spatial orientation o,
computing an inverse Fourier Transform of the product of the results of the log-Gabor filter for the frame t of the HDR sequence and the frame t of the reference HDR sequence in order to generate a subband frame lt,s,o (HDR) for the frame t of the HDR sequence and a subband frame lt,s,o (REF) or the frame t of the reference HDR sequence, and
computing the similarity frame Simt,s,o based on the subband frames lt,s,o (HDR) and lt,s,o (REF).
7. The method of claim 1, wherein the global similarity frame Simt is defined by the formula:
Sim t = 1 N scale × N orient s = 1 N scale o = 1 N orient Sim t , s , o
8. The method of claim 1, wherein the spatio-temporal tubes are non-overlapping spatio-temporal tubes.
9. The method of claim 1, wherein the video frames t of the HDR sequence and the reference HDR sequence in the perceived luminance domain are generated by applying a perceptually uniform encoding to the video frames t of the HDR sequence and the reference HDR sequence or, when appropriate, to the video frames t of the HDR sequence and the reference sequence issued from said preliminary step.
US15/087,578 2016-03-31 2016-03-31 Method for determining a visual quality index of a high dynamic range video sequence Active 2036-06-01 US9794554B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/087,578 US9794554B1 (en) 2016-03-31 2016-03-31 Method for determining a visual quality index of a high dynamic range video sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/087,578 US9794554B1 (en) 2016-03-31 2016-03-31 Method for determining a visual quality index of a high dynamic range video sequence

Publications (2)

Publication Number Publication Date
US20170289539A1 true US20170289539A1 (en) 2017-10-05
US9794554B1 US9794554B1 (en) 2017-10-17

Family

ID=59962114

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/087,578 Active 2036-06-01 US9794554B1 (en) 2016-03-31 2016-03-31 Method for determining a visual quality index of a high dynamic range video sequence

Country Status (1)

Country Link
US (1) US9794554B1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866489A (en) * 2019-11-07 2020-03-06 腾讯科技(深圳)有限公司 Image recognition method, device, equipment and storage medium
CN111988613A (en) * 2020-08-05 2020-11-24 华侨大学 Screen content video quality analysis method based on tensor decomposition
US11205257B2 (en) * 2018-11-29 2021-12-21 Electronics And Telecommunications Research Institute Method and apparatus for measuring video quality based on detection of change in perceptually sensitive region
US20230336878A1 (en) * 2020-12-17 2023-10-19 Beijing Bytedance Network Technology Co., Ltd. Photographing mode determination method and apparatus, and electronic device and storage medium
CN118521876A (en) * 2024-07-22 2024-08-20 华侨大学 Immersion type video quality evaluation method and device based on similarity measurement

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3614627B1 (en) * 2018-08-20 2021-09-15 EXFO Inc. Telecommunications network and services qoe assessment

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5446492A (en) * 1993-01-19 1995-08-29 Wolf; Stephen Perception-based video quality measurement system
JP3025415B2 (en) * 1995-01-20 2000-03-27 ケイディディ株式会社 Digital compression / reproduction image quality evaluation device
US6363116B1 (en) * 1997-04-04 2002-03-26 Tektronix, Inc. Picture quality assessment using spatial location with or without subsampling
US5940124A (en) * 1997-07-18 1999-08-17 Tektronix, Inc. Attentional maps in objective measurement of video quality degradation
DE69803830T2 (en) * 1998-03-02 2002-09-12 Koninklijke Kpn N.V., Groningen Method, device, ASIC and their use for objective video quality assessment
EP1151618B1 (en) * 1999-02-11 2003-10-08 BRITISH TELECOMMUNICATIONS public limited company Analysis of video signal quality
US6493023B1 (en) * 1999-03-12 2002-12-10 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Method and apparatus for evaluating the visual quality of processed digital video sequences
US6690839B1 (en) * 2000-01-17 2004-02-10 Tektronix, Inc. Efficient predictor of subjective video quality rating measures
US6943827B2 (en) * 2001-04-16 2005-09-13 Kddi Corporation Apparatus for monitoring quality of picture in transmission
US20030161406A1 (en) * 2002-02-26 2003-08-28 Chulhee Lee Methods for objective measurement of video quality
US20080285651A1 (en) * 2007-05-17 2008-11-20 The Hong Kong University Of Science And Technology Spatio-temporal boundary matching algorithm for temporal error concealment
EP2077527A1 (en) * 2008-01-04 2009-07-08 Thomson Licensing Method for assessing image quality
US20120020415A1 (en) * 2008-01-18 2012-01-26 Hua Yang Method for assessing perceptual quality
EP2114080A1 (en) * 2008-04-30 2009-11-04 Thomson Licensing Method for assessing the quality of a distorted version of a frame sequence
WO2011133505A1 (en) * 2010-04-19 2011-10-27 Dolby Laboratories Licensing Corporation Quality assessment of high dynamic range, visual dynamic range and wide color gamut image and video
US8723960B2 (en) * 2010-07-02 2014-05-13 Thomson Licensing Method for measuring video quality using a reference, and apparatus for measuring video quality using a reference
WO2012142285A2 (en) * 2011-04-12 2012-10-18 Dolby Laboratories Licensing Corporation Quality assessment for images that have extended dynamic ranges or wide color gamuts
US9541494B2 (en) * 2013-12-18 2017-01-10 Tektronix, Inc. Apparatus and method to measure display quality
US9396531B2 (en) * 2013-12-23 2016-07-19 Tufts University Systems and methods for image and video signal measurement

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205257B2 (en) * 2018-11-29 2021-12-21 Electronics And Telecommunications Research Institute Method and apparatus for measuring video quality based on detection of change in perceptually sensitive region
CN110866489A (en) * 2019-11-07 2020-03-06 腾讯科技(深圳)有限公司 Image recognition method, device, equipment and storage medium
CN111988613A (en) * 2020-08-05 2020-11-24 华侨大学 Screen content video quality analysis method based on tensor decomposition
US20230336878A1 (en) * 2020-12-17 2023-10-19 Beijing Bytedance Network Technology Co., Ltd. Photographing mode determination method and apparatus, and electronic device and storage medium
US12081876B2 (en) * 2020-12-17 2024-09-03 Beijing Bytedance Network Technology Co., Ltd. Method for determining photographing mode, electronic device and storage medium
CN118521876A (en) * 2024-07-22 2024-08-20 华侨大学 Immersion type video quality evaluation method and device based on similarity measurement

Also Published As

Publication number Publication date
US9794554B1 (en) 2017-10-17

Similar Documents

Publication Publication Date Title
Narwaria et al. HDR-VQM: An objective quality measure for high dynamic range video
US9794554B1 (en) Method for determining a visual quality index of a high dynamic range video sequence
Wang et al. Structural similarity based image quality assessment
CN111193923B (en) Video quality evaluation method and device, electronic equipment and computer storage medium
US8760578B2 (en) Quality assessment of high dynamic range, visual dynamic range and wide color gamut image and video
Martens et al. Image dissimilarity
Winkler Issues in vision modeling for perceptual video quality assessment
Valenzise et al. Performance evaluation of objective quality metrics for HDR image compression
JP2002503360A (en) Method and apparatus for evaluating the visibility of the difference between two image sequences
Mozhaeva et al. Full reference video quality assessment metric on base human visual system consistent with PSNR
US10085015B1 (en) Method and system for measuring visual quality of a video sequence
US8229229B2 (en) Systems and methods for predicting video location of attention focus probability trajectories due to distractions
Kocić et al. Image quality parameters: A short review and applicability analysis
Mantiuk Practicalities of predicting quality of high dynamic range images and video
Mikhailiuk et al. The effect of display brightness and viewing distance: a dataset for visually lossless image compression
Potashnikov et al. Analysis of modern methods used to assess the quality of video sequences during signal streaming
Narwaria et al. Study of high dynamic range video quality assessment
Fry et al. Bridging the gap between imaging performance and image quality measures
KR102573982B1 (en) Method for determining a visual quality index of a high dynamic range video sequence
Avadhanam et al. Prediction and measurement of high quality in still-image coding
Barkowsky et al. On the perceptual similarity of realistic looking tone mapped high dynamic range images
Bhola et al. Image Quality Assessment Techniques
US20130335578A1 (en) Ambient Adaptive Objective Image Metric
Ploumis et al. Image brightness quantification for hdr
Cadık et al. HDR Video Metrics

Legal Events

Date Code Title Description
AS Assignment

Owner name: UNIVERSITE DE NANTES, FRANCE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LE CALLET, PATRICK;DA SILVA, MATTHIEU PERREIRA;NARWARIA, MANISH;SIGNING DATES FROM 20170322 TO 20170323;REEL/FRAME:042188/0608

Owner name: CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE - CNR

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LE CALLET, PATRICK;DA SILVA, MATTHIEU PERREIRA;NARWARIA, MANISH;SIGNING DATES FROM 20170322 TO 20170323;REEL/FRAME:042188/0608

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

AS Assignment

Owner name: NANTES UNIVERSITE, FRANCE

Free format text: MERGER;ASSIGNOR:UNIVERSITE DE NANTES;REEL/FRAME:064824/0596

Effective date: 20211001