Nothing Special   »   [go: up one dir, main page]

CN110324626B - Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things - Google Patents

Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things Download PDF

Info

Publication number
CN110324626B
CN110324626B CN201910618148.4A CN201910618148A CN110324626B CN 110324626 B CN110324626 B CN 110324626B CN 201910618148 A CN201910618148 A CN 201910618148A CN 110324626 B CN110324626 B CN 110324626B
Authority
CN
China
Prior art keywords
face
image
frame
resolution
code stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910618148.4A
Other languages
Chinese (zh)
Other versions
CN110324626A (en
Inventor
肖晶
肖尚武
陈宇
彭冬梅
廖良
朱荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUZHOU Institute OF WUHAN UNIVERSITY
Wuhan University WHU
Original Assignee
SUZHOU Institute OF WUHAN UNIVERSITY
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUZHOU Institute OF WUHAN UNIVERSITY, Wuhan University WHU filed Critical SUZHOU Institute OF WUHAN UNIVERSITY
Priority to CN201910618148.4A priority Critical patent/CN110324626B/en
Publication of CN110324626A publication Critical patent/CN110324626A/en
Application granted granted Critical
Publication of CN110324626B publication Critical patent/CN110324626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/177Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/186Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/577Motion compensation with bidirectional frame interpolation, i.e. using B-pictures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/59Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Computer Security & Cryptography (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a video coding and decoding method for dual-code-stream face resolution fidelity facing to monitoring of the Internet of things, which comprises the steps of firstly, extracting face elements in a monitoring video picture, and detecting and tracking a face by adopting an MTCNN convolutional neural network; then, down-sampling and coding the original monitoring video to obtain a low-resolution base layer code stream; and then, filling the original resolution image of the face into the corresponding area of the image subjected to upsampling, subtracting the upsampled recovery image of the low-resolution image, obtaining difference information of the face area, and then coding the difference information to obtain a face recovery layer code stream. And a corresponding decoding method is further provided, and a dual-code-stream face recovery decoding algorithm is adopted to decode and obtain a monitoring picture with fidelity of the face local resolution. The invention integrates the fidelity images of the local resolution of high definition face and low definition background, reserves the rich detail information of the key area in the scene elements, greatly reduces the video coding code rate, improves the compression ratio of the monitoring video and has strong practicability.

Description

Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things
Technical Field
The invention relates to the technical field of monitoring video coding, in particular to a double-code-stream face resolution fidelity video coding and decoding method for monitoring of the Internet of things.
Background
Urban video monitoring systems play an increasingly important role in the field of public safety. With the increase of the distribution control range and the improvement of the image definition requirement, the amount of monitoring video data generated every day is continuously increased, so that the existing energy consumption, transmission and storage cost is increased day by day. How to maintain the analyzability of the surveillance video and simultaneously reduce the access and network transmission cost of the surveillance camera becomes an important problem which needs to be solved.
Under the prior art, for the requirements of wide coverage and massive access of city monitoring, NB-IoT (Narrow Band Internet of Things) is emerging as a representative Internet of Things technology, and an extension condition is provided for low-bandwidth and low-power video monitoring. The shooting of the monitoring video shows a high-definition trend and the uplink bandwidth of the Internet of things is extremely narrow, so that a serious challenge is provided for the coding efficiency of the monitoring video. In order to solve the contradiction between video definition and bandwidth requirement, the mainstream method improves the compression ratio by mining redundant information, and currently, there are two mainstream directions: the separation of foreground and background, and surrounding interested and non-interested areaTheory of things[1]. The background separation mode is multi-pass modeling, background redundancy in the monitoring video is removed, and bits required by background part coding are greatly reduced[2]. Monitoring video coding based on interested regions mainly researches code rate distribution of the interested regions and the non-interested regions, and is realized by adjusting multiple quantization lengths[3]. When the conventional monitoring video is coded, the local information of the face is damaged after being compressed, and the extraction and identification of the face information are severely limited.
The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:
(1) in the traditional video coding technology based on the region of interest, the conditions of large scene element extraction amplitude and inaccuracy exist, and excessive code rate is occupied;
(2) the different coding of the interested region and the non-interested region is realized only by adjusting the quantization step length, when the quantization step length is too large, the background can generate a screen-blooming phenomenon (large-area quantization distortion), and the integral visual impression is seriously influenced;
(3) the capacity of adjusting the video compression ratio by the quantization step is limited, and the requirement of a narrow-band Internet of things cannot be met under the condition that the overall picture quality is acceptable. The relevant references in the prior art are as follows:
[1]Meuel H,Munderloh M,Ostermann J.Low bit rate ROI based video coding for HDTV aerial surveillance video sequences[C]//Computer Vision&Pattern RecognitionWorkshops.2011.
[2]Zhang X,Huang T,Tian Y,et al.Background-Modeling-Based Adaptive Prediction for Surveillance Video Coding.[J].IEEE Transactions on Image Processing,2014,23(2):769-784.
[3]Jiang H,Deng W,Shen Z.Surveillance Video Processing Using Compressive Sensing[J].Inverse Problems&Imaging,2017,6(2):201-214.
disclosure of Invention
In view of the above, the present invention provides a dual-stream face resolution fidelity video encoding and decoding method for monitoring of internet of things, so as to solve or at least partially solve the technical problem that the method in the prior art cannot achieve face resolution fidelity under a narrowband internet of things.
The invention provides a video coding method for dual-code-stream face resolution fidelity facing to monitoring of the Internet of things, which comprises the following steps:
step S1: acquiring an original monitoring video image, detecting key frames in the monitoring video image by adopting an MTCNN (multiple-coded convolutional neural network), and performing multi-face tracking on non-key frames in the monitoring video image by adopting a KCF (kcF) algorithm to obtain face original resolution images of all frames in the original monitoring video image;
step S2: the method comprises the steps of carrying out down-sampling on an original monitoring video image to obtain a low-resolution image, coding the low-resolution image to obtain a low-resolution base layer code stream, and marking a timestamp when the base code stream is coded;
step S3: up-sampling the low-resolution image obtained in the step S2 to obtain a first image with the same resolution as the original surveillance video image, filling the first image with the original resolution image of the face into the up-sampled first image, fusing to obtain a second image, and subtracting the first image from the second image to obtain difference information of the face region;
step S4: and coding the difference information of the face region obtained in the step S3 to obtain a face recovery layer code stream.
In one embodiment, step S1 specifically includes:
step S1.1: acquiring an original monitoring video image through a high-resolution monitoring camera, and performing image preprocessing on the original monitoring video image;
step S1.2: determining the interval and the encoding mode of an image group, wherein in each GOP (group of pictures), a first frame is a key frame I frame, and the rest frames are P frames;
step S1.3: inputting an I frame image into an MTCNN model, detecting all human faces in the image, labeling a human face detection frame, caching the original image of the human face of the frame, and storing coordinate position parameters of the human face in the image;
step S1.4: reading all faces in the frame I, establishing a multi-target tracking object by taking the faces as a reference, and training by adopting a KCF kernel correlation filtering algorithm to obtain regressors corresponding to all face regions;
step S1.5: inputting a subsequent P frame, sampling a preset area around a target position of a previous frame, comparing the sampling position with a tracking object image by adopting a trained regressor, recording response values of all sampling points, determining a human face tracking area with the maximum response value and meeting a threshold value, intercepting and caching the tracked human face image, storing airspace coordinate parameters of the human face, and obtaining the human face original resolution images of all frames.
In one embodiment, step S2 specifically includes:
step S2.1: processing an original monitoring video image by adopting a specified downsampling filter to obtain a low-resolution image;
step S2.2: for the low-resolution image, an encoder x265 of the HEVC standard is adopted for encoding processing to obtain a basic code stream of the monitoring picture, and a timestamp is marked when the basic code stream is encoded.
In one embodiment, step S3 specifically includes:
step S3.1: performing up-sampling processing on the low-resolution image obtained in the step S2 by adopting a bilinear interpolation filter to obtain a first image P1 with the same resolution as that of the original monitoring video image;
step S3.2: taking the face data cached in the corresponding frame, reading the coordinate position parameters of the face in the image, filling the original resolution image of the face into the first image P1 after the upsampling, and fusing to obtain a second image P2 with a clear face and a fuzzy background;
step S3.3: subtracting the P1 from the P2 to obtain difference information of the image, wherein the difference information is used as a face recovery code stream to be coded, and the method specifically comprises the following steps: converting the second image P2 and the image P1 obtained by up-sampling into yuv format, and performing difference processing on three channels respectively, wherein at the moment, the background area information is zero, and the face area is the difference value between a high-definition face and an up-sampled low-definition face under the same resolution, and the specific calculation mode is as formula (1):
Figure BDA0002124646160000031
y, U, V are luminance and two chrominance components respectively, a lower corner mark d represents difference value information of three channel components, and corner marks P1 and P2 are marked with a low-definition up-sampling image and an image fused with a high-definition human face respectively.
In one embodiment, the method further comprises:
step S3.4: and carrying out post-processing on the obtained difference value information of the face image, and supplementing to obtain all difference value sequences of all frames.
In one embodiment, step S3.4 specifically includes:
step S3.4.1: caching difference information quantity of each frame of a group of pictures (GOP), wherein when part of human faces are missed, the difference information of the front and rear frames jump, and the difference information of all the missed human faces is zero;
step S3.4.2: judging whether face missing detection exists or not according to whether missing jumping exists in front and rear frames of difference information restored by the face;
step S3.4.3: when the missing detection interval of the time domain face is short and the detection is successful before and after, motion estimation is carried out according to single frames or multiple frames before and after, a preset prediction method is adopted, inter-frame prediction is carried out to obtain a difference value information frame, wherein the calculation formula of the corresponding pixel value of the bidirectional prediction is shown as (2):
Figure BDA0002124646160000041
wherein PreF (i, j) and PreB (i, j) respectively represent predicted pixels pushed out from adjacent forward and adjacent backward reference frames, m represents a forward reference m frame, n represents a backward reference n frame, square brackets represent the rounding of the predicted pixel result, and Pre (i, j) is a bidirectional prediction result;
step S3.4.4: when a plurality of frames are separated from time domain face missing detection and cannot be repaired through bidirectional prediction, a pixel synthesis method based on a convolutional neural network is adopted, a plurality of frames adjacent in the front and back are input, local convolution processing at a pixel level is carried out, multi-frame interpolation information is generated, and the local pixel convolution processing process is as shown in a formula (3):
Figure BDA0002124646160000042
wherein FeaturePixel is tpPredicting the convolution result of the pixels at the corresponding positions of the frame, wherein the three-dimensional convolution is respectively the abscissa of the spatial pixel, the ordinate of the pixel and the time of the image frame, [ F ]1,F2]Is the frame range of the input;
step S3.4.5: and arranging the original difference information and the data after frame insertion into a sequence according to a time sequence to obtain a human face high-definition image difference sequence with all original resolution maintained.
In one embodiment, step S4 specifically includes:
step S4.1: coding a difference value sequence of the face image, wherein the difference value sequence comprises difference value information intra-frame coding and difference value information inter-frame coding;
step S4.2: and marking the face recovery code stream according to the timestamp of the basic code stream to ensure the time synchronization of the basic code stream and the face recovery layer code stream.
In one embodiment, step S4.1 specifically includes:
step S4.1.1: after obtaining a continuous human face recovery difference sequence, inputting the sequence into a conventional encoder for encoding to obtain a human face recovery difference code stream;
step S4.1.2: adopting a coding mode corresponding to a basic code stream to perform intra-frame coding of difference information on a key frame I frame with a human face, and compressing intra-frame data volume;
step S4.1.3: for the non-key frame, performing interframe predictive coding on the difference information, and compressing the time domain redundancy of the difference information;
step S4.1.4: and carrying out statistical coding on the difference information of the face recovery to obtain a final code stream.
Based on the same inventive concept, a second aspect of the present invention provides a video decoding method based on the dual-stream face resolution fidelity video decoding method for internet of things monitoring of the first aspect, including:
and decoding the face recovery layer code stream by adopting a double-code-stream face recovery decoding algorithm to obtain a monitoring picture with fidelity of the face local resolution.
In one embodiment, a dual-stream face recovery decoding algorithm is adopted to decode a face recovery layer stream, which specifically includes:
synchronously receiving code stream packets corresponding to the basic code stream and the face recovery layer code stream, and analyzing the code stream packets to obtain a difference value information code stream of the basic code stream and the face recovery;
decoding the basic code stream, and upsampling the decoded video image by adopting bilinear interpolation to enlarge the video to the original resolution so as to obtain a first original resolution image frame f 1;
decoding the difference information code stream recovered by the human face to obtain a human face difference image frame f2 corresponding to each frame;
performing time sequence correspondence on the difference information for face recovery and the up-sampled basic code stream image according to a time stamp, adding f2 to f1 to obtain a face recovery frame f, and performing frame-by-frame processing to obtain a decoded image with recovered face resolution, wherein the specific calculation mode is as shown in formula (4):
Figure BDA0002124646160000051
y, U, V are luminance and two chrominance components respectively, the lower corner mark f represents a reconstructed high-definition face recovery frame, f1 represents a low-definition original resolution image frame, f2 represents a face difference information frame, the three components correspond to the face recovery frame and the low-definition background recovery frame respectively, and all decoded images form a face high-definition and low-definition fused video image.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a dual-code-stream face resolution fidelity video coding method for monitoring of the Internet of things, which comprises the steps of firstly, extracting face elements in a monitoring video picture, detecting non-key frames in a monitoring video image by adopting an MTCNN convolutional neural network and tracking a face; adopting a KCF algorithm to perform multi-face tracking on non-key frames in the monitored video image to obtain face original resolution images of all frames in the original monitored video image; then, down-sampling and coding the original monitoring video image to obtain a basic layer code stream, then, filling the original resolution image of the face into the corresponding area of the up-sampled image, and subtracting the up-sampled recovery image of the low resolution image to obtain the difference value information of the face area; and finally, coding the difference information to obtain a face recovery layer code stream.
The invention further provides a video decoding method based on the dual-code-stream face resolution fidelity video coding method for the monitoring of the Internet of things.
The method comprises the steps of firstly extracting face information in scene elements, performing down-sampling coding processing on an original picture, calculating a difference value between a resolution recovery image fused with a high-definition face low-definition background and an original low-definition resolution recovery image, independently coding the difference value into a difference code stream for face recovery, transmitting a double code stream, performing code stream synchronous processing, receiving double-code stream data at a decoding end, and decoding through a decoder for face resolution recovery to obtain a video image with a face local resolution fidelity.
Compared with the prior art, the method is based on the idea of image segmentation, extracts the key area, and codes the information for recovering the definition of the key area by using the independent code stream, thereby achieving the effect of local resolution fidelity. Compared with the conventional region-of-interest coding, the method has the advantages that the segmentation is more definite, the information retention of key elements is more complete, the compression rate is greatly improved, the network adaptability facing to the Internet of things is better, and the method can be widely applied to a face-facing monitoring system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a dual-stream face resolution fidelity video coding method for monitoring of the Internet of things according to the invention;
FIG. 2 is a schematic diagram of a system model for dual stream encoding and decoding according to the present invention;
FIG. 3 is a diagram illustrating inter-frame mapping according to the present invention;
fig. 4 is a flowchart of encoding for dual-stream face resolution fidelity in a specific example.
Detailed Description
The invention aims to provide a video coding and decoding method for face resolution fidelity facing to internet of things monitoring, aiming at the technical problem that the method in the prior art cannot realize face resolution fidelity under a narrow-band internet of things, so as to realize high-definition monitoring of face local key information resolution fidelity under a narrow band.
In order to achieve the above purpose, the main concept of the invention is as follows:
the key area and the background of the human face are segmented, and the human face area and the monitoring background are respectively coded by different resolution ratios through double code streams, so that the compression ratio of the monitoring video is improved. And restoring the monitoring video pictures with high local definition of the face and low background definition at the decoding end.
The key region of the face is extracted by using a MTCNN face detection mode based on deep learning to obtain the face contour. The basic code stream is a CIF-level low-resolution image, the face recovery code stream is high-resolution and low-resolution difference information, inter-frame prediction and intra-frame prediction coding of the difference information are carried out after frame interpolation supplement, and the space-time redundancy of the face image difference is removed. And receiving the synchronized double code streams at a decoding end, respectively decoding, fusing the basic image with the human face difference information, and recovering the high-definition image with the original resolution. The technical scheme of the invention proves feasible, can be well suitable for narrow-band Internet of things environment, and can be widely applied to a face-oriented community monitoring system
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment provides a dual-code stream face resolution fidelity video coding and decoding method for internet of things monitoring, please refer to fig. 1, and the method includes:
step S1: the method comprises the steps of obtaining an original monitoring video image, detecting key frames in the monitoring video image by adopting an MTCNN (multiple connectivity neural network) convolutional neural network, and carrying out multi-face tracking on non-key frames in the monitoring video image by adopting a KCF (kcF) algorithm to obtain face original resolution images of all frames in the original monitoring video image.
Specifically, in step S1, the face elements in the original surveillance video image are extracted, and different methods are respectively used for the key frames and the non-key frames in the surveillance video image.
In one embodiment, step S1 specifically includes:
step S1.1: acquiring an original monitoring video image through a high-resolution monitoring camera, and performing image preprocessing on the original monitoring video image;
step S1.2: determining the interval and the encoding mode of an image group, wherein in each GOP (group of pictures), a first frame is a key frame I frame, and the rest frames are P frames;
step S1.3: inputting an I frame image into an MTCNN model, detecting all human faces in the image, labeling a human face detection frame, caching the original image of the human face of the frame, and storing coordinate position parameters of the human face in the image;
step S1.4: reading all faces in the frame I, establishing a multi-target tracking object by taking the faces as a reference, and training by adopting a KCF kernel correlation filtering algorithm to obtain regressors corresponding to all face regions;
step S1.5: inputting a subsequent P frame, sampling a preset area around a target position of a previous frame, comparing the sampling position with a tracking object image by adopting a trained regressor, recording response values of all sampling points, determining a human face tracking area with the maximum response value and meeting a threshold value, intercepting and caching the tracked human face image, storing airspace coordinate parameters of the human face, and obtaining the human face original resolution images of all frames.
Specifically, in step S1.1, the image preprocessing of the original monitoring video image includes histogram equalization, illumination, and grayscale processing. GOP is the Group of pictures. The coding mode of IPPP … … is adopted. The KCF (Kernelized Correlation filters) algorithm has the main principle that: and training a discrimination classifier by using a given sample to judge whether the tracked target or the surrounding background information is the target or the surrounding background information.
In step S1.5, the preset area around the target position of the previous frame is the vicinity of the target position of the previous frame, and may be selected according to actual conditions. The maximum response value indicates the strongest response, and the threshold value can be preset. And analogizing in turn, and performing tracking processing on all subsequent P frames of the GOP by adopting the method in the step S1.5. The process is repeated for all GOPs, so that the extraction and the caching of the original resolution images of all frames of human faces can be realized.
Step S2: the method comprises the steps of conducting down-sampling on an original monitoring video image to obtain a low-resolution image, conducting coding on the low-resolution image to obtain a low-resolution base layer code stream, and marking a timestamp when the base layer code stream is coded.
Specifically, in step S2, the original surveillance video image is downsampled and encoded to obtain a low-resolution base layer bitstream. The MTCNN model is a multitask cascade convolution neural network, and the basic process and principle of detecting the key frames in the monitored video images by adopting the MTCNN convolution neural network are as follows:
(1) and acquiring a monitoring image through a high-resolution monitoring camera, preprocessing the image and inputting the image into the MTCNN model.
(2) And (4) setting the picture Resize to different Scales to establish an image pyramid, and taking the image pyramid as the input of a neural network framework of a subsequent three-cascade structure.
(3) And generating candidate window and frame regression vectors of the human face by using a full convolutional network P-Net, correcting the candidate windows by using a frame regression method, and finally inhibiting and merging overlapped candidate windows by using NMS non-maximum values.
(4) All the candidate frames of the previous network are used as the input of the R-Net of the next network, the R-Net added with the full connection layer is used for improving the face candidate window, and the bounding box regression and NMS non-maximum suppression are also used for filtering out the wrong face candidate window.
(5) And inputting a frame regression vector generated by R-Net, and outputting a final face frame and a corresponding feature point position by using O-Net. As in the bounding box regression of the previous step, the euclidean distance between the network predicted coordinate positions and the real coordinates is calculated and minimized. And then, finishing the real-time detection and extraction of the face in the monitoring picture, caching the original face image of the frame, and storing the coordinate position parameters of the face in the image.
(6) And inputting a continuous video sequence, creating a multi-target tracking object, and realizing multi-target face tracking by adopting a KCF algorithm. When the continuous frames are detected in real time and detection omission occurs, estimating the corresponding face coordinate position of the frame according to a tracking prediction algorithm, intercepting and caching the face image obtained by tracking estimation, and storing the airspace coordinate parameters of the face. Therefore, the extraction of the original resolution images of all frames of human faces can be realized.
In one embodiment, step S2 specifically includes:
step S2.1: processing an original monitoring video image by adopting a specified downsampling filter to obtain a low-resolution image;
step S2.2: for the low-resolution image, an encoder x265 of the HEVC standard is adopted for encoding processing to obtain a basic code stream of the monitoring picture, and a timestamp is marked when the basic code stream is encoded.
Specifically, the original monitoring video image is a high-definition monitoring video with an original resolution, and a low-resolution monitoring picture (image) can be obtained after processing by using a down-sampling filter. And the labeled timestamp (timestamp) is used for synchronizing the subsequent face recovery code stream.
Step S3: up-sampling the low-resolution image obtained in the step S2 to obtain a first image with the same resolution as the original surveillance video image, filling the first image with the original resolution image of the face into the up-sampled first image, fusing to obtain a second image, and subtracting the second image from the first image to obtain difference information of the face region.
Specifically, in step S3, the extracted original resolution image of the face is filled into the corresponding region of the high-resolution low-definition background image, and then the low-definition background image that is not fused with the high-definition face is subtracted to obtain the difference information of the face region.
In one embodiment, step S3 specifically includes:
step S3.1: performing up-sampling processing on the low-resolution image obtained in the step S2 by adopting a bilinear interpolation filter to obtain a first image P1 with the same resolution as that of the original monitoring video image;
step S3.2: taking the face data cached in the corresponding frame, reading the coordinate position parameters of the face in the image, filling the original resolution image of the face into the first image P1 after the upsampling, and fusing to obtain a second image P2 with a clear face and a fuzzy background;
step S3.3: subtracting the P1 from the P2 to obtain difference information of the image, wherein the difference information is used as a face recovery code stream to be coded, and the method specifically comprises the following steps: converting the second image P2 and the image P1 obtained by up-sampling into yuv format, and performing difference processing on three channels respectively, wherein at the moment, the background area information is zero, and the face area is the difference value between a high-definition face and an up-sampled low-definition face under the same resolution, and the specific calculation mode is as formula (1):
Figure BDA0002124646160000101
y, U, V are luminance and two chrominance components respectively, a lower corner mark d represents difference value information of three channel components, and corner marks P1 and P2 are marked with a low-definition up-sampling image and an image fused with a high-definition human face respectively.
Wherein the method further comprises:
step S3.4: and carrying out post-processing on the obtained difference value information of the face image, and supplementing to obtain all difference value sequences of all frames.
Specifically, step S3.4 specifically includes:
step S3.4.1: caching difference information quantity of each frame of a group of pictures (GOP), wherein when part of human faces are missed, the difference information of the front and rear frames jump, and the difference information of all the missed human faces is zero;
step S3.4.2: judging whether face missing detection exists or not according to whether missing jumping exists in front and rear frames of difference information restored by the face;
step S3.4.3: when the missing detection interval of the time domain face is short and the detection is successful before and after, motion estimation is carried out according to single frames or multiple frames before and after, a preset prediction method is adopted, inter-frame prediction is carried out to obtain a difference value information frame, wherein the calculation formula of the corresponding pixel value of the bidirectional prediction is shown as (2):
Figure BDA0002124646160000102
wherein PreF (i, j) and PreB (i, j) respectively represent predicted pixels pushed out from adjacent forward and adjacent backward reference frames, m represents a forward reference m frame, n represents a backward reference n frame, square brackets represent the rounding of the predicted pixel result, and Pre (i, j) is a bidirectional prediction result;
step S3.4.4: when a plurality of frames are separated from time domain face missing detection and cannot be repaired through bidirectional prediction, a pixel synthesis method based on a convolutional neural network is adopted, a plurality of frames adjacent in the front and back are input, local convolution processing at a pixel level is carried out, multi-frame interpolation information is generated, and the local pixel convolution processing process is as shown in a formula (3):
Figure BDA0002124646160000111
wherein FeaturePixel is tpPredicting the convolution result of the pixels at the corresponding positions of the frame, wherein the three-dimensional convolution is respectively the abscissa of the spatial pixel, the ordinate of the pixel and the time of the image frame, [ F ]1,F2]Is the frame range of the input;
step S3.4.5: and arranging the original difference information and the data after frame insertion into a sequence according to a time sequence to obtain a human face high-definition image difference sequence with all original resolution maintained.
Specifically, after performing frame-by-frame real-time face detection and adding a tracking algorithm, all images with faces can be basically obtained, but the condition of continuous face missing detection is not excluded, so that the missing detection of difference information is required. Step S3.4.2: if the missing jump exists in the frames before and after the difference information recovered by the human face is judged, continuously reading the video frames, and if the number of the frames exceeds 3, determining that the human face does not exist in the default video picture; and if the information quantity jumps within the time sequence range of 5 frames before and 3 frames after the current frame, the face missing detection is considered to exist. Of course, the above range can be selected according to the situation and is not limited to 3 and 5.
At step S3.4.4, the convolution process is performed on the channel data of the pixel to output a plurality of difference prediction results according to actual conditions. And S3.4.5, arranging the original difference information and the data after frame insertion into a sequence according to time sequence, so that a human face high-definition image difference sequence with all original resolution maintained can be obtained after a human face tracking algorithm and frame insertion leak repairing. It can be seen that when no human face appears in the picture, the continuous difference sequence remains static with a value of zero.
Step S4: and coding the difference information of the face region obtained in the step S3 to obtain a face recovery layer code stream.
Wherein, step S4 specifically includes:
step S4.1: coding a difference value sequence of the face image, wherein the difference value sequence comprises difference value information intra-frame coding and difference value information inter-frame coding;
step S4.2: and marking the face recovery code stream according to the timestamp of the basic code stream to ensure the time synchronization of the basic code stream and the face recovery layer code stream.
Specifically, step S4.1 specifically includes:
step S4.1.1: after obtaining a continuous human face recovery difference sequence, inputting the sequence into a conventional encoder for encoding to obtain a human face recovery difference code stream;
step S4.1.2: adopting a coding mode corresponding to a basic code stream to perform intra-frame coding of difference information on a key frame I frame with a human face, and compressing intra-frame data volume;
step S4.1.3: for the non-key frame, performing interframe predictive coding on the difference information, and compressing the time domain redundancy of the difference information;
step S4.1.4: and carrying out statistical coding on the difference information of the face recovery to obtain a final code stream.
The method provided by the invention has the core that according to different importance of background and face information in the monitoring video, an original picture is coded into a low-definition basic layer code stream, a high-definition area of the face is detected, tracked and extracted, a face resolution recovery layer code stream is obtained through coding processing, and the video compression with fidelity of the face resolution under low code rate is realized.
Please refer to fig. 3, which is a diagram illustrating inter-frame mapping according to the present invention. Please refer to fig. 4, which is a flowchart illustrating a dual-stream face resolution fidelity encoding process in a specific example, including obtaining a face recovery layer code stream S2 by encoding difference information of an enhancement layer and a base code stream S1, wherein the specific process is described in detail in the foregoing steps and is not repeated herein.
The embodiment provides a dual-code-stream face resolution fidelity video coding method for monitoring of the Internet of things. The method comprises the steps of inputting a high-resolution monitoring video facing to a human face, coding the video by the method, outputting a synchronous dual-code stream, transmitting the video through a narrow-band Internet of things, and decoding the video by a specific decoder to obtain a video image with high definition of the human face and low definition of the background.
In a specific embodiment, a common high-definition monitoring camera for a human face is collected as test data. The encoding end is processed through an embedded development board, the transmission environment simulates the narrow-band bandwidth to be 120-160 kbps, and the decoding end is processed and displayed on a background server. The objective quality of the image after encoding and decoding is evaluated by PSNR (Peak Signal to Noise Ratio). The implementation process comprises the following steps: the video data are collected in real time through the accessed camera, and the code stream data are sent in a wireless mode through real-time processing of the development board. After narrow-band transmission, the background receives the real-time data stream, the final video is obtained by decoding through a specially designed decoder, and the final video is stored in a local disk or directly called and displayed. The encoding end can set an expected average code rate value according to the actual bandwidth, and the encoding process preferentially ensures the face quality. The decoding end compares the human face quality of the conventional low definition and the local high definition restored by the scheme in real time, and calculates the Peak Signal to Noise Ratio (PSNR) of the local part of the human face compared with the original collected picture.
Based on the same inventive concept, the invention further provides a decoding method corresponding to the video encoding method in the first embodiment, which is specifically referred to in the second embodiment.
Example two
The embodiment provides a dual-code-stream face resolution fidelity video decoding method for monitoring of the Internet of things, which comprises the following steps:
and decoding the face recovery layer code stream by adopting a double-code-stream face recovery decoding algorithm to obtain a monitoring picture with fidelity of the face local resolution.
The method comprises the following steps of decoding a face recovery layer code stream by adopting a double-code-stream face recovery decoding algorithm, and specifically comprises the following steps:
synchronously receiving code stream packets corresponding to the basic code stream and the face recovery layer code stream, and analyzing the code stream packets to obtain a difference value information code stream of the basic code stream and the face recovery;
decoding the basic code stream, and upsampling the decoded video image by adopting bilinear interpolation to enlarge the video to the original resolution so as to obtain a first original resolution image frame f 1;
decoding the difference information code stream recovered by the human face to obtain a human face difference image frame f2 corresponding to each frame;
performing time sequence correspondence on the difference information for face recovery and the up-sampled basic code stream image according to a time stamp, adding f2 to f1 to obtain a face recovery frame f, and performing frame-by-frame processing to obtain a decoded image with recovered face resolution, wherein the specific calculation mode is as shown in formula (4):
Figure BDA0002124646160000131
y, U, V are luminance and two chrominance components respectively, the lower corner mark f represents a reconstructed high-definition face recovery frame, f1 represents a low-definition original resolution image frame, f2 represents a face difference information frame, the three components correspond to the face recovery frame and the low-definition background recovery frame respectively, and all decoded images form a face high-definition and low-definition fused video image.
Specifically, all decoded images (i.e., all decoded frames) form a fused video image with high definition of the face and low definition of the background, and the resolution of the face can basically achieve the fidelity effect. And at this point, the decoding recovery is completed, and the decoding is displayed and output or input into a background server for relevant processing such as face recognition.
Please refer to fig. 2, which is a schematic diagram of a system model for dual stream coding and decoding provided by the present invention, the system includes a hardware system and a software system, the hardware system includes a monitoring acquisition, a coding end, an NB-IoT (Narrow Band Internet of Things), a carrier network and related devices of a monitoring background, and the software system includes a hybrid coding system and a decoding analysis system. The steps 1-4 in the first embodiment are processed at an encoding end, the decoding process in the second embodiment is processed at a monitoring back end, the hybrid encoding system is used for realizing the encoding method in the first embodiment, and the decoding analysis system is used for realizing the decoding method in the second embodiment.
Based on the results obtained by the steps 1-4 in the first embodiment and the steps in the second embodiment, compared with a conventional monitoring video coding method for the region of interest, the compression rate is averagely improved by 20 times under the condition of the same face quality; compared with the conventional region-of-interest coding method, the local PSNR of the human face is averagely improved by 15dB compared with the monitoring transmission condition under the same bandwidth. The results of the specific experiments are shown in table 1.
TABLE 1
Figure BDA0002124646160000141
Compared with the prior art, the invention provides a video coding idea of dual-code stream face resolution fidelity, distinguishes importance of different elements in a scene, integrates methods of deep face detection, video frame insertion, code stream synchronization and the like, not only efficiently extracts key areas in a monitoring picture and retains original face information, but also recovers face high-definition resolution at a decoding end through face difference code stream information, reduces time domain redundancy as far as possible and improves compression efficiency. Based on this concept, the present invention provides a complete and feasible embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A dual-code-stream face resolution fidelity video coding method for monitoring of the Internet of things is characterized by comprising the following steps:
step S1: acquiring an original monitoring video image, detecting key frames in the monitoring video image by adopting an MTCNN (multiple-coded convolutional neural network), and performing multi-face tracking on non-key frames in the monitoring video image by adopting a KCF (kcF) algorithm to obtain face original resolution images of all frames in the original monitoring video image;
step S2: the method comprises the steps of carrying out down-sampling on an original monitoring video image to obtain a low-resolution image, coding the low-resolution image to obtain a low-resolution base layer code stream, and marking a timestamp when the base code stream is coded;
step S3: up-sampling the low-resolution image obtained in the step S2 to obtain a first image with the same resolution as the original surveillance video image, filling the first image with the original resolution image of the face into the up-sampled first image, fusing to obtain a second image, and subtracting the first image from the second image to obtain difference information of the face region;
step S4: and coding the difference information of the face region obtained in the step S3 to obtain a face recovery layer code stream.
2. The method according to claim 1, wherein step S1 specifically comprises:
step S1.1: acquiring an original monitoring video image through a high-resolution monitoring camera, and performing image preprocessing on the original monitoring video image;
step S1.2: determining the interval and the encoding mode of an image group, wherein in each GOP (group of pictures), a first frame is a key frame I frame, and the rest frames are P frames;
step S1.3: inputting an I frame image into an MTCNN model, detecting all human faces in the image, labeling a human face detection frame, caching the original image of the human face of the frame, and storing coordinate position parameters of the human face in the image;
step S1.4: reading all faces in the frame I, establishing a multi-target tracking object by taking the faces as a reference, and training by adopting a KCF kernel correlation filtering algorithm to obtain regressors corresponding to all face regions;
step S1.5: inputting a subsequent P frame, sampling a preset area around a target position of a previous frame, comparing the sampling position with a tracking object image by adopting a trained regressor, recording response values of all sampling points, determining a human face tracking area with the maximum response value and meeting a threshold value, intercepting and caching the tracked human face image, storing airspace coordinate parameters of the human face, and obtaining the human face original resolution images of all frames.
3. The method according to claim 1, wherein step S2 specifically comprises:
step S2.1: processing an original monitoring video image by adopting a specified downsampling filter to obtain a low-resolution image;
step S2.2: for the low-resolution image, an encoder x265 of the HEVC standard is adopted for encoding processing to obtain a basic code stream of the monitoring picture, and a timestamp is marked when the basic code stream is encoded.
4. The method according to claim 2, wherein step S3 specifically comprises:
step S3.1: performing up-sampling processing on the low-resolution image obtained in the step S2 by adopting a bilinear interpolation filter to obtain a first image P1 with the same resolution as that of the original monitoring video image;
step S3.2: taking the face data cached in the corresponding frame, reading the coordinate position parameters of the face in the image, filling the original resolution image of the face into the first image P1 after the upsampling, and fusing to obtain a second image P2 with a clear face and a fuzzy background;
step S3.3: subtracting the P1 from the P2 to obtain difference information of the image, wherein the difference information is used as a face recovery code stream to be coded, and the method specifically comprises the following steps: converting the second image P2 and the image P1 obtained by up-sampling into yuv format, and performing difference processing on three channels respectively, wherein at the moment, the background area information is zero, and the face area is the difference value between a high-definition face and an up-sampled low-definition face under the same resolution, and the specific calculation mode is as formula (1):
Figure FDA0002124646150000021
y, U, V are luminance and two chrominance components respectively, a lower corner mark d represents difference value information of three channel components, and corner marks P1 and P2 are marked with a low-definition up-sampling image and an image fused with a high-definition human face respectively.
5. The method of claim 4, wherein the method further comprises:
step S3.4: and carrying out post-processing on the obtained difference value information of the face image, and supplementing to obtain all difference value sequences of all frames.
6. The method according to claim 5, characterized in that step S3.4 comprises in particular:
step S3.4.1: caching difference information quantity of each frame of a group of pictures (GOP), wherein when part of human faces are missed, the difference information of the front and rear frames jump, and the difference information of all the missed human faces is zero;
step S3.4.2: judging whether face missing detection exists or not according to whether missing jumping exists in front and rear frames of difference information restored by the face;
step S3.4.3: when the missing detection interval of the time domain face is short and the detection is successful before and after, motion estimation is carried out according to single frames or multiple frames before and after, a preset prediction method is adopted, inter-frame prediction is carried out to obtain a difference value information frame, wherein the calculation formula of the corresponding pixel value of the bidirectional prediction is shown as (2):
Figure FDA0002124646150000031
wherein PreF (i, j) and PreB (i, j) respectively represent predicted pixels pushed out from adjacent forward and adjacent backward reference frames, m represents a forward reference m frame, n represents a backward reference n frame, square brackets represent the rounding of the predicted pixel result, and Pre (i, j) is a bidirectional prediction result;
step S3.4.4: when a plurality of frames are separated from time domain face missing detection and cannot be repaired through bidirectional prediction, a pixel synthesis method based on a convolutional neural network is adopted, a plurality of frames adjacent in the front and back are input, local convolution processing at a pixel level is carried out, multi-frame interpolation information is generated, and the local pixel convolution processing process is as shown in a formula (3):
Figure FDA0002124646150000032
wherein FeaturePixel is tpPredicting the convolution result of the pixels at the corresponding positions of the frame, wherein the three-dimensional convolution is respectively the abscissa of the spatial pixel, the ordinate of the pixel and the time of the image frame, [ F ]1,F2]Is the frame range of the input;
step S3.4.5: and arranging the original difference information and the data after frame insertion into a sequence according to a time sequence to obtain a human face high-definition image difference sequence with all original resolution maintained.
7. The method according to claim 1, wherein step S4 specifically comprises:
step S4.1: coding a difference value sequence of the face image, wherein the difference value sequence comprises difference value information intra-frame coding and difference value information inter-frame coding;
step S4.2: and marking the face recovery code stream according to the timestamp of the basic code stream to ensure the time synchronization of the basic code stream and the face recovery layer code stream.
8. The method according to claim 7, characterized in that step S4.1 comprises in particular:
step S4.1.1: after obtaining a continuous human face recovery difference sequence, inputting the sequence into a conventional encoder for encoding to obtain a human face recovery difference code stream;
step S4.1.2: adopting a coding mode corresponding to a basic code stream to perform intra-frame coding of difference information on a key frame I frame with a human face, and compressing intra-frame data volume;
step S4.1.3: for the non-key frame, performing interframe predictive coding on the difference information, and compressing the time domain redundancy of the difference information;
step S4.1.4: and carrying out statistical coding on the difference information of the face recovery to obtain a final code stream.
9. The video decoding method based on the dual-code-stream face resolution fidelity-oriented video coding method for the monitoring of the internet of things according to any one of claims 1 to 8 is characterized by comprising the following steps of:
and decoding the face recovery layer code stream by adopting a double-code-stream face recovery decoding algorithm to obtain a monitoring picture with fidelity of the face local resolution.
10. The method of claim 9, wherein decoding a face recovery layer code stream using a dual-stream face recovery decoding algorithm, specifically comprises:
synchronously receiving code stream packets corresponding to the basic code stream and the face recovery layer code stream, and analyzing the code stream packets to obtain a difference value information code stream of the basic code stream and the face recovery;
decoding the basic code stream, and upsampling the decoded video image by adopting bilinear interpolation to enlarge the video to the original resolution so as to obtain a first original resolution image frame f 1;
decoding the difference information code stream recovered by the human face to obtain a human face difference image frame f2 corresponding to each frame;
performing time sequence correspondence on the difference information for face recovery and the up-sampled basic code stream image according to a time stamp, adding f2 to f1 to obtain a face recovery frame f, and performing frame-by-frame processing to obtain a decoded image with recovered face resolution, wherein the specific calculation mode is as shown in formula (4):
Figure FDA0002124646150000041
y, U, V are luminance and two chrominance components respectively, the lower corner mark f represents a reconstructed high-definition face recovery frame, f1 represents a low-definition original resolution image frame, f2 represents a face difference information frame, the three components correspond to the face recovery frame and the low-definition background recovery frame respectively, and all decoded images form a face high-definition and low-definition fused video image.
CN201910618148.4A 2019-07-10 2019-07-10 Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things Active CN110324626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910618148.4A CN110324626B (en) 2019-07-10 2019-07-10 Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910618148.4A CN110324626B (en) 2019-07-10 2019-07-10 Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things

Publications (2)

Publication Number Publication Date
CN110324626A CN110324626A (en) 2019-10-11
CN110324626B true CN110324626B (en) 2021-05-18

Family

ID=68123191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910618148.4A Active CN110324626B (en) 2019-07-10 2019-07-10 Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things

Country Status (1)

Country Link
CN (1) CN110324626B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717070A (en) * 2019-10-17 2020-01-21 山东浪潮人工智能研究院有限公司 Video compression method and system for indoor monitoring scene
CN112785533B (en) * 2019-11-07 2023-06-16 RealMe重庆移动通信有限公司 Image fusion method, image fusion device, electronic equipment and storage medium
US11375204B2 (en) 2020-04-07 2022-06-28 Nokia Technologies Oy Feature-domain residual for video coding for machines
CN111737525B (en) * 2020-06-03 2022-10-25 西安交通大学 Multi-video program matching method
CN113810696B (en) * 2020-06-12 2024-09-17 华为技术有限公司 Information transmission method, related equipment and system
CN112435440B (en) * 2020-10-30 2022-08-09 成都蓉众和智能科技有限公司 Non-contact type indoor personnel falling identification method based on Internet of things platform
CN112104869B (en) * 2020-11-10 2021-02-02 光谷技术有限公司 Video big data storage and transcoding optimization system
CN112733666A (en) * 2020-12-31 2021-04-30 湖北亿咖通科技有限公司 Method, equipment and storage medium for collecting difficult images and training models
CN112949547A (en) * 2021-03-18 2021-06-11 北京市商汤科技开发有限公司 Data transmission and display method, device, system, equipment and storage medium
CN114630129A (en) * 2022-02-07 2022-06-14 浙江智慧视频安防创新中心有限公司 Video coding and decoding method and device based on intelligent digital retina
CN115361582B (en) * 2022-07-19 2023-04-25 鹏城实验室 Video real-time super-resolution processing method, device, terminal and storage medium
CN116506665A (en) * 2023-06-27 2023-07-28 北京蔚领时代科技有限公司 VR streaming method, system, device and storage medium for self-adaptive code rate control
CN117544782A (en) * 2023-11-09 2024-02-09 四川新视创伟超高清科技有限公司 Target enhancement coding method and device in 8K video of unmanned aerial vehicle
CN117556082B (en) * 2024-01-12 2024-03-22 广东启正电子科技有限公司 Remote face recognition video storage method and system based on sequence coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478671A (en) * 2008-01-02 2009-07-08 中兴通讯股份有限公司 Video encoding apparatus applied on video monitoring and video encoding method thereof
CN101938656A (en) * 2010-09-27 2011-01-05 上海交通大学 Video coding and decoding system based on keyframe super-resolution reconstruction
CN107454412A (en) * 2017-08-23 2017-12-08 绵阳美菱软件技术有限公司 A kind of processing method of video image, apparatus and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9584710B2 (en) * 2008-02-28 2017-02-28 Avigilon Analytics Corporation Intelligent high resolution video system
CN105701515B (en) * 2016-01-18 2019-01-04 武汉大学 A kind of human face super-resolution processing method and system based on the constraint of the double-deck manifold

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478671A (en) * 2008-01-02 2009-07-08 中兴通讯股份有限公司 Video encoding apparatus applied on video monitoring and video encoding method thereof
CN101938656A (en) * 2010-09-27 2011-01-05 上海交通大学 Video coding and decoding system based on keyframe super-resolution reconstruction
CN107454412A (en) * 2017-08-23 2017-12-08 绵阳美菱软件技术有限公司 A kind of processing method of video image, apparatus and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Superpixel-based segmentation of moving objects for low bitrate ROI coding systems;Holger Meuel等;《2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance》;20131021;全文 *
物联网环境下基于场景要素的多码流变分辨率压缩传输技术研究;肖尚武等;《物联网学报》;20181231;全文 *

Also Published As

Publication number Publication date
CN110324626A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110324626B (en) Dual-code-stream face resolution fidelity video coding and decoding method for monitoring of Internet of things
EP2782340B1 (en) Motion analysis method based on video compression code stream, code stream conversion method and apparatus thereof
US6625333B1 (en) Method for temporal interpolation of an image sequence using object-based image analysis
US20100303150A1 (en) System and method for cartoon compression
US20140177706A1 (en) Method and system for providing super-resolution of quantized images and video
CN111434115B (en) Method and related device for coding and decoding video image comprising pixel points
Li et al. A scalable coding approach for high quality depth image compression
US6577352B1 (en) Method and apparatus for filtering chrominance signals of image
CN102761765B (en) Deep and repaid frame inserting method for three-dimensional video
Meisinger et al. Automatic tv logo removal using statistical based logo detection and frequency selective inpainting
JP3405788B2 (en) Video encoding device and video decoding device
KR20060111528A (en) Detection of local visual space-time details in a video signal
CN113727073A (en) Method and system for realizing vehicle-mounted video monitoring based on cloud computing
CN112887587A (en) Self-adaptive image data fast transmission method capable of carrying out wireless connection
Xie et al. Just noticeable visual redundancy forecasting: a deep multimodal-driven approach
KR100289054B1 (en) Region segmentation and background mosaic composition
JP2007514362A (en) Method and apparatus for spatial scalable compression techniques
Bosch et al. Video coding using motion classification
Chen et al. AV1 video coding using texture analysis with convolutional neural networks
US7899112B1 (en) Method and apparatus for extracting chrominance shape information for interlaced scan type image
Décombas et al. Seam carving modeling for semantic video coding in security applications
CN114782676B (en) Method and system for extracting region of interest of video
Sakaida et al. Moving object extraction using background difference and region growing with a spatiotemporal watershed algorithm
CN117974881A (en) Traffic human body detection reconstruction method based on video reconstruction technology
CN116723331A (en) Real-time video image equipment based on light weight and processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant