WO2024212822A1 - 一种基于ai的视频编码方法、装置、设备和存储介质 - Google Patents
一种基于ai的视频编码方法、装置、设备和存储介质 Download PDFInfo
- Publication number
- WO2024212822A1 WO2024212822A1 PCT/CN2024/084506 CN2024084506W WO2024212822A1 WO 2024212822 A1 WO2024212822 A1 WO 2024212822A1 CN 2024084506 W CN2024084506 W CN 2024084506W WO 2024212822 A1 WO2024212822 A1 WO 2024212822A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- key point
- frame
- driving
- point information
- reference frame
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000007906 compression Methods 0.000 claims abstract description 157
- 230000006835 compression Effects 0.000 claims abstract description 157
- 230000005540 biological transmission Effects 0.000 claims abstract description 36
- 238000001514 detection method Methods 0.000 claims abstract description 33
- 239000011159 matrix material Substances 0.000 claims description 99
- 230000033001 locomotion Effects 0.000 claims description 42
- 239000013067 intermediate product Substances 0.000 claims description 10
- 230000003287 optical effect Effects 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 abstract description 16
- 230000000694 effects Effects 0.000 abstract description 14
- 230000008569 process Effects 0.000 abstract description 6
- 230000000875 corresponding effect Effects 0.000 description 28
- 238000005516 engineering process Methods 0.000 description 12
- 230000008859 change Effects 0.000 description 7
- 238000011161 development Methods 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000001815 facial effect Effects 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 3
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 3
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013144 data compression Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000004886 head movement Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000001172 regenerating effect Effects 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
Definitions
- the embodiments of the present application relate to the field of video processing technology, and in particular to an AI-based video encoding method, apparatus, device, storage medium, and product.
- AI video coding methods are mainly deep video coding (DVC) and machine vision coding.
- DVC deep video coding
- Deep video coding DVC is an end-to-end video coding model.
- the entire video compression framework is completed by a neural network and can be trained uniformly.
- Machine vision coding is a video coding technology aimed at intelligent applications. It combines video coding and decoding with machine vision analysis, and implements video coding and decoding tasks by using an end-to-end network system to enable machines to complete visual tasks.
- the deep video coding DVC framework is completed by a neural network, and the deep learning method used is mainly based on offline optimization, there are problems such as poor model adaptability, complex model deployment and implementation, and high transmission bit rate; at the same time, in addition to video encoding and decoding, the machine vision coding model is more important to use the decoded video to complete machine vision tasks, and there are problems such as strong model targeting and high difficulty in model training.
- AI video coding technology usually requires a large floating-point data representation, which does not show obvious advantages compared to P frame compression in high-efficiency video coding. Therefore, when using existing AI video coding technology to encode videos, there are problems such as poor adaptability of the coding model, high difficulty in model training, and high transmission bit rate.
- the embodiments of the present application provide an AI-based video encoding method, apparatus, device, storage medium and product, which solve the problems of poor adaptability of the encoding model, high difficulty in model training and high transmission bit rate when encoding videos using existing AI video encoding technology.
- the method uses a key point detection network to output key point information for the source reference frame and drive frame of the encoded video, and determines the key point information compression result of each drive frame based on the key point information of the source reference frame and the preset compression rules, and then generates code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each drive frame for transmission, so as to achieve ultra-low bit rate video encoding.
- an embodiment of the present application provides an AI-based video encoding method, the method comprising:
- code stream data is generated for transmission.
- an embodiment of the present application further provides an AI-based video encoding device, including:
- a video acquisition module used to acquire the video to be encoded
- a key point information output module used for outputting key point information of a source reference frame and a driving frame of the video to be encoded using a key point detection network
- a key point information compression module is used to determine the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule;
- the code stream data transmission module is used to generate code stream data for transmission based on the compression result of the source reference frame, the key point information of the source reference frame and the key point information of each driving frame.
- an embodiment of the present application further provides an AI-based video encoding device, the device comprising:
- processors one or more processors
- a storage device for storing one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the AI-based video encoding method described in the embodiment of the present application.
- an embodiment of the present application further provides a storage medium storing computer executable instructions, which, when executed by a computer processor, are used to execute the AI-based video encoding method described in the embodiment of the present application.
- an embodiment of the present application also provides a computer program product, which includes a computer program stored in a computer-readable storage medium, and at least one processor of the device reads and executes the computer program from the computer-readable storage medium, so that the device executes the AI-based video encoding method described in the embodiment of the present application.
- a video to be encoded is obtained; a key point detection network is used to output key point information for the source reference frame and the drive frame of the video to be encoded; a key point information compression result of each drive frame is determined based on the key point information of the source reference frame and the preset compression rules; and a code stream data is generated based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each drive frame for transmission.
- the above-mentioned AI-based video encoding method solves the problems of poor adaptability of the encoding model, high difficulty in model training and high transmission bit rate when encoding videos using existing AI video encoding technology.
- the key point detection network is used to output key point information for the source reference frame and the drive frame of the video to be encoded, and the key point information compression result of each drive frame is determined based on the key point information of the source reference frame and the preset compression rules. Then, a code stream data is generated based on the key point information of the source reference frame, the key point information of the source reference frame and the key point information compression result of each drive frame.
- a code stream data is generated based on the key point information of the source reference frame, the key point information of the source reference frame and the key point information compression result of each drive frame.
- FIG1 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application.
- FIG2 is a schematic diagram of a framework of an AI-based video encoding system provided in an embodiment of the present application
- FIG3 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application.
- FIG4 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application.
- FIG5 is a flowchart of a key point coordinate compression provided by an embodiment of the present application.
- FIG6 is a flow chart of a reconstructed image evaluation method provided in an embodiment of the present application.
- FIG7 is a structural block diagram of an AI-based video encoding device provided in an embodiment of the present application.
- FIG8 is a schematic diagram of the structure of an AI-based video encoding device provided in an embodiment of the present application.
- first, second, etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of one type, and the number of objects is not limited.
- the first object can be one or more.
- “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally indicates that the objects associated with each other are in an "or” relationship.
- the first frame (reference frame) is first encoded and transmitted using a traditional encoder, and then the key points of subsequent frames are extracted at the encoding end and transmitted to the decoding end.
- the decoding end relies on the relative displacement of the key points of the current frame and the reference frame to distort the reference frame face and fill in the background to generate the current frame.
- this paper proposes an ultra-low bit rate generation coding method, which can compress the original key point data and reuse the key point data, further reducing the key point information that needs to be transmitted at the encoding end.
- the bit rate goal of at least 1/10 can be achieved, and the subjective quality is not significantly reduced.
- the present invention constructs an end-to-end generation coding system framework, and provides new ideas for regenerating lost frames and generating frames to replace traditional coded frames for the transcoding engine side and the audience decoding side through pre-coding processing and FOM generation compression link.
- FIG1 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application, which can be used in video image transmission scenarios, especially video transmission scenarios through video encoding compression.
- the method can be executed by a server and smart terminal with encoding and computing capabilities. The method specifically includes the following steps:
- the video to be encoded may be video data to be transmitted, and the video data may be composed of multiple frames of the same or different images.
- the purpose of compressing the video can be achieved, so that the video can be successfully transmitted to the user terminal device.
- the video to be encoded can be obtained through a video recording device at the video recording end, such as a camera, etc.
- a video recording device at the video recording end, such as a camera, etc.
- data collection and image preprocessing can be performed on the images in the video to be encoded, for example, the video to be encoded can be classified, judged and enhanced, etc., which can improve the robustness of video encoding model training and the efficiency of video information acquisition.
- the source reference frame may be the first frame of the video to be encoded.
- the driving frame may be each frame of the video to be encoded that is located after the source reference frame.
- the image content and image data of each frame of the driving frame may be the same or different, and the image content and image data of the driving frame and the source reference frame may be the same or different.
- key points may be points that can represent the features of each frame of the video to be encoded, including points that represent the overall form or content distribution of the source reference frame and the driving frame image, and points that change in the next frame or multiple frames of the driving frame image.
- key point information may be information used to encode the video to be encoded, including content information, position information, and data information of the key points, such as pixel data of the key points, position coordinate data, and data information within a certain range near the coordinates.
- the key point detection network may be a network for obtaining the key point information, including a key point detector Kp-detector, etc.
- the key point information of the source reference frame and the driving frame may be obtained by acting on the source reference frame and the driving frame in the video to be encoded through the sparse motion field network in the key point detection network.
- the preset compression rule may be a rule for compressing each drive frame in the to-be-encoded video according to video compression requirements to reduce the volume of the to-be-encoded video, including an intra-frame compression rule and an inter-frame compression rule, etc.
- the key point information compression result of each drive frame may be a result of compressing the key point data of each drive frame into encoded data, including intra-frame compression results of different key point information data in the same drive frame and inter-frame compression results of the same key point data in different drive frames.
- the coding system model can be generated end-to-end based on the FOM (First Order Motion Model).
- the method of key point data compression is adopted to compress the key point information of each driving frame within a frame according to a preset compression rule, and to compress the key point information of each driving frame between frames according to the relationship between the source reference frame and the key point information of each driving frame and the preset compression rule.
- the key point data compression may include compression of pixel data, compression of position coordinate data, conversion of data type, etc., which are not limited here.
- This scheme further compresses the key point information using preset compression rules, which can greatly improve the compression efficiency based on the key point information obtained by the original compression, achieve the purpose of deep compression of the data, and thus achieve an ultra-low bit rate encoding effect.
- the bitstream data may be data obtained by encoding and compressing the source reference frame and each driving frame in the video to be encoded.
- the video to be encoded may be transmitted to the user terminal device in the form of bitstream data.
- the bitstream data is generated based on the image encoding result of the source reference frame, the key point information encoding result of the source reference frame, and the key point information compression result of each driving frame.
- the source reference frame can be directly used to generate each drive frame by unidirectional reference to restore the video to be encoded. For example: based on the key point information of the source reference frame and the preset compression rules, each drive frame image adjacent to or not adjacent to the source reference frame is generated through dense motion field calculation.
- the user terminal device when the user terminal device receives the bitstream, for videos with occasional facial occlusion, medium motion, and simple background, since the content information, position information, and data information of the key points of the source reference frame and each driving frame in the video vary greatly, that is, the key point information of the source reference frame and each driving frame is less correlated, after the video is compressed, it is impossible to accurately restore the driving frame image based on the unidirectional reference of the source reference frame. Therefore, a backward reference can be added on the basis of FOM, and a bidirectional reference driving strategy is adopted to generate the final image of each driving frame.
- the scene change detection method includes: scene detection network and calculation of the mutual information of adjacent frames. By detecting the scene change, it is judged whether the estimated update probability of the current frame is greater than the mean update probability predicted by the encoded frame before the frame, and whether the mutual information between the current frame and the adjacent frame is greater than the mean mutual information calculated by the encoded frame, so as to indicate a more accurate source reference update positioning, and finally a certain degree of subjective quality improvement is obtained, which improves the poor generation results.
- the two-way reference driving strategy can also be adopted according to the user's demand for video restoration quality, and no excessive restrictions are made here.
- the method further includes:
- the code stream data is directly transmitted to the corresponding terminal.
- the transcoding engine can be used to receive the code stream data generated by the above scheme, and perform transcoding processing and resolution requirements when necessary.
- the transcoding engine can be provided with multiple output ports for forwarding the code stream data to different user terminal devices.
- the transcoding engine can identify whether there is a user terminal device that is inconsistent with the code stream data transmission conditions, that is, a user terminal device that cannot identify or transmit the code stream data.
- the target terminal may be a user terminal device that is unable to transmit the code stream.
- the target terminal parameter may be a code stream transmission parameter of the target terminal. If the transcoding engine identifies the existence of the target terminal, the code stream data is decoded and re-encoded according to the target terminal parameter, and then the converted code stream data is forwarded to the corresponding target terminal. Since the transcoding engine can be connected to multiple different user terminal devices at the same time, the encoding method and encoding standard re-encoded by the transcoding engine may be the same or different for different target terminals. If the transcoding engine does not identify the existence of the target terminal, the code stream data is directly transmitted to the corresponding user terminal device for the user terminal device to regenerate the video.
- the transcoding engine and the user terminal device can decode the source reference frame code stream to regenerate the source reference frame image, and regenerate each drive frame image in the order of the drive frame according to the regenerated source reference frame and code stream data to restore the video.
- the transcoding engine and the user terminal device can determine whether there is a frame loss phenomenon according to the sequence number of each drive frame in the code stream, and if so, regenerate the lost frame through encoding according to the adjacent drive frames.
- This solution uses a transcoding engine to identify whether there are terminals that cannot recognize the code stream data, and converts the code stream data based on the target terminal parameters, and then transmits the converted code stream data to the target terminal. This can avoid the problem of being unable to receive code stream data due to different configurations of user terminal devices, and improves the accuracy and efficiency of user terminal devices in receiving and restoring transmitted videos.
- the technical solution provided in the embodiment of the present application obtains a video to be encoded; uses a key point detection network to output key point information for a source reference frame and a driving frame of the video to be encoded; determines a key point information compression result of each driving frame based on the key point information of the source reference frame and a preset compression rule; and generates code stream data for transmission based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame.
- the above-mentioned AI-based video encoding method solves the problems of poor adaptability of the encoding model, high difficulty in model training and high transmission bit rate when encoding videos using the existing AI video encoding technology.
- the key point detection network is used to output key point information of the source reference frame and the driving frame of the encoded video, and the key point information compression result of each driving frame is determined according to the key point information of the source reference frame and the preset compression rules. Then, based on the key point information of the source reference frame, the source reference frame and the key point information compression result of each driving frame, the code stream data is generated for transmission, so that the effect of ultra-low bit rate video encoding can be achieved.
- this solution also improves the scope of application of the AI video encoding model and reduces the difficulty and cost of model development and training.
- FIG. 2 is a schematic diagram of the framework of an AI-based video encoding system provided in an embodiment of the present application. As shown in Figure 2, it specifically includes: an acquisition/pre-processing module, an AI Enc module, a transcoding engine module and an AIDec module.
- the acquisition/pre-processing module is used to perform pre-processing operations such as classification, enhancement and highlighting on the video to be encoded, and transmit the processed video to be encoded to the AI Enc module.
- the AI Enc module is used to perform encoding and compression operations on the video to be encoded, including a traditional encoding link and two AI encoding links.
- the traditional encoding link is directed to the mixing module by the Codec Engine module.
- the Codec Engine module is used to encode the source reference frame image in the video to be encoded, so that the source reference frame image can enter the mixing module and be transmitted to the user end.
- the mixing module is used to mix the image encoding results of the source reference frame with the image encoding results of each driving frame and generate a code stream for the transmission of the encoded video.
- One of the AI encoding links is connected to the Codec Engine module, which is used to encode and compress the source reference frame image, including the DPB (Decoded Picture Buffer) module, the key point detection Net, and the generated information compression module.
- the DPB module is used to decode and cache the source reference frame image encoded by the Codec Engine module, and transmit the decoded source reference frame image to the key point detection Net for the key point detection Net to detect the key point information of the source reference frame image. It can be understood that the source reference frame image can also be directly input into the key point detection Net here.
- the generated information compression module is used to encode and compress the key point information detected by the key point detection Net, and transmit the compressed key point information to the mixing module in the traditional encoding link.
- Another AI encoding link includes a key point detection Net and AIDec, wherein AIDec includes a sparse motion field Net and a dense motion field Net.
- the key point detection Net is used to perform key point information detection operations on each drive frame image, and transmit the detected key point information to the generated information compression module.
- AIDec is used to regenerate each frame image based on the detected key point information, and transmit the generated frame image to the identification Net, forming a generative adversarial network with the identification Net to identify the reliability and accuracy of the generated frame.
- the sparse motion field Net in AIDec is used to obtain the specific data of the key points in the key point information of each frame image
- the dense motion field Net is used to regenerate each frame image based on the data.
- the transcoding engine module is used to transcode the code stream data and regenerate the lost frame data, including Codec Dec and AIDec.
- Codec Dec is used to generate the source reference frame image based on the code stream data and the source reference frame image coding data.
- AIDec is used to regenerate the lost frame based on the code stream data and the generated adjacent reference frame or driving frame image.
- the AIDec module is used to receive the code stream data directly transmitted by the AI Enc module or the code stream data forwarded by the transcoding engine module, including Codec Dec and AIDec.
- Codec Dec is used to generate the source reference frame image according to the code stream data and the source reference frame image coding data.
- AIDec is used to regenerate the lost frame according to the code stream data and the generated adjacent reference frame or driving frame image.
- FIG3 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application. As shown in FIG3 , the method specifically includes the following steps:
- the key point coordinates may be the coordinates of each key point in the source reference frame and each driving frame in the same coordinate system, which represent the position of each key point.
- the change of each frame image can be determined by the change of the key point coordinates.
- the Jacobian matrix can be the Jacobian matrix corresponding to the key point coordinates, which is the first-order partial derivatives of the key point coordinates arranged in a certain way.
- the matrix whose determinant is called Jacobian determinant reflects the optimal linear approximation of a differentiable equation to a given point.
- the Jacobian matrix corresponding to the key point coordinates is calculated by using the sparse motion field in the key point detection network to obtain and output the key point coordinates for the source reference frame and the driving frame of the video to be encoded.
- the source Jacobian matrix may be the Jacobian matrix corresponding to the coordinates of each source key point of the source reference frame
- the driving Jacobian matrix may be the Jacobian matrix corresponding to the coordinates of each driving key point of each driving frame.
- the coordinates of each driving key point may be the same or different, so the driving Jacobian matrix may also be the same or different accordingly.
- the key point detection network outputs the key point coordinates and the Jacobian matrix corresponding to the coordinates to read the source Jacobian matrix corresponding to the coordinates of each source key point of the source reference frame and the driving Jacobian matrix corresponding to the coordinates of each driving key point of each driving frame.
- the first preset compression rule may be a rule for compressing the driving Jacobian matrix of each driving frame, may be a rule for performing residual calculation on the driving Jacobian matrices of adjacent driving frames, or may be a rule for performing residual calculation on each driving Jacobian matrix and the source Jacobian matrix, etc., which are not limited in detail here.
- the driving Jacobian matrix of each driving frame may be compressed to obtain a compression result of the driving Jacobian matrix.
- the method before compressing the driving Jacobian matrix of each driving frame according to the source Jacobian matrix and the first preset compression rule to obtain the driving Jacobian matrix compression result, the method further includes:
- the source Jacobian matrix and the driving Jacobian matrix are converted into a format from Float32 type data to Float16 type data.
- the driving Jacobian matrix of each driving frame is compressed according to the source Jacobian matrix and the first preset compression rule.
- the precision of the source Jacobian matrix and the driving Jacobian matrix can be reduced.
- the way to reduce the matrix precision is to convert the format of the source Jacobian matrix and the driving Jacobian matrix, from Float32 type data to Float16 type data. At this time, the amount of data required to calculate each driving frame is 100 bytes, and the compression comparison source video is about 1/5.09.
- the accuracy of the Jacobian matrix can be reduced, thereby achieving the purpose of preliminary compression of the video.
- the technical solution provided by the embodiment of the present application uses a key point detection network to output key point coordinates and Jacobian matrices corresponding to the coordinates for the source reference frame and the driving frame of the video to be encoded, reads the source Jacobian matrix corresponding to the coordinates of each source key point of the source reference frame, and reads the driving Jacobian matrix corresponding to the coordinates of each driving key point of each driving frame, and then compresses the driving Jacobian matrix of each driving frame according to the source Jacobian matrix and the first preset compression rule to obtain the compression result of the driving Jacobian matrix, which can improve the Jacobian compression of the driving frame.
- the simplicity and efficiency of comparable matrix compression reduces the difficulty and cost of video coding model development and training.
- FIG4 is a flowchart of an AI-based video encoding method provided in an embodiment of the present application. As shown in FIG4 , the method specifically includes the following steps:
- the frame sequence can be the arrangement sequence of the source reference frame and each driving frame determined according to the playback order of the video image frame.
- the frame sequence can be stored in each frame code in the form of numbers, symbols, and character strings. Since the overall fluctuation range of the Jacobian matrix corresponding to the same key point position of the source reference frame and each driving frame in the video does not exceed [-1, 1], therefore, in order to further compress the Jacobian matrix of each driving frame, the driving Jacobian matrix of each driving frame can be subtracted from the Jacobian matrix of the source reference frame frame by frame according to the frame sequence according to the source reference frame and the frame sequence of each driving frame to obtain the Jacobian residual matrix of each driving frame, and the Jacobian residual matrix is used as the driving Jacobian matrix compression result, and the Jacobian matrix of Float16 type can be compressed into the Jacobian residual matrix of Uint8 type, at this time, the size of each driving frame is 60 bytes, and the compression comparison source video is about 1/8.28.
- the technical solution provided in the embodiment of the present application is to obtain the Jacobian residual matrix of each driving frame as the compression result of the driving Jacobian matrix, by subtracting the driving Jacobian matrix of each driving frame from the Jacobian matrix of the source reference frame frame by frame according to the frame sequence based on the source reference frame and the frame sequence of each driving frame, thereby achieving the effect of further compressing the Jacobian matrix of each driving frame, thereby improving the degree of video compression and the effect of video encoding.
- FIG5 is a flowchart of a key point coordinate compression provided by an embodiment of the present application, as shown in FIG5 , specifically including the following steps:
- the source key point coordinates may be the coordinates of each key point in the source reference frame; the driving key point coordinates may be the coordinates of each key point in each driving frame.
- the coordinates of each key point may be obtained through the sparse motion field in the key point detection network.
- the second preset compression rule may be a rule for compressing the source key point coordinates and the driving key point coordinates, and may be a coding redundancy of the key point coordinates within a frame or between frames, etc., which is not limited here. Since the key point coordinates are normalized to [-1,1], the key point coordinates can be mapped from [-1,1] to [0,255] to achieve the purpose of compressing the key point coordinates.
- the second preset compression rule can be used to convert the format of the source key point coordinates and the driving key point coordinates so that the key point coordinates are
- the target data type is converted from Float32 to Uint8, and then the compression result of the source key point coordinates and the compression result of the driving key point coordinates are obtained.
- the key point data can be compressed to the maximum extent by multiplexing the key point data.
- the key points can be sampled at a fixed interval of 4 frames. At this time, the compressed video is about 1/27.67 of the source video, achieving a greater degree of compression and improving the effect of video encoding.
- the technical solution provided in the embodiment of the present application reads the coordinates of each source key point of the source reference frame and the coordinates of each driving key point of each driving frame, and adopts the second preset compression rule to convert the format of the source key point coordinates and the driving key point coordinates, converting the Float32 type data into the Uint8 type data, and obtaining the compression result of the source key point coordinates and the compression result of the driving key point coordinates, which can increase the degree of video compression, reduce the compression bit rate, improve the effect of video encoding, and reduce the difficulty of video encoding model development and training.
- FIG6 is a flow chart of a reconstructed image evaluation method provided in an embodiment of the present application. As shown in FIG6 , the method specifically includes the following steps:
- the generator may be a model for mapping a noise signal (usually a random number) into a sample similar to real data, and specifically, may be a model for outputting an image, such as a fully connected neural network and a deconvolution network.
- the generator is used to generate a reconstructed image of the current frame, including a sparse motion field network and a dense motion field network.
- the sparse motion field network acts on the source reference frame image and each driving frame image to obtain the key point information of each frame; the dense motion field acts on the key point information received by the generator and the source reference frame to obtain an intermediate product.
- the intermediate product may be the change content or data of the key point information of the source reference frame and the corresponding key point information of the driving frame, which is used to guide the generator to generate the reconstructed image of the current frame.
- the image generation unit may be a unit for generating a reconstructed image of the current frame, having functions such as data reception, data processing, and image processing.
- the reconstructed image of the current frame may be an image generated by the image generation unit based on the source reference frame image or the driving frame image generated in the previous frame and the key point information of the current frame.
- the source reference frame and each of the driving frames are input into the generator to generate key point information of the source reference frame and key point information of the driving frame through the sparse motion field network of the generator, and the intermediate product is output through the dense motion field network of the generator, and a reconstructed image of the current frame is generated through the image generation unit.
- the intermediate product includes: a motion optical flow field and an occlusion map
- an image generation unit including:
- the motion optical flow field, the occlusion map and the source image are input into an image generation unit to output a reconstructed image of the current frame.
- the motion optical flow field may be the motion of an object between continuous sequence frames, which is caused by the relative motion between the object and the camera.
- the motion optical flow field includes key point information of the observed object in the video and the key point motion trend, etc., and is used to calculate the pixel motion information of each key point between adjacent frames based on the source reference frame and the pixel changes in each driving frame image sequence in the time domain and the correlation between adjacent frames.
- the occlusion map is used to indicate which parts can be obtained by pixel displacement of the source image and which parts need to be obtained by padding through the context when the image generation unit generates the reconstructed image of the current frame.
- the motion optical flow field, the occlusion map and the source reference frame image are input to the image generation unit of the generator to output the reconstructed image of the current frame.
- the efficiency and accuracy of regenerating the driving frame image can be improved.
- the discriminator can be used to evaluate the generation quality of the current frame reconstructed image generated by the generator, and can be judged by the similarity between the current frame reconstructed image and the actual current frame image.
- the input of the discriminator is the reconstructed image of the current frame, and the output is the authenticity label of the image, that is, the similarity. Therefore, the reconstructed image is input to the discriminator, and the evaluation result of the discriminator on the reconstructed image can be obtained.
- the adversarial neural network is a generative model, which consists of two parts, a generator and a discriminator. They play against each other and confront each other, and generate high-quality data through this confrontation. In the adversarial neural network training process, through the continuous training and updating of the discriminator and the generator, the discriminator network can gradually become more accurate, and the data finally generated by the generator will also be closer to the real data.
- the technical solution provided in the embodiment of the present application generates key point information of the source reference frame and key point information of the driving frame through a sparse motion field network of the generator, outputs an intermediate product through a dense motion field network of the generator, and generates a reconstructed image of the current frame through an image generation unit, and inputs the reconstructed image into a discriminator to obtain an evaluation result of the reconstructed image, which can improve the accuracy of the current frame image regeneration, and thereby improve the reliability of the restored encoded video.
- FIG7 is a structural block diagram of an AI-based video encoding device provided in an embodiment of the present application.
- the device is used to execute the AI-based video encoding method provided in the above embodiment, and has functional modules and beneficial effects corresponding to the execution method.
- the device specifically includes: a video acquisition module 601, a key point information output module 602, a key point information compression module 603, and a code stream data transmission module 604, wherein:
- the video acquisition module 601 is used to acquire the video to be encoded
- a key point information output module 602 is used to output key point information using a key point detection network for the source reference frame and the driving frame of the video to be encoded;
- the key point information compression module 603 is used to determine the key point information compression result of each driving frame according to the key point information of the source reference frame and the preset compression rule;
- the code stream data transmission module 604 is used to generate code stream data for transmission based on the compression result of the source reference frame, the key point information of the source reference frame and the key point information of each driving frame.
- the key point information includes: key point coordinates and source Jacobian matrix corresponding to the coordinates;
- the key point information compression module 603 includes:
- a Jacobian matrix reading unit used for reading a source Jacobian matrix corresponding to the coordinates of each source key point of a source reference frame, and reading a driving Jacobian matrix corresponding to the coordinates of each driving key point of each driving frame;
- the Jacobian matrix compression unit is used to compress the driving Jacobian matrix of each driving frame according to the source Jacobian matrix and a first preset compression rule to obtain a driving Jacobian matrix compression result.
- the Jacobian matrix compression unit is specifically used to:
- each driving Jacobian matrix of each driving frame is subtracted from the Jacobian matrix of the source reference frame frame by frame according to the frame sequence to obtain the Jacobian residual matrix of each driving frame as the driving Jacobian matrix compression result.
- the Jacobian matrix compression unit is further used for:
- the source Jacobian matrix and the driving Jacobian matrix are converted into a format from Float32 type data to Float16 type data.
- the key point information compression module 603 further includes:
- a key point coordinate reading unit used for reading the coordinates of each source key point of a source reference frame, and reading the coordinates of each driving key point of each driving frame;
- the key point coordinate compression unit is used to adopt a second preset compression rule to convert the format of the source key point coordinates and the driving key point coordinates, converting the Float32 type data into the Uint8 type data, and obtain the compression result of the source key point coordinates, as well as the compression result of each driving key point coordinate.
- the video acquisition module 601 includes:
- a reconstructed image generation unit used to input the source reference frame and each of the driving frames into a generator, so as to generate key point information of the source reference frame and key point information of the driving frame through the sparse motion field network of the generator, output an intermediate product through the dense motion field network of the generator, and generate a reconstructed image of the current frame through the image generation unit;
- the reconstructed image evaluation unit is used to input the reconstructed image into the discriminator to obtain an evaluation result of the reconstructed image; wherein the generator and the discriminator constitute an adversarial neural network.
- the intermediate product includes: a motion optical flow field and an occlusion map
- the reconstructed image generating unit is specifically used for:
- the motion optical flow field, the occlusion map and the source image are input into an image generation unit to output a reconstructed image of the current frame.
- the code stream data transmission module 604 is further used to:
- the code stream data is directly transmitted to the corresponding terminal.
- the technical solution provided in the embodiment of the present application includes a video acquisition module for acquiring a video to be encoded; a key point information output module for outputting key point information of a source reference frame and a driving frame of the video to be encoded using a key point detection network; a key point information compression module for determining a key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule; and a code stream data transmission module for generating code stream data for transmission based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame.
- the above-mentioned AI-based video encoding device solves the problems of poor adaptability of the encoding model, high difficulty in model training and high transmission bit rate when encoding videos using the existing AI video encoding technology.
- the key point detection network is used to output key point information of the source reference frame and the driving frame of the encoded video, and the key point information compression result of each driving frame is determined according to the key point information of the source reference frame and the preset compression rules. Then, based on the key point information of the source reference frame, the source reference frame and the key point information compression result of each driving frame, the code stream data is generated for transmission, thereby achieving the effect of ultra-low bit rate video encoding.
- this solution also improves the applicability of the AI video encoding model.
- the scope of the model is expanded, and the difficulty and cost of model development and training are reduced.
- FIG8 is a schematic diagram of the structure of an AI-based video encoding device provided in an embodiment of the present application.
- the device includes a processor 701, a memory 702, an input device 703, and an output device 704; the number of processors 701 in the device may be one or more, and FIG7 takes one processor 701 as an example; the processor 701, the memory 702, the input device 703, and the output device 704 in the device may be connected by a bus or other means, and FIG7 takes the connection by a bus as an example.
- the memory 702, as a computer-readable storage medium can be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the AI-based video encoding method in the embodiment of the present application.
- the processor 701 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 702, that is, realizing the above-mentioned AI-based video encoding method.
- the input device 703 can be used to receive input digital or character information, and to generate key signal input related to the user settings and function control of the device.
- the output device 704 may include a display device such as a display screen.
- the embodiment of the present application further provides a storage medium containing computer executable instructions, wherein the computer executable instructions are used to execute an AI-based video encoding method described in the above embodiment when executed by a computer processor, including:
- code stream data is generated for transmission.
- the various units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be achieved; in addition, the specific names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the embodiments of the present application.
- various aspects of the method provided in this application may also be implemented in the form of a program product, which includes a program code.
- the program product When the program product is run on a computer device, the program code is used to enable the computer device to perform the steps of the method according to various exemplary embodiments of the present application described above in this specification.
- the computer device may execute the AI-based video encoding method described in the embodiment of this application.
- the program product may be implemented in any combination of one or more readable media.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
本申请实施例提供了一种基于AI的视频编码方法、装置、设备、存储介质以及产品,该方法包括:获取待编码视频;对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。本方案实现了超低码率视频编码的效果,通过对视频数据的深度压缩,并且结合AI模型来实现数据的编码过程,可以获得更好的视频压缩效果,同时,本方案也提高了AI视频编码模型的适用范围、降低了模型开发训练的难度与成本。
Description
本申请要求在2023年04月12日提交中国专利局,申请号为202310396495.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
本申请实施例涉及视频处理技术领域,尤其涉及一种基于AI的视频编码方法、装置、设备、存储介质以及产品。
随着互联网行业的不断发展,视频的使用场景愈发广泛,例如:智能安防、自动驾驶、智慧城市以及工业互联网等。越来越多的领域开始将视频作为处理工作的辅助手段,同时越来越多的人也逐渐将视频的分享以及观看等作为娱乐项目之一。由于视频本身的数据量非常大,为了使视频数据能够成功传输并呈现在设备终端,需要对视频进行编码。因此,对于视频编码技术,尤其是基于AI的视频编码技术的研究变得十分重要。
相关技术中,AI视频编码方式主要是深度视频编码(Deep Video Coding,DVC)以及机器视觉编码。深度视频编码DVC是一个端到端的视频编码模型,整个视频压缩框架都由神经网络完成,并且能够统一训练,通过利用深度神经网络替换传统编码框架中的各个模块,达到视频编码以及图像帧重建的目的。机器视觉编码是以智能应用为目标的视频编码技术,将视频编解码与机器视觉分析结合,通过利用端到端网络系统实现视频的编解码任务,以供机器完成视觉任务。
由于深度视频编码DVC框架由神经网络完成,且采用的深度学习方法以离线优化为主,存在模型自适应能力较差、模型部署实施复杂以及传输码率较高的问题;同时,机器视觉编码模型除了进行视频编解码,更重要的是利用解码后的视频完成机器视觉任务,存在模型针对性强以及模型训练难度较高的问题。并且,相关技术中,AI视频编码技术通常需要较大的浮点数据表示,这与高效率视频编码中P帧压缩相比并不能够表现出明显的优势。因此利用现有AI视频编码技术对视频进行编码时,存在编码模型适应性较差、模型训练难度较大以及传输码率较高的问题。
发明内容
本申请实施例提供了一种基于AI的视频编码方法、装置、设备、存储介质以及产品,解决了利用现有AI视频编码技术对视频进行编码时,存在编码模型适应性较差、模型训练难度较大以及传输码率较高的问题,通过对待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息,并根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,进而基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输,可以实现超低码率视频编码的
效果,通过对视频数据的深度压缩,并且结合AI模型来实现数据的编码过程,可以获得更好的视频压缩效果,同时,本方案也提高了AI视频编码模型的适用范围、降低了模型开发训练的难度与成本。
第一方面,本申请实施例提供了一种基于AI的视频编码方法,该方法包括:
获取待编码视频;
对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;
根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;
基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
第二方面,本申请实施例还提供了一种基于AI的视频编码装置,包括:
视频获取模块,用于获取待编码视频;
关键点信息输出模块,用于对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;
关键点信息压缩模块,用于根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;
码流数据传输模块,用于基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
第三方面,本申请实施例还提供了一种基于AI的视频编码设备,该设备包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请实施例所述的基于AI的视频编码方法。
第四方面,本申请实施例还提供了一种存储计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行本申请实施例所述的基于AI的视频编码方法。
第五方面,本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中,设备的至少一个处理器从计算机可读存储介质读取并执行计算机程序,使得设备执行本申请实施例所述的基于AI的视频编码方法。
本申请实施例中,获取待编码视频;对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。通过上述基于AI的视频编码方法,解决了利用现有AI视频编码技术对视频进行编码时,存在编码模型适应性较差、模型训练难度较大以及传输码率较高的问题,通过对待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息,并根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,进而基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数
据进行传输,可以实现超低码率视频编码的效果,通过对视频数据的深度压缩,并且结合AI模型来实现数据的编码过程,可以获得更好的视频压缩效果,同时,本方案也提高了AI视频编码模型的适用范围、降低了模型开发训练的难度与成本。
图1为本申请实施例提供的一种基于AI的视频编码方法的流程图;
图2为本申请实施例提供的一种基于AI的视频编码系统框架示意图;
图3为本申请实施例提供的一种基于AI的视频编码方法的流程图;
图4为本申请实施例提供的一种基于AI的视频编码方法的流程图;
图5为本申请实施例提供的一种关键点坐标压缩的流程图;
图6为本申请实施例提供的一种重构图像评价方法的流程图;
图7为本申请实施例提供的一种基于AI的视频编码装置的结构框图;
图8为本申请实施例提供的一种基于AI的视频编码设备的结构示意图。
下面结合附图和实施例对本申请实施例作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请实施例,而非对本申请实施例的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请实施例相关的部分而非全部结构。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
对于一些针对人脸视频的生成压缩方案,基于一阶运动模型(First OrderMotion Model,FOM)框架,先用传统编码器编码并传输第一帧(参考帧),然后在编码端提取后续各帧的关键点传输至解码端,解码端依赖当前帧和参考帧的关键点相对位移对参考帧人脸进行扭曲和背景填充来生成当前帧。
然而这种方法仅适用于视频内容简单、运动量很小的人物头动视频,一但视频中出现头部大角度偏移、脸部遮挡以及背景物体增加等情况生成当前帧的质量将大幅下降。人物面部出现变形、抖动以及背景物体缺失等问题,非常影响观感。此时需要频繁更新参考帧才能解决,但这无疑又增加了码率负担。
为了探究生成压缩方案的压缩极限及应用场景,本发明提出了一种超低码率生成编码方法,可以对原始关键点数据进行压缩,对关键点数据进行复用,进一步减少了编码端需要传输的关键点信息。实现了相比HEVC至少可降低1/10的码率目标,并保证主观质量无明显下降。
现有的基于FOM的生成压缩方案仍处于探索阶段,在码率压缩和驱动策略上仍有较大提升空间。首
先,关键点信息通常需要240bytes浮点数据表示,这与HEVC中P帧压缩相比并不占优势。因此,在面对实际应用场景对传输码率和主观质量的严苛要求,仍需进一步探究生成压缩方案能够达到的压缩极限以及有效的应用场景。
本发明针对FOM生成压缩方案存在的问题,结合移动直播业务需求,构建了端到端的生成编码系统框架,通过编码前处理、FOM生成压缩链路为转码引擎侧和观众端解码侧分别提供了丢帧重生成和生成帧代替传统编码帧的新思路。
图1为本申请实施例提供的一种基于AI的视频编码方法的流程图,可用于视频图像传输的场景,尤其是通过视频编码压缩进行视频传输场景,该方法可以由具有编码以及计算能力的服务器以及智能终端等执行。该方法具体包括如下步骤:
S101,获取待编码视频;
在一实施例中,待编码视频可以是需要传输的视频数据,所述视频数据可以由多帧相同或者不同的图像构成。通过对所述视频编码,可以达到压缩视频的目的,使得所述视频能够成功传输至用户终端设备。
本方案中,可以通过视频录制端的视频录制设备,如摄像头等,获取所述待编码视频。获取所述待编码视频后,可以对所述待编码视频中的图像进行数据采集以及图像预处理等,例如,对所述待编码视频进行分类、判断以及增强等,可以提高视频编码模型训练的鲁棒性以及视频信息获取的效率。
S102,对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;
其中,源参考帧可以是所述待编码视频的首帧图像。驱动帧可以是在所述待编码视频中位于所述源参考帧之后的每一帧图像。所述驱动帧中的每一帧图像的图像内容以及图像数据等可以相同或者不同,且所述驱动帧与所述源参考帧的图像内容以及图像数据等可以相同或者不同。
在一实施例中,关键点可以是能够表示待编码视频中每一帧图像特征的点,包括表示源参考帧以及驱动帧图像整体形态或者内容分布的点,以及,在下一帧或者多帧的驱动帧图像中发生变化的点,例如:以教师出镜为主的授课视频中,可以将教师的头部以及面部五官等作为待编码视频的关键点。关键点信息可以是用于对所述待编码视频进行视频编码的信息,包括所述关键点的内容信息、位置信息以及数据信息等,例如:关键点的像素数据、位置坐标数据以及坐标附近一定范围内的数据信息等。
关键点检测网络可以是用于获取所述关键点信息的网络,包括关键点检测器Kp-detector等。通过关键点检测网络中的稀疏运动场网络作用于所述待编码视频中的源参考帧以及驱动帧,可以获取所述源参考帧以及驱动帧的关键点信息。
S103,根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;
其中,预设压缩规则可以是根据视频压缩需求对所述待编码视频中各驱动帧进行压缩以减小所述待编码视频体积的规则,包括帧内压缩规则以及帧间压缩规则等。各驱动帧的关键点信息压缩结果可以是将各驱动帧的关键点数据压缩成编码数据的结果,包括同一驱动帧中不同关键点信息数据帧内压缩结果以及同一关键点数据在不同驱动帧中的帧间压缩结果。
在一实施例中,可以基于FOM(First OrderMotion Model,一阶运动模型)的端到端生成编码系统模
型,采用关键点数据压缩的方法,根据预设压缩规则对各驱动帧的关键点信息进行帧内压缩,以及,根据源参考帧与各驱动帧的关键点信息的关系以及预设压缩规则对各驱动帧的关键点信息进行帧间压缩。所述关键点数据压缩可以包括像素数据的压缩、位置坐标数据的压缩以及数据类型的转换等,此处不做过多限定。通过对各驱动帧的帧内以及帧间压缩,可以确定各驱动帧的关键点信息压缩结果。
本方案通过对关键点信息采用预设压缩规则进行进一步压缩处理,可以在原有的压缩得到关键点信息的基础上,得到较大的压缩效率的提升,达到了对数据进行深度压缩的目的,从而可以实现超低码率的编码效果。
S104,基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
其中,码流数据可以是对所述待编码视频中的源参考帧以及各驱动帧进行编码压缩后的数据。所述待编码视频可以通过码流数据的形式传输至用户终端设备。基于所述源参考帧的图像编码结果、源参考帧的关键点信息编码结果以及各驱动帧的关键点信息压缩结果生成码流数据。
在一实施例中,以人脸为主的视频为例,在用户终端设备接收到所述码流时,对于无面部遮挡、头动小以及背景简单的视频,由于视频中的源参考帧以及各驱动帧的关键点的内容信息、位置信息以及数据信息等变化较小,即源参考帧与各驱动帧的关键点信息相关性较高,因此,可以直接利用源参考帧单向参考生成各驱动帧以还原所述待编码视频。例如:根据源参考帧的关键点信息以及预设的压缩规则经密集运动场计算生成与源参考帧相邻或者不相邻的各驱动帧图像。
在另一实施例中,在用户终端设备接收到所述码流时,对于偶有面部遮挡、中等运动以及背景简单的视频,由于视频中源参考帧以及各驱动帧的关键点的内容信息、位置信息以及数据信息等变化较大,即源参考帧与各驱动帧的关键点信息相关性较低,在对视频进行压缩后,无法根据源参考帧的单向参考较为准确地还原出驱动帧图像,因此,可以在FOM基础上添加后向参考,采用双向参考驱动策略,生成各驱动帧的最终图像。例如:经过稀疏运动场得到当前驱动帧d以及分别在相距当前驱动帧d的前后各n帧位置处的驱动帧的关键点信息,并将相距当前驱动帧前后各n帧位置处的驱动帧作为前向参考帧d-n以及后向参考帧d+n,将前向参考帧d-n以及后向参考帧d+n经密集运动场作用后产生各自参考方向的参考图像,最后分别对前向参考帧以及后向参考帧的参考图像进行融合得到双向融合参考图像M,并采用公式:
output=d-n×M+d+n×(1-M);
output=d-n×M+d+n×(1-M);
指导当前驱动帧图像的最终生成。与仅使用单向参考相比,双向参考在主观质量上得到明显改善,脸部抖动减少。通过调节参考源的更新频率以达到较好的主观质量,并在此基础上添加预处理模块用于检测场景变化。场景变化检测方式包括:场景检测网络和计算相邻帧互信息量。通过检测场景变化分别判断当前帧估计的更新概率是否大于该帧之前已编码帧预测的更新概率均值,以及,判断当前帧与相邻帧之间的互信息量是否大于已编码帧计算的互信息量均值,以指示更准确的源参考更新定位,最终均获得一定程度的主观质量提升,改善了较差的生成结果。对于无面部遮挡、头动小以及背景简单的视频,可以根据用户对视频还原质量的需求同样采用所述双向参考驱动策略,此处不做过多限定。
在一实施例中,可选的,在基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输之后,所述方法还包括:
通过转码引擎识别是否存在无法识别所述码流数据的终端;
若存在,则基于目标终端参数对所述码流数据进行转换,并将转换后的所述码流数据传输至所述目标终端;
若不存在,则直接将所述码流数据传输至对应终端。
其中,转码引擎可以用于接收上述方案所产生的码流数据,并在需要的情况下进行转码处理以及分辨率需求等的服务器。所述转码引擎可以设有多个输出端口,用于将所述码流数据转发至不同的用户终端设备。通过所述转码引擎可以识别是否存在与所述码流数据传输条件不一致的用户终端设备,即无法识别或者传输所述码流数据的用户终端设备。
在一实施例中,目标终端可以是无法传输所述码流的用户终端设备。目标终端参数可以是目标终端的码流传输参数。若通过所述转码引擎识别到存在所述目标终端,则对所述码流数据进行解码并根据目标终端参数对所述码流数据进行重新编码,进而将转换后的所述码流数据转发至对应的所述目标终端。由于所述转码引擎可以同时连接多个不同的用户终端设备,因此,针对不同的目标终端,所述转码引擎重新编码的编码方式以及编码标准等可以相同或者不同。若通过所述转码引擎未识别到存在所述目标终端,则直接将所述码流数据传输至对应的用户终端设备,以供用户终端设备进行视频重生成。
在另一实施例中,所述转码引擎以及所述用户终端设备可以对源参考帧码流进行解码以重生成源参考帧图像,并根据重生成的源参考帧以及码流数据对各驱动帧图像按驱动帧顺序进行图像重生成,以进行视频还原。同时,所述转码引擎以及所述用户终端设备可以根据所述码流中的各驱动帧序号判断是否存在丢帧现象,若存在,则根据相邻驱动帧通过编码进行丢帧重生成。
本方案,通过转码引擎识别是否存在无法识别所述码流数据的终端,并基于目标终端参数对所述码流数据进行转换,进而将转换后的所述码流数据传输至所述目标终端,可以避免由于用户终端设备配置的不同导致无法接收码流数据的问题,提高了用户终端设备接收并还原传输视频的准确性和效率。
本申请实施例所提供的技术方案,获取待编码视频;对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。通过上述基于AI的视频编码方法,解决了利用现有AI视频编码技术对视频进行编码时,存在编码模型适应性较差、模型训练难度较大以及传输码率较高的问题,通过对待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息,并根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,进而基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输,可以实现超低码率视频编码的效果,通过对视频数据的深度压缩,并且结合AI模型来实现数据的编码过程,可以获得更好的视频压缩效果,同时,本方案也提高了AI视频编码模型的适用范围、降低了模型开发训练的难度与成本。
图2为本申请实施例提供的一种基于AI的视频编码系统框架示意图,如图2所示,具体包括:采集/前处理模块、AI Enc模块、转码引擎模块以及AIDec模块。
采集/前处理模块用于对待编码视频进行分类、增强以及突出等预处理操作,并将处理后的待编码视频传输至AI Enc模块。
AI Enc模块用于对所述待编码视频进行编码压缩操作,包括一条传统编码链路以及两条AI编码链路。其中,传统编码链路由Codec Engine模块指向混流模块。Codec Engine模块用于对待编码视频中的源参考帧图像进行编码操作,使得所述源参考帧图像能够进入混流模块并传输至用户端。混流模块用于将源参考帧的图像编码结果与各驱动帧的图像编码结果混合并生成码流进行编码视频的传输。
AI编码链路中的一条与Codec Engine模块连接,用于对源参考帧图像进行编码压缩处理,包括DPB(DecodedPicture Buffer,解码图片缓存区)模块、关键点检测Net以及生成信息压缩模块。DPB模块用于对Codec Engine模块编码的源参考帧图像进行解码操作和缓存操作,并将解码后的源参考帧图像传输至关键点检测Net,以供关键点检测Net进行源参考帧图像的关键点信息检测。可以理解的,此处也可以直接将源参考帧图像输入至关键点检测Net。生成信息压缩模块用于将关键点检测Net检测到的关键点信息进行编码压缩处理,并将压缩处理后的关键点信息传输至传统编码链路中的混流模块。
另一条AI编码链路包括关键点检测Net以及AIDec,其中,AIDec包括稀疏运动场Net以及密集运动场Net。所述关键点检测Net用于对各驱动帧图像进行关键点信息检测操作,并将检测到的关键点信息传输至生成信息压缩模块。AIDec用于根据检测到的关键点信息进行各帧图像的重生成,并将生成帧图像传输至鉴别Net,与鉴别Net构成生成对抗网络,以鉴别所述生成帧的可靠性以及准确度。其中,AIDec中的稀疏运动场Net用于获取各帧图像关键点信息中的关键点具体数据等,密集运动场Net用于根据所述数据重生成各帧图像。
转码引擎模块用于对码流数据的转码以及丢帧数据的重生成,包括Codec Dec以及AIDec。Codec Dec用于根据码流数据以及源参考帧图像编码数据生成源参考帧图像。AIDec用于根据码流数据以及已生成的相邻参考帧或者驱动帧图像进行丢帧重生成。
AIDec模块用于接收AI Enc模块直接传输的码流数据或者接收转码引擎模块转发的码流数据,包括Codec Dec以及AIDec。Codec Dec用于根据码流数据以及源参考帧图像编码数据生成源参考帧图像。AIDec用于根据码流数据以及已生成的相邻参考帧或者驱动帧图像进行丢帧重生成。
图3为本申请实施例提供的一种基于AI的视频编码方法的流程图,如图3所示,具体包括如下步骤:
S201,获取待编码视频;
S202,对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点坐标及坐标对应的雅可比矩阵;
其中,关键点坐标可以是在同一坐标系下源参考帧以及各驱动帧中各关键点的坐标,表征了各关键点的位置,通过关键点坐标的变化可以确定各帧图像的变化。
雅可比矩阵可以是所述关键点坐标对应的雅可比矩阵,是关键点坐标的一阶偏导数以一定方式排列成
的矩阵,其行列式称为雅可比行列式,它体现了一个可微方程与给出点的最优线性逼近。通过对所述待编码视频的源参考帧和驱动帧采用关键点检测网络中的稀疏运动场获取并输出关键点坐标进而计算关键点坐标对应的雅可比矩阵。
S203,读取源参考帧的各源关键点坐标对应的源雅可比矩阵,以及,读取各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵;
源雅可比矩阵可以是源参考帧的各源关键点坐标对应的雅可比矩阵,驱动雅可比矩阵可以是各驱动帧的各驱动关键点坐标对应的雅可比矩阵,所述各驱动关键点坐标可以相同或者不同,因此所述驱动雅可比矩阵也可以相应的相同或者不同。通过所述关键点检测网络输出关键点坐标及坐标对应的雅可比矩阵读取源参考帧的各源关键点坐标对应的源雅可比矩阵以及各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵。
S204,根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果;
其中,第一预设压缩规则可以是对各驱动帧的驱动雅可比矩阵进行压缩的规则,可以是相邻驱动帧的驱动雅可比矩阵进行残差计算,还可以是各驱动雅可比矩阵分别与源雅可比矩阵进行残差计算等,此处不做过多限定。根据所述源雅可比矩阵以及第一预设压缩规则,可以对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果。
在一个实施例中,可选的,在根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果之前,所述方法还包括:
对所述源雅可比矩阵和所述驱动雅可比矩阵进行格式转换,由Float32类型的数据转换为Float16类型的数据。
在一个实施例中,为了使所述驱动雅可比矩阵能够最大限度地压缩,在根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果之前,可以先降低所述源雅可比矩阵和所述驱动雅可比矩阵的精度。具体的,降低矩阵精度的方式可以是对所述源雅可比矩阵和所述驱动雅可比矩阵进行格式转换,由Float32类型的数据转换为Float16类型的数据。此时计算各驱动帧需要的数据量为100bytes,压缩比较源视频约为1/5.09。
在一实施例中,通过对所述源雅可比矩阵和所述驱动雅可比矩阵进行格式转换,可以降低雅可比矩阵的精度,达到对视频初步压缩的目的。
S205,基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
本申请实施例所提供的技术方案,通过对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点坐标及坐标对应的雅可比矩阵,读取源参考帧的各源关键点坐标对应的源雅可比矩阵,以及,读取各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵,进而根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果,可以提高对驱动帧的雅
可比矩阵压缩的简便性和效率,降低了视频编码模型开发训练的难度与成本。
图4为本申请实施例提供的一种基于AI的视频编码方法的流程图,如图4所示,具体包括如下步骤:
S301,获取待编码视频;
S302,对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点坐标及坐标对应的源雅可比矩阵;
S303,读取源参考帧的各源关键点坐标对应的源雅可比矩阵,以及,读取各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵;
S304,基于源参考帧与各驱动帧的帧序列,对各驱动帧的各驱动雅可比矩阵按照所述帧序列逐帧与源参考帧的雅可比矩阵作差,得到各驱动帧的雅可比残差矩阵,作为驱动雅可比矩阵压缩结果;
在一实施例中,帧序列可以是根据视频图像帧的播放顺序确定的源参考帧与各驱动帧的排列序号。所述帧序列可以以数字、符号以及字符串等形式存储于各帧编码中。由于视频中源参考帧与各驱动帧相同关键点位置对应的雅可比矩阵的整体波动范围没有超过[-1,1],因此,为了进一步压缩各驱动帧的雅可比矩阵,可以根据源参考帧与各驱动帧的帧序列对各驱动帧的各驱动雅可比矩阵按照所述帧序列逐帧与源参考帧的雅可比矩阵作差,以得到各驱动帧的雅可比残差矩阵,并将所述雅可比残差矩阵作为驱动雅可比矩阵压缩结果,可以将Float16类型的雅可比矩阵压缩为Uint8类型的雅可比残差矩阵,此时各驱动帧的大小为60bytes,压缩比较源视频约为1/8.28。
S305,基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
本申请实施例所提供的技术方案,通过基于源参考帧与各驱动帧的帧序列,对各驱动帧的各驱动雅可比矩阵按照所述帧序列逐帧与源参考帧的雅可比矩阵作差,得到各驱动帧的雅可比残差矩阵,作为驱动雅可比矩阵压缩结果,可以达到对个驱动帧的雅可比矩阵进一步压缩的效果,提高了视频压缩的程度以及视频编码的效果。
图5为本申请实施例提供的一种关键点坐标压缩的流程图,如图5所示,具体包括如下步骤:
S401,读取源参考帧的各源关键点坐标,以及,读取各驱动帧的各驱动关键点坐标;
在一实施例中,源关键点坐标可以是源参考帧中各关键点的坐标;驱动关键点坐标可以是各驱动帧中各关键点的坐标。通过关键点检测网络中的稀疏运动场,可以获取各关键点坐标。
S402,采用第二预设压缩规则,对所述源关键点坐标以及所述驱动关键点坐标进行格式转换,由Float32类型的数据转换为Uint8类型的数据,得到源关键点坐标的压缩结果,以及得到各驱动关键点坐标的压缩结果。
在一实施例中,第二预设压缩规则可以是对所述源关键点坐标以及所述驱动关键点坐标进行压缩的规则,可以是帧内或者帧间的关键点坐标的编码冗余等,此处不做过多限定。由于关键点坐标有归一化到[-1,1]的操作,因此可以将关键点坐标从[-1,1]映射到[0,255],以达到对关键点坐标压缩的目的。具体的,可以采用第二预设压缩规则,对所述源关键点坐标以及所述驱动关键点坐标进行格式转换,使得所述关键点坐
标的数据类型由Float32类型转换为Uint8类型,进而得到所述源关键点坐标的压缩结果,以及得到所述驱动关键点坐标的压缩结果。进一步的,可以通过关键点数据复用的方式对关键点数据进行最大限度的压缩,具体的,可以是以固定4帧为间隔采样关键点,此时压缩后的视频比较源视频约为1/27.67,达到了更大程度的压缩,提高了视频编码的效果。
本申请实施例所提供的技术方案,通过读取源参考帧的各源关键点坐标,以及,读取各驱动帧的各驱动关键点坐标,采用第二预设压缩规则,对所述源关键点坐标以及所述驱动关键点坐标进行格式转换,由Float32类型的数据转换为Uint8类型的数据,得到源关键点坐标的压缩结果,以及得到各驱动关键点坐标的压缩结果,可以增大视频压缩的程度,降低压缩码率,提高了视频编码的效果,同时降低了视频编码模型开发训练的难度。
图6为本申请实施例提供的一种重构图像评价方法的流程图,如图6所示,具体包括如下步骤:
S501,将所述源参考帧以及各所述驱动帧输入生成器,以通过所述生成器的稀疏运动场网络生成源参考帧的关键点信息以及所述驱动帧的关键点信息,并通过生成器的密集运动场网络输出中间产物,并通过图像生成单元生成当前帧的重构图像;
在一实施例中,生成器可以是一个用于将噪声信号(通常是随机数)映射为一个与真实数据相似的样本的模型,具体的,可以是一个输出图片的模型,例如:全连接神经网络以及反卷积网络等。所述生成器用于生成当前帧的重构图像,包括稀疏运动场网络以及密集运动场网络。其中,稀疏运动场网络作用于源参考帧图像和各驱动帧图像以得到各帧的关键点信息;密集运动场作用于生成器接收到的关键点信息以及源参考帧以得到中间产物。所述中间产物可以是所述源参考帧的关键点信息与所述驱动帧的对应关键点信息的变化内容或者数据等,用于指导生成器生成当前帧的重构图像。图像生成单元可以是用于生成当前帧的重构图像的单元,具有数据接收、数据处理以及图像处理等功能。当前帧的重构图像可以是图像生成单元根据源参考帧图像或者前一帧已生成的驱动帧图像以及当前帧的关键点信息生成的图像。将所述源参考帧以及各所述驱动帧输入生成器,以通过所述生成器的稀疏运动场网络生成源参考帧的关键点信息以及所述驱动帧的关键点信息,并通过生成器的密集运动场网络输出中间产物,并通过图像生成单元生成当前帧的重构图像。
在一实施例中,可选的,所述中间产物包括:运动光流场和遮挡图;
相应的,通过图像生成单元生成当前帧的重构图像,包括:
将所述运动光流场、遮挡图以及源图像输入至图像生成单元,以输出当前帧的重构图像。
在一实施例中,运动光流场可以是物体在连续的序列帧之间的运动,是由物体和相机之间的相对运动引起的。所述运动光流场包括视频中被观察物体的关键点信息以及关键点运动趋势等,用于根据源参考帧以及各驱动帧图像序列中像素在时间域上的变化以及相邻帧之间的相关性,计算得到相邻帧之间各关键点的像素运动信息。遮挡图用于指示图像生成单元生成当前帧的重构图像时,哪些部分可由源图像像素位移得到,哪些部分需要通过上下文进行padding(填充)得到。将所述运动光流场、遮挡图以及源参考帧图像输入至生成器的图像生成单元,以输出当前帧的重构图像。
在一实施例中,通过将所述运动光流场、遮挡图以及源图像输入至图像生成单元,以输出当前帧的重构图像,可以提高对驱动帧图像重生成的效率和准确度。
S502,将所述重构图像输入至判别器,以得到重构图像的评价结果;其中,所述生成器与所述判别器构成对抗神经网络。
在一实施例中,判别器可以是用于评估生成器生成的当前帧重构图像的生成质量,可以通过当前帧重构图像与实际的当前帧图像的相似度判别。所述判别器输入为当前帧重构图像,输出为所述图像的真伪标签即相似度,因此,将所述重构图像输入至判别器,可以得到判别器对所述重构图像的评价结果。对抗神经网络是一种生成模型,由生成器和判别器两部分组成,它们相互博弈,互相对抗,通过这种对抗生成高质量的数据。在对抗神经网络训练过程中,通过对判别器以及生成器的不断训练以及更新,可以使得判别器网络会逐渐变得更加准确,同时生成器最终生成的数据也会更加接近于真实数据。
本申请实施例所提供的技术方案,通过生成器的稀疏运动场网络生成源参考帧的关键点信息以及所述驱动帧的关键点信息,并通过生成器的密集运动场网络输出中间产物,并通过图像生成单元生成当前帧的重构图像,将所述重构图像输入至判别器,以得到重构图像的评价结果,可以提高当前帧图像重生成的精准度,进而提高对编码视频还原的可靠性。
图7为本申请实施例提供的一种基于AI的视频编码装置的结构框图,该装置用于执行上述实施例提供的基于AI的视频编码方法,具备执行方法相应的功能模块和有益效果。如图7所示,该装置具体包括:视频获取模块601,关键点信息输出模块602,关键点信息压缩模块603,码流数据传输模块604,其中,
视频获取模块601,用于获取待编码视频;
关键点信息输出模块602,用于对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;
关键点信息压缩模块603,用于根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;
码流数据传输模块604,用于基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
在一个可能的实施例中,所述关键点信息包括:关键点坐标及坐标对应的源雅可比矩阵;
相应的,所述关键点信息压缩模块603,包括:
雅可比矩阵读取单元,用于读取源参考帧的各源关键点坐标对应的源雅可比矩阵,以及,读取各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵;
雅可比矩阵压缩单元,用于根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果。
在一个可能的实施例中,所述雅可比矩阵压缩单元,具体用于:
基于源参考帧与各驱动帧的帧序列,对各驱动帧的各驱动雅可比矩阵按照所述帧序列逐帧与源参考帧的雅可比矩阵作差,得到各驱动帧的雅可比残差矩阵,作为驱动雅可比矩阵压缩结果。
在一个可能的实施例中,所述雅可比矩阵压缩单元,还用于:
对所述源雅可比矩阵和所述驱动雅可比矩阵进行格式转换,由Float32类型的数据转换为Float16类型的数据。
在一个可能的实施例中,所述关键点信息压缩模块603,还包括:
关键点坐标读取单元,用于读取源参考帧的各源关键点坐标,以及,读取各驱动帧的各驱动关键点坐标;
关键点坐标压缩单元,用于采用第二预设压缩规则,对所述源关键点坐标以及所述驱动关键点坐标进行格式转换,由Float32类型的数据转换为Uint8类型的数据,得到源关键点坐标的压缩结果,以及得到各驱动关键点坐标的压缩结果。
在一个可能的实施例中,所述视频获取模块601,包括:
重构图像生成单元,用于将所述源参考帧以及各所述驱动帧输入生成器,以通过所述生成器的稀疏运动场网络生成源参考帧的关键点信息以及所述驱动帧的关键点信息,并通过生成器的密集运动场网络输出中间产物,并通过图像生成单元生成当前帧的重构图像;
重构图像评价单元,用于将所述重构图像输入至判别器,以得到重构图像的评价结果;其中,所述生成器与所述判别器构成对抗神经网络。
在一个可能的实施例中,所述中间产物包括:运动光流场和遮挡图;
相应的,所述重构图像生成单元,具体用于:
将所述运动光流场、遮挡图以及源图像输入至图像生成单元,以输出当前帧的重构图像。
在一个可能的实施例中,所述码流数据传输模块604,还用于:
通过转码引擎识别是否存在无法识别所述码流数据的终端;
若存在,则基于目标终端参数对所述码流数据进行转换,并将转换后的所述码流数据传输至所述目标终端;
若不存在,则直接将所述码流数据传输至对应终端。
本申请实施例所提供的技术方案,视频获取模块,用于获取待编码视频;关键点信息输出模块,用于对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;关键点信息压缩模块,用于根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;码流数据传输模块,用于基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。通过上述基于AI的视频编码装置,解决了利用现有AI视频编码技术对视频进行编码时,存在编码模型适应性较差、模型训练难度较大以及传输码率较高的问题,通过对待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息,并根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,进而基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输,实现了超低码率视频编码的效果,通过对视频数据的深度压缩,并且结合AI模型来实现数据的编码过程,可以获得更好的视频压缩效果,同时,本方案也提高了AI视频编码模型的适用
范围、降低了模型开发训练的难度与成本。
图8为本申请实施例提供的一种基于AI的视频编码设备的结构示意图,如图8所示,该设备包括处理器701、存储器702、输入装置703和输出装置704;设备中处理器701的数量可以是一个或多个,图7中以一个处理器701为例;设备中的处理器701、存储器702、输入装置703和输出装置704可以通过总线或其他方式连接,图7中以通过总线连接为例。存储器702作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的基于AI的视频编码方法对应的程序指令/模块。处理器701通过运行存储在存储器702中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述的基于AI的视频编码方法。输入装置703可用于接收输入的数字或字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入。输出装置704可包括显示屏等显示设备。
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种上述实施例描述的基于AI的视频编码方法,其中,包括:
获取待编码视频;
对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;
根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;
基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
值得注意的是,上述基于AI的视频编码装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请实施例的保护范围。
在一些可能的实施方式中,本申请提供的方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在计算机设备上运行时,所述程序代码用于使所述计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的方法中的步骤,例如,所述计算机设备可以执行本申请实施例所记载的基于AI的视频编码方法。所述程序产品可以采用一个或多个可读介质的任意组合实现。
Claims (12)
- 一种基于AI的视频编码方法,其中,包括:获取待编码视频;对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
- 根据权利要求1所述的基于AI的视频编码方法,其中,所述关键点信息包括:关键点坐标及坐标对应的雅可比矩阵;相应的,根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,包括:读取源参考帧的各源关键点坐标对应的源雅可比矩阵,以及,读取各驱动帧的各驱动关键点坐标对应的驱动雅可比矩阵;根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果。
- 根据权利要求2所述的基于AI的视频编码方法,其中,根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果,包括:基于源参考帧与各驱动帧的帧序列,对各驱动帧的各驱动雅可比矩阵按照所述帧序列逐帧与源参考帧的雅可比矩阵作差,得到各驱动帧的雅可比残差矩阵,作为驱动雅可比矩阵压缩结果。
- 根据权利要求2所述的基于AI的视频编码方法,其中,在根据所述源雅可比矩阵以及第一预设压缩规则,对各驱动帧的驱动雅可比矩阵进行压缩处理,得到驱动雅可比矩阵压缩结果之前,所述方法还包括:对所述源雅可比矩阵和所述驱动雅可比矩阵进行格式转换,由Float32类型的数据转换为Float16类型的数据。
- 根据权利要求2所述的基于AI的视频编码方法,其中,根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果,还包括:读取源参考帧的各源关键点坐标,以及,读取各驱动帧的各驱动关键点坐标;采用第二预设压缩规则,对所述源关键点坐标以及所述驱动关键点坐标进行格式转换,由Float32类型的数据转换为Uint8类型的数据,得到源关键点坐标的压缩结果,以及得到各驱动关键点坐标的压缩结果。
- 根据权利要求1-5所述的基于AI的视频编码方法,其中,在获取待编码视频之后,所述方法还包 括:将所述源参考帧以及各所述驱动帧输入生成器,以通过所述生成器的稀疏运动场网络生成源参考帧的关键点信息以及所述驱动帧的关键点信息,并通过生成器的密集运动场网络输出中间产物,并通过图像生成单元生成当前帧的重构图像;将所述重构图像输入至判别器,以得到重构图像的评价结果;其中,所述生成器与所述判别器构成对抗神经网络。
- 根据权利要求6所述的基于AI的视频编码方法,其中,所述中间产物包括:运动光流场和遮挡图;相应的,通过图像生成单元生成当前帧的重构图像,包括:将所述运动光流场、遮挡图以及源图像输入至图像生成单元,以输出当前帧的重构图像。
- 根据权利要求1-7所述的基于AI的视频编码方法,其中,在基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输之后,所述方法还包括:通过转码引擎识别是否存在无法识别所述码流数据的终端;若存在,则基于目标终端参数对所述码流数据进行转换,并将转换后的所述码流数据传输至所述目标终端;若不存在,则直接将所述码流数据传输至对应终端。
- 一种基于AI的视频编码装置,其中,包括:视频获取模块,配置为获取待编码视频;关键点信息输出模块,配置为对所述待编码视频的源参考帧和驱动帧采用关键点检测网络输出关键点信息;关键点信息压缩模块,配置为根据源参考帧的关键点信息以及预设压缩规则,确定各驱动帧的关键点信息压缩结果;码流数据传输模块,配置为基于所述源参考帧、源参考帧的关键点信息以及各驱动帧的关键点信息压缩结果生成码流数据进行传输。
- 一种基于AI的视频编码设备,所述设备包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现权利要求1-8中任一项所述的基于AI的视频编码方法。
- 一种存储计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行权利要求1-8中任一项所述的基于AI的视频编码方法。
- 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时实现权利要求 1-8中任一项所述的基于AI的视频编码方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310396495.3 | 2023-04-12 | ||
CN202310396495.3A CN116582686A (zh) | 2023-04-12 | 2023-04-12 | 一种基于ai的视频编码方法、装置、设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024212822A1 true WO2024212822A1 (zh) | 2024-10-17 |
Family
ID=87544358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/084506 WO2024212822A1 (zh) | 2023-04-12 | 2024-03-28 | 一种基于ai的视频编码方法、装置、设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116582686A (zh) |
WO (1) | WO2024212822A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116582686A (zh) * | 2023-04-12 | 2023-08-11 | 百果园技术(新加坡)有限公司 | 一种基于ai的视频编码方法、装置、设备和存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113132732A (zh) * | 2019-12-31 | 2021-07-16 | 北京大学 | 一种人机协同的视频编码方法及视频编码系统 |
CN114257818A (zh) * | 2020-09-22 | 2022-03-29 | 阿里巴巴集团控股有限公司 | 视频的编、解码方法、装置、设备和存储介质 |
CN114363623A (zh) * | 2021-08-12 | 2022-04-15 | 财付通支付科技有限公司 | 图像处理方法、装置、介质及电子设备 |
US20230105436A1 (en) * | 2021-10-06 | 2023-04-06 | Kwai Inc. | Generative adversarial network for video compression |
CN115941966A (zh) * | 2022-12-30 | 2023-04-07 | 深圳大学 | 一种视频压缩方法及电子设备 |
CN116582686A (zh) * | 2023-04-12 | 2023-08-11 | 百果园技术(新加坡)有限公司 | 一种基于ai的视频编码方法、装置、设备和存储介质 |
-
2023
- 2023-04-12 CN CN202310396495.3A patent/CN116582686A/zh active Pending
-
2024
- 2024-03-28 WO PCT/CN2024/084506 patent/WO2024212822A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113132732A (zh) * | 2019-12-31 | 2021-07-16 | 北京大学 | 一种人机协同的视频编码方法及视频编码系统 |
CN114257818A (zh) * | 2020-09-22 | 2022-03-29 | 阿里巴巴集团控股有限公司 | 视频的编、解码方法、装置、设备和存储介质 |
CN114363623A (zh) * | 2021-08-12 | 2022-04-15 | 财付通支付科技有限公司 | 图像处理方法、装置、介质及电子设备 |
US20230105436A1 (en) * | 2021-10-06 | 2023-04-06 | Kwai Inc. | Generative adversarial network for video compression |
CN115941966A (zh) * | 2022-12-30 | 2023-04-07 | 深圳大学 | 一种视频压缩方法及电子设备 |
CN116582686A (zh) * | 2023-04-12 | 2023-08-11 | 百果园技术(新加坡)有限公司 | 一种基于ai的视频编码方法、装置、设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN116582686A (zh) | 2023-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023016155A1 (zh) | 图像处理方法、装置、介质及电子设备 | |
CN104333762B (zh) | 一种视频解码方法 | |
KR102058759B1 (ko) | 디코딩된 픽쳐 버퍼 및 참조 픽쳐 목록들에 관한 상태 정보의 시그널링 기법 | |
EP2134092B1 (en) | Information processing apparatus and method, and program | |
CN102326391B (zh) | 多视点图像编码装置及方法、多视点图像译码装置及方法 | |
CN101540926B (zh) | 基于h.264的立体视频编解码方法 | |
US20130022116A1 (en) | Camera tap transcoder architecture with feed forward encode data | |
JP5731672B2 (ja) | 暗黙基準フレームを用いる動画像符号化システム | |
WO2021036795A1 (zh) | 视频超分辨率处理方法及装置 | |
CN101543078A (zh) | 信息处理设备和方法 | |
CN107027025B (zh) | 一种基于宏像素块自适应预测的光场图像压缩方法 | |
CN104349074A (zh) | 用于生成合并的数字视频序列的方法、设备和系统 | |
KR100612691B1 (ko) | 동영상 화질 평가시스템 및 방법 | |
WO2024212822A1 (zh) | 一种基于ai的视频编码方法、装置、设备和存储介质 | |
JP2023518307A (ja) | 顔復元に基づくビデオ会議のためのフレームワーク | |
WO2024212826A1 (zh) | 基于ai的编码驱动策略调整方法、装置、设备和存储介质 | |
US20220335560A1 (en) | Watermark-Based Image Reconstruction | |
US20130223525A1 (en) | Pixel patch collection for prediction in video coding system | |
WO2024078066A1 (zh) | 视频解码方法、视频编码方法、装置、存储介质及设备 | |
WO2023024832A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN111343463A (zh) | 一种图像编码设备、方法及图像编码器 | |
CN111406404A (zh) | 获得视频文件的压缩方法、解压缩方法、系统及存储介质 | |
Yang et al. | Graph-convolution network for image compression | |
CN111212288B (zh) | 视频数据的编解码方法、装置、计算机设备和存储介质 | |
Dong et al. | A demo of semantic communication: Rosefinch |