WO2022227689A1 - 视频处理方法及装置 - Google Patents
视频处理方法及装置 Download PDFInfo
- Publication number
- WO2022227689A1 WO2022227689A1 PCT/CN2022/070267 CN2022070267W WO2022227689A1 WO 2022227689 A1 WO2022227689 A1 WO 2022227689A1 CN 2022070267 W CN2022070267 W CN 2022070267W WO 2022227689 A1 WO2022227689 A1 WO 2022227689A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- features
- video
- transparency information
- video frame
- fused
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 36
- 230000004927 fusion Effects 0.000 claims abstract description 170
- 238000012545 processing Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims description 82
- 238000003062 neural network model Methods 0.000 claims description 72
- 230000007246 mechanism Effects 0.000 claims description 41
- 238000000034 method Methods 0.000 claims description 40
- 230000033001 locomotion Effects 0.000 claims description 32
- 239000000284 extract Substances 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
- H04N17/004—Diagnosis, testing or measuring for television systems or their details for digital television systems
Definitions
- the present disclosure relates to the field of image processing, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.
- Image matting is one of the important techniques in the field of image processing.
- the traditional matting technology uses the underlying features such as color or structure of the image to separate the foreground, but when applied to complex scenes, the matting effect is limited by the limited expressive ability of the underlying features and cannot accurately separate the foreground.
- image matting technology based on deep learning has become the mainstream image matting technology.
- deep video matting techniques have not been effectively explored due to the lack of large-scale deep learning video matting datasets.
- one of the solutions for deep video matting is to apply a deep image matting technique to video data frame by frame, thereby realizing video matting.
- the present disclosure provides a video processing method and apparatus, an electronic device and a storage medium.
- a video processing method comprising: acquiring a video and partial transparency information corresponding to each video frame in the video; Partial transparency information extracts spatial features of multiple scales of each video frame; fuses the spatial features of the same scale of adjacent video frames of the video to generate multiple fusion features of different scales; based on the fusion feature prediction of multiple different scales Unknown transparency information of each video frame; the video is processed according to the predicted unknown transparency information.
- the fusion of spatial features of the same scale of adjacent video frames of the video to generate a plurality of fusion features of different scales includes: extracting motion information between adjacent video frames, and extracting motion information according to the motion The information aligns the spatial features of the same scale of adjacent frames; the aligned spatial features of the same scale are fused to generate multiple fused features of different scales.
- the fusion of the aligned spatial features of the same scale to generate a plurality of fusion features of different scales includes: by directly performing channel merging on the aligned spatial features of each scale, to fuse the aligned spatial features of the same scale.
- the aligned spatial features are fused to generate multiple fused features of different scales; or by channel merging the aligned spatial features of each scale and using the attention mechanism to fuse the channel-merged features, the same scale can be merged.
- the aligned spatial features are fused to generate multiple fused features at different scales.
- using the attention mechanism to fuse the channel-merged features includes: using the channel attention mechanism to fuse the feature channels, and using the spatial attention mechanism to fuse the pixels in the same channel.
- the fusion of the aligned spatial features of the same scale to generate fusion features of different scales further includes: performing further feature extraction on the fusion features of different scales to obtain new fusion features ; wherein, predicting the unknown transparency information of each video frame based on a plurality of fusion features of different scales includes: predicting the unknown transparency information of each video frame based on a new fusion feature.
- the video processing method further comprises: using a deep neural network model to predict unknown transparency information except the partial transparency information of each video frame based on the video and the partial transparency information, wherein the The deep neural network model is an encoder-decoder structure model, and there is a skip layer connection between the encoder and the decoder of the encoder-decoder structure model, and the decoder includes a feature fusion module and a prediction branch, wherein the The video processing method further comprises: using an encoder to extract spatial features of multiple scales of each video frame, and using a feature fusion module to fuse the spatial features of the same scale of adjacent video frames of the video to generate a plurality of fusions of different scales features, and utilizes the prediction branch to predict unknown transparency information of each video frame based on the fusion features of multiple different scales.
- the The deep neural network model is an encoder-decoder structure model, and there is a skip layer connection between the encoder and the decoder of the encoder-decoder structure model, and the decoder includes a
- the video processing method further includes: the skip layer connection indicates that spatial features of different scales generated by the encoder are respectively input to a feature fusion module of the decoder for fusing features of corresponding scales.
- the extracting spatial features of multiple scales of each video frame based on each video frame and the partial transparency information includes: connecting each video frame and a transparency information map corresponding to each video frame to form Connecting images; extracting the spatial features of multiple scales of the connected images corresponding to each video frame as the spatial features of multiple scales of each video frame.
- the processing the video according to the predicted unknown transparency information includes: extracting the target object in the video according to the predicted unknown transparency information of each video frame.
- a method for training a deep neural network model including: acquiring a training video and all transparency information corresponding to each video frame in the training video; based on the training video and For the partial transparency information in the total transparency information, a deep neural network model is used to perform the following operations to predict unknown transparency information other than the partial transparency information: based on each video frame of the training video and the corresponding
- the partial transparency information extracts the spatial features of multiple scales of each video frame, fuses the spatial features of the same scale of the adjacent video frames of the training video to generate multiple fusion features of different scales, and based on the fusion of multiple different scales
- the feature predicts unknown transparency information other than the partial transparency information in the total transparency information of each video frame; by comparing the predicted unknown transparency information with the transparency information except the partial transparency information in the total transparency information. Compare to tune the parameters of the deep neural network model.
- the fusion of spatial features of the same scale of adjacent video frames of the training video to generate a plurality of fusion features of different scales includes: extracting motion information between adjacent video frames, and according to The motion information aligns the spatial features of the same scale of adjacent frames; the aligned spatial features of the same scale are fused to generate multiple fused features of different scales.
- the fusion of the aligned spatial features of the same scale to generate a plurality of fusion features of different scales includes: by directly performing channel merging on the aligned spatial features of each scale, to fuse the aligned spatial features of the same scale.
- the aligned spatial features are fused to generate multiple fused features of different scales; or by channel merging the aligned spatial features of each scale and using the attention mechanism to fuse the channel-merged features, the same scale can be merged.
- the aligned spatial features are fused to generate multiple fused features at different scales.
- using the attention mechanism to fuse the channel-merged features includes: using the channel attention mechanism to fuse the feature channels, and using the spatial attention mechanism to fuse the pixels in the same channel.
- the fusion of the aligned spatial features of the same scale to generate fusion features of different scales further includes: performing further feature extraction on the fusion features of different scales to obtain new fusion features ; wherein, predicting unknown transparency information of each video frame based on a plurality of fusion features of different scales includes: predicting unknown transparency information of each video frame based on a new fusion feature.
- the deep neural network model is an encoder-decoder structure model, and a skip layer connection exists between the encoder and the decoder of the encoder-decoder structure model, and the decoder includes a feature fusion module and a prediction branch, wherein the method further comprises: extracting spatial features of multiple scales of each video frame using an encoder, and using a feature fusion module to fuse the spatial features of the same scale of adjacent video frames of the training video to generate Multiple fusion features of different scales are used, and the prediction branch is used to predict unknown transparency information of each video frame based on the fusion features of multiple different scales.
- the method further includes: the skip layer connection indicates that spatial features of different scales generated by the encoder are respectively input to a feature fusion module of the decoder for fusing features of corresponding scales.
- extracting spatial features of multiple scales of each video frame based on each video frame and the partial transparency information includes: connecting each video frame and a transparency information map corresponding to each video frame to form a connected image ; Extract the spatial features of multiple scales of connected images corresponding to each video frame as the spatial features of multiple scales of each video frame.
- a video processing apparatus comprising: a data acquisition unit configured to acquire a video and partial transparency information corresponding to each video frame in the video; a prediction unit configured to for extracting spatial features of multiple scales of each video frame based on each video frame and the partial transparency information, fusing the spatial features of the same scale of adjacent video frames of the video to generate a plurality of fused features of different scales, and The unknown transparency information of each video frame is predicted based on a plurality of fusion features of different scales; the processing unit is configured to process the video according to the predicted unknown transparency information.
- the fusion of spatial features of the same scale of adjacent video frames of the video to generate a plurality of fusion features of different scales includes: extracting motion information between adjacent video frames, and extracting motion information according to the motion The information aligns the spatial features of the same scale of adjacent frames; the aligned spatial features of the same scale are fused to generate multiple fused features of different scales.
- the fusion of the aligned spatial features of the same scale to generate a plurality of fusion features of different scales includes: by directly performing channel merging on the aligned spatial features of each scale, to fuse the aligned spatial features of the same scale.
- the aligned spatial features are fused to generate multiple fused features of different scales; or by channel merging the aligned spatial features of each scale and using the attention mechanism to fuse the channel-merged features, the same scale can be merged.
- the aligned spatial features are fused to generate multiple fused features at different scales.
- using the attention mechanism to fuse the channel-merged features includes: using the channel attention mechanism to fuse the feature channels, and using the spatial attention mechanism to fuse the pixels in the same channel.
- the fusion of the aligned spatial features of the same scale to generate fusion features of different scales further includes: performing further feature extraction on the fusion features of different scales to obtain new fusion features ; wherein, predicting the unknown transparency information of each video frame based on a plurality of fusion features of different scales includes: predicting the unknown transparency information of each video frame based on a new fusion feature.
- the prediction unit uses a deep neural network model to predict unknown transparency information other than the partial transparency information of each video frame based on the video and the partial transparency information
- the deep neural network model is An encoder-decoder structure model, in which there is a skip layer connection between the encoder and the decoder, and the decoder includes a feature fusion module and a prediction branch, wherein each video frame is extracted by the encoder
- the feature fusion module is used to fuse the spatial features of the same scale of adjacent video frames of the video to generate multiple fusion features of different scales
- the prediction branch is used based on multiple fusion features of different scales Predict unknown transparency information for individual video frames.
- the skip layer connection indicates that the spatial features of different scales generated by the encoder are respectively input to a feature fusion module of the decoder for fusing the features of the corresponding scales.
- the extracting spatial features of multiple scales of each video frame based on each video frame and the partial transparency information includes: connecting each video frame and a transparency information map corresponding to each video frame to form Connecting images; extracting the spatial features of multiple scales of the connected images corresponding to each video frame as the spatial features of multiple scales of each video frame.
- the processing unit is configured to extract the target object in the video according to the predicted unknown transparency information of each video frame.
- an apparatus for training a deep neural network model comprising: a training data acquisition unit configured to acquire a training video and all transparency corresponding to each video frame in the training video information; a model training unit configured to use a deep neural network model to predict unknown transparency information other than the partial transparency information based on the training video and partial transparency information in the total transparency information:
- the video frames and the partial transparency information corresponding to each video frame extract the spatial features of multiple scales of each video frame, and fuse the spatial features of the same scale of adjacent video frames of the training video to generate multiple fusion features of different scales , predicting unknown transparency information except the partial transparency information in the total transparency information of each video frame based on fusion features of multiple different scales; and by combining the predicted unknown transparency information with the total transparency information except the The transparency information other than the partial transparency information is compared to adjust the parameters of the deep neural network model.
- the fusion of spatial features of the same scale of adjacent video frames of the video to generate a plurality of fusion features of different scales includes: extracting motion information between adjacent video frames, and extracting motion information according to the motion The information aligns the spatial features of the same scale of adjacent frames; the aligned spatial features of the same scale are fused to generate multiple fused features of different scales.
- the fusion of the aligned spatial features of the same scale to generate a plurality of fusion features of different scales includes: by directly performing channel merging on the aligned spatial features of each scale, to fuse the aligned spatial features of the same scale.
- the aligned spatial features are fused to generate multiple fused features of different scales; or by channel merging the aligned spatial features of each scale and using the attention mechanism to fuse the channel-merged features, the same scale can be merged.
- the aligned spatial features are fused to generate multiple fused features at different scales.
- using the attention mechanism to fuse the channel-merged features includes: using the channel attention mechanism to fuse the feature channels, and using the spatial attention mechanism to fuse the pixels in the same channel.
- the fusion of the aligned spatial features of the same scale to generate fusion features of different scales further includes: performing further feature extraction on the fusion features of different scales to obtain new fusion features ; wherein, predicting unknown transparency information of each video frame based on a plurality of fusion features of different scales includes: predicting unknown transparency information of each video frame based on a new fusion feature.
- the deep neural network model is an encoder-decoder structure model, and a skip layer connection exists between the encoder and the decoder of the encoder-decoder structure model, and the decoder includes a feature fusion module and the prediction branch, wherein the model training unit is further configured to: extract spatial features of multiple scales of each video frame by using an encoder, and use a feature fusion module to combine the spatial features of the same scale of adjacent video frames of the training video Feature fusion is performed to generate multiple fused features of different scales, and the prediction branch is used to predict unknown transparency information of each video frame based on the fused features of multiple different scales.
- the model training unit is further configured to: the skip layer connection indicates that spatial features of different scales generated by the encoder are respectively input to a feature fusion module of the decoder for fusing features of corresponding scales.
- extracting spatial features of multiple scales of each video frame based on each video frame and the partial transparency information includes: connecting each video frame and a transparency information map corresponding to each video frame to form a connected image ; Extract the spatial features of multiple scales of connected images corresponding to each video frame as the spatial features of multiple scales of each video frame.
- an electronic device comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the at least one processor At runtime, the at least one processor is caused to perform a video processing method or a method of training a deep neural network model as described above.
- a computer-readable storage medium that non-volatilely stores instructions that, in response to the instructions being executed by at least one processor, cause the at least one processor to perform the above described video processing methods or methods for training deep neural network models.
- a computer program product including computer instructions that, in response to the computer instructions being executed by a processor, implement the above-described video processing method or method for training a deep neural network model.
- the spatial features of the adjacent video frames of the same scale are fused, so that the time series information between adjacent video frames is utilized in the prediction of transparency information , therefore, the continuity and consistency of the predicted transparency information is improved, that is, the prediction accuracy of the transparency information is improved.
- FIG. 1 is an exemplary system architecture in which exemplary embodiments of the present disclosure may be applied;
- FIG. 2 is a flowchart of a video processing method according to an exemplary embodiment of the present disclosure
- FIG. 3 is a schematic diagram illustrating an example of a deep neural network model of an exemplary embodiment of the present disclosure
- FIG. 4 is a schematic diagram illustrating an example of a feature fusion model of an exemplary embodiment of the present disclosure
- FIG. 5 is a flowchart illustrating a method for training a deep neural network model according to another exemplary embodiment of the present disclosure
- FIG. 6 is a block diagram illustrating a video processing apparatus of an exemplary embodiment of the present disclosure
- FIG. 7 is a block diagram illustrating an apparatus for training a deep neural network model according to another exemplary embodiment of the present disclosure.
- FIG. 8 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
- FIG. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.
- the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
- the network 104 is the medium used to provide the communication link between the end devices 101, 102, 103 and the server 105.
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages (eg, video data upload request, video data download request) and the like.
- Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as video recording software, video players, video editing software, instant communication tools, email clients, social platform software, and the like.
- the terminal devices 101, 102, and 103 may be hardware or software. In the case where the terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with a display screen and capable of audio and video playback, recording, editing, etc., including but not limited to smart phones, tablet computers, laptop portable devices, etc. Computers and desktop computers, etc.
- terminal devices 101, 102, and 103 are software
- they can be installed in the electronic devices listed above, which can be implemented as multiple software or software modules (for example, to provide distributed services), or can be implemented as a single software or software modules. There is no specific limitation here.
- the terminal devices 101, 102, and 103 may be installed with image capture devices (eg, cameras) to capture video data.
- image capture devices eg, cameras
- the smallest visual unit that composes a video is a frame.
- Each frame is a static image.
- a dynamic video is formed by synthesizing a sequence of temporally consecutive frames together.
- the terminal devices 101, 102, 103 may also be installed with components for converting electrical signals into sounds (such as speakers) to play sounds, and may also be installed with devices for converting analog audio signals into digital audio signals (for example, microphone) to capture sound.
- the server 105 may be a server that provides various services, such as a background server that provides support for multimedia applications installed on the terminal devices 101 , 102 , and 103 .
- the background server can parse and store the received audio and video data upload requests and other data, and can also receive audio and video data download requests sent by the terminal devices 101, 102, 103, and download the audio and video data requests. The indicated audio and video data are fed back to the terminal devices 101 , 102 and 103 .
- the server may be hardware or software.
- the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
- the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
- the video processing method provided by the embodiments of the present disclosure may be executed by a terminal device or a server, or may also be executed by a terminal device and a server in cooperation. Accordingly, the video processing apparatus may be provided in the terminal device, in the server, or in both the terminal device and the server.
- terminal devices, networks and servers in FIG. 1 are merely illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers, which are not limited in the present disclosure.
- FIG. 2 is a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.
- the video processing method can be executed by a terminal device, also by a server, or can also be executed by a terminal device and a server in cooperation.
- the video processing method may include steps S210-S250.
- Processing the video may be matting the video.
- Cutout refers to the separation of specific foreground objects (portraits, animals, etc.) from the original image and the background into a separate layer in preparation for post-compositing images.
- a video and partial transparency information corresponding to each video frame in the video are acquired.
- the transparency is the above-mentioned Alpha.
- the partial transparency information may be a transparency information map including a determined foreground area, a determined background area, and an unknown area, that is, a trimap map, but is not limited thereto, but any data form that can reflect the partial transparency information.
- the video may be acquired in response to a user request, and the partial transparency information corresponding to the video frame may be acquired according to user input (for example, an input of the user specifying a partial foreground area and a background area).
- partial transparency is acquired
- the method of information is not limited to the above methods, for example, partial transparency information can also be obtained automatically through machine analysis without user input, that is, the present disclosure does not have any limitation on the method of obtaining video and partial transparency information.
- unknown transparency information other than the partial transparency information of each video frame may be predicted using a deep neural network model based on the video and the partial transparency information.
- the present disclosure utilizes a pre-trained deep neural network model to predict unknown transparency information. Next, the description will focus on the operations performed by the deep neural network model of the embodiment of the present disclosure.
- step S220 spatial features of multiple scales of each video frame may be extracted based on each video frame and the partial transparency information; in step S230, the same scale of adjacent video frames of the video may be extracted The spatial features are fused to generate a plurality of fusion features of different scales; in step S240, the unknown transparency information of each video frame can be predicted based on the fusion features of a plurality of different scales; in step S250, the unknown transparency information can be predicted according to the predicted unknown transparency information.
- the video is processed.
- the deep neural network model according to the present disclosure may be an encoder-decoder structure model, a MobileNet network structure model, or a deep residual network model, but is not limited thereto.
- the architecture and operation of the deep neural network model are introduced below by taking the deep neural network model as an encoder-decoder structural model as an example.
- FIG. 3 is a schematic diagram illustrating an example of a deep neural network model of an exemplary embodiment of the present disclosure.
- the encoder-decoder structure model shown in FIG. 3 includes an encoder and a decoder, and the decoder includes a feature fusion module (denoted as ST-FAM in FIG. 3 ) and a prediction branch (not shown).
- a feature fusion module denotes the spatial features of the same scale of adjacent video frames of the video to generate Multiple fusion features of different scales are used
- the prediction branch is used to predict unknown transparency information of each video frame based on the fusion features of multiple different scales.
- the skip layer connection indicates that the spatial features of different scales generated by the encoder are respectively input to the feature fusion module of the decoder for fusing the features of the corresponding scales.
- the convolutional layers in the encoder for generating spatial features of different scales are connected with the convolutional layers in the decoder for fusing features of corresponding scales, and due to this corresponding connection, the encoder of the present disclosure-
- the decoder structure model has skip-layer connections between the encoder and the decoder, which is convenient for the decoder to fuse spatial features of different scales respectively.
- step S220 spatial features of multiple scales of each video frame may be extracted based on each video frame and the partial transparency information corresponding to each video frame .
- each video frame and a transparency information map referred to as "Trimap" in FIG. 3
- Trimap a transparency information map
- step S230 spatial features of multiple scales of the connected images corresponding to each video frame may be extracted as the spatial features of multiple scales of each video frame. As shown in FIG.
- the encoder adopts the ResNet-50 structure, which may include, for example, a 7x7 convolutional layer (the convolutional layer can perform convolutional operations and max-pooling operations) and several standard residual blocks (for example, 4),
- the downsampling step size is 32, however, the structure of the encoder is not limited to this, as long as it can extract spatial features of different scales of each video frame based on the video frame and its corresponding partial transparency information. After passing through the encoder, spatial features of different scales are obtained. Different from the traditional low-level image features, these features not only contain the underlying expressive ability but also contain rich semantic information, which lays a good foundation for the subsequent reconstruction process.
- the predicted Alpha is often relatively independent, lacking continuity and consistency, that is, the Alpha prediction accuracy is not high. This is because the image matting algorithm is applied to video frames independently, without considering the connection between adjacent video frames and ignoring the timing information in the video.
- the present disclosure simultaneously sends multiple video frames of the video to the deep neural network model, extracts spatial features of multiple scales of each video frame based on each video frame and the partial transparency information, and extracts the spatial features of each video frame.
- the spatial features of the same scale of the adjacent video frames of the video are fused to generate a plurality of fusion features of different scales, so that the timing information can be encoded into the fused features, that is, the fused features include both Both spatial features and temporal features are included.
- the fused features include both Both spatial features and temporal features are included.
- video frames from t i-2 to t i+2 and their corresponding Trimaps are simultaneously input to the encoder, and the spatial features of different scales generated by the encoder are respectively input to the feature fusion model ( Denoted as ST-FAM in Figure 3) for feature fusion.
- ST-FAM feature fusion model
- FIG. 3 the present disclosure does not limit the number of ST-FAMs.
- the number of ST-FAMs may vary with the number of selected scales and different.
- the top-down ST-FAM in the decoder in Figure 3 is used to fuse spatial features of different scales, for example, from top to bottom, the first ST-FAM is used to fuse the spatial features of the first scale , the second ST-FAM is used to fuse the spatial features of the second scale, the third ST-FAM is used to fuse the spatial features of the third scale, and the fourth ST-FAM is used to fuse the spatial features of the fourth scale.
- the spatial features are fused.
- the first scale is smaller than the second scale
- the second scale is smaller than the third scale
- the third scale is smaller than the fourth scale.
- the motion information of objects in the video can help the deep neural network model to effectively distinguish the foreground and the background. Therefore, in the present disclosure, when the spatial features of the same scale of adjacent video frames of the video are fused to generate a plurality of fusion features of different scales, the motion information between the adjacent video frames is first extracted, and the phase is made according to the motion information. The spatial features of the same scale of adjacent frames are aligned, and then the aligned spatial features of the same scale are fused to generate multiple fused features of different scales. Through the above operations, the motion information between the video frames is effectively utilized, so that the accuracy of the model prediction result can be further improved.
- FIG. 4 is a schematic diagram illustrating an example of a feature fusion model of an exemplary embodiment of the present disclosure.
- the ST-FAM module includes two sub-modules: (i) a feature alignment sub-module to compensate for the misalignment between adjacent frames due to object movement; (ii) a feature fusion sub-module , which is used to fuse the spatial features of the same scale between adjacent frames to generate a global fusion feature that is beneficial to alpha prediction.
- Such fusion features contain timing information between video frames.
- the feature alignment sub-module may extract motion information between adjacent frames, thereby aligning the same scale spatial features of adjacent frames.
- spatial features can be in the form of feature maps.
- the feature alignment sub-module can first merge the feature maps of the same scale of adjacent frames (for example, F t , F t+n and F tn in Figure 3 ), and then use the convolutional layer to predict the feature map at each time t.
- the displacement ⁇ p of pixel p, this ⁇ p is the motion vector of pixel p between frame t and frame t+1.
- we utilize a deformable convolution layer to align the features of frame t+1 to frame t. In this way, we can automatically align the features of multiple time intervals [tn, t+n] to the t frame, and the aligned features of these multiple video frames will be sent to the feature fusion sub-module for feature fusion.
- the aligned spatial features of the same scale may be fused to generate a plurality of fused features of different scales by directly performing channel merging on the aligned spatial features of each scale.
- the merged features have the feature information of each frame, thereby helping to distinguish the foreground and background, while ensuring the stability of multi-frame prediction.
- additional interference information may be introduced in the case of channel merging by directly merging the spatial features of the same scale of multiple frames.
- the aligned spatial features of each scale can be channel-merged and the channel-merged features can be fused by using an attention mechanism.
- the aligned spatial features of the same scale are fused to generate multiple fused features of different scales.
- the feature channels can be first fused using the channel attention mechanism, and then the pixels within the same channel can be fused using the spatial attention mechanism.
- the channel attention weights using the global average pooling operation. This weight is then multiplied to the aligned features to filter out channels that are useful for t-frames.
- the spatial attention operation is used to increase the interaction between the pixels in the channel and increase the receptive field, thereby reducing the influence of interfering information.
- further feature extraction may be performed on a plurality of fused features of different scales to obtain new fused features.
- This can further increase the receptive field and obtain fusion features with stronger expressive ability.
- a global convolution layer can be operated to encode previously fused features to further increase the receptive field and obtain fused features with stronger expressiveness.
- unknown transparency information of each video frame may be predicted based on the multiple fusion features of different scales.
- unknown transparency information for each video frame may be predicted based on the new fused features (ie, the above-mentioned new fused features obtained by performing further feature extraction).
- the output fusion features are up-sampled and merged with the fusion features output by the lower layer ST-FAM, so that the output can be gradually
- the decoded features of the same scale as the original video frame are reconstructed, and the decoded features are finally sent to the prediction branch to generate the prediction result, that is, the unknown transparency information of the video frame.
- the original video frame may also be used to modify and further refine the prediction result to obtain the final prediction result.
- the output of the decoder is sent to the refinement module for refinement, and finally the prediction result is obtained.
- the video may be processed according to the predicted unknown transparency information.
- the target object in the video can be extracted according to the predicted unknown transparency information of each video frame.
- the extracted target objects can also be synthesized with other videos.
- the video processing method according to the embodiment of the present disclosure has been described above with reference to FIGS. 2 to 4 .
- the video processing method of the present disclosure can produce continuity and consistency. Better Alpha prediction results and improved Alpha prediction accuracy.
- the deep neural network model is an encoder-decoder structural model as an example above, the architecture of the deep neural network model and its operation and the video processing method of the present disclosure have been introduced, however, according to the present disclosure
- the deep neural network model is not limited to the encoder-decoder structure model, for example, it can also be a MobileNet network structure model or a deep residual network model, and so on.
- the encoder-decoder structure can be adjusted according to actual requirements to cope with different application scenarios.
- the encoder-decoder structure can be replaced with a portable network suitable for mobile, such as MobileNet, because the mobile app has high requirements for speed and real-time performance;
- a portable network suitable for mobile such as MobileNet
- the encoder-decoder structure can be replaced with a deep network with more expressive power, such as the ResNet-101 network, to meet the accuracy requirements.
- FIG. 5 is a flowchart illustrating a method of training a deep neural network model according to another exemplary embodiment of the present disclosure.
- the method for training a deep neural network model can be executed by a terminal device, also by a server, or can also be executed by a terminal device and a server in cooperation.
- the method for training a deep neural network model may include steps S510-S550.
- step S510 a training video and all transparency information corresponding to each video frame in the training video are acquired.
- the deep neural network model is used to perform steps S520 to S540 to predict unknown transparency information except the partial transparency information.
- step S520 spatial features of multiple scales of each video frame are extracted based on each video frame of the training video and partial transparency information corresponding to each video frame.
- step S530 the spatial features of the same scale of adjacent video frames of the training video are fused to generate a plurality of fused features of different scales.
- step S540 unknown transparency information except for the partial transparency information in the total transparency information of each video frame is predicted based on the fusion features of multiple different scales.
- step S550 the predicted unknown transparency information is compared with the transparency information in the whole transparency information except the partial transparency information to adjust the parameters of the deep neural network model.
- a pre-constructed loss function can be utilized when tuning the parameters of the deep neural network model.
- the deep neural network model adopts different network structures or model types, and the loss function used by it will be different accordingly.
- the present disclosure does not limit the structure and type of the deep neural network model and the loss function used, as long as it can Just perform the operations described above.
- the operation performed by the deep neural network model during training is exactly the same as the operation performed by the model during prediction, except that the training data is used for training, and the real pending data is used for prediction.
- Prediction data, and after the prediction result is obtained in the training process, the prediction result will be compared with the actual value to adjust the model parameters. Therefore, in view of the operation performed on the deep neural network model in the description of the video processing method above, and each operation The details involved have been introduced, and they will not be repeated here. For corresponding parts, reference may be made to the corresponding descriptions in FIGS. 2 to 4 .
- the deep neural network model is made to extract spatial features of multiple scales of each video frame based on each video frame and the partial transparency information corresponding to each video frame, and The spatial features of the same scale of the adjacent video frames of the video are fused to generate multiple fusion features of different scales. Therefore, the time series information between the video frames is utilized, so that the trained deep neural network model can provide more accurate prediction results. .
- FIG. 6 is a block diagram illustrating a video processing apparatus of an exemplary embodiment of the present disclosure.
- the video processing apparatus 600 may include a data acquisition unit 601 , a prediction unit 602 and a processing unit 603 .
- the data acquisition unit 601 may acquire a video and partial transparency information corresponding to each video frame in the video.
- the prediction unit 602 may extract spatial features of multiple scales of each video frame based on each video frame and the partial transparency information, and fuse the spatial features of the same scale of adjacent video frames of the video to generate a plurality of fusions of different scales. features, and predict the unknown transparency information of each video frame based on the fusion features of multiple different scales.
- the prediction unit 602 may predict unknown transparency information other than the partial transparency information of each video frame using a deep neural network model based on the video and the partial transparency information.
- the processing unit 603 may process the video according to the predicted unknown transparency information.
- the processing unit may extract the target object in the video according to the predicted unknown transparency information of each video frame.
- step S250 Since the video processing method shown in FIG. 2 can be performed by the video processing apparatus 600 shown in FIG. 6 , and the data acquisition unit 601 , the prediction unit 602 and the processing unit 603 can respectively perform steps S210 and S220 to 240 in FIG. 2 .
- the operations corresponding to step S250 therefore, for any relevant details involved in the operations performed by the units in FIG. 6, reference may be made to the corresponding descriptions of FIG. 2 to FIG. 4, which will not be repeated here.
- the video processing apparatus 600 is described above as being divided into units for performing corresponding processing respectively, it is clear to those skilled in the art that the processing performed by the above units can also be performed in video processing
- the apparatus 600 is executed without any specific unit division or clear demarcation between the units.
- the video processing apparatus 600 may further include other units, such as a data processing unit, a storage unit, and the like.
- FIG. 7 is a block diagram illustrating an apparatus for training a deep neural network model (hereinafter, for convenience of description, it is simply referred to as a “training apparatus”) according to another exemplary embodiment of the present disclosure.
- a training apparatus 700 may include a training data acquisition unit 701 and a model training unit 702 .
- the training data obtaining unit 701 can obtain a training video and all transparency information corresponding to each video frame in the training video.
- the model training unit 702 may use a deep neural network model to perform the following operations based on the training video and partial transparency information in the entire transparency information to predict unknown transparency information other than the partial transparency information: based on the training video
- the spatial features of multiple scales of each video frame are extracted from each video frame and the partial transparency information corresponding to each video frame, and the spatial features of the same scale of adjacent video frames of the training video are fused to generate a plurality of different scales.
- Fusion features predicting unknown transparency information except for the partial transparency information in the total transparency information of each video frame based on fusion features of multiple different scales; and by dividing the predicted unknown transparency information from the total transparency information.
- the transparency information other than the partial transparency information is compared to adjust the parameters of the deep neural network model.
- the operation performed by the deep neural network model during training is exactly the same as the operation performed by the model during prediction, except that the training data is used during training, and the real video to be predicted is used during prediction. Therefore, about the deep neural network model For the operations performed and the details involved in each operation, reference may be made to the corresponding descriptions in FIG. 2 to FIG. 4 , which will not be repeated here.
- training device 700 is described above as being divided into units for performing corresponding processing respectively, it is clear to those skilled in the art that the processing performed by the above-mentioned units can also be performed in the training device 700 Execute without any specific unit division or without clear demarcation between units.
- the training device 700 may also include other units, such as a data processing unit, a storage unit, and the like.
- FIG. 8 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
- an electronic device 800 may include at least one memory 801 and at least one processor 802, the at least one memory stores a set of computer-executable instructions, and in response to the set of computer-executable instructions being executed by the at least one processor, executes according to The video processing method or the method for training a deep neural network model according to an embodiment of the present disclosure.
- the electronic device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions.
- the electronic device does not have to be a single electronic device, but can also be any set of devices or circuits that can individually or jointly execute the above-mentioned instructions (or instruction sets).
- the electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
- a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
- processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
- the processor may execute instructions or code stored in memory, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
- the memory may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the memory may comprise a separate device such as an external disk drive, a storage array, or any other storage device that may be used by a database system.
- the memory and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.
- the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device can be connected to each other via a bus and/or a network.
- a video display such as a liquid crystal display
- a user interaction interface such as a keyboard, mouse, touch input device, etc.
- a computer-readable storage medium non-volatilely storing instructions, wherein, in response to the instructions being executed by at least one processor, the at least one processor is caused to perform exemplary embodiments according to the present disclosure video processing methods or methods for training deep neural network models.
- Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card, or Extreme Digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk, and any other apparatuses configured to store, in a non-transitory manner, a
- the computer program in the above-mentioned computer readable storage medium can be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc.
- the computer program and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
- a computer program product the instructions in the computer program product can be executed by at least one processor in an electronic device to perform a video processing method or training according to an exemplary embodiment of the present disclosure Methods for deep neural network models.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
一种视频处理方法、装置、电子设备及存储介质,视频处理方法包括:获取视频以及与视频中的各个视频帧对应的部分透明度信息;基于各个视频帧和部分透明度信息提取各个视频帧的多个尺度的空间特征;将视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;根据预测出的未知透明度信息对视频进行处理。
Description
相关申请的交叉引用
本申请要求于2021年4月28日递交的中国专利申请第202110468173.6号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
本公开涉及图像处理领域,尤其涉及一种视频处理方法和装置、电子设备及存储介质。
抠图是图像处理领域里的重要技术之一。传统的抠图技术利用图像的色彩或结构等底层特征来分离前景,但是当应用于复杂场景时,抠图效果被底层特征的有限表达能力所限制而不能精确的分离出前景。随着深度学习的发展,基于深度学习的图像抠图技术成为主流的图像抠图技术。然而,不同于日益成熟的深度图像抠图技术,深度视频抠图技术由于缺乏大规模的深度学习视频抠图数据集而没有得到有效探索。
通常,深度视频抠图的解决方案之一是将深度图像抠图技术逐帧应用于视频数据,从而实现视频抠图。
发明内容
本公开提供一种视频处理方法和装置、电子设备及存储介质。
根据本公开实施例的第一方面,提供了一种视频处理方法,所述视频处理方法包括:获取视频以及与所述视频中的各个视频帧对应的部分透明度信息;基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征;将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;根据预测出的未知透明度信息对所述视频进行处理。
在一些实施例中,所述将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利 用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,所述基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
在一些实施例中,所述视频处理方法还包括:基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息,其中,所述深度神经网络模型是编码器-解码器结构模型,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述视频处理方法还包括:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
在一些实施例中,所述视频处理方法还包括:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
在一些实施例中,所述基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
在一些实施例中,所述根据预测出的未知透明度信息对所述视频进行处理,包括:根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
根据本公开实施例的第二方面,提供了一种训练深度神经网络模型的方法,包括:获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息;基于所述训练视频以及所述全部透明度信息中的部分透明度信息,利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于所述训练视频的各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。
在一些实施例中,所述将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
在一些实施例中,所述深度神经网络模型是编码器-解码器结构模型,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述方法还包括:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
在一些实施例中,所述方法还包括:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
在一些实施例中,基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
根据本公开实施例的第三方面,提供了一种视频处理装置,包括:数据获取单元,被配置为获取视频以及与所述视频中的各个视频帧对应的部分透明度信息;预测单元,被配置为基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;处理单元,被配置为根据预测出的未知透明度信息对所述视频进行处理。
在一些实施例中,所述将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融 合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,所述基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
在一些实施例中,预测单元基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息,其中,所述深度神经网络模型是编码器-解码器结构模型,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
在一些实施例中,所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
在一些实施例中,所述基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
在一些实施例中,处理单元被配置为根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
根据本公开实施例的第四方面,提供了一种训练深度神经网络模型的装置,包括:训练数据获取单元,被配置为获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息;模型训练单元,被配置为基于所述训练视频以及所述全部透明度信息中的部分透明度信息利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;并且通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述 深度神经网络模型的参数。
在一些实施例中,所述将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
在一些实施例中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
在一些实施例中,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
在一些实施例中,所述深度神经网络模型是编码器-解码器结构模型,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述模型训练单元还被配置为:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
在一些实施例中,所述模型训练单元还被配置为:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
在一些实施例中,基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
根据本公开实施例的第五方面,提供了一种电子设备,包括:至少一个处理器;至少一个存储计算机可执行指令的存储器,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行如上所述的视频处理方法或训练深度神经网络模型的方法。
根据本公开实施例的第六方面,提供了一种非易失性存储指令的计算机可读存储介质,响应于所述指令被至少一个处理器运行,促使所述至少一个处理器执行如上所述的视 频处理方法或训练深度神经网络模型的方法。
根据本公开实施例的第七方面,提供了一种计算机程序产品,包括计算机指令,响应于所述计算机指令被处理器执行,实现如上所述的视频处理方法或训练深度神经网络模型的方法。
本公开的实施例通过在提取各个视频帧的多个尺度的空间特征之后将相邻视频帧的同一尺度的空间特征融合,从而使得在透明度信息预测时利用了相邻视频帧之间的时序信息,因此,提高了预测得到的透明度信息的连续性和一致性,即,提高了透明度信息的预测准确性。
图1是本公开的示例性实施例可以应用于其中的示例性系统架构;
图2是本公开示例性实施例的视频处理方法的流程图;
图3是示出本公开示例性实施例的深度神经网络模型的示例的示意图;
图4是示出本公开示例性实施例的特征融合模型的示例的示意图;
图5是示出本公开另一示例性实施例的训练深度神经网络模型的方法的流程图;
图6是示出本公开示例性实施例的视频处理装置的框图;
图7是示出本公开另一示例性实施例的训练深度神经网络模型的装置的框图;
图8是根据本公开示例性实施例的电子设备的框图。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
图1示出了本公开的示例性实施例可以应用于其中的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的 介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息(例如视频数据上传请求、视频数据下载请求)等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如视频录制软件、视频播放器、视频编辑软件、即时通信工具、邮箱客户端、社交平台软件等。终端设备101、102、103可以是硬件,也可以是软件。在终端设备101、102、103为硬件的情况下,可以是具有显示屏并且能够进行音视频播放、录制、编辑等的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。在终端设备101、102、103为软件的情况下,可以安装在上述所列举的电子设备中,其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
终端设备101、102、103可以安装有图像采集装置(例如摄像头),以采集视频数据。实践中,组成视频的最小视觉单位是帧(Frame)。每一帧是一幅静态的图像。将时间上连续的帧序列合成到一起便形成动态视频。此外,终端设备101、102、103也可以安装有用于将电信号转换为声音的组件(例如扬声器)以播放声音,并且还可以安装有用于将模拟音频信号转换为数字音频信号的装置(例如,麦克风)以采集声音。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上所安装的多媒体应用提供支持的后台服务器。后台服务器可以对所接收到的音视频数据上传请求等数据进行解析、存储等处理,并且还可以接收终端设备101、102、103所发送的音视频数据下载请求,并将该音视频数据下载请求所指示的音视频数据反馈至终端设备101、102、103。
需要说明的是,服务器可以是硬件,也可以是软件。在服务器为硬件的情况下,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。在服务器为软件的情况下,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本公开实施例所提供的视频处理方法既可以由终端设备执行,也可由服务器执行,或者也可以由终端设备和服务器协作执行。相应地,视频处理装置可设置在终端设备中、服务器中或者设置在终端设备和服务器两者中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器,本公开对此并无限制。
图2是本公开示例性实施例的视频处理方法的流程图。该视频处理方法既可以由终端设备执行,也可由服务器执行,或者也可以由终端设备和服务器协作执行。该视频处理方法可以包括步骤S210-S250。
对视频进行处理可以是对视频进行抠图。抠图是指把特定的前景物体(人像,动物等)从原始图片中与背景分离,成为单独的图层,为后期合成图片做准备。抠图问题可以用公式I=αF+(1-α)B来定义,即一张图片(简记为I)是前景图层(简记为F) 和背景图层(简记为B)的加权和,其中权重也被称为透明度或Alpha(简记为α),是抠图问题中的待求解变量。由于给定一张图片,我们无从得知前景和背景图层的具体数值,因此对Alpha的估值是一个不适定问题,这意味着这个问题没有唯一解。为了限定求解空间,通常提供额外条件比如指定前景区域,从而使得抠图问题可以求解。
参照图2,在步骤S210,获取视频以及与所述视频中的各个视频帧对应的部分透明度信息。这里,透明度是以上提及的Alpha。作为示例,部分透明度信息可以是包括确定的前景区域、确定的背景区域以及未知区域的透明度信息图,即,trimap图,但不限于此,而是能够反映部分透明度信息的任何数据形式。此外,可响应于用户请求获取视频,并可根据用户输入(例如,用户指定部分前景区域和背景区域的输入)来获取与视频帧对应的部分透明度信息,然而,需要说明的是,获取部分透明度信息的方式不限于以上方式,例如,还可以无需用户输入而自动通过机器分析获取部分透明度信息,也就是说,本公开对于获取视频以及部分透明度信息的方式并无任何限制。
根据示例性实施例,可基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息。
本公开利用预先训练的深度神经网络模型来预测未知透明度信息。接下来,将重点对本公开实施例的深度神经网络模型执行的操作进行描述。
在一些实施例中,在步骤S220,可以基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征;在步骤S230,可以将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;在步骤S240,可以基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;在步骤S250,可以根据预测出的未知透明度信息对所述视频进行处理。
作为示例,根据本公开的深度神经网络模型可以是编码器-解码器结构模型、MobileNet网络结构模型或深度残差网络模型,但不限于此。为描述方便,下面以深度神经网络模型是编码器-解码器结构模型为例,对深度神经网络模型的架构及其操作进行介绍。
图3是示出本公开示例性实施例的深度神经网络模型的示例的示意图。如图3所示的编码器-解码器结构模型包括编码器和解码器,并且解码器包括特征融合模块(图3中被表示为ST-FAM)和预测分支(未示出)。在一些实施例中,在图3的示例中,利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
此外,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接。这里,跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。换言之,编码器中用于产生不同尺度的空间特征的卷积层被与解码器中用于融合对应尺度的特征的卷积层相连接,而由于这种对应连接, 使得本公开的编码器-解码器结构模型在编码器与解码器之间存在跳层连接,这样便于解码器对不同尺度的空间特征分别进行融合。
接下来,将具体参照图3,以深度网络模型是编码器-解码器结构模型为例,对以上步骤S220至S240进行进一步描述。
在获取到视频以及与所述视频中的各个视频帧对应的部分透明度信息之后,在步骤S220可基于各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征。例如,首先,可将各个视频帧和与各个视频帧对应的透明度信息图(在图3中被称为“Trimap”)连接(concatenate),以构成连接图像。随后,在步骤S230,可提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。如图3所示,t
i-2帧至t
i+2视频帧分别与其对应的Trimap连接,然后,连接图像被输入到编码器。作为示例,编码器采用ResNet-50结构,其例如可包括一个7x7的卷积层(卷积层可执行卷积操作和最大池化操作)和若干个标准残差块(例如,4个),降采样步长为32,然而编码器的结构不限于此,只要其能够基于视频帧及其对应的部分透明度信息提取各个视频帧的不同尺度的空间特征即可。经过编码器后,便得到了各个不同尺度的空间特征。不同于传统的图像底层特征,这些特征不仅包含底层表达能力而且蕴含了丰富的语义信息,为后续的重建过程打下了良好的基础。
将图像抠图算法独立应用于视频帧,预测得到的Alpha往往相对独立、缺少连续性和一致性,即,Alpha预测准确性不高。这是因为,将图像抠图算法独立应用于视频帧,没有考虑到相邻视频帧之间的联系,忽略了视频中的时序信息。为利用视频里的时序信息,本公开同时将视频的多个视频帧送到深度神经网络模型中,基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,并将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,从而能够将时序信息编码到融合后的特征里,也就是说,此时融合的特征里既包含了空间特征也包含了时序特征。例如,如图3所示,t
i-2帧至t
i+2视频帧及其对应的Trimap被同时输入到编码器,利用编码器产生的不同尺度的空间特征被分别输入到特征融合模型(图3中被表示为ST-FAM)进行特征融合。虽然图3中示出四个ST-FAM,但是本公开对ST-FAM的数量并未限制,事实上,根据本公开示例性实施例,ST-FAM的数量可以随着选取的尺度的数量不同而不同。在一些实施例中,图3中编码器中由上而下是不同层级的卷积层,用于生产不同尺度的空间特征。图3中解码器中的由上而下的ST-FAM分别用于对不同尺度的空间特征融合,例如,由上而下,第一个ST-FAM用于对第一尺度的空间特征进行融合,第二个ST-FAM用于对第二尺度的空间特征进行融合,第三个ST-FAM用于对第三尺度的空间特征进行融合,第四个ST-FAM用于对第四尺度的空间特征进行融合,此外,第一尺度小于第二尺度,第二尺度小于第三尺度,第三尺度小于第四尺度。
接下来,将重点对以上提及的特征融合模型的具体操作进行描述。本公开发现视频中物体的运动信息能够帮助深度神经网络模型有效区分前景和背景。因此,本公开在将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征时,首先提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐,然后,将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。通过以上操作,有效地利用了视频帧之间的运动信息,从而可进一步提高模型预测结果的准确性。
图4是示出本公开示例性实施例的特征融合模型的示例的示意图。接下来,将参照图4,对特征融合模型的示例进行介绍。如图4所示,ST-FAM模块包括两个子模块:(i)特征对齐子模块,用来弥补由于物体的移动带来的相邻帧之间不对齐的影响;(ii)特征融合子模块,用来将相邻帧之间的同一尺度的空间特征融合,进而产生一个对alpha预测有利的全局的融合特征,这样的融合特征里含有视频帧之间的时序信息。
在一些实施例中,特征对齐子模块可提取相邻帧之间的运动信息,从而对齐相邻帧的同一尺度的空间特征。例如,空间特征可以呈特征图的形式。特征对齐子模块可首先将相邻帧的同一尺度的特征图(例如,图3中的F
t、F
t+n和F
t-n)合并,然后利用卷积层预测每一个时刻t的特征图的像素p的位移Δp,这个Δp是t帧到t+1帧之间像素p的运动向量。之后,我们利用可变形卷积层(deformable convolution)将t+1帧的特征对齐到t帧。通过这种方式,我们可以自动让多个时刻区间[t-n,t+n]的特征均对齐到t帧,这些多个视频帧的对齐的特征会送往特征融合子模块,进行特征融合。
根据示例性实施例,可通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。合并后的特征拥有每一帧的特征信息,从而帮助区分前景背景,同时保证多帧预测的稳定性。但是由于前景的运动是不规律的,在一些大的运动的情况下,t帧的某一个像素p在经过移动后可能会在t+1帧丢失。在这种情况下,在直接合并多帧同一尺度的空间特征进行通道合并的情况下可能会引入额外的干扰信息。为了减轻这些干扰信息的负面影响,根据本公开另一示例性实施例,可通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
如图4所示,在对每个尺度的对齐的空间特征进行通道合并之后,可首先利用通道注意力机制对特征通道进行融合,随后可利用空间注意力机制对同一通道内的像素进行融合。具体来说,在对特征通道合并之后,我们首先利用全局平均池化(global average pooling)操作获取通道注意力权重。然后将这个权重乘到对齐的特征上,从而筛选出对t帧有用的通道。随后,使用空间注意力操作,来增加通道内像素之间的交互,增加感受野,从而减少干扰信息带来的影响。
在一些实施例中,根据本公开另一示例性实施例,还可以对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征。这可以进一步增大感受野,得到表达能力更强的融合特征。例如,如图4所示,可全局卷积层(global convolution)操作来编码先前融合的特征,以进一步增大感受野,得到表达能力更强的融合特征。
在产生多个不同尺度的融合特征之后,在步骤S240,可基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。在一些实施例中,可基于新的融合特征(即,以上提及的通过进行进一步特征提取所获得的新的融合特征),预测各个视频帧的未知透明度信息。在一些实施例中,例如,上一层ST-FAM对第一尺度的空间特征进行融合之后输出的融合特征被进行上采样并与下一层ST-FAM输出的融合特征进行合并,从而可逐渐重建出和原始视频帧相同尺度的解码特征,解码特征最终被送到预测分支里以产生预测结果,即,视频帧的未知透明度信息。
在一些实施例中,如图3所示,还可利用原始视频帧对预测结果进行修正和进一步精细化处理,来得到最终的预测结果。例如,如图3所示,解码器的输出被送入精细化模块进行精细化处理,最终得到预测结果。另外,在获得预测结果后,在步骤S259,可根据预测出的未知透明度信息对所述视频进行处理。例如,可以根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。此外,还可将提取出的目标对象与其他视频进行合成。
以上,已经参照图2至图4描述了根据本公开实施例的视频处理方法,相比于直接将图像抠图算法应用到视频上来预测Alpha方法,本公开的视频处理方法能够产生连续性和一致性更好的Alpha预测结果,提高了Alpha预测的准确性。
需要说明的是,尽管以上以深度神经网络模型是编码器-解码器结构模型为例,对深度神经网络模型的架构及其操作以及本公开的视频处理方法进行了介绍,然而,根据本公开的深度神经网络模型不限于编码器-解码器结构模型,例如,还可以是MobileNet网络结构模型或深度残差网络模型,等等。比如,可以根据实际需求对编码器-解码器结构作出调整以应对不同的应用场景。比如,如果需要将此技术方案应用到移动端App上,由于移动端App对速度和实时性要求较高,可以将编码器-解码器结构替换为适用于移动端的轻便网络,如MobileNet网络;如果要将此技术部署到对精度要求较高的服务端上,可以将编码器-解码器结构替换为表达能力更强的深度网络如ResNet-101网络,从而满足精确度需求。
本领域技术人员均熟知,在利用深度神经网络模型进行预测之前,需要预先对深度神经网络模型进行训练。接下来,简要对上述深度神经网络模型的训练进行介绍。
图5是示出本公开另一示例性实施例的训练深度神经网络模型的方法的流程图。该训练深度神经网络模型的方法既可以由终端设备执行,也可由服务器执行,或者也可以由终端设备和服务器协作执行。该训练深度神经网络模型的方法可以包括步骤S510-S550。
参照图5,在步骤S510,获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息。接下来,基于所述训练视频以及所述全部透明度信息中的部分透明度信息,利用深度神经网络模型执行步骤S520至S540来预测除所述部分透明度信息之外的未知透明度信息。在一些实施例中,如图5所示,在步骤S520,基于所述训练视频的各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征。在步骤S530,将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征。接下来,在步骤S540,基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息。最后,在步骤S550,将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。在调整深度神经网络模型的参数时可利用预先构造的损失函数。深度神经网络模型采用不同的网络结构或模型类型,其所使用的损失函数会相应的有所差别,本公开对深度神经网络模型的结构和类型以及所使用的损失函数均无限制,只要其能够执行以上描述的操作即可。
此外,本领域技术人员均熟知的是,深度神经网络模型在训练时模型执行的操作和预测时模型执行的操作完全相同,只是训练时使用的是训练数据,而预测时使用的是真实的待预测数据,而且训练过程中在得到预测结果后会将预测结果与真实值进行比较来调整模型参数,因此,鉴于以上已经在描述视频处理方法中对深度神经网络模型所执行的操作以及各个操作所涉及的细节进行过介绍,这里不再对它们进行赘述。相应部分可参照图2至图4中的相应描述。
根据本公开实施例的上述训练深度神经网络模型的方法,由于使深度神经网络模型基于各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征并将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,因此,利用了视频帧之间的时序信息,从而训练出的深度神经网络模型可以提供更准确的预测结果。
图6是示出本公开示例性实施例的视频处理装置的框图。
参照图6,视频处理装置600可包括数据获取单元601、预测单元602和处理单元603。具体而言,数据获取单元601可获取视频以及与所述视频中的各个视频帧对应的部分透明度信息。预测单元602可基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。根据示例性实施例,预测单元602可基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息。
此外,处理单元603可根据预测出的未知透明度信息对所述视频进行处理。例如, 处理单元可以根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
由于图2所示的视频处理方法可由图6所示的视频处理装置600来执行,并且数据获取单元601、预测单元602和处理单元603可分别执行与图2中的步骤S210、步骤S220至240、步骤S250对应的操作,因此,关于图6中的各单元所执行的操作中涉及的任何相关细节均可参见关于图2至图4的相应描述,这里都不再赘述。
此外,需要说明的是,尽管以上在描述视频处理装置600时将其划分为用于分别执行相应处理的单元,然而,本领域技术人员清楚的是,上述各单元执行的处理也可以在视频处理装置600不进行任何具体单元划分或者各单元之间并无明确划界的情况下执行。此外,视频处理装置600还可包括其他单元,例如,数据处理单元、存储单元等。
图7是示出本公开另一示例性实施例的训练深度神经网络模型的装置(在下文中,为描述方便,将其简称为“训练装置”)的框图。
参照图7,训练装置700可包括训练数据获取单元701和模型训练单元702。具体而言,训练数据获取单元701可获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息。模型训练单元702可基于所述训练视频以及所述全部透明度信息中的部分透明度信息,利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于所述训练视频的各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;并且通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。
同样,深度神经网络模型在训练时模型执行的操作和预测时模型执行的操作完全相同,只是训练时使用的是训练数据,而预测时使用的真实的待预测视频,因此,关于深度神经网络模型所执行的操作以及各个操作所涉及的细节可参照图2至图4中的相应描述,这里不再赘述。
此外,需要说明的是,尽管以上在描述训练装置700时将其划分为用于分别执行相应处理的单元,然而,本领域技术人员清楚的是,上述各单元执行的处理也可以在训练装置700不进行任何具体单元划分或者各单元之间并无明确划界的情况下执行。此外,训练装置700还可包括其他单元,例如,数据处理单元、存储单元等。
图8是根据本公开示例性实施例的电子设备的框图。
参照图5,电子设备800可包括至少一个存储器801和至少一个处理器802,所述至少一个存储器中存储有计算机可执行指令集合,响应于计算机可执行指令集合被至少一个处理器执行,执行根据本公开实施例的视频处理方法或训练深度神经网络模型 的方法。
作为示例,电子设备可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里,电子设备并非必须是单个的电子设备,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子设备。
在电子设备中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。
处理器可运行存储在存储器中的指令或代码,其中,存储器还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,网络接口装置可采用任何已知的传输协议。
存储器可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储器可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储器中的文件。
此外,电子设备还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。电子设备的所有组件可经由总线和/或网络而彼此连接。
根据本公开的实施例,还可提供一种非易失性存储指令的计算机可读存储介质,其中,响应于指令被至少一个处理器运行,促使至少一个处理器执行根据本公开示例性实施例的视频处理方法或训练深度神经网络模型的方法。这里的计算机可读存储介质的示例包括:只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如,多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置,所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,此外,在一个示例中,计算机 程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上,使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。
根据本公开的实施例中,还可提供一种计算机程序产品,该计算机程序产品中的指令可被电子设备中的至少一个处理器运行以执行根据本公开示例性实施例的视频处理方法或训练深度神经网络模型的方法。
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。
Claims (53)
- 一种视频处理方法,其特征在于,包括:获取视频以及与所述视频中的各个视频帧对应的部分透明度信息;基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征;将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;根据预测出的未知透明度信息对所述视频进行处理。
- 如权利要求1所述的视频处理方法,其特征在于,所述将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求2所述的视频处理方法,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求3所述的视频处理方法,其中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求4所述的视频处理方法,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,所述基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求1所述的视频处理方法,还包括:基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息,其中,所述深度神经网络模型是编码器-解码器结构模型,其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述视频处理方法还包括:利用编码器提取各个视频帧的多个尺度的空间特征, 利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求6所述的视频处理方法,还包括:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求1所述的视频处理方法,所述基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 如权利要求1所述的视频处理方法,其中,所述根据预测出的未知透明度信息对所述视频进行处理,包括:根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
- 一种训练深度神经网络模型的方法,包括:获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息;基于所述训练视频以及所述全部透明度信息中的部分透明度信息,利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于所述训练视频的各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征;将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。
- 如权利要求10所述的方法,其特征在于,所述将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求11所述的方法,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求12所述的方法,其中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求13所述的方法,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求10所述的方法,其中,所述深度神经网络模型是编码器-解码器结构模型其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述方法包括:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求15所述的方法,其中,所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求10所述的方法,其中,基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 一种视频处理装置,包括:数据获取单元,被配置为获取视频以及与所述视频中的各个视频帧对应的部分透明度信息;预测单元,被配置为基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;处理单元,被配置为根据预测出的未知透明度信息对所述视频进行处理。
- 如权利要求18所述的视频处理装置,其特征在于,所述将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求19所述的视频处理装置,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求20所述的视频处理装置,其中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求21所述的视频处理装置,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,所述基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求18所述的视频处理装置,其中,预测单元基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息,其中,所述深度神经网络模型是编码器-解码器结构模型,其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求23所述的视频处理装置,其中,所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求18所述的视频处理装置,所述基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 如权利要求18所述的视频处理装置,其中,处理单元被配置为根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
- 一种训练深度神经网络模型的装置,包括:训练数据获取单元,被配置为获取训练视频以及与所述训练视频中的各个视频帧对应 的全部透明度信息;模型训练单元,被配置为:基于所述训练视频以及所述全部透明度信息中的部分透明度信息利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征,将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;并且通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。
- 如权利要求27所述的装置,其特征在于,所述将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,包括:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求28所述的装置,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,包括:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求29所述的装置,其中,所述利用注意力机制对经过通道合并后的特征进行融合,包括:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求30所述的装置,其特征在于,所述将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征,还包括:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求27所述的装置,其中,所述深度神经网络模型是编码器-解码器结构模型,其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,模型训练单元还被配置为:利用编码器提取各个视频帧的多个尺度的空间特征, 利用特征融合模块将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求32所述的装置,其中,所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求27所述的装置,其中,基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征,包括:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 一种电子设备,其特征在于,包括:至少一个处理器;至少一个存储计算机可执行指令的存储器,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:获取视频以及与所述视频中的各个视频帧对应的部分透明度信息;基于各个视频帧和所述部分透明度信息提取各个视频帧的多个尺度的空间特征;将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息;根据预测出的未知透明度信息对所述视频进行处理。
- 如权利要求35所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求36所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求37所述的电子设备,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求38所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,所述基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求35所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:基于所述视频以及所述部分透明度信息利用深度神经网络模型预测各个视频帧的除所述部分透明度信息之外的未知透明度信息,其中,所述深度神经网络模型是编码器-解码器结构模型,其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求40所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求35所述的电子设备,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 如权利要求35所述的电子设备,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:根据预测出的各个视频帧的未知透明度信息提取所述视频中的目标对象。
- 一种电子设备,其特征在于,包括:至少一个处理器;至少一个存储计算机可执行指令的存储器,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:获取训练视频以及与所述训练视频中的各个视频帧对应的全部透明度信息;基于所述训练视频以及所述全部透明度信息中的部分透明度信息,利用深度神经网络模型执行以下操作来预测除所述部分透明度信息之外的未知透明度信息:基于所述训练视频的各个视频帧和与各个视频帧对应的部分透明度信息提取各个视频帧的多个尺度的空间特征;将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合特征;基于多个不同尺度的融合特征预测各个视频帧的全部透明度信息中除所述部分透明度信息之外的未知透明度信息;通过将预测出的未知透明度信息与所述全部透明度信息中除所述部分透明度信息之外的透明度信息进行比较来调整所述深度神经网络模型的参数。
- 如权利要求44所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:提取相邻视频帧之间的运动信息,并根据运动信息使相邻帧的同一尺度的空间特征对齐;将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求45所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:通过对每个尺度的对齐的空间特征直接进行通道合并,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征;或者通过对每个尺度的对齐的空间特征进行通道合并并且利用注意力机制对经过通道合并后的特征进行融合,来将同一尺度的对齐后的空间特征融合以产生多个不同尺度的融合特征。
- 如权利要求46所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:利用通道注意力机制对特征通道进行融合,并且利用空间注意力机制对同一通道内的像素进行融合。
- 如权利要求47所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:对多个不同尺度的融合特征进行进一步特征提取,以获得新的融合特征;其中,基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息,包括:基于新的融合特征,预测各个视频帧的未知透明度信息。
- 如权利要求44所述的电子设备,其中,所述深度神经网络模型是编码器-解码器结构模型其中,所述编码器-解码器结构模型的编码器与解码器之间存在跳层连接,并且解码器包括特征融合模块和预测分支,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:利用编码器提取各个视频帧的多个尺度的空间特征,利用特征融合模块将所述训练视频的相邻视频帧的同一尺度的空间特征融合以产生多个不同尺度的融合 特征,并且利用预测分支基于多个不同尺度的融合特征预测各个视频帧的未知透明度信息。
- 如权利要求49所述的电子设备,其特征在于,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:所述跳层连接指示编码器产生的不同尺度的空间特征被分别输入到解码器的用于融合对应尺度的特征的特征融合模块。
- 如权利要求44所述的电子设备,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:将各个视频帧和与各个视频帧对应的透明度信息图连接,以构成连接图像;提取与各个视频帧对应的连接图像的多个尺度的空间特征,作为各个视频帧的多个尺度的空间特征。
- 一种非易失性存储指令的计算机可读存储介质,其特征在于,响应于所述指令被至少一个处理器运行,促使所述至少一个处理器执行如权利要求1到17中的任一权利要求所述的方法。
- 一种计算机程序产品,包括计算机指令,其特征在于,响应于所述计算机指令被处理器执行,实现权利要求1到17中的任一权利要求所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110468173.6A CN113194270B (zh) | 2021-04-28 | 2021-04-28 | 视频处理方法、装置、电子设备及存储介质 |
CN202110468173.6 | 2021-04-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022227689A1 true WO2022227689A1 (zh) | 2022-11-03 |
Family
ID=76980050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/070267 WO2022227689A1 (zh) | 2021-04-28 | 2022-01-05 | 视频处理方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113194270B (zh) |
WO (1) | WO2022227689A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113194270B (zh) * | 2021-04-28 | 2022-08-05 | 北京达佳互联信息技术有限公司 | 视频处理方法、装置、电子设备及存储介质 |
CN116233553A (zh) * | 2022-12-23 | 2023-06-06 | 北京医百科技有限公司 | 一种视频处理方法、装置、设备及介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104935832A (zh) * | 2015-03-31 | 2015-09-23 | 浙江工商大学 | 针对带深度信息的视频抠像方法 |
CN109829925A (zh) * | 2019-01-23 | 2019-05-31 | 清华大学深圳研究生院 | 一种在抠图任务中提取干净前景的方法及模型训练方法 |
CN111161277A (zh) * | 2019-12-12 | 2020-05-15 | 中山大学 | 一种基于深度学习的自然图像抠图方法 |
CN111724400A (zh) * | 2020-06-29 | 2020-09-29 | 北京高思博乐教育科技股份有限公司 | 视频自动抠像方法及系统 |
CN113194270A (zh) * | 2021-04-28 | 2021-07-30 | 北京达佳互联信息技术有限公司 | 视频处理方法、装置、电子设备及存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8818028B2 (en) * | 2010-04-09 | 2014-08-26 | Personify, Inc. | Systems and methods for accurate user foreground video extraction |
CN108305256B (zh) * | 2017-11-28 | 2019-11-15 | 腾讯科技(深圳)有限公司 | 视频抠像处理方法、处理装置及计算机可读存储介质 |
CN109377445B (zh) * | 2018-10-12 | 2023-07-04 | 北京旷视科技有限公司 | 模型训练方法、替换图像背景的方法、装置和电子系统 |
CN112016472B (zh) * | 2020-08-31 | 2023-08-22 | 山东大学 | 基于目标动态信息的驾驶员注意力区域预测方法及系统 |
-
2021
- 2021-04-28 CN CN202110468173.6A patent/CN113194270B/zh active Active
-
2022
- 2022-01-05 WO PCT/CN2022/070267 patent/WO2022227689A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104935832A (zh) * | 2015-03-31 | 2015-09-23 | 浙江工商大学 | 针对带深度信息的视频抠像方法 |
CN109829925A (zh) * | 2019-01-23 | 2019-05-31 | 清华大学深圳研究生院 | 一种在抠图任务中提取干净前景的方法及模型训练方法 |
CN111161277A (zh) * | 2019-12-12 | 2020-05-15 | 中山大学 | 一种基于深度学习的自然图像抠图方法 |
CN111724400A (zh) * | 2020-06-29 | 2020-09-29 | 北京高思博乐教育科技股份有限公司 | 视频自动抠像方法及系统 |
CN113194270A (zh) * | 2021-04-28 | 2021-07-30 | 北京达佳互联信息技术有限公司 | 视频处理方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN113194270B (zh) | 2022-08-05 |
CN113194270A (zh) | 2021-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102054548B1 (ko) | 다시점 오디오 및 비디오 대화형 재생 | |
CN111476871B (zh) | 用于生成视频的方法和装置 | |
US10580143B2 (en) | High-fidelity 3D reconstruction using facial features lookup and skeletal poses in voxel models | |
WO2019242222A1 (zh) | 用于生成信息的方法和装置 | |
US11386630B2 (en) | Data sterilization for post-capture editing of artificial reality effects | |
WO2022227689A1 (zh) | 视频处理方法及装置 | |
US20230072759A1 (en) | Method and apparatus for obtaining virtual image, computer device, computer-readable storage medium, and computer program product | |
US10917679B2 (en) | Video recording of a display device | |
US20170148224A1 (en) | 3d scene reconstruction using shared semantic knowledge | |
US11122332B2 (en) | Selective video watching by analyzing user behavior and video content | |
WO2020093724A1 (zh) | 生成信息的方法和装置 | |
US20230011823A1 (en) | Method for converting image format, device, and storage medium | |
US12056849B2 (en) | Neural network for image style translation | |
Qi et al. | A DNN-based object detection system on mobile cloud computing | |
WO2023138441A1 (zh) | 视频生成方法、装置、设备及存储介质 | |
CN114630057B (zh) | 确定特效视频的方法、装置、电子设备及存储介质 | |
JP2023538825A (ja) | ピクチャのビデオへの変換の方法、装置、機器および記憶媒体 | |
WO2020119670A1 (zh) | 一种视频转码方法及装置 | |
CN117835001A (zh) | 视频编辑方法、装置、设备和介质 | |
US11012662B1 (en) | Multimedia content adjustment using comparable content | |
CN114299089A (zh) | 图像处理方法、装置、电子设备及存储介质 | |
CN115019040A (zh) | 图像分割方法和装置以及图像分割模型的训练方法和装置 | |
US20220139251A1 (en) | Motivational Extended Reality | |
CN114157895A (zh) | 视频处理方法、装置、电子设备及存储介质 | |
CN113076828B (zh) | 视频编辑方法和装置以及模型训练方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22794173 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22/03/2024) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22794173 Country of ref document: EP Kind code of ref document: A1 |