WO2023178662A1

WO2023178662A1 - Image and video coding using multi-sensor collaboration and frequency adaptive processing

Info

Publication number: WO2023178662A1
Application number: PCT/CN2022/083079
Authority: WO
Inventors: Cheolkon Jung; Hui Lan; Zhe JI
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2023-09-28
Also published as: CN118901079A

Abstract

In certain aspects, a method is disclosed. A first image and a second image of a same scene are received by a processor. The first image and the second image are acquired by two different sensors. First information at a first frequency of the first image is estimated by the processor based, at least in part, on a first component at the first frequency of the second image. Second information at a second frequency of the first image is estimated by the processor based, at least in part, on a second component at the second frequency of the second image. The first image is reconstructed by the processor based on the first information and the second information.

Description

IMAGE AND VIDEO CODING USING MULTI-SENSOR COLLABORATION AND FREQUENCY ADAPTIVE PROCESSING

BACKGROUND

Embodiments of the present disclosure relate to image and video coding.

In recent years, the storage and transmission of video data have become more and more common, and a huge amount of video data has been produced persistently. Thus, the effective compression of video data is increasingly important. Video coding technology has made meaningful contributions to the compression of video data. The earliest research on video compression can be traced back to 1929 when inter-frame compression was first proposed in that year. After years of research and development, mature video compression codec standards have gradually formed, such as audio video interleave (AVI) , moving picture expert group (MPEG) , advanced video coding (H. 264/AVC) , and high-efficiency video coding (H. 265/HEVC) . The latest versatile video coding (H. 266/VVC) standard was officially published in 2020, representing the most advanced video coding technology at present. Although the structure of VVC is still based on the traditional hybrid video coding mode, its compression rate is about doubled.

SUMMARY

According to one aspect of the present disclosure, a method is disclosed. A first image and a second image of a same scene are received by a processor. The first image and the second image are acquired by two different sensors. First information at a first frequency of the first image is estimated by the processor based, at least in part, on a first component at the first frequency of the second image. Second information at a second frequency of the first image is estimated by the processor based, at least in part, on a second component at the second frequency of the second image. The first image is reconstructed by the processor based on the first information and the second information.

According to another aspect of the present disclosure, a system includes a memory configured to store instructions and a processor coupled to the memory. The processor is configured to, upon executing the instructions, receive a first image and a second image of a same scene. The first image and the second image are acquired by two different sensors. The processor is also configured to, upon executing the instructions, estimate first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image, and estimate second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image. The processor is further configured to, upon executing the instructions, reconstruct the first image based on the first information and the second information.

According to still another aspect of the present disclosure, a method is disclosed. A first image of a scene is acquired by a first sensor. A second image of the scene is acquired by a second sensor. First information at a first frequency of the first image is estimated by a first processor based, at least in part, on a first component at the first frequency of the second image. Second information at a second frequency of the first image is estimated by the first processor based, at least in part, on a second component at the second frequency of the second image. The first image is reconstructed by the first processor based on the first information and the second information.

According to yet another aspect of the present disclosure, a system includes an encoding subsystem and a decoding subsystem. The encoding subsystem includes a first sensor, a second sensor, a first memory configured to store instructions, and a first processor coupled to the first memory and the first and second sensors. The first sensor is configured to acquire a first image of a scene. The second sensor is configured to acquire a second image of the scene. The decoding subsystem includes a second memory configured to store instructions, and a second processor coupled to the second memory. The second processor is configured to, upon executing the instructions, estimate first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image, and estimate second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image. The second processor is further configured to, upon executing the instructions, reconstruct the first image based on the first information and the second information.

These illustrative embodiments are mentioned not to limit or define the present disclosure, but to provide examples to aid understanding thereof. Additional embodiments are described in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

FIG. 1 illustrates a block diagram of an exemplary encoding system, according to some embodiments of the present disclosure.

FIG. 2 illustrates a block diagram of an exemplary decoding system, according to some embodiments of the present disclosure.

FIG. 3 illustrates a detailed block diagram of an exemplary encoding and decoding system, according to some embodiments of the present disclosure.

FIG. 4 illustrates a detailed block diagram of an exemplary compression module, according to some embodiments of the present disclosure.

FIG. 5 illustrates a detailed block diagram of an exemplary guided reconstruction module, according to some embodiments of the present disclosure.

FIG. 6A illustrates a detailed block diagram of an exemplary guided reconstruction module using machine learning models, according to some embodiments of the present disclosure.

FIG. 6B illustrates a detailed block diagram of another exemplary guided reconstruction module using machine learning models, according to some embodiments of the present disclosure.

FIG. 6C illustrates a detailed block diagram of still another exemplary guided reconstruction module using machine learning models, according to some embodiments of the present disclosure.

FIG. 7 illustrates a visual comparison between anchor images and various output depth images from the guided reconstruction modules in FIGs. 6A–6C with different quantization parameters (QPs) , according to some embodiments of the present disclosure.

FIG. 8 illustrates a flow chart of an exemplary method for encoding, according to some embodiments of the present disclosure.

FIG. 9 illustrates a flow chart of an exemplary method for decoding, according to some embodiments of the present disclosure.

FIG. 10A illustrates a detailed flow chart of an exemplary method for image reconstruction, according to some embodiments of the present disclosure.

FIG. 10B illustrates a detailed flow chart of another exemplary method for image reconstruction, according to some embodiments of the present disclosure.

FIG. 11 illustrates a block diagram of an exemplary model training system, according to some embodiments of the present disclosure.

FIG. 12 illustrates an exemplary scheme for model training, according to some embodiments of the present disclosure.

Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

Although some configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.

It is noted that references in the specification to “one embodiment, ” “an embodiment, ” “an example embodiment, ” “some embodiments, ” “certain embodiments, ” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a, ” “an, ” or “the, ” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

Various aspects of image and video coding systems will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various modules, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements” ) . These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

The techniques described herein may be used for various image and video coding applications. As described herein, image and video coding includes both encoding and decoding a video, a frame of a video, or a still image (a.k.a., a map) . For ease of description, the present disclosure may refer to a video, a frame, or an image; unless otherwise stated, in either case, it encompasses a video, a frame of a video, and a still image.

The three-dimension (3D) extension of HEVC (3D-HEVC) is a 3D video coding standard investigated by Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) . The goal of 3D-HEVC is to improve the coding technology on the basis of HEVC to efficiently compress multi-viewpoints and their corresponding depth data. 3D-HEVC includes all the key technologies of HEVC and adds technologies that are conducive to multi-view video coding to improve the efficiency of 3D video coding and decoding. Compared with 2D video coding, 3D video coding transmits depth maps to facilitate the synthesis of virtual viewpoints at the decoding end, but there are certain differences between color and depth. Existing video encoding tools are not suitable for depth map encoding, and thus the study of 3D-HEVC is launched. Existing methods, however, are dedicated to modifying the internal modules of 3D-HEVC to improve performance but do not take the characteristics and correlation of the depth map and the color map into account.

On the other hand, with the recent advances in sensor technology, especially the popularization of multi-sensory data, there is a new opportunity to reform and elevate the compression efficiency using multi-sensor collaboration. Multi-sensor data have a significant advantage over single sensor data due to the unique property of each sensor. Multi-sensor collaboration, such as color and depth images, can remarkably increase the coding efficiency. Traditional video codecs, including 3D-HEVC, however, only save bits by removing redundancy and do not consider multi-sensor collaboration to save bits. Moreover, traditional 3D-HEVC has low compression efficiency for depth data, and when the quantization parameter (QP) is large, there will be obvious blocky artifacts. Although most existing methods based on deep learning may achieve speed-up of the prediction mode decision for coding unit/prediction unit (CU/PU) , they cannot deal with blocky artifacts in a large QP caused by 3D-HEVC. Moreover, existing methods focus on enhancing video quality or reducing bitrate, but not both.

To save bits in the bitstream and enhance video/image quality, the present disclosure provides various schemes of guided reconstruction-based image and video coding using multi-sensor collaboration, such as color and depth images, and frequency adaptive processing. As described below in detail, the present disclosure can be implemented for the compression of various multi-sensor data, such as color/depth, color/near-infrared (NIR) , visible (VIS) /infrared (IR) , or color/light detection and ranging (LiDAR) , and can use various video compression standards (a.k.a. codec) , such as HEVC, VVC, audio video standard (AVS) , etc. In some embodiments, the color images acquired by a color sensor represent the color and texture of the scene, while the depth images acquired by a depth sensor represent the 3D geometric shape of the scene. The two types of sensor data can be complementary, and the color images can help reconstruct their corresponding depth images.

In some embodiments, to save bits in the bitstream in multi-sensor collaboration, an original depth image can be downsampled (i.e., having a downsampling factor greater than 1) at the encoding side to become a low resolution (LR) depth image, and the downsampled LR depth image and the corresponding color image can be compressed, e.g., by 3D-HEVC, respectively, into the bitstream to be transmitted to the decoding side. In some embodiments, to preserve the best video/image quality, the original depth image may not be downsampled (i.e., having a downsampling factor equal to 1) at the encoding side, such that the depth image and corresponding color image to be compressed have the same resolution (i.e., size) . The depth and color images can be compressed, by 3D-HEVC, respectively, into the bitstream to be transmitted to the decoding side.

On the decoding side, the color and depth information from the color and depth images, respectively, can be combined and used by guided reconstruction to reconstruct and recover the high resolution (HR) depth image if the depth image has been downsampled, or enhance the quality of the depth image if the depth image has not been downsampled. For example, quality enhancement (e.g., removing blocking artifacts caused by 3D-HEVC codec) and bit reduction may be balanced and optimized as needed under different upsampling factors. Consistent with the scope of the present disclosure, a frequency adaptive processing scheme is used by guided reconstruction to process different frequency components of the input images separately using different machine learning models (e.g., convolutional neural network (CNN) ) that are designed and trained for different frequency components to get more accurate results. In some embodiments, a high frequency learning model is used to estimate rapid transitions in depth, such as edges and details of the scene, while a low frequency learning model is used to estimate slow variations in depth, such as smooth areas of the scene. The final depth image can be reconstructed by combining and upsampling the features estimated from the learning models for different frequency components.

In some embodiments, the design of the learning models is not limited by the value of the downsampling factor and thus, has the flexibility to balance the efficiency and performance without the need to consider the specific downsampling factor. That is, the design of the learning models may be independent of and remain substantially the same under different downsampling factors. In some embodiments, discrete wavelet transform (DWT) is used before the learning models to separate sub-band components of different frequencies from the input image (s) , as well as to adjust the size/resolution relationship of the input images to ensure the consistency of the learning models under different upsampling factors.

FIG. 1 illustrates a block diagram of an exemplary encoding system 100, according to some embodiments of the present disclosure. FIG. 2 illustrates a block diagram of an exemplary decoding system 200, according to some embodiments of the present disclosure. Each

system

100 or 200 may be applied or integrated into various systems and apparatus capable of data processing, such as computers and wireless communication devices. For example,

system

100 or 200 may be the entirety or part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having data processing capability. As shown in FIGs. 1 and 2,

system

100 or 200 may include a processor 102, a memory 104, and an interface 106. These components are shown as connected to one another by a bus, but other connection types are also permitted. It is understood that

system

100 or 200 may include any other suitable components for performing functions described here.

Processor 102 may include microprocessors, such as graphic processing unit (GPU) , image signal processor (ISP) , central processing unit (CPU) , digital signal processor (DSP) , tensor processing unit (TPU) , vision processing unit (VPU) , neural processing unit (NPU) , synergistic processing unit (SPU) , or physics processing unit (PPU) , microcontroller units (MCUs) , application-specific integrated circuits (ASICs) , field-programmable gate arrays (FPGAs) , programmable logic devices (PLDs) , state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functions described throughout the present disclosure. Although only one processor is shown in FIGs. 1 and 2, it is understood that multiple processors can be included. Processor 102 may be a hardware device having one or more processing cores. Processor 102 may execute software. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Software can include computer instructions written in an interpreted language, a compiled language, or machine code. Other techniques for instructing hardware are also permitted under the broad category of software.

Memory 104 can broadly include both memory (a.k.a, primary/system memory) and storage (a.k.a., secondary memory) . For example, memory 104 may include random-access memory (RAM) , read-only memory (ROM) , static RAM (SRAM) , dynamic RAM (DRAM) , ferro-electric RAM (FRAM) , electrically erasable programmable ROM (EEPROM) , compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD) , such as magnetic disk storage or other magnetic storage devices, Flash drive, solid-state drive (SSD) , or any other medium that can be used to carry or store desired program code in the form of instructions that can be accessed and executed by processor 102. Broadly, memory 104 may be embodied by any computer-readable medium, such as a non-transitory computer-readable medium. Although only one memory is shown in FIGs. 1 and 2, it is understood that multiple memories can be included.

Interface 106 can broadly include a data interface and a communication interface that is configured to receive and transmit a signal in a process of receiving and transmitting information with other external network elements. For example, interface 106 may include input/output (I/O) devices and wired or wireless transceivers. Although only one memory is shown in FIGs. 1 and 2, it is understood that multiple interfaces can be included.

Processor 102, memory 104, and interface 106 may be implemented in various forms in

system

100 or 200 for performing video coding functions. In some embodiments, processor 102, memory 104, and interface 106 of

system

100 or 200 are implemented (e.g., integrated) on one or more system-on-chips (SoCs) . In one example, processor 102, memory 104, and interface 106 may be integrated on an application processor (AP) SoC that handles application processing in an operating system (OS) environment, including running video encoding and decoding applications. In another example, processor 102, memory 104, and interface 106 may be integrated on a specialized processor chip for video coding, such as a GPU or ISP chip dedicated for image and video processing in a real-time operating system (RTOS) .

As shown in FIG. 1, in encoding system 100, processor 102 may include one or more modules, such as an encoder 101. Although FIG. 1 shows that encoder 101 is within one processor 102, it is understood that encoder 101 may include one or more sub-modules that can be implemented on different processors located closely or remotely with each other. Encoder 101 (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 102 designed for use with other components or software units implemented by processor 102 through executing at least part of a program, i.e., instructions. The instructions of the program may be stored on a computer-readable medium, such as memory 104, and when executed by processor 102, it may perform a process having one or more functions related to image and video encoding, such as downsampling, image partitioning, inter prediction, intra prediction, transformation, quantization, filtering, entropy encoding, etc., as described below in detail.

Similarly, as shown in FIG. 2, in decoding system 200, processor 102 may include one or more modules, such as a decoder 201. Although FIG. 2 shows that decoder 201 is within one processor 102, it is understood that decoder 201 may include one or more sub-modules that can be implemented on different processors located closely or remotely with each other. Decoder 201 (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 102 designed for use with other components or software units implemented by processor 102 through executing at least part of a program, i.e., instructions. The instructions of the program may be stored on a computer-readable medium, such as memory 104, and when executed by processor 102, it may perform a process having one or more functions related to image and video decoding, such as entropy decoding, upsampling, inverse quantization, inverse transformation, inter prediction, intra prediction, filtering, as described below in detail.

Consistent with the scope of the present disclosure, for multi-sensor collaboration application, at least encoding system 100 further include a plurality of sensors 108 coupled to processor 102, memory 104, and interface 106 via the bus. Sensors 108 may include a first sensor 108 configured to acquire a first image of a scene (e.g., including one or more objects, a.k.a., scene object (s) ) , and a second sensor 108 configured to acquire a second image of the same scene. That is, the first image and the second image may be acquired by two different sensors 108. In some embodiments, the first and second images of the scene are different types of images but are complementary to one another with respect to the scene. In other words, first and second sensors 108 may obtain different types of image information of a scene that, when combined, provides a comprehensive visual representation of the scene. The first and second images may also have characteristics that are correlated, i.e., being correlated images of the same scene. In some embodiments, both first and second images may reflect the edges of the scene (including edges of the objects in the scene) . For example, the first image acquired by first sensor 108 is a depth image, and the second image acquired by second sensor 108 is a color image. The depth image and color image may be correlated as both images can represent the edges of the same scene. The depth image and color image may also be complementary to one another with respect to the scene, e.g., a 3D scene, as the depth and color images, when combined, can provide a comprehensive visual representation of the scene.

In some embodiments, first sensor 108 is a depth sensor, and the depth image represents a 3D geometric shape of the scene. For example, the depth sensor may include any 3D range finder that acquires multi-point distance information across a wide field-of-view (FoV) , such as LiDAR distance sensors, time-of-flight (ToF) cameras, or light-field cameras. In some embodiments, second sensor 108 is a color sensor, and the color image represents the color of the scene. It is understood that the “color” referred to herein may encompass texture and grayscale as well. For example, the color sensor may include any sensor that detects the color of light reflected from an object in any suitable spectrum, such as VIS sensors, IR sensors, VIS-IR sensors, NIR sensors, or red-green-blue/NIR (RGB-NIR) sensors. It is understood that various types of multi-sensor images (data) are not limited to color and depth and may be any other suitable types with respect to the same scene in other examples. It is also understood that the number of sensors 108 and the types of multi-sensor images are not limited to two and may be more than two in other examples. It is further understood that sensors 108 may be configured to acquire videos of different types, such as color video and depth video, each of which includes a plurality of frames, such as color frame and depth frames, respectively. It is still further understood that although not shown in FIG. 2, in some examples, decoding system 200 may include sensors 108 as well, like encoding system 100.

FIG. 3 illustrates a detailed block diagram of an exemplary encoding and decoding system 300, according to some embodiments of the present disclosure. System 300 may be a combination of encoding system 100 and decoding system 200 described above, or any portions of the combination. As shown in FIG. 3, system 300 may include first and second sensors 108, encoder 101, and decoder 201. First sensor 108 may be configured to acquire a depth image D of a scene, and second sensor 108 may be configured to acquire a color image I of the same scene. As described above, a color image and a depth image are used as the example of two complementary images with respect to the same scene for ease of description, and any other suitable types of complementary images with respect to the same scene may be used as well. As described above, the color image and depth of the image may be either still images (maps) or frames of videos captured by sensors 108. It is understood that in some examples, the depth image D and/or the color image I may not be acquired by first and/or second sensors 108 but obtained through any other suitable means. In one example, the depth image D and/or the color image I may be derived or otherwise obtained from other image (s) of the scene using any suitable image analysis or processing techniques. In another example, the depth image D and/or the color image I may be acquired by a third party and transmitted from the third party to encoder 101.

As shown in FIG. 3, encoder 101 (e.g., part of encoding system 100 or encoding and decoding system 300) may include a downsampling module 302 and a compression module 304. Downsampling module 302 is configured to downsample the original depth image D to generate a downsampled depth image D _s, according to some implementations. For example, the original depth image D may be an HR depth image, and the downsampled depth image D _s may be an LR depth image after downsampling. It is understood that unless otherwise stated, an LR image and a downsampled image may be used interchangeably in the present disclosure. Downsampling module 302 may downsample the original depth image D to reduce its size (resolution) by any suitable downsampling techniques, including but not limited to interpolation, uniform sampling, a machine learning model, or any combination thereof. As a result, the amount of data (e.g., the number of bits) representing the downsampled depth image D _s may become smaller than that of the original depth image D. In contrast, in some embodiments, the original color image I does not go through downsampling module 302 and thus, is not downsampled.

It is understood that in some embodiments, the original depth image D is not downsampled. That is, downsampling module 302 may be bypassed such that the original depth image D (e.g., an HR depth image) , as opposed to the downsampled depth image D _s (e.g., an LR depth image) , may be provided to compression module 304. In some embodiments, assuming the same size of the original color image I and depth image D is (W, H, C) and depth video downsampling may be performed by downsampling module 302 using a uniform sampling operation with a downsampling factor s, the height and width of the depth image D or D _s may be become (W/s, H/s, C) , where C represents the number of channels (e.g., C = 3 for an R, G, B color image I) . When the downsampling factor s is set to 1, downsampling module 302 may be bypassed (does not downsample the original depth image D) , such that the size of the original depth image D remains the same, i.e., the same as the size of the original color image I. When the downsampling factor is set to be greater than 1 (e.g., 2 or 4) , downsampling module 302 may downsample the original depth image D to generate the downsampled depth image D _s, such that the size of the downsampled depth image D _s shrinks, i.e., becomes smaller than the size of the original color image I. It is understood that the value of the downsampling factors is not limited to 1, 2, and 4 and may be any suitable values, such as 2.5, 3, 6, 8, etc. It is understood that in the present disclosure, the sizes (resolutions) of two images of the same scene are considered to be the same as long as the heights (H) and widths (W) of the two images are the same, regardless of the numbers of channels thereof. In other words, the sizes (resolutions) of the two images may be compared by their heights and widths, but not the numbers of channels, in the present disclosure.

In some embodiments, compression module 304 is configured to compress (encode) the original color image I and the downsampled depth image D _s (or the original depth image D) , respectively, into a bitstream. The compression may be performed independently for the original color image I and the downsampled/original depth image D _s/D. Compression module 304 may perform the compression using any suitable compression techniques (codecs) , including but not limited to 3D-HEVC, VVC, AVS, etc. For example, FIG. 4 illustrates a detailed block diagram of exemplary compression module 304 implementing the 3D-HEVC codec, according to some embodiments of the present disclosure. As shown in FIG. 4, compression module 304 implementing 3D-HEVC codec may include an HEVC conforming video coder 402 and a depth map coder 404 configured to encode a color frame and a depth frame, respectively, in an independent view of a 3D video. Compression module 304 may also include multiple video coders 406 and depth coders 408 each configured to encode a respective color frame and a respective depth frame in a respective one of N dependent views of the 3D video. Each depth frame may correspond to the downsampled/original depth image D _s/D in FIG. 3, and each color frame may correspond to the original color image I in FIG. 3. Compression module 304 may further include a multiplexer 410 (MUX) to sequentially select the encoded data from each

coder

402, 404, 406, or 406 to form a bitstream. That is, compression module 304 implementing 3D-HEVC codec can compress multi-viewpoints and their corresponding depth data into the bitstream by transmitting depth maps to facilitate the synthesis of virtual viewpoints at the decoding.

Referring back to FIG. 3, decoder 201 (e.g., part of decoding system 200 or encoding and decoding system 300) may include a decompression module 306 and a guided reconstruction module 308. In some embodiments, the bitstream including the compressed original color image I and downsampled/original depth image D _s/D is transmitted from encoder 101 (e.g., by interface 106 of encoding system 100 in FIG. 1) to decoder 201. That is, decompression module 306 of decoder 201 may receive the bitstream having compressed original color image I and downsampled/original depth image D _s/D (e.g., by interface 106 of decoding system 200 in FIG. 2) . In some embodiments, by reducing the size (resolution) of the depth image from the original depth image to D to downsampled depth image D _s (e.g., s > 1) , the number of bits transmitted in the bitstream can be reduced to improve the throughput of system 300. Decompression module 306 may be configured to decompress the compressed original color image I and downsampled/original depth image D _s/D from the bitstream to reconstruct color image I’ and downsampled depth image D _s’, respectively, using the same compression techniques (codec) implemented by compression module 304, including but not limited to, 3D-HEVC, VVC, AVS, etc. The decompression may be performed independently for the compressed color image I and downsampled/original depth image D _s/D. In some embodiments, the 3D-HEVC codec is implemented by decompression module 306 to obtain color image I’ and downsampled depth image D _s’ (or depth image D’) , respectively. It is understood that in some examples, the depth image D _s’/D’ and/or the color image I’ may not be obtained by decoder 201 from the decompressed bitstream but obtained through any other suitable means. In one example, the depth image D _s’/D’ and/or the color image I’ may be derived or otherwise obtained from other image (s) of the scene using any suitable image analysis or processing techniques. In another example, the depth image D _s’/D’ and/or the color image I’ may be acquired by sensors coupled directly to decoder 201 (on the decoder side, not shown) .

It is understood that decompression module 306 may not perfectly restore the original color image I and depth image D _s’/D’ from the bitstream, for example, due to information loss, depending on the codec used by compression module 304 and decompression module 306. In other words, color image I’ may not be identical to color image I, and depth image D _s’/D’ may not be identical to depth image D _s/D. It is understood that in the present disclosure, the compression and decompression processes performed by compression module 304 and decompression module 306, respectively, may be sometimes referred to together as a compression process as well. Accordingly, color image I’ and depth image D _s’/D’ outputted from decompression module 306 may be sometimes referred to as a compressed color image I’ and a compressed depth image D _s’/D’, respectively, as well. Depending on the compression efficiency for depth data and/or the QP used by compression module 304 and decompression module 306 during the compression/decompression process, blocky artifacts may appear at the edges of the scene on depth image D _s’/D’ and color image I’ with information loss, thereby causing blurry or other distortions. For example, when using the 3D-HEVC codec with a relatively low compression efficiency for depth data, the larger the QP is, the blurrier the edges of the scene may be on depth image D _s’/D’.

Guided reconstruction module 308 may be configured to reconstruct a depth image D” (e.g., an HR depth image) from depth image D _s’/D’. The conventional LR depth upsampling process, however, may cause blurry, in particular, at the edges of the scene, due to the lack of high-frequency components. On the other hand, color image I’ has a number of clear and complete edges for reconstructing depth image D” . Moreover, color image I’ may also contain unnecessary edges for reconstructing depth image D” , thereby causing texture copying artifacts after reconstructing depth image D” . Even though the edges in color image I’ can provide important clues for reconstructing depth image D” , they cannot be directly used in upsampling.

Consistent with the scope of the present disclosure, guided reconstruction module 308 may be configured to reconstruct depth image D” from depth image D _s’/D’ with the guidance from color image I’. As original color image I and original depth image D are correlated, compressed color image I’ and depth image D _s’/D’ are also correlated, according to some embodiments. Thus, the correlated characteristics and information thereof (e.g., at the edges of the scene) in color image I’ and depth image D _s’/D’ can be combined in guided reconstruction module 308 to recover depth image D” after upsampling. Depending on whether the input depth image of reconstruction module 308 is downsampled (e.g., s > 1) or not (e.g., s = 1) , guided reconstruction module 308 may upsample and recover downsampled depth image D _s’ or enhance the quality of depth image D’.

In some embodiments, guided reconstruction module 308 separate the sub-band components at different frequencies (e.g., one or more high frequency components and one or more low frequency components) of color image I’ and combine them with the sub-band components at different frequencies of depth image D’, respectively, or with the extracted initial features of downsampled depth image D _s. ’ For example, DWT and/or downsampling may be used to pre-process color image I’ and/or depth image D _s’/D’ to separate the high frequency component (s) and low frequency component (s) and/or adjust and match the sizes of the initial features from color image I’ and depth image D _s’/D’, as described below in detail. In some embodiments, guided reconstruction module 308 then processes the combined initial features at different frequencies using different machine learning models designed and trained for the different frequency components (e.g., a high frequency component CNN and a low frequency component CNN) , respectively, to estimate features at different frequencies separately. In some embodiments, guided reconstruction module 308 then combines and upsamples the estimated features at different frequencies to reconstruct depth image D” . For example, a machine learning model (e.g., an upsampling CNN) may be used to upsample the combined features to reconstruct depth image D” .

As described above, it is understood that the number and/or types of images that can be applied in the image and video coding using multi-sensor collaboration by system 300 are not limited by the examples described above with respect to FIG. 3 and may vary in other examples. It is understood that each of the elements shown in FIG. 3 is independently shown to represent characteristic functions different from each other in system 300, and it does not mean that each component is formed by the configuration unit of separate hardware or single software. That is, each element is included to be listed as an element for convenience of explanation, and at least two of the elements may be combined to form a single element, or one element may be divided into a plurality of elements to perform a function. It is also understood that some of the elements are not necessary elements that perform functions described in the present disclosure but instead may be optional elements for improving performance. It is further understood that these elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on system 300.

FIG. 5 illustrates a detailed block diagram of exemplary guided reconstruction module 308, according to some embodiments of the present disclosure. As shown in FIG. 5, guided reconstruction module 308 may include a pre-processing unit 504, a pre-processing unit 505, a high frequency (HF) estimation unit 506, a low frequency (LF) estimation unit 507, and a reconstruction unit 508. Pre-processing unit 504 may be configured to receive a guide input 502 (e.g., color image I’) and extract components at different frequencies of guide input 502 from guide input 502. In some embodiments, pre-processing unit 504 may be configured to separate one or more first components at a first frequency of guide input 502 and one or more second components at a second frequency of guide input 502 from guide input 502. For example, the first frequency may be higher than the second frequency. For ease of description, the first frequency may be referred to herein as the high frequency (HF) , and the second frequency may be referred to as the low frequency (LF) . In one example in which guide input 502 is color image I’, the HF component of color image I’ may be indicative of rapid transitions in color or texture, such as edges and details, while the LF component of color image I’ may be indicative of slow variations in color or texture, such as smooth and continuous areas. In some embodiments, pre-processing unit 504 may also be configured to extract initial HF features of guide input 502 from the HF component and extract initial LF features of guide input 502 from the LF component. That is, pre-processing unit 504 may separately extract initial features of the corresponding frequency from the respective frequency component.

In some embodiments in which the size of target input 503 (e.g., downsampled depth image D _s’) is smaller than the size of guide input 502 (e.g., color image I’) , pre-processing unit 504 may be further configured to downsample guide input 502 based on the size of a target input 503. For example, pre-processing unit 504 may downsample the extracted HF initial features and LF initial features, such that the resolution of the HF initial features and the LF initial features becomes equal to the resolution of the initial features extracted from target input 503 (e.g., downsampled depth image D _s’) . It is understood that the downsampling function of pre-processing unit 504 may not be performed in some cases, for example, when the size of target input 503 (e.g., downsampled depth image D _s’) is equal to the size of guide input 502 (e.g., color image I’) and/or when the resolution of the HF initial features and the LF initial features of guide input 502 is equal to the resolution of the initial features of target input 503.

Similarly, in some embodiments, pre-processing unit 505 may be configured to receive target input 503 (e.g., depth image D’) and extract components at different frequencies of target input 503 from target input 503. In some embodiments, pre-processing unit 505 may be configured to separate one or more first components at the first frequency (e.g., HF) of target input 503 and one or more second components at the second frequency (e.g., LF) of target input 503 from target input 503. In one example in which target input 503 is depth image D’, the HF component of depth image D’ may be indicative of rapid transitions in depth, such as edges, while the LF component of depth image D’ may be indicative of slow variations in depth, such as smooth areas. In some embodiments, pre-processing unit 505 may also be configured to extract initial HF features of target input 503 from the HF component and extract initial LF features of target input 503 from the LF component. That is, pre-processing unit 505 may separately extract initial features of the corresponding frequency from the respective frequency component.

It is understood that the frequency component separation function of pre-processing unit 505 may not be performed in some cases, for example, depending on the size of target input 503, such as when target input 503 is a downsampled depth image D _s’. In those cases, pre-processing unit 505 may be configured to extract the initial features of target input 503 from target input 503. That is, instead of separately extracting initial features of the corresponding frequency from the respective frequency component, pre-processing unit 505 may extract the initial features directly from target input 503 (e.g., downsampled depth image D _s’) without frequency component separation.

In some embodiments, pre-processing unit 504 may be further configured to combine the initial HF features of guide input 502 with initial features of target input 503 from pre-processing unit 505. Depending on whether pre-processing unit 505 performs the frequency component separation function or not, pre-processing unit 504 may combine the initial HF features of guide input 502 and the initial HF features of target input 503, or combine the initial HF features of guide input 502 and the initial features of target input 503. Similarly, in some embodiments, pre-processing unit 505 may be further configured to combine the initial LF features of guide input 502 from pre-processing unit 504 with initial features of target input 503. Depending on whether pre-processing unit 505 performs the frequency component separation function or not, pre-processing unit 505 may combine the initial LF features of guide input 502 and the initial LF features of target input 503, or combine the initial LF features of guide input 502 and the initial features of target input 503.

That is, pre-processing unit 504 and pre-processing unit 505 can work together to combine initial features from guide input 502 and target input 503 at different frequencies (e.g., HF and LF) to improve the estimation accuracy because different frequency components represent different types of information of guide input 502 and target input 503. Moreover, in case the sizes of guide input 502 and target input 503 do not match, pre-processing unit 504 and/or pre-processing unit 505 can also adjust the size of guide input 502 and/or target input 503 to enable the combination of the initial features from guide input 502 and target input 503.

In some embodiments, HF estimation unit 506 may be configured to receive the combined initial features at the first frequency (e.g., HF initial features) from pre-processing unit 504 (e.g., the combined HF initial features of guide input 502 and target input 503, or the combined HF initial features of guide input 502 and initial features of target input 503) , and estimate the first information at the first frequency (e.g., HF information) of target input 503 based on the combined initial features at the first frequency. Similarly, in some embodiments, LF estimation unit 507 may be configured to receive the combined initial features at the second frequency (e.g., LF initial features) from pre-processing unit 505 (e.g., the combined LF initial features of guide input 502 and target input 503, or the combined LF initial features of guide input 502 and initial features of target input 503) , and estimate the second information at the second frequency (e.g., LF information) of target input 503 based on the combined initial features at the second frequency. In some embodiments, the first information of depth image D _s’/D’ includes features indicative of edges of the scene, and the second information of depth image D _s’/D’ includes features indicative of smooth areas of the scene. In some embodiments, the estimation functions of HF estimation unit 506 and LF estimation unit 507 may be performed using machine learning models designed and trained for the corresponding frequency components.

In some embodiments, reconstruction unit 508 may be configured to reconstruct target input 503 based on the estimated first information and second information to generate target output 510 (e.g., depth image D” ) . To reconstruct target input 503, reconstruction unit 508 may be configured to combine the estimated first information and second information of target input 503, and upsample target input 503 based on the combined first and second information. That is, HF features indicative of edges of the scene may be used as the guidance for reconstructing depth image D _s’/D’ in which the edges become blurry or distorted due to the lack of HF components. In some embodiments, the upsampling function of reconstruction unit 508 may be performed using a machine learning model designed and trained for upsampling.

Consistent with the scope of the present disclosure, in some implementations, machine learning models, such as CNNs, are used by HF estimation unit 506, LF estimation unit 507, and reconstruction unit 508 to improve the efficiency and effectiveness of guided reconstruction module 308, for example, as shown in FIGs. 6A–6C. FIGs. 6A–6C will be described together with FIG. 5 in describing guided reconstruction module 308 implementing machine learning models with different downsampling factors (e.g., 1, 2, and 4) . It is understood that in some examples, any other suitable machine learning models, such as regression, support vector machine (SVM) , decision tree, Bayesian network, etc., may be implemented by guided reconstruction module 308.

As shown in FIGs. 5 and 6A, guide input 502 may be an input color image I’ and target input 503 may be an input depth image D’. That is, the downsampling factor of input depth image D’ may be 1 in this example, and the size of input color image I’ may be the same as the size of input depth image D’. Pre-processing unit 504 may include a discrete wavelet transform (DWT) unit 602, a feature extraction unit 604, and a concatenation (C) layer 606. DWT is a transform that decomposes a given signal into a number of sets, where each set is a time series of coefficients describing the time evolution of the signal in the corresponding frequency band. DWT unit 602 may perform DWT on input color image I’ to separate input color image I’ into four sub-band components: three HF components (HH, HL, LH) obtained by scanning input color image I’ along three different directions (height, width, and diagonal) , and one LF component (LL) . Besides frequency component separation, DWT unit 602 may also reduce the size of input color image I’ to half, i.e., downsample input color image I’ with a downsampling factor of 2. It is understood that in some examples, DWT unit 602 may be replaced with any suitable unit that can perform frequency component separation, as well as downsampling, on input color image I’, such as any other suitable transforms or CNNs. Feature extraction unit 604 may include a convolutional layer (e.g., a 3×3 convolutional layer) for extracting initial HF features from the three HF components (HH, HL, and LH) , and a convolutional layer for extracting initial LF features from one LF component (LL) . It is understood that the number of convolutional layer (s) used by feature extraction unit 604 is not limited to one, as shown in FIG. 6A and may be any suitable number.

Similarly, pre-processing unit 505 may include a DWT unit 603, a feature extraction unit 605, and a concatenation layer 607. DWT unit 603 may perform DWT on input depth image D’ to separate input depth image D’ into four sub-band components: three HF components (HH, HL, LH) obtained by scanning input depth image D’ along three different directions (height, width, and diagonal) , and one LF component (LL) . Besides frequency component separation, DWT unit 603 may also reduce the size of input depth image D’ to half, i.e., downsample input depth image D’ with a downsampling factor of 2. It is understood that in some examples, DWT unit 603 may be replaced with any suitable unit that can perform frequency component separation, as well as downsampling, on input depth image D’, such as any other suitable transforms or CNNs. Feature extraction unit 605 may include a convolutional layer (e.g., a 3×3 convolutional layer) for extracting initial HF features from the three HF components (HH, HL, and LH) , and a convolutional layer for extracting initial LF features from one LF component (LL) . It is understood that the number of convolutional layer (s) used by feature extraction unit 605 is not limited to one, as shown in FIG. 6A and may be any suitable number.

Concatenation layer 606 of pre-processing unit 504 may concatenate the initial HF features of input color image I’ and the initial HF features of input depth image D’ to form concatenated initial HF features of input color image I’ and depth image D’. Similarly, concatenation layer 607 of pre-processing unit 505 may concatenate the initial LF features of input color image I’ and the initial LF features of input depth image D’ to form concatenated initial LF features of input color image I’ and depth image D’. In order to perform the concatenation operation, the size (resolution) of the two feature maps need to be the same. Thus, the size of the extracted HF features of input color image I’ may be the same as the size of the extracted HF features of input depth image D’, and the size of the extracted LF features of input color image I’ may be the same as the size of the extracted LF features of input depth image D’.

HF estimation unit 506 may include an HF machine learning model having one or more sub-models 608 for HF components and features. The HF machine learning model may be a CNN, and each sub-model 608 may be a CNN block. Sub-models 608 may include a number of levels, such as one or more dense blocks, 1×1 convolutional layers, and additional convolutional layers (e.g., 3×3 convolutional layers) , to extract HF features from the initial HF features step-by-step and get deep semantic information with multiscale HF features. In one example, each dense block may include four convolutional layers and a concatenation layer. For example, the semantic information with multiscale HF features from one or more dense blocks of sub-model 608 may be used to distinguish between valid and invalid edges. In one example, a 1×1 convolutional layer may be used to reduce the number of channels of the extracted HF features to reduce the computation complexity of the HF machine learning model. In some embodiments, the initial HF features (i.e., the input of the first sub-model 608 as shown in FIG. 6A) and/or any intermediate HF features between sub-models 608 (not shown) may be added to the output of the last sub-model 608 by an addition operation, for example, to obtain the residual therebetween, thereby further reducing the computation complexity of the HF machine learning model. For example, an element-wise addition at the pixel level may be performed. Eventually, estimated HF features may be extracted by the last 3×3 convolutional layer and provided as the output HF information of HF estimation unit 506.

Similarly, LF estimation unit 507 may include an LF machine learning model having one or more sub-models 609 for LF components and features. The LF machine learning model may be a CNN, and each sub-model 609 may be a CNN block. Sub-models 609 may include a number of levels, such as one or more residual blocks (ResBlocks) , 1×1 convolutional layers, and additional convolutional layers (e.g., 3×3 convolutional layers) , to extract LF features from the initial LF features step-by-step and get deep semantic information with multiscale LF features. In one example, a residual block may include a smaller number of convolutional layers than a dense block (e.g., two convolutional layers) due to because the LF features are simpler and more predictable than the HF features. Each residual block may also be replaced by a dense block. For example, the semantic information with multiscale LF features from one or more residual blocks of sub-model 609 may be used to distinguish between valid and invalid smooth areas. In one example, a 1×1 convolutional layer may be used to reduce the number of channels of the extracted LF features to reduce the computation complexity of the LF machine learning model. In some embodiments, the initial LF features (i.e., the input of the first sub-model 609 as shown in FIG. 6A) and/or any intermediate LF features between sub-models 609 (not shown) may be added to the output of the last sub-model 609 by an addition operation, for example, to obtain the residual therebetween, thereby further reducing the computation complexity of the LF machine learning model. For example, an element-wise addition at the pixel level may be performed. Eventually, estimated LF features may be extracted by the last 3×3 convolutional layer and provided as the output LF information of LF estimation unit 507.

As shown in FIG. 6A, reconstruction unit 508 may include a concatenation layer 610 and an upsampling unit 612. Concatenation layer 610 may concatenate the estimated HF features from HF estimation unit 506 with the estimated LF features from LF estimation unit 507. As described above, due to the edge information loss (i.e., lack of HF components) from compression/decompression, the edges of the scene in depth image D’ may become blurry or otherwise distorted, which can be compensated by estimated HF features from HF components of color image I’ and depth image D’. Upsampling unit 612 may use an upsampling machine learning model (e.g., a CNN) for upsampling input depth image D’ (downsampled by DWT unit 603 to half) based on the concatenated HF and LF features to reconstruct output depth image D” , which has the same size as input depth image D’. The upsampling machine learning model may include one or more upsampling layers to learn and recover output depth image D” . It is understood that the upsampling machine learning model may be replaced with any other suitable upsampling means, such as interpolation or inverse DWT (IDWT) , provided that four sub-band components (HH, HL, LH, and LL) for IDWT can be provided from the concatenated HF and LF features.

As shown in another example of FIG. 6B, guide input 502 may still be an input color image I’, while target input 503 may become a downsampled input depth image D _s’ with half size of original depth image D. That is, the downsampling factor of input depth image D _s’ may be 2 in this example, and the size of input color image I’ may be twice the size of input depth image D _s’. Different from the example in FIG. 6A, pre-processing unit 505 in FIG. 6B may not include DWT unit 603, such that the convolution layer of feature extraction unit 605 may extract the initial features, as opposed to initial HF features and LF features, directly from input depth image D _s’ without frequency component separation. On the one hand, since input depth image D _s’ has already been downsampled, frequency component separation using DWT may not be able to obtain usefully HF or LF features from the downsampled image. On the other hand, DWT may further downsample input depth image D _s’ to cause a mismatch between the size of initial HF/LF features of input color image I’ and the size of initial HF/LF features of input depth image D _s’. That is, when the downsampling factor is 2, DWT may be performed only on input color image I’ because input depth image D _s’ has half size of input color image I’.

As shown in FIG. 6B, different from the example in FIG. 6A, concatenation layer 606 of pre-processing unit 504 may concatenate the initial HF features of input color image I’ and the initial features of input depth image D _s’, and concatenation layer 607 of pre-processing unit 505 may concatenate the initial LF features of input color image I’ and the initial features of input depth image D _s’. As a result, the size of the extracted HF features of input color image I’ may still be the same as the size of the extracted features of input depth image D _s’, and the size of the extracted LF features of input color image I’ may still be the same as the size of the extracted features of input depth image D _s’. The concatenated initial HF features of input color image I’ and initial features of input depth image D _s’ may be provided as the input to HF estimation unit 506, and the concatenated initial LF features of input color image I’ and initial features of input depth image D _s’ may be provided as the input to LF estimation unit 507. Besides pre-processing

units

504 and 505, the remaining elements may remain the same between FIGs. 6A and 6B and thus, are not repeated for ease of description.

As shown in still another example of FIG. 6C, guide input 502 may still be an input color image I’, while target input 503 may become a downsampled input depth image D _s’ with a quarter size of original depth image D. That is, the downsampling factor of input depth image D _s’ may be 4 in this example, and the size of input color image I’ may be four times the size of input depth image D _s’. Different from the example in FIG. 6B, pre-processing unit 504 in FIG. 6C may further include a downsampling unit 614 to downsample the initial HF features and the initial LF features with a downsampling factor of 2, such that the size of the extracted HF features of input color image I’ after downsampling unit 614 may still be the same as the size of the extracted features of input depth image D _s’, and the size of the extracted LF features of input color image I’ after downsampling unit 614 may still be the same as the size of the extracted features of input depth image D _s’. It is understood that the downsampling may be performed before feature extraction unit 604 or before DWT unit 602 in some examples, as long as the size of initial HF/LF features of input color image I can match the size of initial features of input depth image D _s’ at concatenation layers 606 and 607. That is, downsampling unit 614 of pre-processing unit 504 may be configured to downsample input color image I’, the separated HF components (HH, HL, LH) and LF components (LL) of input color image I’, or the extracted HF features and LF features of input color image I’.

As shown in FIG. 6C, the upsampling machine learning model of upsampling unit 612 in reconstruction unit 508 may have more levels (e.g., two) to upsample input depth image D _s’ with a downsampling factor of 4 compared with the example in FIG. 6B where input depth image D _s’ has a downsampling factor of 2. Besides pre-processing

units

504 and 505 and reconstruction unit 508, the remaining elements may remain the same between FIGs. 6A and 6C and thus, are not repeated for ease of description.

It is understood that the number of

sub-models

608 and 609 in HF estimation unit 506 and LF estimation unit 507, as well as the number of levels in each sub-model 608 or 609, are not limited to the examples of FIGs. 6A–6C and may be any suitable numbers as needed, for example, based on the efficiency and accuracy requirements of guided reconstruction module 308. It is also understood that the number of

sub-models

608 and 609 in HF estimation unit 506 and LF estimation unit 507, as well as the number of levels in each sub-model 608 or 609, are not affected by the downsampling factor of input depth image D’/D _s’ and thus, can be optimized based on the efficiency and accuracy requirements of guided reconstruction module 308 without the constrain from the downsampling factor of input depth image D’/D _s’. It is further understood that the downsampling factor of input depth image D’/D _s’ is not limited to the examples (e.g., 1, 2, and 4) of FIGs. 6A–6C and may be any suitable numbers as needed. The design of guided reconstruction unit 308 may be similarly adjusted by introducing or removing DWT unit 603, introducing or removing downsampling unit 614, and/or adjusting the downsampling factor of downsampling unit 614, such that the size of the initial HF/LF features of input color image I’ is the same of the size of the initial features of input depth image D _s’

FIG. 7 illustrates visual comparison between anchor images and various output depth images from guided reconstruction modules 308 in FIGs. 6A–6C with different QPs, according to some embodiments of the present disclosure. As shown in the first column of FIG. 7, the result of the 3D-HEVC anchors contains obvious blocky artifacts with the increase of QP. As QP increases, in the second column of FIG. 7, output depth image D” from guided reconstruction module 308 in FIG. 6A (e.g., downsampling factor = 1) makes depth edges clear while significantly reducing blocking artifacts. As QP increases, in the third column of FIG. 7, output depth image D” from guided reconstruction module 308 in FIG. 6B (e.g., downsampling factor = 2) makes some small edges blurry, causing local distortion without obvious blocky artifacts. As QP increases, in the fourth column of FIG. 7, output depth image D” from guided reconstruction module 308 in FIG. 6C (e.g., downsampling factor = 4) , the blurry degree becomes serious with more distortion; however, there are no serious blocking artifacts. The results demonstrate that guided reconstruction modules 308 in FIGs. 6A–6C can effectively use multi-sensor collaboration for video compression while successfully suppressing blocky artifacts.

Moreover, compared with the 3D-HEVC anchors, guided reconstruction modules 308 in FIGs. 6A–6C can remarkably save bits in terms of Bjontegaard Delta (BD) -rate, while maintaining depth quality. For example, when the downsampling factor is 1, the bitrate is the same as that of 3D-HEVC, but its peak signal-to-noise ratio (PSNR) is higher than the 3D-HEVC anchors, which means that guided reconstruction module 308 in FIG. 6A can effectively improve the depth quality in the same bitrate; when the downsampling factor is 2, guided reconstruction module 308 in FIG. 6B can reduce bitrate by 50%to 75%while maintaining the depth quality; when the downsampling factor is 4, guided reconstruction module 308 in FIG. 6C can further save bitrate about 60%to 90%compared to the 3D-HEVC anchor. Although most depth information may become lost when the downsampling factor is 4, guided reconstruction module 308 in FIG. 6C may still maintain the depth quality to a certain degree.

FIG. 8 illustrates a flow chart of an exemplary method 800 for encoding, according to some embodiments of the present disclosure. Method 800 may be performed by encoder 101 of encoding system 100, encoding and decoding system 300, or any other suitable image and video encoding systems. Method 800 may include

operations

802, 804, 806, and 808, as described below. It is understood that some of the operations (e.g., operation 806) may be optional, and some of the operations may be performed simultaneously, or in a different order than shown in FIG. 8.

At operation 802, a first image of a scene is acquired by a first sensor. In some embodiments, the first image is a depth image. As shown in FIG. 3, first sensor 108 may be configured to acquire a depth image. At operation 804, a second image of the scene is acquired by a second sensor. In some embodiments, the first image and the second image are complementary to one another with respect to the scene. In some embodiments, the second image is a color image. As shown in FIG. 3, second sensor 108 may be configured to acquire a color image. At optional operation 806, the first image is downsampled by a first processor. In some embodiments, to downsample the first image, at least one of interpolation, uniform sampling, or a machine learning model is used. As shown in FIGs. 1 and 3, downsampling module 302 of encoder 101 in processor 102 may be configured to downsample the depth image. At operation 808, The first image and the second image are compressed into a bitstream by the first processor. As shown in FIGs. 1 and 3, compression module 304 of encoder 101 in processor 102 may be configured to compress the depth image and the color image into the output bitstream, for example, using a 3D-HEVC codec. It is understood that

operations

802, 804, 806, and 808 may be repeated for each frame of a video to encode a video, such as a 3D video. It is also understood that in some examples, the first image and/or the second image may not be acquired by first and/or second sensors but obtained through any other suitable means. In one example, the first image and/or the second image may be derived or otherwise obtained from other image (s) of the scene using any suitable image analysis or processing techniques. In another example, the first image and/or the second image may be acquired by a third party and transmitted from the third party to the encoder.

FIG. 9 illustrates a flow chart of an exemplary method 900 for decoding, according to some embodiments of the present disclosure. Method 900 may be performed by decoder 201 of decoding system 200, encoding and decoding system 300, or any other suitable image and video decoding systems. Method 900 may include

operations

902, 904, 906, and 908, as described below. It is understood that some of the operations may be optional, and some of the operations may be performed simultaneously, or in a different order than shown in FIG. 9.

At operation 902, the bitstream is decompressed to receive the first image and the second image by a second processor. As shown in FIGs. 2 and 3, decompression module 306 of decoder 201 in processor 102 may be configured to decompress the input bitstream to receive the depth image and the color image, for example, using a 3D-HEVC codec. It is understood that in some examples, the first image and/or the second image may not be obtained by the decoder from the decompressed bitstream but obtained through any other suitable means. In one example, the first image and/or the second image may be derived or otherwise obtained from other image (s) of the scene using any suitable image analysis or processing techniques. In another example, the first image and/or the second image may be acquired by sensors coupled directly to the decoder. It is further understood that depending on whether optional operation 806 is performed or not, the first image (e.g., the depth image) obtained from the bitstream may be a downsampled first image or not.

At operation 904, first information at a first frequency of the first image is estimated based, at least in part, on a first component at the first frequency of the second image by the second processor. The first information may include features indicative of edges of the scene. At operation 906, second information at a second frequency of the first image is estimated based, at least in part, on a second component at the second frequency of the second image by the second processor. The first frequency may be higher than the second frequency. In some embodiments, the first information may include features indicative of edges of the scene, and the second information may include features indicative of smooth areas of the scene. As shown in FIGs. 2 and 3, guided reconstruction module 308 of decoder 201 in processor 102 may be configured to estimate the first information and the second information.

Referring to FIGs. 10A and 10B, in some embodiments, the first component and the second component of the second image are separated from the second image at operation 1002. In some embodiments, the first and second components are separated using DWT. In some embodiments, initial features at the first frequency of the second image are extracted from the first component at operation 1004, and initial features at the second frequency of the second image are extracted from the second component at operation 1006. As shown in FIGs. 5 and 6A–6C, pre-processing unit 504 of guided reconstruction module 308 may be configured to separate the HF and LF components from input color image I’, and extract initial HF features and initial LF features from the HF component and the LF component, respectively.

In some embodiments in which the size of the first image is the same as the size of the second image, the first information is estimated based on the first component of the second image and a first component at the first frequency of the first image, and the second information is estimated based on the second component of the second image and a second component at the second frequency of the first image.

Referring to FIG. 10A, in some embodiments, the first component and the second component of the first image are separated from the first image at operation 1008. In some embodiments, the first and second components are separated using DWT. In some embodiments, initial features at the first frequency of the first image are extracted from the first component at operation 1010, and initial features at the second frequency of the first image are extracted from the second component at operation 1012. As shown in FIGs. 5 and 6A, pre-processing unit 505 of guided reconstruction module 308 may be configured to separate the HF and LF components from input depth image D’ with a downsampling factor equal to 1, and extract initial HF features and initial LF features from the HF component and the LF component, respectively.

Referring to FIG. 10A, in some embodiments, the initial features at the first frequency of the first and second images are combined at operation 1014. As shown in FIGs. 5 and 6A, pre-processing unit 504 of guided reconstruction module 308 may be configured to combine the initial HF features of input color image I’ and the initial HF features of input depth image D’.

Referring to FIG. 10A, in some embodiments, features at the first frequency are estimated based on the combined initial features at the first frequency of the first and second images using a first learning model for the first frequency at operation 1016. As shown in FIGs. 5 and 6A, HF estimation unit 506 of guided reconstruction module 308 may be configured to estimate HF features of input depth image D’ based on the combined initial HF features of input color image I’ and input depth image D’ using an HF machine learning model.

Referring to FIG. 10A, in some embodiments, the initial features at the second frequency of the first and second images are combined at operation 1018. As shown in FIGs. 5 and 6A, pre-processing unit 505 of guided reconstruction module 308 may be configured to combine the initial LF features of input color image I’ and the initial LF features of input depth image D’.

Referring to FIG. 10A, in some embodiments, features at the second frequency are estimated based on the combined initial features at the second frequency of the first and second images using a second learning model for the second frequency at operation 1020. As shown in FIGs. 5 and 6A, LF estimation unit 507 of guided reconstruction module 308 may be configured to estimate LF features of input depth image D’ based on the combined initial LF features of input color image I’ and input depth image D’ using an LF machine learning model.

In some embodiments in which the size of the first image is smaller than the size of the second image, the first information is estimated based on the first image and the first component of the second image, and the second information is estimated based on the first image and the second component of the second image.

Referring to FIG. 10B, in some embodiments, initial features of the first image are extracted at operation 1009. As shown in FIGs. 5, 6B, and 6C, pre-processing unit 505 of guided reconstruction module 308 may be configured to extract initial features from downsampled input depth image D _s’ with a downsampling factor equal to 2 or 4.

Referring to FIG. 10B, in some embodiments, the initial features at the first frequency of the second image and the initial features of the first image are combined at operation 1011. As shown in FIGs. 5, 6B, and 6C, pre-processing unit 504 of guided reconstruction module 308 may be configured to combine the initial HF features of input color image I’ and the initial features of input depth image D _s’.

Referring to FIG. 10B, in some embodiments, features at the first frequency are estimated based on the combined initial features of the first and second images using a first learning model for the first frequency at operation 1013. As shown in FIGs. 5, 6B, and 6C, HF estimation unit 506 of guided reconstruction module 308 may be configured to estimate HF features of input depth image D _s’ based on the combined initial HF features of input color image I’ and initial features of input depth image D _s’ using an HF machine learning model.

Referring to FIG. 10B, in some embodiments, the initial features at the second frequency of the second image and the initial features of the first image are combined at operation 1015. As shown in FIGs. 5, 6B, and 6C, pre-processing unit 505 of guided reconstruction module 308 may be configured to combine the initial LF features of input color image I’ and the initial features of input depth image D _s’.

Referring to FIG. 10B, in some embodiments, features at the second frequency are estimated based on the combined initial features of the first and second images using a second learning model for the second frequency at operation 1017. As shown in FIGs. 5, 6B, and 6C, LF estimation unit 507 of guided reconstruction module 308 may be configured to estimate LF features of input depth image D _s’ based on the combined initial LF features of input color image I’ and initial features of input depth image D _s’ using an LF machine learning model.

Optionally, in some embodiments, the second image, the first and second components of the second image, or the initial features of the second image are downsampled based on the size of the first image prior to combining the initial features at

operations

1011 and 1015, for example, at operation 1007. As shown in FIGs. 5 and 6C, pre-processing unit 504 of guided reconstruction module 308 may be configured to downsample input color image I’ when the size of downsampled input depth image D _s’ is quarter of the size of input color image I’, such that the size of initial HF/LF features of input color image I’ becomes the same as the size of initial features of input depth image D _s’. It is understood that pre-processing unit 504 of guided reconstruction module 308 may be configured to downsample input color image I’, the separated HF components (HH, HL, LH) and LF components (LL) of input color image I’, or the extracted HF features and LF features of input color image I’.

Referring back to FIG. 9, at operation 908, the first image is reconstructed based on the first information and the second information by the second processor. As shown in FIGs. 2 and 3, guided reconstruction module 308 of decoder 201 in processor 102 may be configured to reconstruct the first image based on the first information and the second information.

Referring to FIGs. 10A and 10B, in some embodiments, to reconstruct the first image, the estimated first information and the estimated second information are combined at operation 1022, and the first image is upsampled based on the combined first and second information at operation 1024, for example, using a third machine learning model for upsampling. As shown in FIGs. 5 and 6A–6C, reconstruction unit 508 of guided reconstruction module 308 may be configured to combine the estimated HF features and LF features of input depth image D’/D _s’, and upsample input depth image D’/D _s’ based on the combined HF and LF features using an upsampling machine learning model to generate output depth image D” that has the same size as input color image I’.

FIG. 11 illustrates a block diagram of an exemplary model training system 1100, according to some embodiments of the present disclosure. System 1100 may be configured to train the various machine learning models described herein, such as the HF, LF, and upsampling machine learning models described above with respect to FIGs. 6A–6C. The machine learning models may be trained jointly as a single model or separately as individual models in different examples by system 1100. System 1100 may be implemented by encoding system 100, decoding system 200, or a separate computing system.

As shown in FIG. 11, system 1100 may include a model training module 1102 configured to train each CNN model 1101 over a set of training samples 1104 based on a loss function 1106 using a training algorithm 1108. CNN models 1101 may include the HF, LF, and upsampling machine learning models described above with respect to FIGs. 6A–6C in detail.

In some embodiments, as shown in FIG. 12, each training sample 1104 includes an input color image 1202 of a scene, an input depth image 1204 of the scene, and a ground truth (GT) depth image 1206 of the scene. The training may be supervised training with GT depth image 1206 in each training sample 1104. In some embodiments, input color image 1202 and input depth image 1204 in each training sample 1104 are compressed. In other words, input color image 1202 and input depth image 1204 may be compressed and decompressed using the same codec to be used by the encoding/decoding system in which CNN models 1101 are to be used. In some embodiments, input color image 1202 and input depth image 1204 in each training sample 1104 are compressed based on a QP. In other words, input color image 1202 and input depth image 1204 may be compressed and decompressed based on the same QP or a similar QP (e.g., in the same group of QPs) that is to be used by the encoding/decoding system in which CNN models 1101 are to be used.

Referring back to FIG. 11, CNN model 1101 may include a plurality of parameters that can be jointly adjusted by model training module 1102 when being fed with training samples 1104. Model training module 1102 may jointly adjust the parameters of CNN model 1101 to minimize loss function 1106 over training samples 1104 using training algorithm 1108. Training algorithm 1108 may be any suitable iterative optimization algorithm for finding the minimum of loss function 1106, including gradient descent algorithms (e.g., the stochastic gradient descent algorithm) .

As shown in FIG. 12, in some embodiments, for each training sample 1104, model training module 1102 is configured to reconstruct an output depth image1208 by upsampling input depth image 1204 with the guidance of HF components from input color image 1202 using CNN model (s) 1101. In some embodiments, for each training sample 1104, model training module 1102 is further configured to train CNN model (s) 1101 based on the difference between each output depth image 1208 and the corresponding GT depth image 1206 using loss function 1106. That is, the loss between output depth image 1208 and GT depth image 1206 may be calculated. For example, loss function 1106 may combine both L1 loss L ₁ and structural similarity index (SSIM) loss L _SSIM with different weights as follows to preserve as much structure and detail information as possible:

Loss _r (x, y) = L ₁ (x, y) +w×L _SSIM (x, y) (2) ,

where x and y are output depth image 1208 and the corresponding GT depth image 1206, respectively, and w is the weight of SSIM loss. L ₁ and L _SSIM may be defined as follows:

L ₁(x, y) = |x-y| (3) ,

L _SSIM (x, y) = [l (x, y) ] ^α [c (x, y) ] ^β [s (x, y) ] ^γ (4) ,

where SSIM compares luminance, contrast, and structure simultaneously as follows:

Luminance part:

Contrast part:

Structure part:

where μ _x and μ _y are means of x and y, respectively;

and

are variances of x and y; σ _xy is the covariance of x and y; c ₁= (k ₁L) ², c ₂= (k ₂L) ² are the constants and c ₃=c ₂/2, and L is the range of pixel values. In some embodiments, since the pixel values are more important for residual maps, a larger weight may be assigned to L1 loss than SSIM loss. For example, the weight w of SSIM loss may be smaller than 1, such as 0.05, k ₁ may equal 0.01, and k ₂ may equal 0.03.

In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a processor, such as processor 102 in FIGs. 1 and 2. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, HDD, such as magnetic disk storage or other magnetic storage devices, Flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, digital video disc (DVD) , and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

In some embodiments, the first image and the second image are complementary to one another with respect to the scene.

In some embodiments, the first image is a depth image, and the second image is a color image.

In some embodiments, the first frequency is higher than the second frequency. In some embodiments, the first information includes features indicative of edges of the scene, and the second information includes features indicative of smooth areas of the scene.

In some embodiments, the first component and the second component of the second image are separated from the second image, initial features at the first frequency of the second image are extracted from the first component, and initial features at the second frequency of the second image are extracted from the second component.

In some embodiments, the first and second components are extracted using DWT.

In some embodiments, a size of the first image is the same as a size of the second image. In some embodiments, to estimate the first information, the first information is estimated based on the first component of the second image and a first component at the first frequency of the first image. In some embodiments, to estimate the second information, the second information is estimated based on the second component of the second image and a second component at the second frequency of the first image.

In some embodiments, the first component and the second component of the first image are separated from the first image, initial features at the first frequency of the first image are extracted from the first component, the initial features at the first frequency of the first and second images are combined, initial features at the second frequency of the first image are extracted from the first component, and the initial features at the second frequency of the first and second images are combined.

In some embodiments, to estimate the first information, features at the first frequency are estimated based on the combined initial features at the first frequency of the first and second images using a first learning model for the first frequency. In some embodiments, to estimate the second information, features at the second frequency are estimated based on the combined initial features at the second frequency of the first and second images using a first learning model for the first frequency.

In some embodiments, a size of the first image is smaller than a size of the second image. In some embodiments, to estimate the first information, the first information is estimated based on the first image and the first component of the second image. In some embodiments, to estimate the second information, the second information is estimated based on the first image and the second component of the second image.

In some embodiments, initial features of the first image are extracted, the initial features at the first frequency of the second image and the initial features of the first image are combined, and the initial features at the second frequency of the second image and the initial features of the first image are combined.

In some embodiments, the second image, the first and second components of the second image, or the initial features of the second image are downsampled based on the size of the first image prior to combining the initial features.

In some embodiments, to estimate the first information, features at the first frequency are estimated based on the combined initial features of the first and second images using a first learning model for the first frequency. In some embodiments, to estimate the second information, features at the second frequency are estimated based on the combined initial features of the first and second images using a second learning model for the second frequency.

In some embodiments, to reconstruct the first image, the estimated first information and the estimated second information are combined, and the first image is upsampled based on the combined first and second information.

In some embodiments, to upsample the first image, the first image is upsampled based on the combined first and second information using a third learning model for upsampling.

In some embodiments, the processor is further configured to separate the first component and the second component of the second image from the second image, extract initial features at the first frequency of the second image from the first component, and extract initial features at the second frequency of the second image from the second component.

In some embodiments, the first and second components are separated using DWT.

In some embodiments, a size of the first image is the same as a size of the second image. In some embodiments, to estimate the first information, the processor is further configured to estimate the first information based on the first component of the second image and a first component at the first frequency of the first image. In some embodiments, to estimate the second information, the processor is further configured to estimate the second information based on the second component of the second image and a second component at the second frequency of the first image.

In some embodiments, the processor is further configured to separate the first component and the second component of the first image from the first image, extract initial features at the first frequency of the first image from the first component, combine the initial features at the first frequency of the first and second images, extract initial features at the second frequency of the first image from the first component, and combine the initial features at the second frequency of the first and second images.

In some embodiments, to estimate the first information, the processor is further configured to estimate features at the first frequency based on the combined initial features at the first frequency of the first and second images using a first learning model for the first frequency. In some embodiments, to estimate the second information, the processor is further configured to estimate features at the second frequency based on the combined initial features at the second frequency of the first and second images using a second learning model for the second frequency.

In some embodiments, a size of the first image is smaller than a size of the second image. In some embodiments, to estimate the first information, the processor is further configured to estimate the first information based on the first image and the first component of the second image. In some embodiments, to estimate the second information, the processor is further configured to estimate the second information based on the first image and the second component of the second image.

In some embodiments, the processor is further configured to extract initial features of the first image, combine the initial features at the first frequency of the second image and the initial features of the first image, and combine the initial features at the second frequency of the second image and the initial features of the first image.

In some embodiments, the processor is further configured to downsample the second image, the first and second components of the second image, or the initial features of the second image based on the size of the second image prior to combining the initial features.

In some embodiments, to estimate the first information, the processor is further configured to estimate features at the first frequency based on the combined initial features of the first and second images using a first learning model for the first frequency. In some embodiments, to estimate the second information, the processor is further configured to estimate features at the second frequency based on the combined initial features of the first and second images using a second learning model for the second frequency.

In some embodiments, to reconstruct the first image, the processor is further configured to combine the estimated first information and the estimated second information, and upsample the first image based on the combined first and second information.

In some embodiments, to upsample the first image, the processor is further configured to upsample the first image based on the combined first and second information using a third learning model for upsampling.

In some embodiments, the first image and the second image are compressed by a second processor into a bitstream, the bitstream is transmitted from the second processor to the first processor, and the bitstream is decompressed by the first processor to receive the first image and the second image.

In some embodiments, a size of the first image is the same as a size of the second image. In some embodiments, to estimate the first information, the first information is estimated based on the first image and the first component of the second image. In some embodiments, to estimate the second information, the second information is estimated based on the first image and the second component of the second image.

In some embodiments, the first image is downsampled by the second processor prior to compressing, such that a size of the first image becomes smaller than a size of the second image.

In some embodiments, to estimate the first information, the first information is estimated based on the first image and the first component of the second image. In some embodiments, to estimate the second information, the second information is estimated based on the first image and the second component of the second image.

In some embodiments, the first processor of the encoding subsystem is configured to compress the first image and the second image into a bitstream. In some embodiments, the encoding subsystem further includes a first interface configured to transmit the bitstream to the decoding subsystem. In some embodiments, the decoding subsystem further includes a second interface configured to receive the bitstream from the encoding subsystem. In some embodiments, the second processor of the decoding subsystem is further configured to decompress the bitstream to receive the first image and the second image.

In some embodiments, a size of the first image is the same as a size of the second image. In some embodiments, to estimate the first information, the second processor is further configured to estimate the first information based on the first component of the second image and a first component at the first frequency of the first image. In some embodiments, to estimate the second information, the second processor is further configured to estimate the second information based on the second component of the second image and a second component at the second frequency of the first image.

In some embodiments, the first processor is further configured to downsample the first image prior to compressing, such that a size of the first image becomes smaller than a size of the second image.

In some embodiments, to estimate the first information, the second processor is further configured to estimate the first information based on the first image and the first component of the second image. In some embodiments, to estimate the second information, the second processor is further configured to estimate the second information based on the first image and the second component of the second image.

The foregoing description of the embodiments will so reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor (s) , and thus, are not intended to limit the present disclosure and the appended claims in any way.

Various functional blocks, modules, and steps are disclosed above. The arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be reordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

A method, comprising:

receiving, by a processor, a first image and a second image of a same scene, wherein the first image and the second image are acquired by two different sensors;

estimating, by the processor, first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image;

estimating, by the processor, second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image; and

reconstructing, by the processor, the first image based on the first information and the second information.
The method of claim 1, wherein the first image and the second image are complementary to one another with respect to the scene.
The method of claim 1, wherein the first image is a depth image, and the second image is a color image.
The method of claim 3, wherein

the first frequency is higher than the second frequency; and

the first information comprises features indicative of edges of the scene, and the second information comprises features indicative of smooth areas of the scene.
The method of claim 1, further comprising:

separating the first component and the second component of the second image from the second image;

extracting initial features at the first frequency of the second image from the first component; and

extracting initial features at the second frequency of the second image from the second component.
The method of claim 5, wherein the first and second components are separated using discrete wavelet transform (DWT) .
The method of claim 5, wherein

a size of the first image is the same as a size of the second image;

estimating the first information comprises estimating the first information based on the first component of the second image and a first component at the first frequency of the first image; and

estimating the second information comprises estimating the second information based on the second component of the second image and a second component at the second frequency of the first image.
The method of claim 7, further comprising:

separating the first component and the second component of the first image from the first image;

extracting initial features at the first frequency of the first image from the first component;

combining the initial features at the first frequency of the first and second images;

extracting initial features at the second frequency of the first image from the first component; and

combining the initial features at the second frequency of the first and second images.
The method of claim 8, wherein

estimating the first information comprises estimating features at the first frequency based on the combined initial features at the first frequency of the first and second images using a first learning model for the first frequency; and

estimating the second information comprises estimating features at the second frequency based on the combined initial features at the second frequency of the first and second images using a second learning model for the second frequency.
The method of claim 5, wherein

a size of the first image is smaller than a size of the second image;

estimating the first information comprises estimating the first information based on the first image and the first component of the second image; and

estimating the second information comprises estimating the second information based on the first image and the second component of the second image.
The method of claim 10, further comprising:

extracting initial features of the first image;

combining the initial features at the first frequency of the second image and the initial features of the first image; and

combining the initial features at the second frequency of the second image and the initial features of the first image.
The method of claim 11, further comprising downsampling the second image, the first and second components of the second image, or the initial features of the second image based on the size of the first image prior to combining the initial features.
The method of claim 11, wherein

estimating the first information comprises estimating features at the first frequency based on the combined initial features of the first and second images using a first learning model for the first frequency; and

estimating the second information comprises estimating features at the second frequency based on the combined initial features of the first and second images using a second learning model for the second frequency.
The method of claim 1, wherein reconstructing the first image comprises:

combining the estimated first information and the estimated second information; and

upsampling the first image based on the combined first and second information.
The method of claim 14, wherein upsampling the first image comprises upsampling the first image based on the combined first and second information using a third learning model for upsampling.
A system, comprising:

a memory configured to store instructions; and

a processor coupled to the memory and configured to, upon executing the instructions:

receive a first image and a second image of a same scene, wherein the first image and the second image are acquired by two different sensors;

estimate first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image;

estimate second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image; and

reconstruct the first image based on the first information and the second information.
The system of claim 16, wherein the first image and the second image are complementary to one another with respect to the scene.
The system of claim 16, wherein the first image is a depth image, and the second image is a color image.
The system of claim 18, wherein

the first frequency is higher than the second frequency; and

the first information comprises features indicative of edges of the scene, and the second information comprises features indicative of smooth areas of the scene.
The system of claim 16, wherein the processor is further configured to:

separate the first component and the second component of the second image from the second image;

extract initial features at the first frequency of the second image from the first component; and

extract initial features at the second frequency of the second image from the second component.
The system of claim 20, wherein the first and second components are separated using discrete wavelet transform (DWT) .
The system of claim 20, wherein

a size of the first image is the same as a size of the second image;

to estimate the first information, the processor is further configured to estimate the first information based on the first component of the second image and a first component at the first frequency of the first image; and

to estimate the second information, the processor is further configured to estimate the second information based on the second component of the second image and a second component at the second frequency of the first image.
The system of claim 22, wherein the processor is further configured to:

separate the first component and the second component of the first image from the first image;

extract initial features at the first frequency of the first image from the first component;

combine the initial features at the first frequency of the first and second images;

extract initial features at the second frequency of the first image from the first component; and

combine the initial features at the second frequency of the first and second images.
The system of claim 23, wherein

to estimate the first information, the processor is further configured to estimate features at the first frequency based on the combined initial features at the first frequency of the first and second images using a first learning model for the first frequency; and

to estimate the second information, the processor is further configured to estimate features at the second frequency based on the combined initial features at the second frequency of the first and second images using a second learning model for the second frequency.
The system of claim 20, wherein

a size of the first image is smaller than a size of the second image;

to estimate the first information, the processor is further configured to estimate the first information based on the first image and the first component of the second image; and

to estimate the second information, the processor is further configured to estimate the second information based on the first image and the second component of the second image.
The system of claim 25, wherein the processor is further configured to:

extract initial features of the first image;

combine the initial features at the first frequency of the second image and the initial features of the first image; and

combine the initial features at the second frequency of the second image and the initial features of the first image.
The system of claim 26, wherein the processor is further configured to downsample the second image the first and second components of the second image, or the initial features of the second image based on the size of the first image prior to combining the initial features.
The system of claim 26, wherein

to estimate the first information, the processor is further configured to estimate features at the first frequency based on the combined initial features of the first and second images using a first learning model for the first frequency; and

to estimate the second information, the processor is further configured to estimate features at the second frequency based on the combined initial features of the first and second images using a second learning model for the second frequency.
The system of claim 16, wherein to reconstruct the first image, the processor is further configured to:

combine the estimated first information and the estimated second information; and

upsample the first image based on the combined first and second information.
The system of claim 29, wherein to upsample the first image, the processor is further configured to upsample the first image based on the combined first and second information using a third learning model for upsampling.
A method, comprising:

acquiring, by a first sensor, a first image of a scene;

acquiring, by a second sensor, a second image of the scene;

estimating, by a first processor, first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image;

estimating, by the first processor, second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image; and

reconstructing, by the first processor, the first image based on the first information and the second information.
The method of claim 31, further comprising:

compressing, by a second processor, the first image and the second image into a bitstream;

transmitting the bitstream from the second processor to the first processor; and

decompressing, by the first processor, the bitstream to receive the first image and the second image.
The method of claim 32, wherein

a size of the first image is the same as a size of the second image;

estimating the first information comprises estimating the first information based on the first component of the second image and a first component at the first frequency of the first image; and

estimating the second information comprises estimating the second information based on the second component of the second image and a second component at the second frequency of the first image.
The method of claim 32, further comprising:

downsampling, by the second processor, the first image prior to compressing, such that a size of the first image becomes smaller than a size of the second image.
The method of claim 34, wherein

estimating the first information comprises estimating the first information based on the first image and the first component of the second image; and

estimating the second information comprises estimating the second information based on the first image and the second component of the second image.
A system, comprising:

an encoding subsystem comprising:

a first sensor configured to acquire a first image of a scene;

a second sensor configured to acquire a second image of the scene;

a first memory configured to store instructions; and

a first processor coupled to the first memory and the first and second sensors;

a decoding subsystem comprising:

a second memory configured to store instructions; and

a second processor coupled to the second memory and configured to, upon executing the instructions:

estimate first information at a first frequency of the first image based, at least in part, on a first component at the first frequency of the second image;

estimate second information at a second frequency of the first image based, at least in part, on a second component at the second frequency of the second image; and

reconstruct the first image based on the first information and the second information.
The system of claim 36, wherein

the first processor of the encoding subsystem is configured to compress the first image and the second image into a bitstream;

the encoding subsystem further comprises a first interface configured to transmit the bitstream to the decoding subsystem;

the decoding subsystem further comprises a second interface configured to receive the bitstream from the encoding subsystem; and

the second processor of the decoding subsystem is further configured to decompress the bitstream to receive the first image and the second image.
The system of claim 37, wherein

a size of the first image is the same as a size of the second image;

to estimate the first information, the second processor is further configured to estimate the first information based on the first component of the second image and a first component at the first frequency of the first image; and

to estimate the second information, the second processor is further configured to estimate the second information based on the second component of the second image and a second component at the second frequency of the first image.
The system of claim 37, wherein the first processor is further configured to downsample the first image prior to compressing, such that a size of the first image becomes smaller than a size of the second image.
The system of claim 39, wherein

to estimate the first information, the second processor is further configured to estimate the first information based on the first image and the first component of the second image; and

to estimate the second information, the second processor is further configured to estimate the second information based on the first image and the second component of the second image.