US20200014945A1 - Application acceleration - Google Patents
Application acceleration Download PDFInfo
- Publication number
- US20200014945A1 US20200014945A1 US16/442,581 US201916442581A US2020014945A1 US 20200014945 A1 US20200014945 A1 US 20200014945A1 US 201916442581 A US201916442581 A US 201916442581A US 2020014945 A1 US2020014945 A1 US 2020014945A1
- Authority
- US
- United States
- Prior art keywords
- block
- blocks
- frame
- score
- reference frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
- H04N19/43—Hardware specially adapted for motion estimation or compensation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/567—Motion estimation based on rate distortion criteria
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/103—Selection of coding mode or of prediction mode
- H04N19/105—Selection of the reference unit for prediction within a chosen coding or prediction mode, e.g. adaptive choice of position and number of pixels used for prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/119—Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/12—Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
- H04N19/122—Selection of transform size, e.g. 8x8 or 2x4x8 DCT; Selection of sub-band transforms of varying structure or type
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/13—Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/132—Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/176—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/58—Motion compensation with long-term prediction, i.e. the reference frame for a current frame not being the temporally closest one
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
- H04N19/86—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/96—Tree coding, e.g. quad-tree coding
Definitions
- the present invention relates generally to the fields of computer architecture and application acceleration.
- Video encoding is described herein as one example of a workload which creates heavy stresses (by way of non-limiting example, in the cloud and/or in the data center; the same applies in other equipment and/or locations that are part of the video flow).
- data center workload includes millions of video streams which have to be encoded and re-encoded during every minute, in order to stream the streams to multiple devices (generally having different video codecs) in a very efficient way.
- adaptive streaming additionally requires the video to be encoded to multiple bitrates.
- FIG. 9 is a simplified graphical illustration depicting a relationship between CPU power and various encoding tasks
- FIG. 10 is a simplified pictorial illustration depicting an approximate timeline view of various past, present, and future encoding standards
- FIG. 11 is a simplified tabular illustration depicting complexity of various encoding standards.
- FIG. 9 shows that, for quite some time, a single CPU has not been able to keep up with a single encoding task; this problem, it is believed, will become more severe in the future, as future standards are adopted and become more widely used (see FIG. 10 ) and as the complexity of those future standards is expected to be higher, as shown in FIG. 11 .
- FIG. 12 is a simplified partly pictorial, partly graphical illustration depicting relationships between resolution and complexity.
- FIG. 12 show a very conservative forecast of resolution mixture along the years, and also shows the nominal complexity involved with the fact that there are more pixels in every stream.
- FIG. 13 is a simplified graphical illustration depicting relationships between video complexity, CPU capability, and computational effort.
- FIG. 13 shows that the real video complexity (complexity per pixel X number of pixels) is suffering from a growing gap between CPU capability and the computational effort needed.
- the art does not yet include any acceleration device offering video encoding acceleration for the data center.
- the term “device” may be used in the present description to describe implementations of exemplary embodiments of the present invention, as well as (in the preceding sentence) apparent lacks in the known art.
- implementation may take place in: an ASIC [Application Specific Integrated Circuit]; an ASSP [Application Specific Standard Part]; an SOC [System on a Chio], an FPGA [Field Programable Gate Array]; in a GPU [graphics processing unit]; in firmware; or in any appropriate combination of the preceding.
- ASIC Application Specific Integrated Circuit
- ASSP Application Specific Standard Part
- SOC System on a Chio
- FPGA Field Programable Gate Array
- GPU graphics processing unit
- firmware or in any appropriate combination of the preceding.
- the present invention in exemplary embodiments thereof, seeks to provide an improved video encoding, video compression, motion estimation, Current Picture Referencing (CPR), and computer architecture system in which part of the work (such as, by way of non-limiting example, any one or more of the following: motion estimation; Current Picture Referencing (CPR) transform; deblocking; loop filter; and context-adaptive binary arithmetic coding (CABAC) engine) is offloaded in such away that (in the specific non-limiting example of an encoder):
- part of the work such as, by way of non-limiting example, any one or more of the following: motion estimation; Current Picture Referencing (CPR) transform; deblocking; loop filter; and context-adaptive binary arithmetic coding (CABAC) engine
- a system including an acceleration device including input circuitry configured, for each of a first plurality of video frames to be encoded, to receive an input including at least one raw video frame and at least one reference frame, and to divide each of the first plurality of video frames to be encoded into a second plurality of blocks, and similarity computation circuitry configured, for each one of the first plurality of video frame to be encoded: for each block of the second plurality of blocks, to produce a score of result blocks based on similarity of each block in each frame to be encoded to every block of the reference frame, an AC energy coefficient, and a displacement vector.
- the at least one raw video frame and the at least one reference frame are identical.
- the score of result blocks includes a ranked list.
- the result blocks are one of fixed size, and variable size.
- system also includes weighting circuitry configured to weight at least some of the second plurality of blocks.
- system also includes upsampling circuitry configured to upsample at least some of the second plurality of blocks, and the score of results blocks is based on similarity of each block to at least one upsampled block.
- system also includes a second component, and the second component receives an output from the acceleration device and produces, based at least in part on the output received from the acceleration device, a second component output in accordance with a coding standard.
- the second component includes a plurality of second components, each of the plurality of second components producing a second component output in accordance with a coding standard, the coding standard for one of the plurality of second components being different from a coding standard of others of the plurality of second components.
- the second component includes an aggregation component configured to aggregate a plurality of adjacent blocks having equal displacement vectors into a larger block.
- the larger block has a displacement vector equal to a displacement vector of each of the plurality of blocks having equal displacement vectors, and the larger block has a score equal to a sum of scores of the plurality of blocks having equal displacement vectors.
- a method including providing an acceleration device including input circuitry configured, for each of a first plurality of video frames to be encoded, to receive an input including at least one raw video frame and at least one reference frame, and to divide each of the first plurality of video frames to be encoded into a second plurality of blocks, and similarity computation circuitry configured, for each one of the first plurality of video frame to be encoded: for each block of the second plurality of blocks, to produce a score of result blocks based on similarity of each block in each frame to be encoded to every block of the reference frame, an AC energy coefficient and a displacement vector, and providing the input to the acceleration device, and producing the score of result blocks and the displacement vector based on the input.
- the at least one raw video frame and the at least one reference frame are identical.
- the score of result blocks includes a ranked list.
- the result blocks are one of fixed size, and variable size.
- the method also includes weighting at least some of the second plurality of blocks.
- the weighting weights the block B to produce a weighted block B′ in accordance with the following formula: B′ A*B+C1, where A and C1 are scalars.
- the method also includes upsampling at least some of the second plurality of blocks, and the score of results blocks is based on similarity of each block to at least one upsampled block.
- the method also includes providing a second component receiving an output from the acceleration device and producing, based at least in part on the output received from the acceleration device, a second component output in accordance with a coding standard.
- the second component includes a plurality of second components, each of the plurality of second components producing a second component output in accordance with a coding standard, the coding standard for one of the plurality of second components being different from a coding standard of others of the plurality of second components.
- the second component includes an aggregation component configured to aggregate a plurality of adjacent blocks having equal motion vectors into a larger block.
- the larger block has a displacement vector equal to a displacement vector of each of the plurality of blocks having equal rank, and the larger block has a score equal to a sum of scores of the plurality of blocks having equal displacement vectors.
- the reference frame includes one of a reconstructed reference frame, and an original reference frame.
- FIG. 1A is a simplified tabular illustration depicting an inter block in a video bit stream
- FIG. 1B is a simplified block diagram illustration of an application acceleration system, constructed and operative in accordance with an exemplary embodiment of the present invention
- FIG. 2 is a simplified pictorial illustration depicting H.264/AVC block partitioning
- FIG. 3 is a simplified pictorial illustration depicting HEVC (H.265) block partitioning
- FIG. 4 is a simplified tabular illustration depicting an exemplary order of block encoding, intended in exemplary cases to be performed by an application acceleration system such as the system of FIG. 1B ;
- FIG. 5 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention
- FIG. 6 is another simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention
- FIG. 7 is a simplified pictorial illustration depicting a “local minimum” problem that may be encountered by an exemplary system in accordance with an exemplary embodiment of the present invention.
- FIG. 8 is a simplified pictorial illustration depicting an operation on a low-resolution image, which operation may be carried out by an exemplary system in accordance with an exemplary embodiment of the present invention
- FIG. 9 is a simplified graphical illustration depicting a relationship between CPU power and various encoding tasks.
- FIG. 10 is a simplified pictorial illustration depicting an approximate timeline view of various past, present, and future encoding standards
- FIG. 11 is a simplified tabular illustration depicting complexity of various encoding standards
- FIG. 12 is a simplified partly pictorial, partly graphical illustration depicting relationships between resolution and complexity
- FIG. 13 is a simplified graphical illustration depicting relationships between video complexity, CPU capability, and computational effort
- FIG. 14 is a simplified pictorial illustration depicting various portions of video codec standards
- FIG. 15 is a simplified pictorial illustration depicting an exemplary search area for motion vector prediction
- FIG. 16 is a simplified pictorial illustration depicting a non-limiting example of pixel sub blocks
- FIG. 17A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream;
- FIG. 17B is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system of FIG. 1B ;
- FIG. 18 is a simplified block diagram illustration depicting a particular exemplary case of using the acceleration system of FIG. 1B ;
- FIG. 19 is a simplified block diagram illustration depicting another particular exemplary case of using the acceleration system of FIG. 1B ;
- FIG. 20 is a simplified block diagram illustration depicting still another particular exemplary case of using the acceleration system of FIG. 1B ;
- FIG. 21 is a simplified block diagram illustration depicting yet another particular exemplary case of using the acceleration system of FIG. 1B ;
- FIG. 22 is a simplified partly pictorial, partly block diagram illustration of an exemplary embodiment of a portion of the application acceleration system of FIG. 1B ;
- FIG. 23 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with another exemplary embodiment of the present invention.
- FIG. 24 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with still another exemplary embodiment of the present invention.
- FIG. 25 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with yet another exemplary embodiment of the present invention.
- FIGS. 26A and 26B are simplified tabular illustrations useful in understanding still another exemplary embodiment of the present invention.
- FIG. 27A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream;
- FIG. 27B is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system of FIG. 1B .
- the motion estimation portion/tool (which is described in the standard by the motion compensation procedure), is generally considered to be the most demanding one when it comes to computational effort.
- the preceding also applies a Current Picture Referencing (CPR) tool.
- CPR Current Picture Referencing
- motion estimation is not part of the video standard, as illustrated in FIG. 14 , which is a simplified pictorial illustration depicting various portions of video codec standards.
- a motion estimation procedure is involved in comparing a block (termed herein a “reference block”) against many blocks in the reference frames, finding the block with greatest (or in some cases close to greatest) similarity, and then performing a motion compensation which comprises, as is well known, writing the residual (as is known in the art) as well as the motion vectors (also termed herein “displacement vectors”) in the bitstream.
- a “factor” may be used to find a block which, multiplied by some constant (factor), is close to a given block.
- two blocks from two different frames may be combined via weighted summing, and then a block close to the average may be found.
- the motion estimation and the motion compensation are built as one unit, which means that the motion estimation score function (which is the “similarity” to the reference block), is the same as the compensation part, thus allowing the codec to use the score of the best matching block as a residual in the bitstream without further processing.
- FIG. 1A is a simplified tabular illustration depicting an inter block, generally designated 101 , in a video bit stream.
- a high level view of a inter block in a video bitstream is shown in FIG. 1A .
- a block header 103 specifies the block type (which can in general be, by way of non-limiting example, inter or intra, the particular non-limiting example shown in FIG. 1A being an inter block with a single motion vector (MV) structure).
- block type which can in general be, by way of non-limiting example, inter or intra, the particular non-limiting example shown in FIG. 1A being an inter block with a single motion vector (MV) structure).
- MV motion vector
- a motion vector 105 represents the distance between the top left corner of the inter block to the top left corner of the reference block, while residual bits 107 (which are, in certain exemplary embodiments, represent the difference between the reference block and the target block (a given block).
- residual bits 107 which are, in certain exemplary embodiments, represent the difference between the reference block and the target block (a given block).
- each section in FIG. 1A is dynamic, in that each one of the three sections can be of variable size.
- a goal, in exemplary embodiments of the present invention, is to reduce the total number of bits in an entire bitstream, not just to reduce the total number of bits in a given inter block.
- the part of the process of generating an inter block which involves heavy computation is to read (in many case, by way of non-limiting example) tens of blocks for every reference block and to calculate the respective differences. Performing this operation itself may consume approximately 50 times more memory bandwidth than accessing the raw video itself. It is also appreciated that the compute effort of the process of generating an inter block may be very large, approximately, as well as the compute effort which is estimated to be approximately 50 O(number of pixels).
- the motion estimation part is generally responsible not only for finding a block with the minimal residual relative to each reference, but also for finding an optimal partitioning; by way of non-limiting example, a 32 ⁇ 32 block with 5 bits residual will consume many fewer bits in the bitstream than 4 8 ⁇ 8 blocks with 0 bits residual in each of them. In this particular non limiting example, the 32 ⁇ 32 block partitioning would be considered optimal.
- the method of selecting the best matched block is dependent on the details of the particular codec standard, including, by way of non-limiting example, because different standards treat motion vectors having large magnitude differently.
- Non-limiting examples of “differently” in the preceding sentence include: different partitioning; different sub-pixel interpolation; and different compensation options which may be available. It is appreciated that, in exemplary embodiments of the present invention, sub-pixels are produced by upsampling, as is known in the art.
- the different video standards differ from one another, with respect to motion compensation, in at least the following parameters:
- Older standards allow only full pixel comparison (against real blocks), while newer standards allow fractional sampling interpolations.
- the different standards also differ in filters used for the interpolation/s.
- Other features which differ between different standards include the particulars of rounding and clipping, as are well known in the art.
- Motion estimation procedures include “secret sauce” as described above.
- exemplary embodiments of the present invention will make many more calculations than are known in systems which do not use an acceleration device in accordance with exemplary embodiments of the present invention, leaving open later decisions to be made by the motion compensation/estimation software.
- an appropriate acceleration device may take place in: an ASIC [Application Specific Integrated Circuit]; an ASSP [Application Specific Standard Part]; an SOC [System on a Chip], an FPGA [Field Programable Gate Array]; in firmware; in a GPU [graphics processing unit]; or in any appropriate combination of the preceding. Implementations described in the preceding sentence may also be referred to herein as “circuitry”, without limiting the generality of the foregoing.
- circuitry without limiting the generality of the foregoing.
- the description will be followed by a detailed explanation of how the acceleration device overcomes issues related to each of Motion vector resolution; Motion compensation residual function; Block size being compensated; and Weighted prediction, as mentioned above.
- motion estimation is described as one particular non-limiting example.
- FIG. 1B is a simplified block diagram illustration of an application acceleration system, constructed and operative in accordance with an exemplary embodiment of the present invention.
- the application acceleration system comprises a video acceleration system.
- the video acceleration system 110 comprises a video acceleration device 120 ; exemplary embodiments of the construction and operation of the video acceleration device 120 are further described herein. As described in further detail below, the video acceleration device 120 is, in exemplary embodiments, configured to produce a result map 140 .
- the result map 140 is provided as input to a further component (often termed herein “SW”, as described above); the further component, in exemplary embodiments, comprises a motion estimation/block partitioning/rate-distortion control unit 130 .
- the control until 130 may be, as implied by its full name, responsible for:
- optimal performance may take place when: high memory bandwidth is available; multiple queues are available for managing memory access; and virtual memory address translation is available at high performance and to multiple queues.
- ConnectX-5 commercially available system which fulfills the previously-mentioned criteria for optimal performance
- ConnectX-5 commercially available from Mellanox Technologies Ltd. It is appreciated that the example of ConnextX-5 is provided as on particular example, and is not meant to be limiting; other systems may alternatively be used.
- the video acceleration device 120 reads previously decoded frames (also known as reconstructed raw video), against which the target frame is being compensated; by way of particular non-limiting example, two previously decoded frames may be read. It is appreciated that, in an exemplary embodiment using CPR, the video acceleration device 120 may read/use the target frame twice, once as a target frame and once as a reference frame.
- a map of motion vector prediction may be provided; the map of motion vector prediction shows a center of a search area for each block. The particular example in the previous sentence is non-limiting, it being appreciated that a center of search are may be determined, including independently, for any given block.
- FIG. 15 is a simplified pictorial illustration depicting an exemplary search area for motion vector prediction.
- an exemplary map (generally designated 1500 ) is shown.
- the center of the search area for a given block is designated 1510 .
- the search area comprises a plurality of search blocks 1520 , only some of which have been labeled with reference number 1520 for ease of depiction.
- parameters may be configured and may help the system “tune” to a specific encoder, specifying desired quality and speed; such parameters may include, by way of non-limiting example: accuracy of sub pixel interpolation; search area size; aggregation level (maximum block size being aggregated); partitioning information (whether partitioning may be, for example, only into squares or also into rectangles); and block matching function to be used (such as, by way of non-limiting example: SAD (Sum of Absolute Difference); and SSE (Sum of Square Error)).
- SAD Sud of Absolute Difference
- SSE Standard of Square Error
- FIG. 4 is a simplified tabular illustration depicting an exemplary order of block encoding, intended in exemplary cases to be performed by an application acceleration system such as the application acceleration system 110 of FIG. 1B .
- the video acceleration device 120 outputs a list of the top ranked blocks, which have maximal similarity to each and every one of the blocks being encoded; for example, the current encoded frame is divided into small blocks (such as, by way of non-limiting example, 8 ⁇ 8 blocks). This is illustrated, including an exemplary non-limiting example of an order of block encoding, in FIG. 4 .
- FIG. 5 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device, such as the video acceleration device 120 of FIG. 1B , in accordance with an exemplary embodiment of the present invention.
- a result map 500 (which may also be termed an “acceleration vector”), shown in FIG. 5 , is an example of the output (also termed herein a “result score board”) produced in exemplary embodiments of the present invention; the result map in FIG. 5 is a particular non-limiting example of the result map 140 of FIG. 1B .
- the result map of FIG. 5 demonstrate the flexibility which is provided to the SW when using embodiments of the acceleration device described herein, in that (by way of non-limiting example) block partitioning decisions can be carried out based on the data comprised in the result map 500 of FIG. 5 .
- the SW can choose to re-partition to bigger blocks, for example when (by way of non-limiting example) based the results of blocks: 4,5,6,7,16,17,18,19,20,21,22,23 which have arrived from the video acceleration device 120 of FIG. 1 , in order to partition to 64 ⁇ 64 sizes.
- FIG. 17A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, as is known in the field of video.
- a plurality of target frames 1705 are input.
- a comparison operation is carried out at comparison element 1715 .
- the comparison operation is carried out relative to a reference block 1720 , produced/chosen by a motion estimation unit 1725 , based on a decoded picture input received from a decoded picture buffer 1730 .
- the result of the comparison element 1715 is a residual block 1735 .
- the residual block 1735 undergoes a transform operation at a transform unit 1740 ; quantization at a quantizing unit 1750 ; and entropy encoding in an entropy unit 1755 .
- the output the system of FIG. 17A is a bitstream.
- quantized data from the quantizing unit 1750 is dequantized an inverse quantizing unit 1760 , and undergoes an inverse transform operation at an inverse transform unit 1765 , thus producing a decoded residual block 1770 .
- the decoded residual block 1770 is added to the reference block 1720 at element 1772 , with a result thereof being processed by loop filters 1775 , and then sent to the decoded picture buffer 1730 , for further processing as previously described.
- FIG. 17B is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system of FIG. 1B .
- the motion estimation unit 1725 FIG. 17A
- a block matching unit 1780 instantiated in a video acceleration device such as the video acceleration device 120 of FIG. 1B .
- FIG. 17B having like numbers to elements in FIG. 17A , may be similar thereto.
- the block matching unit 1780 produces a result map 1782 , which may be similar to the result map 140 of FIG. 1B , and which is sent to an RDO (rate distortion optimization unit) 1785 , whose function is to choose (in accordance with an applicable metric) a best reference block and partition, with a goal of having bit stream length at the end of the process match a target bitrate, with maximal available quality.
- RDO rate distortion optimization unit
- the RDO 1785 along with the elements which are common between FIGS. 17A and 17B , are SW elements as explained above.
- the problem of motion vector resolution described above is overcome by allowing device configuration, in the case of any particular coding standard, to (by way of one non-limiting example) limit the search to full pixels and half pixels, as appropriate for that standard.
- the difference in the kernel coefficient of sub pixels described above is overcome by allowing SW to configure the kernel coefficient. This coefficient does not generally change “on the fly”; even though the coefficient/coefficients are codec dependent, they are fixed for the entire video stream.
- video encodings may differ in motion vector resolution which is allowed in describing the bit stream; by way of non-limiting example, some may allow 1 ⁇ 4 pixel resolution, while some may allow higher resolution (less than 1 ⁇ 4 pixel, such as, for example, 1 ⁇ 8 pixel).
- Encodings may also differ in a way in which sub pixel samples are defined by interpolation of neighboring pixels using known and fixed coefficients; the coefficients being also termed herein “kernel”.
- the acceleration device may produce (by way of particular non-limiting example, in the case of 1 ⁇ 4 size pixels) sixteen sub-blocks, each of which represent different fractional motion vectors.
- FIG. 16 is a simplified pictorial illustration depicting a non-limiting example of pixel sub blocks.
- each shaded rectangle (such as, by way of non-limiting example, rectangle A 1,0 ) represents the upper-left-hand corner of a pixel, while the nearby non-shaded rectangles (in the example of shaded rectangle A 1,0 , an additional fifteen rectangles (a 1,0 ; b 1,0 ; c 1,0 ; d 1,0 ; e 1,0 ; f 1,0 ; g 1,0 ; h 1,0 ; i 1,0 ; j 1,0 ; k 10 ; n 1,0 ; p 1,0 ; p 1,0 ; and r 1,0 ), for a total of sixteen rectangles (sub-blocks).
- the kernel may be configured as an input.
- Different video encodings also differ in compensation, meaning that the representation of the residual block 1735 of FIG. 17A and FIG. 17B , is done differently in different video encodings.
- the estimation (which block is probably best) is separated from the compensation (how to express the difference between a given block in a reference frame and in a target frame), which in previous systems were computed together.
- the SW will calculate the residual (what remains after compensation), but not the estimation.
- different score functions are implemented in the acceleration device (such as the video acceleration device 120 of FIG. 1B ) thus allowing SW to choose one of them to be used.
- SAD Sud of Absolute Difference
- SSE Standard of Square Error
- Some encoding standards allow compensating blocks against a weighted image, meaning that a reference frame (previously decoded frame) is multiplied by a factor, (rational number). Alternatively, a sum of 2 reference frames may each be multiplied by a different weighting factor.
- the acceleration device in preferred embodiments of the present invention, may either allow configuring a weighting factor for each reference frame, or may receive as input an already weighted frame.
- FIG. 6 is another simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention.
- a score board By way of non-limiting example of a score board, consider a result score board 600 as shown in FIG. 6 ; in FIG. 6 , bold underscore and bold italics are used similarly to their usage in FIG. 5 .
- Smart SW is able to see that two reference frames in the result score board 600 have the same MV (1 ⁇ 4, ⁇ 1 ⁇ 4), so the smart SW can itself calculate the compensation of weighted prediction between frame 0 and frame 1, and might thus get a better result than the score board indicates.
- those blocks can be re-partitioned into one 16 ⁇ 16 block (this being only one particular example, which could, for example, be expandable to larger blocks).
- the reason for the possibility of achieving a better result is that, by separating functions between an acceleration device and SW as described herein, the SW can use the varied results provided by the acceleration device to potentially find the “bigger picture” and produce a better result.
- the flexible acceleration device output allows the SW to do re-partitioning based on the acceleration device result, as described immediately above.
- the acceleration device may not necessarily stop the search when reaching a threshold; by contrast, SW algorithms generally have a minimal threshold that causes the SW to stop looking for candidates, which means (by way of non-limiting example) that if the SW found a block with a small residual in the first search try, it will terminate the search process.
- the acceleration device will complete the search and return a vector of the best results found, in order to allow the SW to do re partitioning. Re-partitioning is discussed in more detail above with reference to FIG. 5 .
- the acceleration device performs a full and hierarchical search over the following so-called predictors:
- Results of the collocated (the term “collocated” being known in the video art) block/s in one or more previously decoded frames. In exemplary embodiments, using such results is configured by SW. Such results may comprise a P&A map, as described below.
- FIG. 8 is a simplified pictorial illustration depicting an operation on a low resolution image, which operation may be carried out by an exemplary system in accordance with an exemplary embodiment of the present invention.
- resolution of the image is reduced (the picture is made smaller), such as, by way of non-limiting example, by a factor of 4 in each dimension.
- searching and motion estimation are then performed on the reduced resolution image, as shown in FIG. 8 .
- an 8 ⁇ 8 block represents a 64 ⁇ 64 block in the original image; persons skilled in the art will appreciate that there is less noise in such a reduced resolution image than in the original image.
- a candidate anchor identified in the reduced resolution image will allow us to look for a larger anchor in the full-resolution partition.
- the acceleration device returns the best result of each anchor, and sometimes the second best result, and not the total ranking score.
- FIG. 18 is a simplified block diagram illustration depicting a particular exemplary case of using the acceleration system of FIG. 1B .
- FIG. 18 depicts an exemplary case in which a single pass is executed in order to obtain a result; in general, whether a single pass or multiple passes are used is decided by SW. Other exemplary cases, in which more than a single pass may be executed, are described below.
- the video acceleration device 120 of FIG. 1B is shown receiving the following inputs:
- P&A prediction and aggregation
- a second P&A map 1820 which may refer to a second reference frame (such as the second reference frame 1840 mentioned below);
- the first reference frame 1830 , second reference frame 1840 , and the target frame 1850 will be understood in light of the above discussion.
- the first P&A map 1810 and the second P&A map 1820 may be similar in form to the result map 140 of FIG. 1B , or, by way of very particular non-limiting example, to the result map 500 of FIG. 5 .
- the first P&A map 1810 and the second P&A map 1820 may be similar in form to the result map 140 of FIG. 1B , or, by way of very particular non-limiting example, to the result map 500 of FIG. 5 .
- the first P&A map 1810 and the second P&A map 1820 may be similar in form to the result map 140 of FIG. 1B , or, by way of very particular non-limiting example, to the result map 500 of FIG. 5 .
- the first P&A map 1810 and the second P&A map 1820 may be similar in form to the result map 140 of FIG. 1B , or, by way of very particular non-limiting example, to the result map 500 of FIG. 5 .
- the first P&A map 1810 and the second P&A map 1820 may
- the video acceleration device 120 produces a result map 1860 which may be similar in form to the result map 140 of FIG. 1B , or, by way of very particular non-limiting example, to the result map 500 of FIG. 5 .
- one or more reference frames and one or more target frames are received as input.
- a plurality of P&A maps may be received; by way of one particular non-limiting example, when two reference frames and two target frames are received, up to four P&A maps may be received.
- a given P&A map refers to a reference frame paired with a target frame, so that if there are two reference frames and two target frames, there would be four P&A maps to cover the applicable pairings. In a case where a particular P&A map is blank, this may be an indication to the video acceleration device 120 to search without any additional information as to where to search.
- a P&A map provides predication and aggregation points to aid the video acceleration device 120 in searching related to each target.
- FIG. 19 is a simplified block-diagram illustration depicting another particular exemplary case of using the acceleration system of FIG. 1B .
- FIG. 19 depicts an exemplary case in which a two passes executed in order to obtain a result; in general, whether a single pass or multiple passes are used is decided by SW.
- the two passes depicted in FIG. 19 use downscaling, which is intended to assist in avoiding a local minimum, as described above.
- a full resolution target frame 1910 and a full resolution reference frame 1920 are provided as input.
- Each of the full resolution target frame 1910 and the full resolution reference frame 1920 are downscaled at downscale units 1930 and 1940 respectively (which for sake of simplicity of depiction and explanation are shown as separate units, it being appreciated that alternatively a single downscale unit may be used to carry out multiple downscale operations).
- the downscale units 1930 and 1940 are shown as downscaling by a factor of 1:8, but it is appreciated that the example of 1:8 is not meant to be limiting.
- the output of the downscale unit 1930 is a downscaled target frame 1950 .
- the output of the downscale unit 1940 is a downscaled reference frame 1960 .
- the downscaled target frame 1950 and the downscaled reference frame 1960 are input into the video acceleration device 120 .
- Two instances of the video acceleration device 120 are shown in FIG. 19 for ease of depiction and acceleration, it being appreciated that only a single video acceleration device 120 may be used,
- an empty P&A map 1965 (see description of P&A maps above, with reference to FIG. 18 ) is also input into the video acceleration device 120 .
- the video acceleration device 120 which receives the downscaled target frame 1950 and the downscaled reference frame 1960 produces a P&A map(R) 1970 , “R” designating that the map relates to a reference frame, representing the best downscaled result found for the downscaled target frame 1950 in the downscaled reference frame 1960 .
- the full resolution target frame 1910 and the full resolution reference frame 1920 are each provide as input to the video acceleration device 120 , which also receives the P&A map(R) 1970 , an which produces a second P&A map(R) 1975 .
- the video acceleration device 120 receives the P&A map(R) 1970 , an which produces a second P&A map(R) 1975 .
- information is produced which may assist in finding larger blocks (such as, by way of non-limiting example, 64 ⁇ 64 blocks). This may be because, in effect, the system has “zoomed in” on each frame, so that more of each frame will be in an effective search area.
- FIG. 20 is a simplified block-diagram illustration depicting still another particular exemplary case of using the acceleration system of FIG. 1B
- FIG. 21 is a simplified block-diagram illustration depicting yet another particular exemplary case of using the acceleration system of FIG. 1B
- FIGS. 20 and 21 will be understood with reference to the above description of FIG. 19 .
- FIG. 20 depicts a situation in which 1:4 and 1:8 downscaling are both used and are combined
- FIG. 21 depicts a situation in which both symmetrical and asymmetrical downscaling are used and combined.
- the “quality” of output of the video acceleration device 120 may be improved, with little additional work expended in a few additional passes.
- FIG. 22 is a simplified partly pictorial, partly block diagram illustration of an exemplary embodiment of a portion of the application acceleration system of FIG. 1B .
- FIG. 22 depicts a particular exemplary non-limiting embodiment of the video acceleration device 120 of FIG. 1B .
- the video acceleration device 120 of FIG. 22 is shown receiving as input: a P&A map 2210 ; a reference frame 2215 ; and a target frame 2220 .
- a P&A map 2210 may be similar to P&A maps, reference frames, and target frames described above.
- the video acceleration device 120 of FIG. 22 comprises the following elements:
- a reference frame buffer 2230 a reference frame buffer 2230 ;
- a block matching engine 2240 a block matching engine 2240 ;
- the result map buffer 2260 is shown as storing a map 2260 , which may be the input P&A map 2210 or another map, as also described below.
- the target frame 2220 or a relevant portion thereof (typically determined under SW control) is received by the video acceleration device 120 , and at least a relevant portion is stored in the target frame buffer 2235 .
- the relevant portion could comprise a current block of 8 ⁇ 8 pixels to be searched.
- the reference frame 2215 or a relevant portion thereof is received by the video acceleration device 120 , and a relevant search area (which may be a search area around the current block in the target frame) is stored in the reference frame buffer 2230 .
- the block matching engine 2240 (which may comprise a plurality of block matching engines, in order to execute more than one operation in parallel) receives current block stored in the target frame buffer 2235 and the relevant blocks stored in the reference frame buffer 2230 .
- the block matching engine 2240 determines a score (using, by way of non-limiting example as described above, SAD or SSE), and writes the score to the score board storage unit 2245 , producing a score board 2250 .
- Score boards are described above; one particular non-limiting example is the score board 500 of FIG. 5 , described above.
- the block matching engine 2240 may use the P&A map 2210 (which may be stored in the result map buffer 2225 , or elsewhere in the video acceleration device 120 ) to “focus” score determination on blocks indicated in the P&A map 2210 , and blocks in proximity to those blocks.
- the aggregation and ranking circuitry 2255 is configured, in exemplary embodiments, to determine the best results from the score board 2250 , and also to determine large blocks by aggregation, using (by way of non-limiting example) sums of values of adjacent blocks, which blocks have the same displacement vector as a given block, in order to produce an output score board/result map 2260 . While not shown in FIG. 22 , the output score board/result map is generally provided as output from the video acceleration device 120 , as described above.
- FIG. 23 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with another exemplary embodiment of the present invention.
- FIG. 5 described a non-limiting example of system output which included information on a reference frame.
- an entire table (such as, by way of a very particular non-limiting example, the table depicted in FIG. 23 , generally designated 2300 ), may refer to a particular reference frame, it being appreciated that a plurality of tables similar to the table 2300 may be provided, one for each reference frame.
- the reference motion vector (Ref MV) may comprise 2 comma-delimited numbers, comprising x and y coordinates respectively (by way of non-limiting example).
- each residual block 1735 undergoes a transform operation at the transform unit 1740 .
- the transform operation in exemplary embodiments, converts the input (such as each residual block 1735 ) from a spatial domain (energy per location in an image) to a frequency domain.
- a non-limiting example of such a transform is a discrete cosine transform (DCT).
- DCT discrete cosine transform
- the output of such a transform is a block having the same dimensions as the input block.
- the top left element of such a block is called a DC element; in particular, it is known in the art of video encoding that the DC element is very close to the average of the values of intensity of pixels in the block, in the spatial domain.
- the other elements in the transform output are called AC elements.
- the quantizing unit 1750 generally quantizes the last AC elements more than the first AC elements, and much more than the DC element.
- AC elements Given a Target block T and a candidate reference block C, the energy of AC coefficients a residual block R that will be created when compensating Block T from block C is: AC in R ⁇ SAD(T,C) ⁇
- FIG. 24 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with still another exemplary embodiment of the present invention.
- the table of FIG. 24 may be similar to the table 2300 of FIG. 23 , with the addition of a column for AC energy estimation, the values of which may be computed in accordance with the formula provided above for “AC in R”. It is appreciated that such a column may also be added mutatis mutandis , to the table 500 of FIG. 5 .
- choosing the “best” block to be compensated against, from the encoder perspective may be accomplished as follows.
- the “best” block will be a block that will introduce minimal quality loss, and still meet a “bit budget”; that is, generally an encoder is supplied with a “bit budget” indicating how many bits may be used in encoding.
- the encoder generally maintains an accounting of bit use/quality loss.
- the best block may be determined using a formula such as:
- MV_cost is the number of bits that the encoder needs in order to encode a given motion vector (MV) in the bitstream;
- Residual_cost is the cost in bits, for the encoder to encode the residual coefficient in the bitstream; referring logically to the “delta” between the 2 blocks (target Vs reference). It is appreciated that the Residual_cost depends on the SAD result and on the AC energy result, since each block is transformed, subtracted from the reference block, and then quantized.
- the quantization process implies that, when using low bitrates, where usually higher quantizers are used, the cost of the residuals will impact less, while the cost of bits used to represent the MV is constant.
- the alpha parameter is introduced, the alpha parameter being generally different for each quantization parameter. For higher quantizers (lower bitrates) the alpha value is smaller than it is for lower quantizers (higher bitrates).
- an acceleration device as described herein may be configured to output just the overall cost, or to rank based on the Cost function above, and thus to reduce the amount of data that the encoder needs to analyze.
- the encoder (or software associated therewith) configures the alpha value or values in advance of operation, and also configures, for every frame being searched against, an average quantization parameter (QP) to the acceleration device, and an alpha value in accordance therewith.
- QP average quantization parameter
- FIG. 25 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with yet another exemplary embodiment of the present invention.
- the table of FIG. 25 may be similar to the table 2300 of FIG. 23 , except that additional information, which may be useful in facilitating bi-directional prediction, has been added to the table 2500 , relative to the table 2300 of FIG. 23 .
- Bi-directional prediction is available in certain video coding standards, and allows prediction from two images; this means that each block in a target image can be compensated against two different blocks, one from each of two different images.
- the acceleration device may compare each block against a block which is a weighted sum of two blocks from two different images, in order to produce the table 2500 .
- a weighting coefficient used in computing the weighted sum is constant for a given target frame.
- an “imaginary” reference block may be assembled using the following formula:
- RefBlock (i,j) ) W 0 * B0 ( i,j )+ W 1 * B1 ( i,j )
- W 0 and W 1 are weights (generally supplied by the encoder, and based on values in relevant video compression standards);
- (i,j) represents a location within a given block
- B0 represents a first actual block
- B1 represents a second actual block
- RefBlock represents the computed “imaginary” reference block
- W 0 and W 1 are not dependent on i and j.
- the acceleration device may then perform cost calculations on the “imaginary” reference block (using the block matching engine 2240 of FIG. 22 , as described above) as if the “imaginary” reference block were an ordinary reference block, but with two different displacement vectors being output to the score board.
- the displacement vectors may be in a separate table (in a case such as that of FIG. 25 ), or a separate row (in a case such as that of FIG. 5 ).
- FIGS. 26A and 26B are simplified tabular illustrations useful in understanding still another exemplary embodiment of the present invention.
- FIG. 26A shows an array of blocks, generally designated 2610 .
- a current block being processed 2620 is depicted, along with a plurality of “valid” possible reference blocks 2630 .
- FIG. 26B shows an array of blocks for such a case, generally designated 2640 .
- a current block 2650 is depicted, along with a plurality of “valid possible reference blocks 2660 and 2670 .
- acceleration is based on receiving a reconstructed buffer in frame resolution, meaning that each frame can access only the previous processed reconstructed frame and can not the current reconstructed frame data. Since future codecs as described above use the reconstructed buffer of the current frame, implementation of a solution in accordance with the exemplary embodiment of FIGS. 17A and 17B would be difficult or perhaps impossible, since a current frame reconstructed image is generally not available during operation of the system of FIGS. 17A and 17B
- FIG. 27A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream
- FIG. 27B which is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system of FIG. 1B
- FIGS. 27A and 27B are similar to FIGS. 17A and 17B , respectively, and like reference numbers have been used therein.
- a motion vector (MV) search is run on the original frame rather than on the reconstructed frame.
- This solution can be executed prior to the encoding of the frame.
- a limitation of this solution may be differences between the original frame data and the reconstructed frame data, but the inventors of the present invention believe the original frame and the reconstructed frame (and hence the corresponding data) images are quite similar, especially for encoding in high bitrate.
- the purpose of the MV search in this case is mainly for finding the best vector rather than calculating the cost of such a vector, since it can be assumed that the best MV is the same for both the original and the reconstructed frames, while the cost itself is believed to be more affected by the difference of the frames.
- the “real” cost can be calculated by the encoder rather than by the systems and processes of FIGS. 27A and 27B , and since no additional search is required the cost operation is relatively simple.
- software components of the present invention may, if desired, be implemented in ROM (read only memory) form.
- the software components may, generally, be implemented in hardware, if desired, using conventional techniques.
- the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Discrete Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
- The present application is a Continuation-in-Part of U.S. patent application Ser. No. 16/291,023, of Levi et al, filed 4 Mar. 2019; and claims priority to US Provisional Patent Application 62/695,063 of Levi et al, filed 8 Jul. 2018, and to US Provisional Patent Application 62/726,446 of Levi et al, filed 4 Sep. 2018.
- The present invention relates generally to the fields of computer architecture and application acceleration.
- Video encoding is described herein as one example of a workload which creates heavy stresses (by way of non-limiting example, in the cloud and/or in the data center; the same applies in other equipment and/or locations that are part of the video flow). With respect to the particular non-limiting example of preparing video for an edge computing device (by way of non-limiting example, for a smart phone, a TV set etc.), data center workload includes millions of video streams which have to be encoded and re-encoded during every minute, in order to stream the streams to multiple devices (generally having different video codecs) in a very efficient way. Persons skilled in the art will appreciate that adaptive streaming additionally requires the video to be encoded to multiple bitrates.
- Persons skilled in the art will appreciate the following:
- 1. Providing solutions to issues mentioned in the present application (including in the Background thereof) in exemplary embodiments of the present invention have applicability to various use cases, and not only to data center and/or cloud use cases.
- 2. Solutions provided in exemplary embodiments of the present invention have applicability to (by way of non-limiting example): video compression; motion estimation; video deblocking filter; Current Picture Referencing (CPR) and video transform. Non-limiting examples of relevant systems to which exemplary embodiments of the present invention may be applicable include, by way of non-limiting example: HEVC/H.265 (high efficiency video coding); AVC/H.264 (advanced video coding); VP9; AV-1 (AOMedia Video 1); and VVC (versatile video coding). While, generally, uses of exemplary embodiments of the present invention are described in the context of motion estimation, such descriptions are not meant to be limiting, and persons skilled in the art will appreciate, in light of the description herein, how to provide solutions in at least the other cases mentioned above in the present paragraph.
- The load of a single encoding task itself is generally too big for a single CPU. Reference is now made to
FIG. 9 , which is a simplified graphical illustration depicting a relationship between CPU power and various encoding tasks; toFIG. 10 , which is a simplified pictorial illustration depicting an approximate timeline view of various past, present, and future encoding standards; and toFIG. 11 , which is a simplified tabular illustration depicting complexity of various encoding standards.FIG. 9 shows that, for quite some time, a single CPU has not been able to keep up with a single encoding task; this problem, it is believed, will become more severe in the future, as future standards are adopted and become more widely used (seeFIG. 10 ) and as the complexity of those future standards is expected to be higher, as shown inFIG. 11 . - In addition, the amount of information/data in every stream is increasing, as more and more information/data is produced and streamed into the edge devices at higher resolution. Reference is now additionally made to
FIG. 12 , which is a simplified partly pictorial, partly graphical illustration depicting relationships between resolution and complexity.FIG. 12 show a very conservative forecast of resolution mixture along the years, and also shows the nominal complexity involved with the fact that there are more pixels in every stream. - Reference is now additionally made to
FIG. 13 , which is a simplified graphical illustration depicting relationships between video complexity, CPU capability, and computational effort.FIG. 13 shows that the real video complexity (complexity per pixel X number of pixels) is suffering from a growing gap between CPU capability and the computational effort needed. - The consideration of the problem presented up to this point does not include the major increase, which is expected to continue, in the number of video streams needs to be simultaneously processed in the data center.
- Having considered the above, it is fair to ask why the art, as known to the inventors of the present invention, does not yet include any acceleration device offering video encoding acceleration for the data center. Without limiting the generality of the present invention, the term “device” may be used in the present description to describe implementations of exemplary embodiments of the present invention, as well as (in the preceding sentence) apparent lacks in the known art. It is appreciated that, in exemplary embodiments of the present invention, by way of non-limiting example, implementation may take place in: an ASIC [Application Specific Integrated Circuit]; an ASSP [Application Specific Standard Part]; an SOC [System on a Chio], an FPGA [Field Programable Gate Array]; in a GPU [graphics processing unit]; in firmware; or in any appropriate combination of the preceding.
- The inventors of the present invention believe that the reasons that the art, as known to the inventors of the present invention, does not yet include an appropriate video acceleration as described above include:
- 1. Technological Reason
-
- The huge diversity in the video standards (as partially imposed by the situation depicted in
FIG. 10 ) which need to be supported, as well as the rich feature-set needed in order to fully support any single encoder, might make it appear, prior to the invention of exemplary embodiments of the present invention, to be impossible to create a generic acceleration device for accelerating video encoding in the data center.
- The huge diversity in the video standards (as partially imposed by the situation depicted in
- 2. Business Reason
- In the last decade, a huge amount of technology and knowledge was gained in the industry by encoder vendors (who generally, but not necessarily, provide their encoders in software [SW]; for purposes of ease of description, the term “SW” is used throughout the present specification and claims to describe such encoders, which might in fact be provided, generally by encoder vendors in software, firmware, hardware, or any appropriate combination thereof) that tuned the video encoding to “their best sweet spot”, representing what each encoder vendor believed to be their competitive advantage. It is important in this context to understand that video compression, by its nature as a lossy compression, differs in various implementations in many aspects, such as: performance; CPU load; latency; (which are also well known in other, non-video workloads); but also in quality. For the aspect of quality, there is no common and accepted objective metric that quantifies and grades the quality; rather, multiple metrics (such as, for example, PSNR, SSIM, and VMAF) are known, with no universal agreement on which metric is appropriate. Thus, the accumulated knowledge and reputation of each company somehow blocks any potential acceleration device vendor that wishes to penetrate the market of acceleration devices. In other words, if such a vendor were to create a device such as, by way of on particular non-limiting example, an ASIC implementing a video codec, the vendor would find himself competing against the entire ecosystem. Even within a given vendor, different quality measures might be used for different use cases, thus leading to an incentive to produce multiple such ASIC implementations.
- The present invention, in exemplary embodiments thereof, seeks to provide an improved video encoding, video compression, motion estimation, Current Picture Referencing (CPR), and computer architecture system in which part of the work (such as, by way of non-limiting example, any one or more of the following: motion estimation; Current Picture Referencing (CPR) transform; deblocking; loop filter; and context-adaptive binary arithmetic coding (CABAC) engine) is offloaded in such away that (in the specific non-limiting example of an encoder):
-
- 1. Work will be dramatically offloaded from the encoder;
- 2. The exemplary embodiments of the present invention will be agnostic to the specific encoder being implemented; and
- 3. A given encoder vendor (by way of non-limiting example, a software encoder vendor) will be enabled to run their own “secret sauce” and use one or more exemplary embodiments of the present invention to provide acceleration as a primitive operation. The appropriate quality/performance can be chosen by a given encoder vendor for the appropriate use case/s.
- There is thus provided in accordance with an exemplary embodiment of the present invention a system including an acceleration device including input circuitry configured, for each of a first plurality of video frames to be encoded, to receive an input including at least one raw video frame and at least one reference frame, and to divide each of the first plurality of video frames to be encoded into a second plurality of blocks, and similarity computation circuitry configured, for each one of the first plurality of video frame to be encoded: for each block of the second plurality of blocks, to produce a score of result blocks based on similarity of each block in each frame to be encoded to every block of the reference frame, an AC energy coefficient, and a displacement vector.
- Further in accordance with an exemplary embodiment of the present invention the at least one raw video frame and the at least one reference frame are identical.
- Still further in accordance with an exemplary embodiment of the present invention the score of result blocks includes a ranked list.
- Additionally in accordance with an exemplary embodiment of the present invention the result blocks are one of fixed size, and variable size.
- Moreover in accordance with an exemplary embodiment of the present invention the system also includes weighting circuitry configured to weight at least some of the second plurality of blocks.
- Further in accordance with an exemplary embodiment of the present invention, for a given block B of the second plurality of blocks, the weighting circuitry is configured to weight the block B to produce a weighted block B′ in accordance with the following formula: B′=A*B+C1, where A and C1 are scalars.
- Still further in accordance with an exemplary embodiment of the present invention the system also includes upsampling circuitry configured to upsample at least some of the second plurality of blocks, and the score of results blocks is based on similarity of each block to at least one upsampled block.
- Additionally in accordance with an exemplary embodiment of the present invention the system also includes a second component, and the second component receives an output from the acceleration device and produces, based at least in part on the output received from the acceleration device, a second component output in accordance with a coding standard.
- Moreover in accordance with an exemplary embodiment of the present invention the second component includes a plurality of second components, each of the plurality of second components producing a second component output in accordance with a coding standard, the coding standard for one of the plurality of second components being different from a coding standard of others of the plurality of second components.
- Further in accordance with an exemplary embodiment of the present invention the second component includes an aggregation component configured to aggregate a plurality of adjacent blocks having equal displacement vectors into a larger block.
- Still further in accordance with an exemplary embodiment of the present invention the larger block has a displacement vector equal to a displacement vector of each of the plurality of blocks having equal displacement vectors, and the larger block has a score equal to a sum of scores of the plurality of blocks having equal displacement vectors.
- There is also provided in accordance with another exemplary embodiment of the present invention a method including providing an acceleration device including input circuitry configured, for each of a first plurality of video frames to be encoded, to receive an input including at least one raw video frame and at least one reference frame, and to divide each of the first plurality of video frames to be encoded into a second plurality of blocks, and similarity computation circuitry configured, for each one of the first plurality of video frame to be encoded: for each block of the second plurality of blocks, to produce a score of result blocks based on similarity of each block in each frame to be encoded to every block of the reference frame, an AC energy coefficient and a displacement vector, and providing the input to the acceleration device, and producing the score of result blocks and the displacement vector based on the input.
- Further in accordance with an exemplary embodiment of the present invention the at least one raw video frame and the at least one reference frame are identical.
- Still further in accordance with an exemplary embodiment of the present invention the score of result blocks includes a ranked list.
- Additionally in accordance with an exemplary embodiment of the present invention the result blocks are one of fixed size, and variable size.
- Moreover in accordance with an exemplary embodiment of the present invention the method also includes weighting at least some of the second plurality of blocks.
- Further in accordance with an exemplary embodiment of the present invention, for a given block B of the second plurality of blocks, the weighting weights the block B to produce a weighted block B′ in accordance with the following formula: B′ A*B+C1, where A and C1 are scalars.
- Still further in accordance with an exemplary embodiment of the present invention the method also includes upsampling at least some of the second plurality of blocks, and the score of results blocks is based on similarity of each block to at least one upsampled block.
- Additionally in accordance with an exemplary embodiment of the present invention the method also includes providing a second component receiving an output from the acceleration device and producing, based at least in part on the output received from the acceleration device, a second component output in accordance with a coding standard.
- Further in accordance with an exemplary embodiment of the present invention the second component includes a plurality of second components, each of the plurality of second components producing a second component output in accordance with a coding standard, the coding standard for one of the plurality of second components being different from a coding standard of others of the plurality of second components.
- Still further in accordance with an exemplary embodiment of the present invention the second component includes an aggregation component configured to aggregate a plurality of adjacent blocks having equal motion vectors into a larger block.
- Additionally in accordance with an exemplary embodiment of the present invention the larger block has a displacement vector equal to a displacement vector of each of the plurality of blocks having equal rank, and the larger block has a score equal to a sum of scores of the plurality of blocks having equal displacement vectors.
- Moreover in accordance with exemplary embodiments of the present invention the reference frame includes one of a reconstructed reference frame, and an original reference frame.
- The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
-
FIG. 1A is a simplified tabular illustration depicting an inter block in a video bit stream; -
FIG. 1B is a simplified block diagram illustration of an application acceleration system, constructed and operative in accordance with an exemplary embodiment of the present invention; -
FIG. 2 is a simplified pictorial illustration depicting H.264/AVC block partitioning; -
FIG. 3 is a simplified pictorial illustration depicting HEVC (H.265) block partitioning; -
FIG. 4 is a simplified tabular illustration depicting an exemplary order of block encoding, intended in exemplary cases to be performed by an application acceleration system such as the system ofFIG. 1B ; -
FIG. 5 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention; -
FIG. 6 is another simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention; -
FIG. 7 is a simplified pictorial illustration depicting a “local minimum” problem that may be encountered by an exemplary system in accordance with an exemplary embodiment of the present invention; -
FIG. 8 is a simplified pictorial illustration depicting an operation on a low-resolution image, which operation may be carried out by an exemplary system in accordance with an exemplary embodiment of the present invention; -
FIG. 9 is a simplified graphical illustration depicting a relationship between CPU power and various encoding tasks; -
FIG. 10 is a simplified pictorial illustration depicting an approximate timeline view of various past, present, and future encoding standards; -
FIG. 11 is a simplified tabular illustration depicting complexity of various encoding standards; -
FIG. 12 is a simplified partly pictorial, partly graphical illustration depicting relationships between resolution and complexity; -
FIG. 13 is a simplified graphical illustration depicting relationships between video complexity, CPU capability, and computational effort; -
FIG. 14 is a simplified pictorial illustration depicting various portions of video codec standards; -
FIG. 15 is a simplified pictorial illustration depicting an exemplary search area for motion vector prediction; -
FIG. 16 is a simplified pictorial illustration depicting a non-limiting example of pixel sub blocks; -
FIG. 17A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream; -
FIG. 17B is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system ofFIG. 1B ; -
FIG. 18 is a simplified block diagram illustration depicting a particular exemplary case of using the acceleration system ofFIG. 1B ; -
FIG. 19 is a simplified block diagram illustration depicting another particular exemplary case of using the acceleration system ofFIG. 1B ; -
FIG. 20 is a simplified block diagram illustration depicting still another particular exemplary case of using the acceleration system ofFIG. 1B ; -
FIG. 21 is a simplified block diagram illustration depicting yet another particular exemplary case of using the acceleration system ofFIG. 1B ; -
FIG. 22 is a simplified partly pictorial, partly block diagram illustration of an exemplary embodiment of a portion of the application acceleration system ofFIG. 1B ; -
FIG. 23 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with another exemplary embodiment of the present invention; -
FIG. 24 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with still another exemplary embodiment of the present invention; -
FIG. 25 is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with yet another exemplary embodiment of the present invention; -
FIGS. 26A and 26B are simplified tabular illustrations useful in understanding still another exemplary embodiment of the present invention; -
FIG. 27A is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream; and -
FIG. 27B is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system ofFIG. 1B . - The following general discussion may be helpful in understanding certain exemplary embodiments of the present invention which are described herein.
- Among hundreds of tools in a video compression standard, the motion estimation portion/tool, (which is described in the standard by the motion compensation procedure), is generally considered to be the most demanding one when it comes to computational effort. The preceding also applies a Current Picture Referencing (CPR) tool.
- Theoretically, motion estimation is not part of the video standard, as illustrated in
FIG. 14 , which is a simplified pictorial illustration depicting various portions of video codec standards. However, a motion estimation procedure is involved in comparing a block (termed herein a “reference block”) against many blocks in the reference frames, finding the block with greatest (or in some cases close to greatest) similarity, and then performing a motion compensation which comprises, as is well known, writing the residual (as is known in the art) as well as the motion vectors (also termed herein “displacement vectors”) in the bitstream. In some cases, a “factor” may be used to find a block which, multiplied by some constant (factor), is close to a given block. In other cases, two blocks from two different frames may be combined via weighted summing, and then a block close to the average may be found. - Typically, in many codec systems, the motion estimation and the motion compensation are built as one unit, which means that the motion estimation score function (which is the “similarity” to the reference block), is the same as the compensation part, thus allowing the codec to use the score of the best matching block as a residual in the bitstream without further processing.
- Reference is now made to
FIG. 1A , which is a simplified tabular illustration depicting an inter block, generally designated 101, in a video bit stream. A high level view of a inter block in a video bitstream is shown inFIG. 1A . - A
block header 103 specifies the block type (which can in general be, by way of non-limiting example, inter or intra, the particular non-limiting example shown inFIG. 1A being an inter block with a single motion vector (MV) structure). - A
motion vector 105 represents the distance between the top left corner of the inter block to the top left corner of the reference block, while residual bits 107 (which are, in certain exemplary embodiments, represent the difference between the reference block and the target block (a given block). Each one of the sections shown inFIG. 1A may be compressed differently, as is known in the art. - The portion/size of each section in
FIG. 1A is dynamic, in that each one of the three sections can be of variable size. A goal, in exemplary embodiments of the present invention, is to reduce the total number of bits in an entire bitstream, not just to reduce the total number of bits in a given inter block. - Generally speaking, the part of the process of generating an inter block which involves heavy computation is to read (in many case, by way of non-limiting example) tens of blocks for every reference block and to calculate the respective differences. Performing this operation itself may consume approximately 50 times more memory bandwidth than accessing the raw video itself. It is also appreciated that the compute effort of the process of generating an inter block may be very large, approximately, as well as the compute effort which is estimated to be approximately 50 O(number of pixels).
- The motion estimation part is generally responsible not only for finding a block with the minimal residual relative to each reference, but also for finding an optimal partitioning; by way of non-limiting example, a 32×32 block with 5 bits residual will consume many fewer bits in the bitstream than 4 8×8 blocks with 0 bits residual in each of them. In this particular non limiting example, the 32×32 block partitioning would be considered optimal. Persons skilled in the art will appreciate that the method of selecting the best matched block is dependent on the details of the particular codec standard, including, by way of non-limiting example, because different standards treat motion vectors having large magnitude differently. Non-limiting examples of “differently” in the preceding sentence include: different partitioning; different sub-pixel interpolation; and different compensation options which may be available. It is appreciated that, in exemplary embodiments of the present invention, sub-pixels are produced by upsampling, as is known in the art.
- In addition to what has been stated above, the different video standards differ from one another, with respect to motion compensation, in at least the following parameters:
-
- 1. Motion vector resolution:
- Older standards allow only full pixel comparison (against real blocks), while newer standards allow fractional sampling interpolations. The different standards also differ in filters used for the interpolation/s. Other features which differ between different standards include the particulars of rounding and clipping, as are well known in the art.
-
- 2. Motion compensation residual function: calculates residual data in the bitstream.
- 3. Block size being compensated. By way of particular non-limiting example: In H.264/AVC block partitioning as shown in
FIG. 2 was allowed by the standard. - By contrast, in H.265/HEVC, the partitioning shown in
FIG. 3 was introduced. - 4. Weighted prediction. Weighted prediction was introduced in H.264 standard, in the Main and Extended profiles. The weighted prediction tool allows scaling of a reference frame with a multiplicative weighting factor, and also add an additive offset (constant) to each pixel. The weighted prediction tool is generally considered to be highly powerful in scenes of fade in/fade out. When fades are uniformly applied across the entire picture, a single weighting factor and offset are sufficient to efficiently encode all macroblocks in a picture that are predicted from the same reference picture. When multiple reference pictures are used, the best weighting factor and offsets generally differ during a fade for the different reference pictures, as brightness levels are more different for more temporally distant pictures.
- By way of particular non-limiting example: for single directional prediction the following equation represents the motion compensation with weighted prediction:
-
- SampleP=Clip1(((SampleP·W0+2LWD-1)>>LWD)+O0) where Clip1( ) is an operator that clips to the range [0, 255], W0 and O0 are the reference picture weighting factor and offset respectively, and LWD is the log weight denominator rounding factor, SampleP is the
list 0 initial predictor, and SampleP is the weighted predictor.
- SampleP=Clip1(((SampleP·W0+2LWD-1)>>LWD)+O0) where Clip1( ) is an operator that clips to the range [0, 255], W0 and O0 are the reference picture weighting factor and offset respectively, and LWD is the log weight denominator rounding factor, SampleP is the
- Persons skilled in the art will appreciate that the motion estimation is generally performed against the weighted reference frame (resulting from applying the above SampleP formula). Persons skilled in the art will further appreciate that it is a reasonable assumption that the compensation function will be different in future codecs.
- It appears reasonable to assume that future codec standards will continue to differ in the points mentioned immediately above.
- Motion estimation procedures include “secret sauce” as described above. In order to allow agnostic preparation which will later allow motion estimation with the desired “secret sauce”, exemplary embodiments of the present invention will make many more calculations than are known in systems which do not use an acceleration device in accordance with exemplary embodiments of the present invention, leaving open later decisions to be made by the motion compensation/estimation software.
- The following is a description of an exemplary embodiment of a method useable in order to create a generic and agnostic acceleration device, offloading motion estimation and Current Picture Referencing (CPR) for a particular codec. By way of non-limiting example, implementation of an appropriate acceleration device may take place in: an ASIC [Application Specific Integrated Circuit]; an ASSP [Application Specific Standard Part]; an SOC [System on a Chip], an FPGA [Field Programable Gate Array]; in firmware; in a GPU [graphics processing unit]; or in any appropriate combination of the preceding. Implementations described in the preceding sentence may also be referred to herein as “circuitry”, without limiting the generality of the foregoing. The description will be followed by a detailed explanation of how the acceleration device overcomes issues related to each of Motion vector resolution; Motion compensation residual function; Block size being compensated; and Weighted prediction, as mentioned above. As mentioned above, motion estimation is described as one particular non-limiting example.
- Reference is now made to
FIG. 1B , which is a simplified block diagram illustration of an application acceleration system, constructed and operative in accordance with an exemplary embodiment of the present invention. - In the particular non-limiting example of
FIG. 1B , the application acceleration system, generally designated 110, comprises a video acceleration system. - The
video acceleration system 110 comprises avideo acceleration device 120; exemplary embodiments of the construction and operation of thevideo acceleration device 120 are further described herein. As described in further detail below, thevideo acceleration device 120 is, in exemplary embodiments, configured to produce aresult map 140. - The
result map 140 is provided as input to a further component (often termed herein “SW”, as described above); the further component, in exemplary embodiments, comprises a motion estimation/block partitioning/rate-distortion control unit 130. The control until 130 may be, as implied by its full name, responsible for: -
- motion estimation;
- block partitioning; and
- rate distortion (determining tradeoffs between bit rate and distortion, for example)
- In certain exemplary embodiments of the present invention, it is appreciated that optimal performance may take place when: high memory bandwidth is available; multiple queues are available for managing memory access; and virtual memory address translation is available at high performance and to multiple queues. One non-limiting example of a commercially available system which fulfills the previously-mentioned criteria for optimal performance is the ConnectX-5, commercially available from Mellanox Technologies Ltd. It is appreciated that the example of ConnextX-5 is provided as on particular example, and is not meant to be limiting; other systems may alternatively be used.
- The operation of the system of
FIG. 1B , and particularly of thevideo acceleration device 120 ofFIG. 1B , is now described in more detail. - In certain exemplary embodiments of the present invention, for each video frame being encoded (termed herein “target frame”), the
video acceleration device 120 reads previously decoded frames (also known as reconstructed raw video), against which the target frame is being compensated; by way of particular non-limiting example, two previously decoded frames may be read. It is appreciated that, in an exemplary embodiment using CPR, thevideo acceleration device 120 may read/use the target frame twice, once as a target frame and once as a reference frame. In addition and optionally, a map of motion vector prediction may be provided; the map of motion vector prediction shows a center of a search area for each block. The particular example in the previous sentence is non-limiting, it being appreciated that a center of search are may be determined, including independently, for any given block. Reference is now additionally made toFIG. 15 , which is a simplified pictorial illustration depicting an exemplary search area for motion vector prediction. InFIG. 15 , an exemplary map (generally designated 1500) is shown. In the particular non-limiting example ofFIG. 15 , the center of the search area for a given block is designated 1510. In the particular non-limiting example ofFIG. 15 , the search area comprises a plurality ofsearch blocks 1520, only some of which have been labeled withreference number 1520 for ease of depiction. It is appreciated that, optionally, other parameters may be configured and may help the system “tune” to a specific encoder, specifying desired quality and speed; such parameters may include, by way of non-limiting example: accuracy of sub pixel interpolation; search area size; aggregation level (maximum block size being aggregated); partitioning information (whether partitioning may be, for example, only into squares or also into rectangles); and block matching function to be used (such as, by way of non-limiting example: SAD (Sum of Absolute Difference); and SSE (Sum of Square Error)). - Reference is now additionally made to
FIG. 4 , which is a simplified tabular illustration depicting an exemplary order of block encoding, intended in exemplary cases to be performed by an application acceleration system such as theapplication acceleration system 110 ofFIG. 1B . In certain exemplary embodiments of the present invention, thevideo acceleration device 120 outputs a list of the top ranked blocks, which have maximal similarity to each and every one of the blocks being encoded; for example, the current encoded frame is divided into small blocks (such as, by way of non-limiting example, 8×8 blocks). This is illustrated, including an exemplary non-limiting example of an order of block encoding, inFIG. 4 . - Reference is now additionally made to
FIG. 5 , which is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device, such as thevideo acceleration device 120 ofFIG. 1B , in accordance with an exemplary embodiment of the present invention. A result map 500 (which may also be termed an “acceleration vector”), shown inFIG. 5 , is an example of the output (also termed herein a “result score board”) produced in exemplary embodiments of the present invention; the result map inFIG. 5 is a particular non-limiting example of theresult map 140 ofFIG. 1B . - The result map of
FIG. 5 demonstrate the flexibility which is provided to the SW when using embodiments of the acceleration device described herein, in that (by way of non-limiting example) block partitioning decisions can be carried out based on the data comprised in theresult map 500 ofFIG. 5 . Thus, by way of one non-limiting example, if the SW is “naïve”, and uses the first score for every block, it will produce four 8×8 blocks, with a total residual of (5+0+0+0)=5 (see entries marked in bold underscore in the result map 500). - Alternatively, in another non-limiting example, the SW may choose to create one single 32×32 block (since there are 4 scores that has the same MV value, so that the corresponding blocks can be combined) with a residual of: (6+3+1+1)=11 (see entries marked in bold italics in the result map 500).
- Similarly, the SW can choose to re-partition to bigger blocks, for example when (by way of non-limiting example) based the results of blocks: 4,5,6,7,16,17,18,19,20,21,22,23 which have arrived from the
video acceleration device 120 ofFIG. 1 , in order to partition to 64×64 sizes. - Reference is now made to
FIG. 17A , which is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, as is known in the field of video. In the system ofFIG. 17A , a plurality oftarget frames 1705 are input. For a given block 1710 (of one ormore target frames 1705, it being appreciated that in general the process is carried out on a plurality of target frames), a comparison operation is carried out atcomparison element 1715. The comparison operation is carried out relative to areference block 1720, produced/chosen by amotion estimation unit 1725, based on a decoded picture input received from a decodedpicture buffer 1730. - The result of the
comparison element 1715 is aresidual block 1735. Theresidual block 1735 undergoes a transform operation at atransform unit 1740; quantization at aquantizing unit 1750; and entropy encoding in anentropy unit 1755. The output the system ofFIG. 17A is a bitstream. - Meanwhile, quantized data from the
quantizing unit 1750 is dequantized aninverse quantizing unit 1760, and undergoes an inverse transform operation at aninverse transform unit 1765, thus producing a decodedresidual block 1770. The decodedresidual block 1770 is added to thereference block 1720 atelement 1772, with a result thereof being processed byloop filters 1775, and then sent to the decodedpicture buffer 1730, for further processing as previously described. - Reference is now made to
FIG. 17B , which is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system ofFIG. 1B . In the exemplary system and process ofFIG. 17B , the motion estimation unit 1725 (FIG. 17A ) has been replaced with ablock matching unit 1780, instantiated in a video acceleration device such as thevideo acceleration device 120 ofFIG. 1B . Other elements ofFIG. 17B , having like numbers to elements inFIG. 17A , may be similar thereto. - The
block matching unit 1780 produces aresult map 1782, which may be similar to theresult map 140 ofFIG. 1B , and which is sent to an RDO (rate distortion optimization unit) 1785, whose function is to choose (in accordance with an applicable metric) a best reference block and partition, with a goal of having bit stream length at the end of the process match a target bitrate, with maximal available quality. In accordance with the above-mentioned discussion, theRDO 1785, along with the elements which are common betweenFIGS. 17A and 17B , are SW elements as explained above. - In order to elucidate further the above discussion of certain goals of certain exemplary embodiments of the present invention, the following describes a sense in which exemplary embodiments of an acceleration device described herein is “codec agnostic” or a “generic device”
-
- (1)
- In exemplary embodiments of the present invention, the problem of motion vector resolution described above is overcome by allowing device configuration, in the case of any particular coding standard, to (by way of one non-limiting example) limit the search to full pixels and half pixels, as appropriate for that standard. The difference in the kernel coefficient of sub pixels described above is overcome by allowing SW to configure the kernel coefficient. This coefficient does not generally change “on the fly”; even though the coefficient/coefficients are codec dependent, they are fixed for the entire video stream. It is also appreciated that video encodings may differ in motion vector resolution which is allowed in describing the bit stream; by way of non-limiting example, some may allow ¼ pixel resolution, while some may allow higher resolution (less than ¼ pixel, such as, for example, ⅛ pixel). Encodings may also differ in a way in which sub pixel samples are defined by interpolation of neighboring pixels using known and fixed coefficients; the coefficients being also termed herein “kernel”. For each block, the acceleration device may produce (by way of particular non-limiting example, in the case of ¼ size pixels) sixteen sub-blocks, each of which represent different fractional motion vectors. Reference is now made to
FIG. 16 , which is a simplified pictorial illustration depicting a non-limiting example of pixel sub blocks. InFIG. 16 , each shaded rectangle (such as, by way of non-limiting example, rectangle A1,0) represents the upper-left-hand corner of a pixel, while the nearby non-shaded rectangles (in the example of shaded rectangle A1,0, an additional fifteen rectangles (a1,0; b1,0; c1,0; d1,0; e1,0; f1,0; g1,0; h1,0; i1,0; j1,0; k10; n1,0; p1,0; p1,0; and r1,0), for a total of sixteen rectangles (sub-blocks). In general, in preferred embodiments of the present invention, the kernel may be configured as an input. - (2)
- Different video encodings also differ in compensation, meaning that the representation of the
residual block 1735 ofFIG. 17A andFIG. 17B , is done differently in different video encodings. In exemplary embodiments of the present invention, the estimation (which block is probably best) is separated from the compensation (how to express the difference between a given block in a reference frame and in a target frame), which in previous systems were computed together. Thus, in distinction to previous systems, in exemplary embodiments of the present invention the SW will calculate the residual (what remains after compensation), but not the estimation. In certain embodiments it may be that different score functions are implemented in the acceleration device (such as thevideo acceleration device 120 ofFIG. 1B ) thus allowing SW to choose one of them to be used. The following is a non-limiting list of exemplary score functions, which functions are known in the art: SAD (Sum of Absolute Difference); and SSE (Sum of Square Error). - (3)
- Some encoding standards allow compensating blocks against a weighted image, meaning that a reference frame (previously decoded frame) is multiplied by a factor, (rational number). Alternatively, a sum of 2 reference frames may each be multiplied by a different weighting factor. The acceleration device, in preferred embodiments of the present invention, may either allow configuring a weighting factor for each reference frame, or may receive as input an already weighted frame.
- Reference is now additionally made to
FIG. 6 , which is another simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with an exemplary embodiment of the present invention. By way of non-limiting example of a score board, consider aresult score board 600 as shown inFIG. 6 ; inFIG. 6 , bold underscore and bold italics are used similarly to their usage inFIG. 5 . - “Smart SW” is able to see that two reference frames in the
result score board 600 have the same MV (¼, −¼), so the smart SW can itself calculate the compensation of weighted prediction betweenframe 0 andframe 1, and might thus get a better result than the score board indicates. In the particular example shown, sinceblocks - Turning now to the partitioning issue as described above:
- It is believed that the flexible acceleration device output allows the SW to do re-partitioning based on the acceleration device result, as described immediately above.
- In certain exemplary embodiments of the present invention, the acceleration device may not necessarily stop the search when reaching a threshold; by contrast, SW algorithms generally have a minimal threshold that causes the SW to stop looking for candidates, which means (by way of non-limiting example) that if the SW found a block with a small residual in the first search try, it will terminate the search process.
- In the particular non-limiting case described above, since the partitioning is done later in the process, and as described in the example, in order to avoid inappropriate termination, the acceleration device will complete the search and return a vector of the best results found, in order to allow the SW to do re partitioning. Re-partitioning is discussed in more detail above with reference to
FIG. 5 . - Aggregation Towards Larger Blocks
-
- The acceleration device also, in exemplary embodiments, provides matching scores for adjacent blocks (blocks that are “upper” and “left”, by way of non-limiting example, relative to a given block) in order to allow aggregation to take place efficiently. In exemplary embodiments, the aggregation is done when adjacent blocks has the same displacement vector, with a bigger block which is aggregated to replace the adjacent blocks having a matching result which is the sum of the score of the sub blocks, since the score function is additive.
- Avoiding Local Minimum
-
- Reference is now additionally made to
FIG. 7 , which is a simplified pictorial illustration depicting a “local minimum” problem that may be encountered by an exemplary system in accordance with an exemplary embodiment of the present invention When searching on small blocks, it may happen that the search algorithm will “fall” into a local minimum, as shown inFIG. 7 . Local minimum in this context refers to a small block which may have a very good match (even a perfect match) to a certain region, even though there may exist a match using a much bigger block. Exemplary techniques for overcoming a local minimum problem, using a plurality of block sizes to do so, are discussed below with reference toFIG. 8 .
- Reference is now additionally made to
- When dealing with small blocks, there is higher chance that many of the small blocks will be similar and the present invention, in exemplary embodiments thereof, will not find the entire area of a larger object. In order to overcome this problem, the acceleration device, in exemplary embodiments, performs a full and hierarchical search over the following so-called predictors:
- 1. Results of the collocated (the term “collocated” being known in the video art) block/s in one or more previously decoded frames. In exemplary embodiments, using such results is configured by SW. Such results may comprise a P&A map, as described below.
-
- 2. Result/s from adjacent block/s, as described above.
- 3. The result around the global motion vector of the image (that is, the global motion vector is used as a center of a search area), using such result being configured by SW, which may be the case when SW provides such a global motion vector.
- 4. The result of a low resolution image search, as described below in more detail, including with reference to
FIGS. 19-21 .
- Reference is now additionally made to
FIG. 8 , which is a simplified pictorial illustration depicting an operation on a low resolution image, which operation may be carried out by an exemplary system in accordance with an exemplary embodiment of the present invention. In the low resolution search technique, resolution of the image is reduced (the picture is made smaller), such as, by way of non-limiting example, by a factor of 4 in each dimension. In an exemplary embodiment, searching and motion estimation are then performed on the reduced resolution image, as shown inFIG. 8 . In the non-limiting example of reduction by a factor of 4, in the reduced resolution image ofFIG. 8 an 8×8 block represents a 64×64 block in the original image; persons skilled in the art will appreciate that there is less noise in such a reduced resolution image than in the original image. Thus, a candidate anchor identified in the reduced resolution image will allow us to look for a larger anchor in the full-resolution partition. - It is appreciated that, in embodiments of the present invention, the acceleration device returns the best result of each anchor, and sometimes the second best result, and not the total ranking score.
- Reference is now made to
FIG. 18 , which is a simplified block diagram illustration depicting a particular exemplary case of using the acceleration system ofFIG. 1B .FIG. 18 depicts an exemplary case in which a single pass is executed in order to obtain a result; in general, whether a single pass or multiple passes are used is decided by SW. Other exemplary cases, in which more than a single pass may be executed, are described below. - In
FIG. 18 , thevideo acceleration device 120 ofFIG. 1B is shown receiving the following inputs: - a first prediction and aggregation (P&A)
map 1810; - a
second P&A map 1820, which may refer to a second reference frame (such as thesecond reference frame 1840 mentioned below); - a
first reference frame 1830; - a
second reference frame 1840; and - a target frame 1850.
- The
first reference frame 1830,second reference frame 1840, and the target frame 1850 will be understood in light of the above discussion. - The
first P&A map 1810 and thesecond P&A map 1820 may be similar in form to theresult map 140 ofFIG. 1B , or, by way of very particular non-limiting example, to theresult map 500 ofFIG. 5 . In general, thefirst P&A map 1810 and the second P&A map 1820: -
- may be optional;
- are provided by SW; and
- are provided with a deliberately poor score, since it is believed that providing a poor score will lead to a better result.
- The
video acceleration device 120 produces aresult map 1860 which may be similar in form to theresult map 140 ofFIG. 1B , or, by way of very particular non-limiting example, to theresult map 500 ofFIG. 5 . - The above description of
FIG. 18 may be further elaborated on as follows. In general, one or more reference frames and one or more target frames are received as input. A plurality of P&A maps may be received; by way of one particular non-limiting example, when two reference frames and two target frames are received, up to four P&A maps may be received. In general, a given P&A map refers to a reference frame paired with a target frame, so that if there are two reference frames and two target frames, there would be four P&A maps to cover the applicable pairings. In a case where a particular P&A map is blank, this may be an indication to thevideo acceleration device 120 to search without any additional information as to where to search. In general, a P&A map provides predication and aggregation points to aid thevideo acceleration device 120 in searching related to each target. - Reference is now made to
FIG. 19 , which is a simplified block-diagram illustration depicting another particular exemplary case of using the acceleration system ofFIG. 1B .FIG. 19 depicts an exemplary case in which a two passes executed in order to obtain a result; in general, whether a single pass or multiple passes are used is decided by SW. The two passes depicted inFIG. 19 use downscaling, which is intended to assist in avoiding a local minimum, as described above. - In
FIG. 19 , a fullresolution target frame 1910 and a fullresolution reference frame 1920 are provided as input. Each of the fullresolution target frame 1910 and the fullresolution reference frame 1920 are downscaled atdownscale units FIG. 19 thedownscale units - The output of the
downscale unit 1930 is a downscaledtarget frame 1950. The output of thedownscale unit 1940 is a downscaledreference frame 1960. The downscaledtarget frame 1950 and the downscaledreference frame 1960 are input into thevideo acceleration device 120. Two instances of thevideo acceleration device 120 are shown inFIG. 19 for ease of depiction and acceleration, it being appreciated that only a singlevideo acceleration device 120 may be used, - By way of non-limiting example, an empty P&A map 1965 (see description of P&A maps above, with reference to
FIG. 18 ) is also input into thevideo acceleration device 120. Thevideo acceleration device 120 which receives the downscaledtarget frame 1950 and the downscaledreference frame 1960 produces a P&A map(R) 1970, “R” designating that the map relates to a reference frame, representing the best downscaled result found for the downscaledtarget frame 1950 in the downscaledreference frame 1960. - Meanwhile, the full
resolution target frame 1910 and the fullresolution reference frame 1920 are each provide as input to thevideo acceleration device 120, which also receives the P&A map(R) 1970, an which produces a second P&A map(R) 1975. It is appreciated that, when a method such as that depicted inFIG. 19 is used, so that in addition to a full search on an entire frame a downscaled search is used, information is produced which may assist in finding larger blocks (such as, by way of non-limiting example, 64×64 blocks). This may be because, in effect, the system has “zoomed in” on each frame, so that more of each frame will be in an effective search area. - Reference is now made to
FIG. 20 , which is a simplified block-diagram illustration depicting still another particular exemplary case of using the acceleration system ofFIG. 1B , and toFIG. 21 , which is a simplified block-diagram illustration depicting yet another particular exemplary case of using the acceleration system ofFIG. 1B .FIGS. 20 and 21 will be understood with reference to the above description ofFIG. 19 .FIG. 20 depicts a situation in which 1:4 and 1:8 downscaling are both used and are combined,FIG. 21 depicts a situation in which both symmetrical and asymmetrical downscaling are used and combined. In general, in the situations depicted inFIGS. 20 and 21 , the “quality” of output of thevideo acceleration device 120 may be improved, with little additional work expended in a few additional passes. - Reference is now made to
FIG. 22 , which is a simplified partly pictorial, partly block diagram illustration of an exemplary embodiment of a portion of the application acceleration system ofFIG. 1B .FIG. 22 depicts a particular exemplary non-limiting embodiment of thevideo acceleration device 120 ofFIG. 1B . - The
video acceleration device 120 ofFIG. 22 is shown receiving as input: aP&A map 2210; areference frame 2215; and atarget frame 2220. Each of theP&A map 2210, thereference frame 2215, and thetarget frame 2220 may be similar to P&A maps, reference frames, and target frames described above. - The
video acceleration device 120 ofFIG. 22 comprises the following elements: - a
result map buffer 2225; - a
reference frame buffer 2230; - a
target frame buffer 2235; - a
block matching engine 2240; - a score
board storage unit 2245; and - aggregation and ranking
circuitry 2255. - The
result map buffer 2260 is shown as storing amap 2260, which may be theinput P&A map 2210 or another map, as also described below. - A non-limiting example of operation of the
video acceleration device 120 ofFIG. 22 is now briefly described: - The
target frame 2220 or a relevant portion thereof (typically determined under SW control) is received by thevideo acceleration device 120, and at least a relevant portion is stored in thetarget frame buffer 2235. By way of a particular non limiting example, the relevant portion could comprise a current block of 8×8 pixels to be searched. - The
reference frame 2215 or a relevant portion thereof is received by thevideo acceleration device 120, and a relevant search area (which may be a search area around the current block in the target frame) is stored in thereference frame buffer 2230. - The block matching engine 2240 (which may comprise a plurality of block matching engines, in order to execute more than one operation in parallel) receives current block stored in the
target frame buffer 2235 and the relevant blocks stored in thereference frame buffer 2230. Theblock matching engine 2240 determines a score (using, by way of non-limiting example as described above, SAD or SSE), and writes the score to the scoreboard storage unit 2245, producing ascore board 2250. Score boards are described above; one particular non-limiting example is thescore board 500 ofFIG. 5 , described above. - In certain exemplary embodiments, the
block matching engine 2240 may use the P&A map 2210 (which may be stored in theresult map buffer 2225, or elsewhere in the video acceleration device 120) to “focus” score determination on blocks indicated in theP&A map 2210, and blocks in proximity to those blocks. - The aggregation and ranking
circuitry 2255 is configured, in exemplary embodiments, to determine the best results from thescore board 2250, and also to determine large blocks by aggregation, using (by way of non-limiting example) sums of values of adjacent blocks, which blocks have the same displacement vector as a given block, in order to produce an output score board/result map 2260. While not shown inFIG. 22 , the output score board/result map is generally provided as output from thevideo acceleration device 120, as described above. - Reference is now made to
FIG. 23 , which is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with another exemplary embodiment of the present invention. Referring back toFIG. 5 and the description thereof,FIG. 5 described a non-limiting example of system output which included information on a reference frame. In another exemplary embodiment, an entire table (such as, by way of a very particular non-limiting example, the table depicted inFIG. 23 , generally designated 2300), may refer to a particular reference frame, it being appreciated that a plurality of tables similar to the table 2300 may be provided, one for each reference frame. Thus, in the case of the table 2300 ofFIG. 23 (by contrast to the table 500 ofFIG. 5 ), the reference motion vector (Ref MV) may comprise 2 comma-delimited numbers, comprising x and y coordinates respectively (by way of non-limiting example). - The following description may apply, mutatis mutandis, either to the case depicted and described with reference to the table 500 of
FIG. 5 , or to the case depicted and described with reference to the table 2300 ofFIG. 23 . - Referring back to
FIG. 17A , eachresidual block 1735 undergoes a transform operation at thetransform unit 1740. The transform operation, in exemplary embodiments, converts the input (such as each residual block 1735) from a spatial domain (energy per location in an image) to a frequency domain. A non-limiting example of such a transform is a discrete cosine transform (DCT). Generally, the output of such a transform is a block having the same dimensions as the input block. The top left element of such a block is called a DC element; in particular, it is known in the art of video encoding that the DC element is very close to the average of the values of intensity of pixels in the block, in the spatial domain. The other elements in the transform output are called AC elements. It is well known in the art of video compression that the human eye model is less sensitive to errors in the higher frequencies, which are the last elements of the transform output, than to lower frequencies. For this reason, thequantizing unit 1750 generally quantizes the last AC elements more than the first AC elements, and much more than the DC element. - It is also known in the art of video compression that residual blocks with less energy in the AC coefficients are compressed better than other residual blocks. In other words, with fewer bits in a bitstream a decoder will be able to reconstruct a block which is closer to a source signal; in this context “closer” may be, by non-limiting example, as measured by the PSNR metric, as referred to above.
- However, when doing motion estimation, or trying to find the best block in a reference image against which to compensate, it is known in the art that doing a transform to each candidate block in order to estimate the rate distortion optimization score (RDO score) of that candidate block, is extremely compute intensive, and may in fact be practically impossible.
- The following formula is believed to be a good estimation of the energy residing in AC coefficients (a term used interchangeably herein with “AC elements”): Given a Target block T and a candidate reference block C, the energy of AC coefficients a residual block R that will be created when compensating Block T from block C is: AC in R˜SAD(T,C)−|AVG(T)−AVG(C)|
- where:
-
- SAD represents Sum of Absolute Difference, as described above;
- AVG represents the average of pixel values in a given block; ˜represents approximation; and
- ∥ represents absolute value.
- Reference is now made to
FIG. 24 , which is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with still another exemplary embodiment of the present invention. The table ofFIG. 24 ,generally designated 2400, may be similar to the table 2300 ofFIG. 23 , with the addition of a column for AC energy estimation, the values of which may be computed in accordance with the formula provided above for “AC in R”. It is appreciated that such a column may also be added mutatis mutandis, to the table 500 ofFIG. 5 . - Referring back to
FIG. 22 , the system ofFIG. 22 , with minor variations, may be used to produce the table ofFIG. 24 , the variations being as follows: -
- The
block matching engine 2240, in additional to determining a score (using, by way of non-limiting example as described above, SAD or SSE) as described above with reference toFIG. 22 , also determines the AC coefficient energy using the “AC in R” formula described above. If SSE was used as a score function, as described above with reference toFIG. 22 , then SAD in the “AC in R” formula described above is replaced by SSE. When writing the score to the scoreboard storage unit 2245, producing ascore board 2250, as described above with reference toFIG. 22 , score board will include: the block number MV, SAD/SSE and AC energy.
- The
- Referring again to
FIG. 24 , choosing the “best” block to be compensated against, from the encoder perspective, may be accomplished as follows. The “best” block will be a block that will introduce minimal quality loss, and still meet a “bit budget”; that is, generally an encoder is supplied with a “bit budget” indicating how many bits may be used in encoding. Thus, the encoder generally maintains an accounting of bit use/quality loss. For example, the best block may be determined using a formula such as: -
Cost=MV_cost+Residual_cost*alpha - where:
- MV_cost is the number of bits that the encoder needs in order to encode a given motion vector (MV) in the bitstream;
- Residual_cost is the cost in bits, for the encoder to encode the residual coefficient in the bitstream; referring logically to the “delta” between the 2 blocks (target Vs reference). It is appreciated that the Residual_cost depends on the SAD result and on the AC energy result, since each block is transformed, subtracted from the reference block, and then quantized. The quantization process implies that, when using low bitrates, where usually higher quantizers are used, the cost of the residuals will impact less, while the cost of bits used to represent the MV is constant. To account for differences in quantization, the alpha parameter is introduced, the alpha parameter being generally different for each quantization parameter. For higher quantizers (lower bitrates) the alpha value is smaller than it is for lower quantizers (higher bitrates).
- The discussion immediately above implies that an acceleration device as described herein may be configured to output just the overall cost, or to rank based on the Cost function above, and thus to reduce the amount of data that the encoder needs to analyze. In order to accomplish this, the encoder (or software associated therewith) configures the alpha value or values in advance of operation, and also configures, for every frame being searched against, an average quantization parameter (QP) to the acceleration device, and an alpha value in accordance therewith.
- Reference is now made to
FIG. 25 , which is a simplified tabular illustration depicting an exemplary result of operations performed by an acceleration device in accordance with yet another exemplary embodiment of the present invention. The table ofFIG. 25 , generally designated 2500, may be similar to the table 2300 ofFIG. 23 , except that additional information, which may be useful in facilitating bi-directional prediction, has been added to the table 2500, relative to the table 2300 ofFIG. 23 . - Bi-directional prediction is available in certain video coding standards, and allows prediction from two images; this means that each block in a target image can be compensated against two different blocks, one from each of two different images. In such a case, the acceleration device may compare each block against a block which is a weighted sum of two blocks from two different images, in order to produce the table 2500. As is known in the art, a weighting coefficient used in computing the weighted sum is constant for a given target frame.
- Prior to the score function as shown in table 2500 being calculated (using, by way of particular non-limiting example, SAD or SSE), an “imaginary” reference block may be assembled using the following formula:
- Where W0 and W1 are weights (generally supplied by the encoder, and based on values in relevant video compression standards);
- (i,j) represents a location within a given block;
- B0 represents a first actual block;
-
- It is appreciated, that, generally, W0 and W1 are not dependent on i and j.
- The acceleration device may then perform cost calculations on the “imaginary” reference block (using the
block matching engine 2240 ofFIG. 22 , as described above) as if the “imaginary” reference block were an ordinary reference block, but with two different displacement vectors being output to the score board. The displacement vectors may be in a separate table (in a case such as that ofFIG. 25 ), or a separate row (in a case such as that ofFIG. 5 ). - Exemplary embodiments of the present invention which may be useful with future codecs are now described. In such future codecs, it is believed that it will be possible to copy the content of a block from a previous encoded/decoded block. Such an ability is limited to copying data from the current processed coding tree unit (CTU) or an immediately previous CTU only. This CTU restriction simplifies intra block copy (IBC) implementation which may be useful for such future codecs, since copying from a reconstructed buffer may be problematic in some systems; and the CTU restriction eliminates the need to access a reconstructed buffer by allowing addition of a separate, relatively small, buffer for IBC purpose only.
- Reference is now made to
FIGS. 26A and 26B , which are simplified tabular illustrations useful in understanding still another exemplary embodiment of the present invention.FIG. 26A shows an array of blocks, generally designated 2610. A current block being processed 2620 is depicted, along with a plurality of “valid” possible reference blocks 2630. - The inventors of the present invention believe that by relaxing the restriction of which blocks are “valid” as depicted in
FIG. 26A , efficiencies may be achieved by allowing computations to take place for future codecs using frame data already processed.FIG. 26B shows an array of blocks for such a case, generally designated 2640. Acurrent block 2650 is depicted, along with a plurality of “validpossible reference blocks - Using the acceleration device in accordance with exemplary embodiments of the present invention (in particular with reference to
FIGS. I7A and 17B, and the above explanation thereof), acceleration is based on receiving a reconstructed buffer in frame resolution, meaning that each frame can access only the previous processed reconstructed frame and can not the current reconstructed frame data. Since future codecs as described above use the reconstructed buffer of the current frame, implementation of a solution in accordance with the exemplary embodiment ofFIGS. 17A and 17B would be difficult or perhaps impossible, since a current frame reconstructed image is generally not available during operation of the system ofFIGS. 17A and 17B - Reference is now made to
FIG. 27A , which is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream; and toFIG. 27B , which is a simplified partly pictorial, partly block-diagram illustration depicting an exemplary system and process, including motion estimation, for producing a bitstream, including use of an exemplary embodiment of the acceleration system ofFIG. 1B ,FIGS. 27A and 27B are similar toFIGS. 17A and 17B , respectively, and like reference numbers have been used therein. - Using the systems and processes of
FIGS. 27A and 27B , a motion vector (MV) search is run on the original frame rather than on the reconstructed frame. This solution can be executed prior to the encoding of the frame. A limitation of this solution may be differences between the original frame data and the reconstructed frame data, but the inventors of the present invention believe the original frame and the reconstructed frame (and hence the corresponding data) images are quite similar, especially for encoding in high bitrate. The purpose of the MV search in this case is mainly for finding the best vector rather than calculating the cost of such a vector, since it can be assumed that the best MV is the same for both the original and the reconstructed frames, while the cost itself is believed to be more affected by the difference of the frames. The “real” cost can be calculated by the encoder rather than by the systems and processes ofFIGS. 27A and 27B , and since no additional search is required the cost operation is relatively simple. - Additionally, in alternative exemplary embodiments of the present invention, it is appreciated that a similar concept can be used for regular MV search; by running the MV search on the original frame data rather than on reconstructed frame data it is possible to use the acceleration device on an entire video prior to the encoder running, so encoding can be more efficient and parallel implementation can be made simpler.
- It is appreciated that software components of the present invention may, if desired, be implemented in ROM (read only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. It is further appreciated that the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.
- It is appreciated that various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable subcombination.
- It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention is defined by the appended claims and equivalents thereof:
Claims (24)
B′=A*B+C1
B′=A*B+1
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/442,581 US20200014945A1 (en) | 2018-07-08 | 2019-06-17 | Application acceleration |
CN201910602249.2A CN110691241A (en) | 2018-07-08 | 2019-07-05 | Application acceleration |
CN202211212487.0A CN115567722A (en) | 2018-07-08 | 2019-07-05 | Application acceleration |
US16/850,036 US11252464B2 (en) | 2017-06-14 | 2020-04-16 | Regrouping of video data in host memory |
US17/542,426 US11700414B2 (en) | 2017-06-14 | 2021-12-05 | Regrouping of video data in host memory |
US17/898,496 US20230012939A1 (en) | 2018-07-08 | 2022-08-30 | Application acceleration |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862695063P | 2018-07-08 | 2018-07-08 | |
US201862726446P | 2018-09-04 | 2018-09-04 | |
US16/291,023 US12058309B2 (en) | 2018-07-08 | 2019-03-04 | Application accelerator |
US16/442,581 US20200014945A1 (en) | 2018-07-08 | 2019-06-17 | Application acceleration |
Related Parent Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/291,023 Continuation-In-Part US12058309B2 (en) | 2017-06-14 | 2019-03-04 | Application accelerator |
US16/850,036 Continuation-In-Part US11252464B2 (en) | 2017-06-14 | 2020-04-16 | Regrouping of video data in host memory |
Related Child Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/622,094 Continuation-In-Part US20180367589A1 (en) | 2017-06-14 | 2017-06-14 | Regrouping of video data by a network interface controller |
US16/850,036 Continuation-In-Part US11252464B2 (en) | 2017-06-14 | 2020-04-16 | Regrouping of video data in host memory |
US17/898,496 Division US20230012939A1 (en) | 2018-07-08 | 2022-08-30 | Application acceleration |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200014945A1 true US20200014945A1 (en) | 2020-01-09 |
Family
ID=69102407
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/442,581 Abandoned US20200014945A1 (en) | 2017-06-14 | 2019-06-17 | Application acceleration |
US17/898,496 Pending US20230012939A1 (en) | 2018-07-08 | 2022-08-30 | Application acceleration |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/898,496 Pending US20230012939A1 (en) | 2018-07-08 | 2022-08-30 | Application acceleration |
Country Status (2)
Country | Link |
---|---|
US (2) | US20200014945A1 (en) |
CN (2) | CN110691241A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11057637B1 (en) | 2020-01-29 | 2021-07-06 | Mellanox Technologies, Ltd. | Efficient video motion estimation by reusing a reference search region |
WO2021197229A1 (en) * | 2020-03-30 | 2021-10-07 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for motion estimation, non-transitory computer-readable storage medium, and electronic device |
US11252464B2 (en) | 2017-06-14 | 2022-02-15 | Mellanox Technologies, Ltd. | Regrouping of video data in host memory |
US11500808B1 (en) | 2021-07-26 | 2022-11-15 | Mellanox Technologies, Ltd. | Peripheral device having an implied reset signal |
Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070110160A1 (en) * | 2005-09-22 | 2007-05-17 | Kai Wang | Multi-dimensional neighboring block prediction for video encoding |
US20080126278A1 (en) * | 2006-11-29 | 2008-05-29 | Alexander Bronstein | Parallel processing motion estimation for H.264 video codec |
US20100002770A1 (en) * | 2008-07-07 | 2010-01-07 | Qualcomm Incorporated | Video encoding by filter selection |
US20100226438A1 (en) * | 2009-02-04 | 2010-09-09 | Droplet Technology, Inc. | Video Processing Systems, Methods and Apparatus |
US20120044990A1 (en) * | 2010-02-19 | 2012-02-23 | Skype Limited | Data Compression For Video |
US20130101039A1 (en) * | 2011-10-19 | 2013-04-25 | Microsoft Corporation | Segmented-block coding |
US20130163674A1 (en) * | 2010-09-10 | 2013-06-27 | Thomson Licensing | Encoding of the Link to a Reference Block in Video Compression by Image Content Based on Search and Ranking |
US20130208795A1 (en) * | 2012-02-09 | 2013-08-15 | Google Inc. | Encoding motion vectors for video compression |
US20130265388A1 (en) * | 2012-03-14 | 2013-10-10 | Qualcomm Incorporated | Disparity vector construction method for 3d-hevc |
US20130301727A1 (en) * | 2012-05-14 | 2013-11-14 | Qualcomm Incorporated | Programmable and scalable integer search for video encoding |
US20140161188A1 (en) * | 2012-12-07 | 2014-06-12 | Qualcomm Incorporated | Advanced residual prediction in scalable and multi-view video coding |
US20160286232A1 (en) * | 2015-03-27 | 2016-09-29 | Qualcomm Incorporated | Deriving motion information for sub-blocks in video coding |
US20180098070A1 (en) * | 2016-10-05 | 2018-04-05 | Qualcomm Incorporated | Systems and methods for adaptive selection of weights for video coding |
US20180124418A1 (en) * | 2016-04-15 | 2018-05-03 | Magic Pony Technology Limited | Motion compensation using machine learning |
US20180316929A1 (en) * | 2017-04-28 | 2018-11-01 | Qualcomm Incorporated | Gradient based matching for motion search and derivation |
US20180343448A1 (en) * | 2017-05-23 | 2018-11-29 | Intel Corporation | Content adaptive motion compensated temporal filtering for denoising of noisy video for efficient coding |
US20180359483A1 (en) * | 2017-06-13 | 2018-12-13 | Qualcomm Incorporated | Motion vector prediction |
US20180376160A1 (en) * | 2017-06-23 | 2018-12-27 | Qualcomm Incorporated | Motion-based priority for the construction of candidate lists in video coding |
US20190058882A1 (en) * | 2012-10-01 | 2019-02-21 | Ge Video Compression, Llc | Scalable video coding using subblock-based coding of transform coefficient blocks in the enhancement layer |
US20190110058A1 (en) * | 2017-10-11 | 2019-04-11 | Qualcomm Incorporated | Low-complexity design for fruc |
US20190158882A1 (en) * | 2008-10-03 | 2019-05-23 | Velos Media, Llc | Device and method for video decoding video blocks |
US20200077116A1 (en) * | 2016-05-13 | 2020-03-05 | Qualcomm Incorporated | Merge candidates for motion vector prediction for video coding |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6950469B2 (en) * | 2001-09-17 | 2005-09-27 | Nokia Corporation | Method for sub-pixel value interpolation |
US8472792B2 (en) * | 2003-12-08 | 2013-06-25 | Divx, Llc | Multimedia distribution system |
JPWO2013031071A1 (en) * | 2011-09-02 | 2015-03-23 | パナソニックIpマネジメント株式会社 | Moving picture decoding apparatus, moving picture decoding method, and integrated circuit |
WO2017075804A1 (en) * | 2015-11-06 | 2017-05-11 | Microsoft Technology Licensing, Llc | Flexible reference picture management for video encoding and decoding |
JP2019521571A (en) * | 2016-05-13 | 2019-07-25 | インターデジタル ヴイシー ホールディングス, インコーポレイテッド | Method and apparatus for video coding using adaptive clipping |
-
2019
- 2019-06-17 US US16/442,581 patent/US20200014945A1/en not_active Abandoned
- 2019-07-05 CN CN201910602249.2A patent/CN110691241A/en active Pending
- 2019-07-05 CN CN202211212487.0A patent/CN115567722A/en active Pending
-
2022
- 2022-08-30 US US17/898,496 patent/US20230012939A1/en active Pending
Patent Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070110160A1 (en) * | 2005-09-22 | 2007-05-17 | Kai Wang | Multi-dimensional neighboring block prediction for video encoding |
US20080126278A1 (en) * | 2006-11-29 | 2008-05-29 | Alexander Bronstein | Parallel processing motion estimation for H.264 video codec |
US20100002770A1 (en) * | 2008-07-07 | 2010-01-07 | Qualcomm Incorporated | Video encoding by filter selection |
US20190158882A1 (en) * | 2008-10-03 | 2019-05-23 | Velos Media, Llc | Device and method for video decoding video blocks |
US20100226438A1 (en) * | 2009-02-04 | 2010-09-09 | Droplet Technology, Inc. | Video Processing Systems, Methods and Apparatus |
US20120044990A1 (en) * | 2010-02-19 | 2012-02-23 | Skype Limited | Data Compression For Video |
US20130163674A1 (en) * | 2010-09-10 | 2013-06-27 | Thomson Licensing | Encoding of the Link to a Reference Block in Video Compression by Image Content Based on Search and Ranking |
US20130101039A1 (en) * | 2011-10-19 | 2013-04-25 | Microsoft Corporation | Segmented-block coding |
US20130208795A1 (en) * | 2012-02-09 | 2013-08-15 | Google Inc. | Encoding motion vectors for video compression |
US20130265388A1 (en) * | 2012-03-14 | 2013-10-10 | Qualcomm Incorporated | Disparity vector construction method for 3d-hevc |
US20130301727A1 (en) * | 2012-05-14 | 2013-11-14 | Qualcomm Incorporated | Programmable and scalable integer search for video encoding |
US20190058882A1 (en) * | 2012-10-01 | 2019-02-21 | Ge Video Compression, Llc | Scalable video coding using subblock-based coding of transform coefficient blocks in the enhancement layer |
US20140161188A1 (en) * | 2012-12-07 | 2014-06-12 | Qualcomm Incorporated | Advanced residual prediction in scalable and multi-view video coding |
US20160286232A1 (en) * | 2015-03-27 | 2016-09-29 | Qualcomm Incorporated | Deriving motion information for sub-blocks in video coding |
US20180124418A1 (en) * | 2016-04-15 | 2018-05-03 | Magic Pony Technology Limited | Motion compensation using machine learning |
US20200077116A1 (en) * | 2016-05-13 | 2020-03-05 | Qualcomm Incorporated | Merge candidates for motion vector prediction for video coding |
US20180098070A1 (en) * | 2016-10-05 | 2018-04-05 | Qualcomm Incorporated | Systems and methods for adaptive selection of weights for video coding |
US20180316929A1 (en) * | 2017-04-28 | 2018-11-01 | Qualcomm Incorporated | Gradient based matching for motion search and derivation |
US20180343448A1 (en) * | 2017-05-23 | 2018-11-29 | Intel Corporation | Content adaptive motion compensated temporal filtering for denoising of noisy video for efficient coding |
US20180359483A1 (en) * | 2017-06-13 | 2018-12-13 | Qualcomm Incorporated | Motion vector prediction |
US20180376160A1 (en) * | 2017-06-23 | 2018-12-27 | Qualcomm Incorporated | Motion-based priority for the construction of candidate lists in video coding |
US20190110058A1 (en) * | 2017-10-11 | 2019-04-11 | Qualcomm Incorporated | Low-complexity design for fruc |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11252464B2 (en) | 2017-06-14 | 2022-02-15 | Mellanox Technologies, Ltd. | Regrouping of video data in host memory |
US11700414B2 (en) | 2017-06-14 | 2023-07-11 | Mealanox Technologies, Ltd. | Regrouping of video data in host memory |
US11057637B1 (en) | 2020-01-29 | 2021-07-06 | Mellanox Technologies, Ltd. | Efficient video motion estimation by reusing a reference search region |
WO2021197229A1 (en) * | 2020-03-30 | 2021-10-07 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for motion estimation, non-transitory computer-readable storage medium, and electronic device |
US11716438B2 (en) | 2020-03-30 | 2023-08-01 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for motion estimation, non-transitory computer-readable storage medium, and electronic device |
US11500808B1 (en) | 2021-07-26 | 2022-11-15 | Mellanox Technologies, Ltd. | Peripheral device having an implied reset signal |
Also Published As
Publication number | Publication date |
---|---|
US20230012939A1 (en) | 2023-01-19 |
CN110691241A (en) | 2020-01-14 |
CN115567722A (en) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230012939A1 (en) | Application acceleration | |
US20240333915A1 (en) | Application accelerator | |
US8023562B2 (en) | Real-time video coding/decoding | |
US9654797B2 (en) | Method and apparatus for encoding and decoding image through intra prediction | |
US10291925B2 (en) | Techniques for hardware video encoding | |
KR20220002844A (en) | Method and apparatus for processing a video signal based on intra prediction | |
US20150373328A1 (en) | Content adaptive bitrate and quality control by using frame hierarchy sensitive quantization for high efficiency next generation video coding | |
US20210168354A1 (en) | Video Coding System | |
US20220182659A1 (en) | Methods and apparatus on prediction refinement with optical flow | |
MX2014013850A (en) | Encoding and reconstruction of residual data based on support information. | |
KR20070027119A (en) | Enhanced motion estimation method, video encoding method and apparatus using the same | |
US12132910B2 (en) | Methods and apparatus on prediction refinement with optical flow | |
US20220210431A1 (en) | Methods and apparatus for prediction refinement with optical flow | |
Muhit et al. | Video coding using fast geometry-adaptive partitioning and an elastic motion model | |
US11889110B2 (en) | Methods and apparatus for prediction refinement with optical flow | |
US20240364921A1 (en) | Methods and apparatuses for prediction refinement with optical flow, bi-directional optical flow, and decoder-side motion vector refinement | |
EP3963887A1 (en) | Methods and apparatus of prediction refinement with optical flow | |
US20240373059A1 (en) | Methods and apparatuses for prediction refinement with optical flow, bi-directional optical flow, and decoder-side motion vector refinement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MELLANOX TECHNOLOGIES, LTD., ISRAEL Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEVI, DOTAN DAVID;WEISSMAN, ASSAF;PINES, KOBI;AND OTHERS;REEL/FRAME:049480/0958 Effective date: 20190617 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |