EP3101650A1 - Method and apparatus for performing interleaving - Google Patents
Method and apparatus for performing interleaving Download PDFInfo
- Publication number
- EP3101650A1 EP3101650A1 EP16172382.0A EP16172382A EP3101650A1 EP 3101650 A1 EP3101650 A1 EP 3101650A1 EP 16172382 A EP16172382 A EP 16172382A EP 3101650 A1 EP3101650 A1 EP 3101650A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- tiled
- operations
- interleaving
- graphics
- compute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 39
- 238000012545 processing Methods 0.000 claims description 39
- 230000004888 barrier function Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 20
- 238000009877 rendering Methods 0.000 description 13
- 238000006243 chemical reaction Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G5/00—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
- G09G5/36—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of a graphic pattern, e.g. using an all-points-addressable [APA] memory
- G09G5/363—Graphics controllers
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2350/00—Solving problems of bandwidth in display systems
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2352/00—Parallel handling of streams of display data
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/08—Power processing, i.e. workload management for processors involved in display operations, such as CPUs or GPUs
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/12—Frame memory handling
- G09G2360/121—Frame memory handling using a cache memory
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/12—Frame memory handling
- G09G2360/122—Tiling
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/12—Frame memory handling
- G09G2360/123—Frame memory handling using interleaving
Definitions
- the present disclosure relates to rendering of graphical images in which a graphics shader and a compute shader are utilized, and more particularly, to a method of interleaving graphics shader operations and compute shader operations.
- Graphical images are often generated through several steps. For example, an image may be generated and then may be read to generate another image through a sequence of render targets (RTs).
- An RT is an intermediate memory surface to which a three-dimensional (3D) image is rendered.
- a sequence of steps may be performed to generate an RT "A” and then to read the RT A so as to generate an RT "B". For example, in a first step, lighting parameters are written to a G-buffer so as to render an image and, in a second step, a lit image may be rendered by reading the G-buffer and doing light-related calculations.
- a sequence of operations i.e., the sequence of steps
- a graphics processor renders all of a first RT (e.g., the RT A), and then reads a result of the rendering from the external memory so as to generate a second RT (e.g., the RT B).
- a first RT e.g., the RT A
- a second RT e.g., the RT B
- a non-transitory computer-readable recording medium having recorded thereon a program for executing the method, by using a computer.
- a method of performing interleaving may include processing a combination of tiled graphics shader operations and tiled compute shader operations of interdependent (or interrelated) render targets; and interleaving the combination of the tiled graphics shader operations and the tiled compute shader operations, according to a result of the processing.
- a method performed in a graphics processing system may include recompiling a compute shader so as to have a tiled access pattern; and interleaving processing of a graphics shader and the recompiled compute shader for a set of interdependent images, wherein the interleaving is performed on a tile-by-tile basis for the interdependent images.
- a non-transitory computer-readable recording medium having recorded thereon a program which, when executed by using a processor, performs a method including determining dependencies of graphics shader operations and compute shader operations of a set of interdependent render target operations; and scheduling an interleaved order of tile processing on interleaved graphics shader operations and compute shader operations so as to reduce traffic to an external memory of a graphics system by maintaining at least one subset of intermediate tile processing computations of the interleaved graphics shader operations and compute shader operations in an on-chip memory of a graphics processing unit (GPU).
- GPU graphics processing unit
- a part includes or “comprises” an element, unless there is a particular description contrary thereto, the part can further include other elements, not excluding the other elements.
- terms such as “unit” and “module” indicate a unit for processing at least one function or operation, wherein the unit and the block may be embodied as hardware or software or embodied by combining hardware and software.
- FIG. 1 illustrates configuration of an apparatus for performing interleaving, according to an embodiment.
- a memory 100 includes a driver 111, and a graphics processing unit (GPU) 120 includes a cache 121.
- the driver 111 may correspond to program instructions.
- the program instructions provide a software interface with respect to the GPU 120, which allows software programs to access hardware configurations of the GPU 120 and to communicate data with the GPU 120.
- the driver 111 may include program instructions and a GPU Application Programming Interface (GPU API) stored in the memory 110.
- GPU API GPU Application Programming Interface
- the GPU 120 may include not only the cache 121 but may also include programmable GPU hardware and at least one processor.
- the cache 121 may be used as a tile buffer for buffering tiled data.
- FIG. 2 is a flowchart for describing a method of performing interleaving, according to an embodiment.
- the driver 111 processes a combination of tiled graphics shader operations and tiled compute shader operations of interrelated render targets (RTs).
- RTs interrelated render targets
- the driver 111 interleaves the combination of the tiled graphics shader operations and the tiled compute shader operations.
- FIG. 3 illustrates configuration of a graphics processing system 300, according to an embodiment.
- a memory 304 and a driver 308 shown in FIG. 3 may respectively correspond to the memory 110 and the driver 111 shown in FIG. 1 .
- a GPU 302 and an on-chip memory 360 shown in FIG. 3 may respectively correspond to the GPU 120 and the cache 121 shown in FIG. 1 .
- the graphics processing system 300 includes a central processing unit (CPU) 301 and the memory 304.
- the driver 308 may be stored on the memory 308.
- the driver 308 may correspond to program instructions.
- the program instructions provide a software interface with respect to the GPU 302 that enables software programs to access hardware configurations of the GPU 302 and to communicate data with the GPU 302.
- the software programs may include an operating system program, an application program, and/or the like software programs.
- the driver 308 may support physical and/or virtual hardware.
- the driver 308 includes program instructions stored in memory 304.
- the GPU 302 includes programmable GPU hardware 350, at least one processor 357, and an on-chip cache memory 360.
- the on-chip cache memory 360 may be used as a tile buffer to buffer tiled data 362.
- the driver 308 includes a GPU API 306. Referring to FIG. 3 , the GPU 302 is illustrated as a physical GPU component, however, the GPU 302 may correspond to a virtual GPU.
- the GPU 302 may access an external memory 320.
- the external memory 320 may be an external dynamic random access memory (DRAM).
- DRAM external dynamic random access memory
- a graphics processor renders a first RT, writes a result of the rendering to the external memory 320, and then reads the result of the rendering from the result of the rendering so to generate a second RT, these processes generate a lot of traffic from and to the external memory 320.
- the processes may include rendering with respect to unnecessary portions included in intermediate images.
- a support is provided to the driver 308 so as to interleave graphics shader operations and compute shader operations.
- Portions of the programmable GPU hardware 350 may be optionally modified with a hardware scheduling support 355 so as to optimize the execution of the interleaved scheduling in the GPU 302.
- Graphics rendering operations are performed in a graphics pipeline including a fixed function logic, one or more graphics shaders, and compute shaders.
- the graphics shader is a program used to do shading.
- the graphics shader is a program that runs as a part of the graphics rendering operation.
- the graphics shader processes pixels, vertices, patches, or primitives.
- the compute shader is a program used for computing arbitrary information and provides more flexibilities. Here, the flexibility indicates a characteristic of software which may be easily changed according to different types of a machine and various demands from a user.
- a compute shader that supports the OpenGL-ESTM 3.1 Application Programming Interface (API)
- benchmarks are moving image post-processing operations like motion blur, depth of field, scaling and filtering from graphics shaders to compute shaders.
- a compute shader is used in computing arbitrary information.
- the compute shaders operate in an abstract space.
- a compute shader may have limitations on a work group size (a smallest amount of compute operations) and a local size.
- the local size is defined by the number of invocations of the shader which may take place within each work group.
- the Open GLTM also defines rules for shared variables and memory barriers. The memory barriers ensure that all memory transactions before the barrier must complete before processing.
- Compute shaders offer more flexibilities and features that may improve the efficiency of post-processing operations. For example, the compute shader gives more freedom in how data is accessed or written. However, the API does not guarantee memory consistency without the use of a global memory barrier (e.g., a g1MemoryBarrier).
- a global memory barrier e.g., a g1MemoryBarrier
- a first analysis module 312 analyzes API graphics shader calls and compute shader calls.
- DAG directed acyclic graph
- a second analysis module 314 uses the first analysis module 312. Examples of the DAG are as illustrated in FIGS. 4A, 4B, and 4C .
- an API dependency graph (corresponding to the DAG) is built up in the driver 308 so as to detect cases where graphics calls and compute calls may be interleaved.
- API calls are grouped to build a sequence of interleaved execution of graphics shader operations and compute shader operations,.
- a scheduling module 330 determines a schedule of interleaved operations. For example, at the shader compilation time, the data access pattern of image load/store operations in the compute shader (or the compute shaders) are analyzed so as to determine whether the image load/store operations are candidates for interleaving. Certain types of the data access patterns have characteristics that are compatible with interleaving. For example, if the data access pattern is a statically known strided pattern that is static in a one-dimensional (1D) or two-dimensional (2D) space, the compute shader is a candidate for the interleaving. The strided pattern has a sequence of memory accesses with respect to addresses separated from each other by a stride length.
- a compute shader with a statically known strided pattern has an order of processing that has some general similarities in shape to a tile pattern, which facilitates converting a compute shader to have a tiled access.
- other types of data access patterns may be suitable candidates for interleaving. Operations of the scheduling module 330 are described in greater detail with reference to FIGS. 5 , 6 , and 9 .
- the scheduling module 330 determines sets of interleaved tiled graphics shader operations and compute shader operations on interdependent RTs so as to reduce external memory traffic.
- the operation by the scheduling module 330 may be performed by maintaining at least some intermediate results of the interleaved operations in the on-chip memory 360.
- An example of the operation of scheduling the interleaved tiled graphics shader operations and the compute shader operation so as to reduce the external memory traffic is described in more detail with reference to FIG. 5 .
- a support is provided in the driver 308 and a compiler (not shown) so as to automatically convert a compute shader (or compute shaders) to operate with a tiled access pattern by a conversion module 340.
- the compute shaders are converted to operate on tiles so as to generate a tiled equivalent version.
- Such conversion includes recompiling, by using the compiler, the compute shader to adapt the compute shader to perform a tiled memory access in a tile format compatible with that of the tiled memory access of a graphics shader.
- such conversion includes redefining the workgroup dimension of the compute shader to be an integer divisor of a tile's width and height.
- image load instructions may be replaced with tile buffer load instructions.
- a removal module 342 may remove the memory barrier.
- the graphics processing system 300 automatically (e.g., without human intervention) interleaves the processing of graphics shader and compute shaders in order to reduce or eliminate writing the intermediate data to an external (off-chip) memory 320.
- the driver 308 re-orders the API calls and performs any necessary recompiling on the compute shaders so as to process, by using the driver 308 (and the support of a compiler software stack), existing applications written to graphics APIs having the compute shaders.
- FIGS. 4A through 4C illustrate examples of a DAG.
- FIG. 4A one embodiment of a data flow graph (or a data structure equivalent thereto) that is generated by the second analysis module 314 is shown.
- the data flow graph of FIG. 4A shows the immediate dependency between a set of two or more images, such as images A and B, and may also include other subsequent images (e.g., C, D).
- FIG. 4A corresponds to a DAG defining dependencies between different RTs.
- the interdependence of different RT images may be simple (e.g., the image B directly flows from the image A) or it may have more complex interdependent relationships (e.g., an image may depend on more than one image). More generally, as the image A and the image B are used to generate the image C as illustrated in FIG.
- one or more images might depend on several others at the same time.
- the interdependence therebetween may be generally more complex.
- two graphics shader RTs may be required so as to generate a compute shader RT.
- dependency on a tile level may also be complex.
- a tile in a given image B may depend on more than one tile in a previous image (e.g., an image A).
- there may be individual tiles in a given RT which are not used by any tiles in a subsequent RT.
- the first analysis module 312 and the second analysis module 314 of the driver 308 examine a stream of commands and then determine if the image of the RT B directly or indirectly (e.g., downstream) depends on an RT A.
- the driver 308 determines how many levels are required to follow the dependency graph so as to reduce external memory accesses. Following every level of the dependency graph may not be required in all cases to achieve a reduction in the external memory accesses. Moreover, following every level of a complex dependency graph consumes processing resources.
- the second analysis module 314 limits the number of levels that follow the dependency graph is followed. If the dependency graph is complex, a rule by which the number of levels which follow the dependency graph may be applied thereto.
- the scheduling module 330 generates interleaving schedule so as to attempt to keep intermediate results in an on-chip cache memory (e.g., a tile buffer).
- the DAG maps dependencies at a tile level between tiles rendered by graphics shader operations and tiles operating on by compute shaders.
- the generation of the DAG may include back-projecting each output tile to the input tile (or input tiles) required to generate them.
- the image A may be used in generating the image B, which in turn may be used in generating the image C.
- a single tile in a given image may depend on more than one tile from another image.
- FIG. 5 illustrates an example in which intermediate results are maintained in the on-chip memory 360.
- FIG. 5 illustrates the example in which the intermediate computations of a sequence of operations are maintained in the on-chip cache memory 360 used as a tile buffer.
- the second analysis module 314 determines dependencies between an RT A, an RT B, an RT C, and an RT D.
- the scheduling module 330 determines a schedule to write tiles of the RT A to the on-chip cache memory 360 and then read them to the RT B.
- the scheduling of the processing of writing and reading the tiles may be selected to generate an interleaved schedule such as after a tile of an image A is produced and in stored in the on-chip cache memory 360.
- the tile is "immediately read back" so as to produce a tile of the RT B, thus, a memory bandwidth may be saved.
- immediately read back may include processing intervening operations between the storing of the tile of the image A in the on-chip memory 360 and the reading of the stored tile from the on-chip memory 360 so as to produce a tile of the RT B.
- "immediately read back” may correspond to reading the tile of the image A from the on-chip memory 360 rather than reading the tile from an external memory 320 so as to produce the tile of the RT B.
- the interleaved schedule may proceed in an order matching with the dependency analysis performed by the second analysis module 314.
- the dependency analysis may be selected to optimize use of the on-chip cache memory 360, so that the need to use the external memory 320 for the intermediate computations may be minimized.
- the dependency analysis performed by the second analysis module 314 may be used in the scheduling scheme for eliminating that processing is performed on tiles that do not contribute to a tile of a final image. By doing so, an unnecessary work may be eliminated.
- the final image may be rendered and then may be sent to the external memory 320 or may be output for display.
- interleaving may be performed at the tile level for graphics rendering and compute shader operations on a set of producer/consumer render targets (e.g., from images 1, 2, 3). Therefore, tile data of a RT1 may be rendered just before it is required so as to generate a corresponding tile of a RT2 with the tiled compute shader, which is rendered just before that portion of the RT2 is required so as to generate a RT3 and so on.
- a compute shader may generate input data required to render a graphics tile. For example, a compute shader may generate a position of particles, and only a particle affecting a specific XY region may be of interest for the graphics shader.
- FIG. 6 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to an embodiment.
- the first analysis module 312 processes API calls, and finds interleaving candidates.
- API call processing is deferred and queued to build a DAG by the second analysis module 314 that defines dependencies.
- the driver 308 uses certain events, such as a render target change or a memory barrier, in order to determine when to start processing deferred API calls.
- the removal module 342 determines whether one or more memory barriers can be safely removed by interleaving the graphics with compute calls in a tiled manner (e.g., the memory barrier can be safely removed if the graphics processing for the tile ends before the compute processing occurs).
- a determination of when interleaving is allowed is based on an analysis of memory access patterns of image load/store operations in the compute shader during a compilation time. If the memory access pattern is a statically-known strided pattern in a 1D or 2D space, then the compute shader becomes a candidate for interleaving.
- the conversion module 340 recompiles the compute shader for tiled access patterns.
- a workgroup dimension of the compute shader is redefined to be an integer divisor of a tile's width and height.
- the tile's width and height may be modified. These modifications ensure that the range of access fits with the defined tile dimensions, so all data may be stored in the on-chip memory 360.
- image load instructions may be replaced with tile buffer load instructions. This may include replacing instructions used to compute the target address to index in a tile buffer.
- the scheduling module 330 groups API calls so as to build a sequence of interleaved execution of graphics and compute operations. For example, the scheduling module 330 determines whether interleaving is feasible. The scheduling module 330 removes the memory barrier, based on a result of determining whether interleaving is feasible.
- the scheduling module 330 schedules the interleaved execution of each tile.
- the output of a graphics shader for a given tile is stored in the on-chip memory 360 (or, another on-chip memory such as a data register or a buffer). For example, data is immediately read back by a subsequent compute shader, so that the unnecessary external memory accesses are eliminated.
- FIG. 7 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to another embodiment.
- the second analysis module 314 determines dependencies between images and tiles in order to build a DAG.
- Candidates are identified in operation 720 for interleaving of graphics calls and compute calls on a tiled basis.
- the scheduling module 330 identifies the candidates for interleaving.
- an associated candidate compute shader (or compute shaders) is adapted to be a tiled equivalent having a tiled access pattern, by the conversion module 340.
- the conversion module 340 recompiles the compute shader for tiled access patterns.
- interleaving of the combined graphics shader and compute shader is scheduled by the scheduling module 330 in a tiled manner so as to reduce traffic to an external memory by maintaining at least some of the intermediate data result in the on-chip memory 360 (e.g., a tile buffer).
- FIG. 8 shows an example of analyzing a sequence of API calls (e.g., OpenGLTM API calls) for interleaving graphics calls followed by computing calls for a graphics shader 830 and a compute shader 850.
- API calls e.g., OpenGLTM API calls
- the first analysis module 312 analyzes API graphics calls 810 and API compute calls 820 in order to determine whether or not graphics and compute calls can be interleaved.
- the second analysis module 314 determines the graphics calls on which the compute call is dependent.
- a render target 840 is an intermediate memory surface to which a 3D image is rendered.
- FIG. 9 illustrates an example in which graphics and compute calls are interleaved by an interleaved scheduling module, according to an embodiment.
- a stride pattern that a compute shader 950 will need may be determined by monitoring memory access patterns of the load/store operations of the compute shader 950.
- a graphics shader 930 may generate tiles in an order by which the compute shader 950 consumes the tiles. For example, once a tile 910 has been produced, then the compute shader 950 processes the tile 910. While the compute shader 950 processes the tile 910, a graphics shader 930 produces a tile 920. Then, while the compute shader 950 processes the tile 920, the graphics shader 930 produces another tile. As a result, the processing of the tiles is interleaved between the graphics shader 930 and a compute shader 950.
- the rendering of a graphics frame may include a mixture of graphics and compute commands.
- a graphics rendering engine may mix graphics operations and compute operations for many stages of post-processing and graphics calculations.
- intermediate results that may be written to a memory (e.g., an external memory) may be generated and may be read back later so as to generate a final resulting image or a next intermediate image.
- the compute shader operations may include lighting calculations based on parameter data from a G-buffer (generated due to graphics rendering operations), a motion blur, a depth of field (DOF), compositing, and anti-aliasing.
- a graphics shader is followed be a sequence of compute shader operations.
- the compute shader may be followed by the graphics shader.
- a similar analysis may be applied to a case where the compute shader writes an output to an external memory and the graphics shader loads the data from the external memory.
- the compute shader and graphics shader may be interleaved by recompiling the compute shader so as to output the data to the on-chip buffer.
- the graphics shader consumes the data from the on-chip buffer, and executes the data in a tiled manner.
- caches are used instead of a tile buffer for intermediate data.
- a direct access to a low-level on-chip cache being used as the tile-buffer may not be possible due to a lack of appropriate data paths.
- the nearest cache level may be used instead.
- the energy and performance benefits may be reduced depending on proximity of the cache to compute logic within the GPU.
- hints or directives may be issued to the cache so as to ensure that the data remains resident in the cache and is not swapped out to an external DRAM memory until the dependent calculations are completed.
- the range of access is selected to fit within the cache capacity, so that all working data may stay in the cache.
- an output of a graphics shader for a given tile is stored in a nearest cache. The data is consumed immediately by a subsequent compute shader, and any unnecessary access to a DRAM or a lower cache level is eliminated.
- a module can correspond to a hardware component, a software component, or a combination thereof.
- a module may include one or more processors (e.g., computer processors) and a data storage device including program instructions.
- the one or more processors may be configured by the instructions to function as a special purpose processor to perform one or more methods described herein.
- Software, hardware, and other modules may reside on servers, workstations, mobile devices, smart phones, wearable computers, personal computers, tablet computers, image data encoders, image data decoders, personal digital assistants (PDAs), video projectors, audio-visual receivers, displays (such as televisions (TVs)), digital cinema projectors, media reproducers, and other devices suitable for the purposes described herein. It is obvious to one of ordinary skill in the art that aspects of the system may be practiced as a standalone device or by a system of devices, such as, e.g., a client-server system.
- a software (or program) component may include any non-transitory medium which carries a set of computer-readable instructions which, when executed by a processor, cause a processor to execute one or more methods described herein.
- the software component may be in any of a wide variety of forms.
- the program component may include non-transitory media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, EPROMs, hardwired or preprogrammed chips (e.g., EEPROM semiconductor chips), nanotechnology memory, or the like.
- the computer-readable instructions on the program components may optionally be compressed or encrypted.
- Example embodiments of software components may include (but are not limited to) firmware, middleware, operating system software, resident software, application software, microcode, and the like. Both hardware and software components may be centralized or distributed (or a combination thereof), in whole or in part, as known to those skilled in the art. For example, software components and other modules may be accessible via local memory, via a network, via a browser or other application in a distributed computing context or via other means suitable for the purposes described above.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Generation (AREA)
- Image Processing (AREA)
Abstract
Description
- The present disclosure relates to rendering of graphical images in which a graphics shader and a compute shader are utilized, and more particularly, to a method of interleaving graphics shader operations and compute shader operations.
- Graphical images are often generated through several steps. For example, an image may be generated and then may be read to generate another image through a sequence of render targets (RTs). An RT is an intermediate memory surface to which a three-dimensional (3D) image is rendered. A sequence of steps may be performed to generate an RT "A" and then to read the RT A so as to generate an RT "B". For example, in a first step, lighting parameters are written to a G-buffer so as to render an image and, in a second step, a lit image may be rendered by reading the G-buffer and doing light-related calculations. A sequence of operations (i.e., the sequence of steps) may be performed on different RTs before a final output image is generated.
- However, these render target steps require a graphics processing unit (GPU) to access an external memory. In a case where a graphics application generates an intermediate image A and then an image A is read so as to generate an image B, an image size (e.g., 1920x3080 pixels) is commonly applied thereto, and each pixel is 4 bytes (RGBA8888 format), the intermediate image A has to be written to the external memory if a cache cannot store data of 8 megabytes (MBs).
- Thus, a graphics processor renders all of a first RT (e.g., the RT A), and then reads a result of the rendering from the external memory so as to generate a second RT (e.g., the RT B).
- Provided are methods and apparatuses for performing interleaving. Also provided are a non-transitory computer-readable recording medium having recorded thereon a program for executing the method, by using a computer.
- Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments.
- According to an aspect of an embodiment, a method of performing interleaving may include processing a combination of tiled graphics shader operations and tiled compute shader operations of interdependent (or interrelated) render targets; and interleaving the combination of the tiled graphics shader operations and the tiled compute shader operations, according to a result of the processing.
- According to another aspect of an embodiment, a method performed in a graphics processing system may include recompiling a compute shader so as to have a tiled access pattern; and interleaving processing of a graphics shader and the recompiled compute shader for a set of interdependent images, wherein the interleaving is performed on a tile-by-tile basis for the interdependent images.
- According to another aspect of an embodiment, a non-transitory computer-readable recording medium having recorded thereon a program which, when executed by using a processor, performs a method including determining dependencies of graphics shader operations and compute shader operations of a set of interdependent render target operations; and scheduling an interleaved order of tile processing on interleaved graphics shader operations and compute shader operations so as to reduce traffic to an external memory of a graphics system by maintaining at least one subset of intermediate tile processing computations of the interleaved graphics shader operations and compute shader operations in an on-chip memory of a graphics processing unit (GPU).
- These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings in which:
-
FIG. 1 illustrates configuration of an apparatus for performing interleaving, according to an embodiment; -
FIG. 2 is a flowchart for describing a method of performing interleaving, according to an embodiment; -
FIG. 3 illustrates configuration of a graphics processing system, according to an embodiment; -
FIGS. 4A through 4C illustrate examples of a directed acyclic graph (DAG); -
FIG. 5 illustrates an example in which intermediate results are maintained in an on-chip memory; -
FIG. 6 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to an embodiment; -
FIG. 7 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to another embodiment; -
FIG. 8 shows an example of analyzing a sequence of Application Programming Interface (API) calls (e.g., OpenGL™ API calls) for interleaving graphics calls followed by computing calls for a graphics shader and a compute shader; and -
FIG. 9 illustrates an example in which graphics and compute calls are interleaved by an interleaved scheduling module, according to an embodiment. - All terms including descriptive or technical terms which are used herein should be construed as having meanings that are obvious to one of ordinary skill in the art. However, the terms may have different meanings according to an intention of one of ordinary skill in the art, precedent cases, or the appearance of new technologies. Also, some terms may be arbitrarily selected by the applicant, and in this case, the meaning of the selected terms will be described in detail in the detailed description of the invention. Thus, the terms used herein have to be defined based on the meaning of the terms together with the description throughout the specification.
- Also, when a part "includes" or "comprises" an element, unless there is a particular description contrary thereto, the part can further include other elements, not excluding the other elements. In the following description, terms such as "unit" and "module" indicate a unit for processing at least one function or operation, wherein the unit and the block may be embodied as hardware or software or embodied by combining hardware and software.
- The present disclosure will now be described more fully with reference to the accompanying drawings, in which one or more embodiments are shown. The present disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this present disclosure will be thorough and complete, and will fully convey the concept of the present disclosure to those of ordinary skill in the art.
- Hereinafter, the present disclosure will be described in detail by explaining embodiments with reference to the attached drawings.
-
FIG. 1 illustrates configuration of an apparatus for performing interleaving, according to an embodiment. - Referring to
FIG. 1 , a memory 100 includes adriver 111, and a graphics processing unit (GPU) 120 includes acache 121. Thedriver 111 may correspond to program instructions. Here, the program instructions provide a software interface with respect to theGPU 120, which allows software programs to access hardware configurations of theGPU 120 and to communicate data with theGPU 120. For example, thedriver 111 may include program instructions and a GPU Application Programming Interface (GPU API) stored in thememory 110. - The GPU 120 may include not only the
cache 121 but may also include programmable GPU hardware and at least one processor. In this regard, thecache 121 may be used as a tile buffer for buffering tiled data. -
FIG. 2 is a flowchart for describing a method of performing interleaving, according to an embodiment. - In
operation 210, thedriver 111 processes a combination of tiled graphics shader operations and tiled compute shader operations of interrelated render targets (RTs). - In
operation 220, according to a result of the processing, thedriver 111 interleaves the combination of the tiled graphics shader operations and the tiled compute shader operations. - Hereinafter, with reference to
FIGS. 3 through 9 , examples in which thedriver 111 and theGPU 320 perform interleaving are described in detail. -
FIG. 3 illustrates configuration of agraphics processing system 300, according to an embodiment. - A
memory 304 and adriver 308 shown inFIG. 3 may respectively correspond to thememory 110 and thedriver 111 shown inFIG. 1 . Also, aGPU 302 and an on-chip memory 360 shown inFIG. 3 may respectively correspond to theGPU 120 and thecache 121 shown inFIG. 1 . - Referring to
FIG. 3 , thegraphics processing system 300 includes a central processing unit (CPU) 301 and thememory 304. Thedriver 308 may be stored on thememory 308. For example, thedriver 308 may correspond to program instructions. In this regard, the program instructions provide a software interface with respect to theGPU 302 that enables software programs to access hardware configurations of theGPU 302 and to communicate data with theGPU 302. For example, the software programs may include an operating system program, an application program, and/or the like software programs. Thedriver 308 may support physical and/or virtual hardware. - For example, the
driver 308 includes program instructions stored inmemory 304. The GPU 302 includesprogrammable GPU hardware 350, at least oneprocessor 357, and an on-chip cache memory 360. Here, the on-chip cache memory 360 may be used as a tile buffer to buffer tileddata 362. Thedriver 308 includes aGPU API 306. Referring toFIG. 3 , theGPU 302 is illustrated as a physical GPU component, however, theGPU 302 may correspond to a virtual GPU. - The
GPU 302 may access anexternal memory 320. However, in some embodiments, in order to reduce memory traffic from and to theexternal memory 320, theexternal memory 320 may be an external dynamic random access memory (DRAM). For example, when a graphics processor renders a first RT, writes a result of the rendering to theexternal memory 320, and then reads the result of the rendering from the result of the rendering so to generate a second RT, these processes generate a lot of traffic from and to theexternal memory 320. In addition, the processes may include rendering with respect to unnecessary portions included in intermediate images. - For example, in order to maintain intermediate results or data on the on-
chip cache memory 360 and to reduce the number of accessing theexternal memory 320, a support is provided to thedriver 308 so as to interleave graphics shader operations and compute shader operations. Portions of theprogrammable GPU hardware 350 may be optionally modified with ahardware scheduling support 355 so as to optimize the execution of the interleaved scheduling in theGPU 302. - Graphics rendering operations are performed in a graphics pipeline including a fixed function logic, one or more graphics shaders, and compute shaders. The graphics shader is a program used to do shading. The graphics shader is a program that runs as a part of the graphics rendering operation. The graphics shader processes pixels, vertices, patches, or primitives. The compute shader is a program used for computing arbitrary information and provides more flexibilities. Here, the flexibility indicates a characteristic of software which may be easily changed according to different types of a machine and various demands from a user. With the addition of a compute shader that supports the OpenGL-ES™ 3.1 Application Programming Interface (API), benchmarks are moving image post-processing operations like motion blur, depth of field, scaling and filtering from graphics shaders to compute shaders. In the OpenGL™ standards, a compute shader is used in computing arbitrary information. In general, the compute shaders operate in an abstract space. However, a compute shader may have limitations on a work group size (a smallest amount of compute operations) and a local size. In this regard, the local size is defined by the number of invocations of the shader which may take place within each work group. The Open GL™ also defines rules for shared variables and memory barriers. The memory barriers ensure that all memory transactions before the barrier must complete before processing.
- Compute shaders offer more flexibilities and features that may improve the efficiency of post-processing operations. For example, the compute shader gives more freedom in how data is accessed or written. However, the API does not guarantee memory consistency without the use of a global memory barrier (e.g., a g1MemoryBarrier).
- A
first analysis module 312 analyzes API graphics shader calls and compute shader calls. In order to construct a directed acyclic graph (DAG) that defines dependencies between images and tiles, asecond analysis module 314 uses thefirst analysis module 312. Examples of the DAG are as illustrated inFIGS. 4A, 4B, and 4C . At run time, an API dependency graph (corresponding to the DAG) is built up in thedriver 308 so as to detect cases where graphics calls and compute calls may be interleaved. As described in connection withFIG. 6 , API calls are grouped to build a sequence of interleaved execution of graphics shader operations and compute shader operations,. - A
scheduling module 330 determines a schedule of interleaved operations. For example, at the shader compilation time, the data access pattern of image load/store operations in the compute shader (or the compute shaders) are analyzed so as to determine whether the image load/store operations are candidates for interleaving. Certain types of the data access patterns have characteristics that are compatible with interleaving. For example, if the data access pattern is a statically known strided pattern that is static in a one-dimensional (1D) or two-dimensional (2D) space, the compute shader is a candidate for the interleaving. The strided pattern has a sequence of memory accesses with respect to addresses separated from each other by a stride length. A compute shader with a statically known strided pattern has an order of processing that has some general similarities in shape to a tile pattern, which facilitates converting a compute shader to have a tiled access. However, in principle, other types of data access patterns may be suitable candidates for interleaving. Operations of thescheduling module 330 are described in greater detail with reference toFIGS. 5 ,6 , and9 . - For example, the
scheduling module 330 determines sets of interleaved tiled graphics shader operations and compute shader operations on interdependent RTs so as to reduce external memory traffic. The operation by thescheduling module 330 may be performed by maintaining at least some intermediate results of the interleaved operations in the on-chip memory 360. An example of the operation of scheduling the interleaved tiled graphics shader operations and the compute shader operation so as to reduce the external memory traffic is described in more detail with reference toFIG. 5 . - For example, a support is provided in the
driver 308 and a compiler (not shown) so as to automatically convert a compute shader (or compute shaders) to operate with a tiled access pattern by aconversion module 340. The compute shaders are converted to operate on tiles so as to generate a tiled equivalent version. Such conversion includes recompiling, by using the compiler, the compute shader to adapt the compute shader to perform a tiled memory access in a tile format compatible with that of the tiled memory access of a graphics shader. For example, such conversion includes redefining the workgroup dimension of the compute shader to be an integer divisor of a tile's width and height. Additionally, image load instructions may be replaced with tile buffer load instructions. If feasible, for the interleaved graphics shader computations and the compute shader operations, aremoval module 342 may remove the memory barrier. - For example, the
graphics processing system 300 automatically (e.g., without human intervention) interleaves the processing of graphics shader and compute shaders in order to reduce or eliminate writing the intermediate data to an external (off-chip)memory 320. Thedriver 308 re-orders the API calls and performs any necessary recompiling on the compute shaders so as to process, by using the driver 308 (and the support of a compiler software stack), existing applications written to graphics APIs having the compute shaders. -
FIGS. 4A through 4C illustrate examples of a DAG. - Referring to
FIG. 4A , one embodiment of a data flow graph (or a data structure equivalent thereto) that is generated by thesecond analysis module 314 is shown. The data flow graph ofFIG. 4A shows the immediate dependency between a set of two or more images, such as images A and B, and may also include other subsequent images (e.g., C, D).FIG. 4A corresponds to a DAG defining dependencies between different RTs. The interdependence of different RT images may be simple (e.g., the image B directly flows from the image A) or it may have more complex interdependent relationships (e.g., an image may depend on more than one image). More generally, as the image A and the image B are used to generate the image C as illustrated inFIG. 4B , one or more images might depend on several others at the same time. Thus, while there may be a simple sequence of one RT feeding (e.g., used as input) into another RT, the interdependence therebetween may be generally more complex. For example, two graphics shader RTs may be required so as to generate a compute shader RT. In addition, dependency on a tile level may also be complex. For example, referring toFIG. 4C , a tile in a given image B may depend on more than one tile in a previous image (e.g., an image A). In addition, there may be individual tiles in a given RT which are not used by any tiles in a subsequent RT. - For example, the
first analysis module 312 and thesecond analysis module 314 of thedriver 308 examine a stream of commands and then determine if the image of the RT B directly or indirectly (e.g., downstream) depends on an RT A. When it is determined that the RT B depends on the RT A, thedriver 308 determines how many levels are required to follow the dependency graph so as to reduce external memory accesses. Following every level of the dependency graph may not be required in all cases to achieve a reduction in the external memory accesses. Moreover, following every level of a complex dependency graph consumes processing resources. Thus, thesecond analysis module 314 limits the number of levels that follow the dependency graph is followed. If the dependency graph is complex, a rule by which the number of levels which follow the dependency graph may be applied thereto. Afterward, thescheduling module 330 generates interleaving schedule so as to attempt to keep intermediate results in an on-chip cache memory (e.g., a tile buffer). - In addition, the DAG maps dependencies at a tile level between tiles rendered by graphics shader operations and tiles operating on by compute shaders. The generation of the DAG may include back-projecting each output tile to the input tile (or input tiles) required to generate them. For example, as illustrated in
FIG. 4C , the image A may be used in generating the image B, which in turn may be used in generating the image C. There is a dependency of images and a dependency of tiles from the images. For example, as illustrated inFIG. 4C , a single tile in a given image may depend on more than one tile from another image. -
FIG. 5 illustrates an example in which intermediate results are maintained in the on-chip memory 360. -
FIG. 5 illustrates the example in which the intermediate computations of a sequence of operations are maintained in the on-chip cache memory 360 used as a tile buffer. In this example, thesecond analysis module 314 determines dependencies between an RT A, an RT B, an RT C, and an RT D. Thescheduling module 330 determines a schedule to write tiles of the RT A to the on-chip cache memory 360 and then read them to the RT B. The scheduling of the processing of writing and reading the tiles may be selected to generate an interleaved schedule such as after a tile of an image A is produced and in stored in the on-chip cache memory 360. The tile is "immediately read back" so as to produce a tile of the RT B, thus, a memory bandwidth may be saved. It will be appreciated that "immediately read back" (also referred to as "directly consumed" or "consumed immediately") may include processing intervening operations between the storing of the tile of the image A in the on-chip memory 360 and the reading of the stored tile from the on-chip memory 360 so as to produce a tile of the RT B. For example, "immediately read back" may correspond to reading the tile of the image A from the on-chip memory 360 rather than reading the tile from anexternal memory 320 so as to produce the tile of the RT B. - The interleaved schedule may proceed in an order matching with the dependency analysis performed by the
second analysis module 314. The dependency analysis may be selected to optimize use of the on-chip cache memory 360, so that the need to use theexternal memory 320 for the intermediate computations may be minimized. In addition, the dependency analysis performed by thesecond analysis module 314 may be used in the scheduling scheme for eliminating that processing is performed on tiles that do not contribute to a tile of a final image. By doing so, an unnecessary work may be eliminated. After a sequence of the intermediate operations is performed, the final image may be rendered and then may be sent to theexternal memory 320 or may be output for display. While it is desirable to completely maintain all intermediate results in the on-chip cache memory 360, in general, even maintaining a subset of intermediate data results in the on-chip cache memory 360 may be beneficial in reducing the number of external memory accesses that are required to perform graphics processing. Further, it is possible to reduce tile sizes so as to minimize the amount of intermediate data that must be maintained, which in effect, allows for an even finer grain of interleaving. - For example, interleaving may be performed at the tile level for graphics rendering and compute shader operations on a set of producer/consumer render targets (e.g., from
images -
FIG. 6 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to an embodiment. - In
operation 610, thefirst analysis module 312 processes API calls, and finds interleaving candidates. API call processing is deferred and queued to build a DAG by thesecond analysis module 314 that defines dependencies. In a tile based deferred rendering (TBDR) processing for a graphics only workload, thedriver 308 uses certain events, such as a render target change or a memory barrier, in order to determine when to start processing deferred API calls. In a case where graphics and compute calls are mixed, theremoval module 342 determines whether one or more memory barriers can be safely removed by interleaving the graphics with compute calls in a tiled manner (e.g., the memory barrier can be safely removed if the graphics processing for the tile ends before the compute processing occurs). For example, a determination of when interleaving is allowed is based on an analysis of memory access patterns of image load/store operations in the compute shader during a compilation time. If the memory access pattern is a statically-known strided pattern in a 1D or 2D space, then the compute shader becomes a candidate for interleaving. - In
operation 620, theconversion module 340 recompiles the compute shader for tiled access patterns. For example, a workgroup dimension of the compute shader is redefined to be an integer divisor of a tile's width and height. Alternately or in addition, the tile's width and height may be modified. These modifications ensure that the range of access fits with the defined tile dimensions, so all data may be stored in the on-chip memory 360. For example, image load instructions may be replaced with tile buffer load instructions. This may include replacing instructions used to compute the target address to index in a tile buffer. - In
operation 630, thescheduling module 330 groups API calls so as to build a sequence of interleaved execution of graphics and compute operations. For example, thescheduling module 330 determines whether interleaving is feasible. Thescheduling module 330 removes the memory barrier, based on a result of determining whether interleaving is feasible. - In
operation 640, thescheduling module 330 schedules the interleaved execution of each tile. The output of a graphics shader for a given tile is stored in the on-chip memory 360 (or, another on-chip memory such as a data register or a buffer). For example, data is immediately read back by a subsequent compute shader, so that the unnecessary external memory accesses are eliminated. -
FIG. 7 illustrates a flowchart for describing a method of interleaving graphics and compute operations, according to another embodiment. - In
operation 710, thesecond analysis module 314 determines dependencies between images and tiles in order to build a DAG. Candidates are identified inoperation 720 for interleaving of graphics calls and compute calls on a tiled basis. For example, thescheduling module 330 identifies the candidates for interleaving. Inoperation 730, an associated candidate compute shader (or compute shaders) is adapted to be a tiled equivalent having a tiled access pattern, by theconversion module 340. For example, theconversion module 340 recompiles the compute shader for tiled access patterns. Inoperation 740, interleaving of the combined graphics shader and compute shader is scheduled by thescheduling module 330 in a tiled manner so as to reduce traffic to an external memory by maintaining at least some of the intermediate data result in the on-chip memory 360 (e.g., a tile buffer). -
FIG. 8 shows an example of analyzing a sequence of API calls (e.g., OpenGL™ API calls) for interleaving graphics calls followed by computing calls for agraphics shader 830 and acompute shader 850. - The
first analysis module 312 analyzes API graphics calls 810 and API compute calls 820 in order to determine whether or not graphics and compute calls can be interleaved. When thecompute call 820 is encountered, thesecond analysis module 314 determines the graphics calls on which the compute call is dependent. A rendertarget 840 is an intermediate memory surface to which a 3D image is rendered. -
FIG. 9 illustrates an example in which graphics and compute calls are interleaved by an interleaved scheduling module, according to an embodiment. - A stride pattern that a
compute shader 950 will need may be determined by monitoring memory access patterns of the load/store operations of thecompute shader 950. Agraphics shader 930 may generate tiles in an order by which thecompute shader 950 consumes the tiles. For example, once atile 910 has been produced, then thecompute shader 950 processes thetile 910. While thecompute shader 950 processes thetile 910, agraphics shader 930 produces atile 920. Then, while thecompute shader 950 processes thetile 920, thegraphics shader 930 produces another tile. As a result, the processing of the tiles is interleaved between thegraphics shader 930 and acompute shader 950. - For example, the rendering of a graphics frame may include a mixture of graphics and compute commands. A graphics rendering engine may mix graphics operations and compute operations for many stages of post-processing and graphics calculations. In particular, without interleaving some stages of processing, intermediate results that may be written to a memory (e.g., an external memory) may be generated and may be read back later so as to generate a final resulting image or a next intermediate image. As an example scenario, the compute shader operations may include lighting calculations based on parameter data from a G-buffer (generated due to graphics rendering operations), a motion blur, a depth of field (DOF), compositing, and anti-aliasing. By tiling the compute shader operations, data production and consumption between graphics and compute operations may be interleaved at a tile level. By doing so, data may retain in the on-
chip memory 360, and some graphics applications may eliminate some or even most traffic to an external memory. Therefore, power efficiency and performance may be significantly improved. In addition, an automatic (e.g., without human intervention) removal of explicit global memory barriers also improves performance. - In a number of graphics applications, a graphics shader is followed be a sequence of compute shader operations. However, in one embodiment, the compute shader may be followed by the graphics shader. In the embodiment, a similar analysis may be applied to a case where the compute shader writes an output to an external memory and the graphics shader loads the data from the external memory. The compute shader and graphics shader may be interleaved by recompiling the compute shader so as to output the data to the on-chip buffer. The graphics shader consumes the data from the on-chip buffer, and executes the data in a tiled manner.
- In another embodiment, caches are used instead of a tile buffer for intermediate data. In certain architectures, a direct access to a low-level on-chip cache being used as the tile-buffer may not be possible due to a lack of appropriate data paths. In such a case, the nearest cache level may be used instead. However, the energy and performance benefits may be reduced depending on proximity of the cache to compute logic within the GPU. If a next-level cache is used, then several changes can be made. For example, image load instructions cannot be changed to tile buffer load instructions. However, hints or directives may be issued to the cache so as to ensure that the data remains resident in the cache and is not swapped out to an external DRAM memory until the dependent calculations are completed. The range of access is selected to fit within the cache capacity, so that all working data may stay in the cache. In addition, an output of a graphics shader for a given tile is stored in a nearest cache. The data is consumed immediately by a subsequent compute shader, and any unnecessary access to a DRAM or a lower cache level is eliminated.
- As used herein, a module can correspond to a hardware component, a software component, or a combination thereof. For example, a module may include one or more processors (e.g., computer processors) and a data storage device including program instructions. The one or more processors may be configured by the instructions to function as a special purpose processor to perform one or more methods described herein. Software, hardware, and other modules may reside on servers, workstations, mobile devices, smart phones, wearable computers, personal computers, tablet computers, image data encoders, image data decoders, personal digital assistants (PDAs), video projectors, audio-visual receivers, displays (such as televisions (TVs)), digital cinema projectors, media reproducers, and other devices suitable for the purposes described herein. It is obvious to one of ordinary skill in the art that aspects of the system may be practiced as a standalone device or by a system of devices, such as, e.g., a client-server system.
- A software (or program) component may include any non-transitory medium which carries a set of computer-readable instructions which, when executed by a processor, cause a processor to execute one or more methods described herein. The software component may be in any of a wide variety of forms. For example, the program component may include non-transitory media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, EPROMs, hardwired or preprogrammed chips (e.g., EEPROM semiconductor chips), nanotechnology memory, or the like. The computer-readable instructions on the program components may optionally be compressed or encrypted.
- Example embodiments of software components may include (but are not limited to) firmware, middleware, operating system software, resident software, application software, microcode, and the like. Both hardware and software components may be centralized or distributed (or a combination thereof), in whole or in part, as known to those skilled in the art. For example, software components and other modules may be accessible via local memory, via a network, via a browser or other application in a distributed computing context or via other means suitable for the purposes described above.
- While the present disclosure has been described in conjunction with specific embodiments, it will be understood that it is not intended to limit the present disclosure to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the scope of the present disclosure as defined by the appended claims. The present disclosure may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the present disclosure. In accordance with the present disclosure, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or computing devices. In addition, one of ordinary skill in the art will recognize that devices such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope of the present disclosure described herein. The present disclosure may also be tangibly embodied as a set of computer instructions stored on a computer-readable medium such as a memory device.
Claims (15)
- A method of performing interleaving, the method comprising:processing a combination of tiled graphics shader operations and tiled compute shader operations of interdependent render targets; andinterleaving the combination of the tiled graphics shader operations and the tiled compute shader operations, according to a result of the processing.
- The method of claim 1, wherein the interleaving comprises determining a schedule of a sequence of interleaved and tiled operations, and
wherein the interleaved and tiled operations are selected such that at least one intermediate data result of a first operation of the sequence is directly read from an on-chip memory by a second operation of the sequence. - The method of claim 1 or 2, wherein the interleaving comprises determining a schedule of a sequence of interleaved and tiled operations, and
wherein the interleaved and tiled operations are selected such that traffic to an external memory associated with intermediate data results is reduced. - The method of any preceding claim, wherein a data access pattern of the tiled graphics shader operations and the tiled compute shader operations is analyzed so as to determine whether a global memory barrier is safely removed as a condition for the interleaving and for removing memory barriers.
- The method of any preceding claim, wherein the interleaving is performed on a tile-by-tile basis and a compute shader operates with respect to a tiled access pattern.
- The method of claim 5, wherein a workgroup dimension of the compute shader is redefined so as to be an integer divisor of a width and a height of a tile.
- The method of claim 6, wherein the processing comprises replacing, for the compute shader, image load instructions with tile buffer load instructions.
- The method of any preceding claim, wherein the processing of the combination comprises analyzing application programming interface (API) calls and grouping the API calls so as to build a sequence of interleaved execution of the tiled graphics shader operations and the tiled compute shader operations.
- The method of any preceding claim, wherein the processing comprises generating a directed acyclic graph (DAG) of the interdependent render targets and interdependent tiles, and using the DAG so as to schedule interleaving of the tiled graphics shader operations and the tiled compute shader operations.
- The method of any preceding claim, wherein the interleaving comprises grouping API calls so as to form a sequence of interleaved execution of the tiled graphics shader operations and the tiled compute shader operations.
- The method of any preceding claim, wherein the processing comprises analyzing a data access pattern of load and store operations so as to determine candidates for the interleaving.
- The method of claim 11, wherein the processing comprises identifying a data access pattern of a statically predetermined strided pattern, as a candidate for the interleaving.
- A computer program comprising computer program codes means adapted to perform all of the steps of any preceding claim when said program is run on a computer.
- Apparatus for performing interleaving, the apparatus comprising:a processor adapted to process a combination of tiled graphics shader operations and tiled compute shader operations of interdependent render targets; andan interleaving unit adapted to interleave the combination of the tiled graphics shader operations and the tiled compute shader operations, according to a result of the processing.
- The apparatus of claim 14, wherein the interleaving unit is adapted to perform the interleaving on a tile-by-tile basis and wherein a compute shader operates with respect to a tiled access pattern.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562171071P | 2015-06-04 | 2015-06-04 | |
US14/981,395 US10089775B2 (en) | 2015-06-04 | 2015-12-28 | Automated graphics and compute tile interleave |
KR1020160014084A KR102680271B1 (en) | 2015-06-04 | 2016-02-04 | Method and apparatus for performing an interleaving |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3101650A1 true EP3101650A1 (en) | 2016-12-07 |
Family
ID=56409460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16172382.0A Pending EP3101650A1 (en) | 2015-06-04 | 2016-06-01 | Method and apparatus for performing interleaving |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP3101650A1 (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030222870A1 (en) * | 1997-07-02 | 2003-12-04 | Mental Images G.M.B.H. & Co, Kg | System and method for generating and using systems of cooperating and encapsulated shaders and shader DAGs for use in a computer graphics system |
US20110050716A1 (en) * | 2009-09-03 | 2011-03-03 | Advanced Micro Devices, Inc. | Processing Unit with a Plurality of Shader Engines |
US20110148919A1 (en) * | 2009-12-17 | 2011-06-23 | Frode Heggelund | Graphics processing systems |
US20130069943A1 (en) * | 2011-09-19 | 2013-03-21 | Qualcomm Incorporated | Optimizing resolve performance with tiling graphics architectures |
US20130235057A1 (en) * | 2012-03-12 | 2013-09-12 | Ati Technologies, Ulc | Area-based dependency chain analysis of shaders and command stream |
US20130293546A1 (en) * | 2012-05-03 | 2013-11-07 | Samsung Electronics Co., Ltd. | Dynamic load balancing apparatus and method for graphic processing unit (gpu) |
US20140327671A1 (en) * | 2013-05-02 | 2014-11-06 | Arm Limited | Graphics processing systems |
-
2016
- 2016-06-01 EP EP16172382.0A patent/EP3101650A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030222870A1 (en) * | 1997-07-02 | 2003-12-04 | Mental Images G.M.B.H. & Co, Kg | System and method for generating and using systems of cooperating and encapsulated shaders and shader DAGs for use in a computer graphics system |
US20110050716A1 (en) * | 2009-09-03 | 2011-03-03 | Advanced Micro Devices, Inc. | Processing Unit with a Plurality of Shader Engines |
US20110148919A1 (en) * | 2009-12-17 | 2011-06-23 | Frode Heggelund | Graphics processing systems |
US20130069943A1 (en) * | 2011-09-19 | 2013-03-21 | Qualcomm Incorporated | Optimizing resolve performance with tiling graphics architectures |
US20130235057A1 (en) * | 2012-03-12 | 2013-09-12 | Ati Technologies, Ulc | Area-based dependency chain analysis of shaders and command stream |
US20130293546A1 (en) * | 2012-05-03 | 2013-11-07 | Samsung Electronics Co., Ltd. | Dynamic load balancing apparatus and method for graphic processing unit (gpu) |
US20140327671A1 (en) * | 2013-05-02 | 2014-11-06 | Arm Limited | Graphics processing systems |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106251392B (en) | Method and apparatus for performing interleaving | |
KR102475212B1 (en) | Foveated rendering in tiled architectures | |
US10403025B2 (en) | Automated graphics and compute tile interleave | |
US20170372450A1 (en) | Apparatus and method for non-uniform frame buffer rasterization | |
US10297003B2 (en) | Efficient saving and restoring of context information for context switches | |
US11803936B2 (en) | Tile assignment to processing cores within a graphics processing unit | |
US9881391B2 (en) | Procedurally defined texture maps | |
US9818170B2 (en) | Processing unaligned block transfer operations | |
CN109564694B (en) | Vertex shader for binning-based graphics processing | |
EP2732370B1 (en) | Instruction culling in graphics processing unit | |
KR102006584B1 (en) | Dynamic switching between rate depth testing and convex depth testing | |
KR20180056316A (en) | Method and apparatus for performing tile-based rendering | |
JP4637640B2 (en) | Graphic drawing device | |
KR20210002753A (en) | Compiler support technology to reduce the memory usage of the graphics pipeline | |
KR20170088687A (en) | Computing system and method for performing graphics pipeline of tile-based rendering thereof | |
CN107408293B (en) | Supporting multiple levels of nesting of command buffers in a graphics command stream at a computing device | |
TW202141417A (en) | Methods and apparatus for efficient multi-view rasterization | |
CN115516421A (en) | GPR (general purpose processor) optimization based on GPR release mechanism in GPU (graphics processing Unit) | |
EP3101650A1 (en) | Method and apparatus for performing interleaving | |
CN116457830A (en) | Motion estimation based on region discontinuity | |
KR20180015564A (en) | Method and apparatus for performing tile-based rendering | |
US11790478B2 (en) | Methods and apparatus for mapping source location for input data to a graphics processing unit | |
US12067666B2 (en) | Run-time mechanism for optimal shader | |
US20230377086A1 (en) | Pipeline delay elimination with parallel two level primitive batch binning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20170606 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20210113 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230520 |
|
APBK | Appeal reference recorded |
Free format text: ORIGINAL CODE: EPIDOSNREFNE |
|
APBN | Date of receipt of notice of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA2E |
|
APBR | Date of receipt of statement of grounds of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA3E |
|
APAV | Appeal reference deleted |
Free format text: ORIGINAL CODE: EPIDOSDREFNE |