US7318223B2 - Method and apparatus for a generic language interface to apply loop optimization transformations - Google Patents
Method and apparatus for a generic language interface to apply loop optimization transformations Download PDFInfo
- Publication number
- US7318223B2 US7318223B2 US10/926,601 US92660104A US7318223B2 US 7318223 B2 US7318223 B2 US 7318223B2 US 92660104 A US92660104 A US 92660104A US 7318223 B2 US7318223 B2 US 7318223B2
- Authority
- US
- United States
- Prior art keywords
- loop
- blocking
- directive
- marked
- computer program
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
Definitions
- the present invention relates to an improved data processing system.
- the present invention relates to loop optimization transformations.
- the present invention relates to a generic language interface that allows programmers to apply loop optimization transformations on loops in data processing system programs.
- processors execute program instructions by first loading the instructions from memory, which may either be a cache memory or main memory.
- Main memory is a storage device used by the computer system to hold currently executing program instructions and working data.
- An example of main memory is random access memory (RAM).
- Cache memory is a fast memory that holds recently accessed data, designed to speed up subsequent access to the same data.
- the cache memory saves a copy along with associated main memory address.
- the cache memory also monitors addresses of subsequent reads to see if requested data is already stored in the cache. If it is, a cache hit occurs and the data is returned immediately. Otherwise, a cache miss occurs and the data is fetched from main memory and saved in the cache.
- level one cache is smaller in size and located closer to the processor, which provides faster access time.
- level two cache is larger in size and provides slower access time than level one cache.
- level one cache may locate in close proximity with the processor
- level two cache may be located further away from the processor. If an attempt made to access data from the level one cache fails, the processor often steps up to the level two cache or higher to access the same data. Thus, a system may have several levels of cache memory that catch lower level cache misses before attempting to access from main memory.
- Cache memory relies on two properties when accessing program data: temporal locality and spatial locality.
- Temporal locality addresses frequency of data access. If data is accessed once, the same data is likely to be accessed again soon.
- Spatial locality addresses the location of data in memory. If a memory location is accessed then nearby memory locations are likely to be accessed.
- cache memory often operates on several words at a time, which is known as a cache line or cache block.
- main memory reads and writes in terms of a number of cache lines or cache blocks.
- attempts have been made to reduce cache miss rate in computer systems. These attempts include utilizing larger block size, cache size, and pre-fetching instructions. However, these attempts require associated hardware changes.
- the present invention provides a method, apparatus and computer instructions for a generic language interface to apply a number of loop optimization transformations.
- the present invention detects at least one directive in a computer program, generates at least one loop transformation based the at least one directive, and places at least one loop transformation in the computer program.
- FIG. 1 is a pictorial representation of a data processing system in which the present invention may be implemented in accordance with a preferred embodiment of the present invention
- FIG. 2 is a block diagram of a data processing system in which the present invention may be implemented
- FIG. 3 is a diagram illustrating an exemplary implementation of components in accordance with the present invention.
- FIG. 4 is a diagram illustrating relationships between compilers and users in a preferred embodiment of the present invention.
- FIG. 5 is a diagram illustrating the concept of loop interchange in accordance with a preferred embodiment of the present invention.
- FIG. 6 is a diagram illustrating the concept of strip mining in accordance with a preferred embodiment of the present invention.
- FIG. 7 is a diagram illustrating the concept of loop tiling in accordance with a preferred embodiment of the present invention.
- FIG. 8 is a diagram illustrating an exemplary implementation of BLOCK_LOOP directive for the purpose of strip mining in accordance with a preferred embodiment of the present invention
- FIG. 9 is a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directives for the purpose of loop tiling using machine's actual cache size in accordance with a preferred embodiment of the present invention
- FIG. 10 is a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directives for the purpose of loop tiling with multiple loop identifiers in accordance with a preferred embodiment of the present invention
- FIG. 11 is a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directives for the purpose of loop interchange in accordance with a preferred embodiment of the present invention
- FIG. 12A is a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directives for the purpose of loop tiling for multi-level memory hierarchy in accordance with a preferred embodiment of the present invention
- FIG. 12B is a diagram illustrating exemplary resulting code after BLOCK_LOOP directive 1208 in FIG. 12A is processed by the compiler is depicted in accordance with a preferred embodiment of the present invention
- FIG. 12C is a diagram illustrating exemplary resulting code after BLOCK_LOOP directive 1202 in FIG. 12B is processed by the compiler in accordance with a preferred embodiment of the present invention.
- FIG. 13 is a diagram illustrating concept of fitting lower level cache memory into higher level cache memory in accordance with a preferred embodiment of the present invention.
- a computer 100 which includes system unit 102 , video display terminal 104 , keyboard 106 , storage devices 108 , which may include floppy drives and other types of permanent and removable storage media, and mouse 110 . Additional input devices may be included with personal computer 100 , such as, for example, a joystick, touchpad, touch screen, trackball, microphone, and the like.
- Computer 100 can be implemented using any suitable computer, such as an IBM eServer computer or IntelliStation computer, which are products of International Business Machines Corporation, located in Armonk, N.Y. Although the depicted representation shows a computer, other embodiments of the present invention may be implemented in other types of data processing systems, such as a network computer. Computer 100 also preferably includes a graphical user interface (GUI) that may be implemented by means of systems software residing in computer readable media in operation within computer 100 .
- GUI graphical user interface
- Data processing system 200 is an example of a computer, such as computer 100 in FIG. 1 , in which code or instructions implementing the processes of the present invention may be located.
- Data processing system 200 employs a peripheral component interconnect (PCI) local bus architecture.
- PCI peripheral component interconnect
- AGP Accelerated Graphics Port
- ISA Industry Standard Architecture
- Processor 202 and main memory 204 are connected to PCI local bus 206 through PCI bridge 208 .
- PCI bridge 208 also may include an integrated memory controller and cache memory for processor 202 .
- PCI local bus 206 may be made through direct component interconnection or through add-in connectors.
- local area network (LAN) adapter 210 small computer system interface SCSI host bus adapter 212 , and expansion bus interface 214 are connected to PCI local bus 206 by direct component connection.
- audio adapter 216 graphics adapter 218 , and audio/video adapter 219 are connected to PCI local bus 206 by add-in boards inserted into expansion slots.
- Expansion bus interface 214 provides a connection for a keyboard and mouse adapter 220 , modem 222 , and additional memory 224 .
- SCSI host bus adapter 212 provides a connection for hard disk drive 226 , tape drive 228 , and CD-ROM drive 230 .
- Typical PCI local bus implementations will support three or four PCI expansion slots or add-in connectors.
- An operating system runs on processor 202 and is used to coordinate and provide control of various components within data processing system 200 in FIG. 2 .
- the operating system may be a commercially available operating system such as Windows XP, which is available from Microsoft Corporation.
- An object oriented programming system such as Java may run in conjunction with the operating system and provides calls to the operating system from Java programs or applications executing on data processing system 200 . “Java” is a trademark of Sun Microsystems, Inc. Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 226 , and may be loaded into main memory 204 for execution by processor 202 .
- FIG. 2 may vary depending on the implementation.
- Other internal hardware or peripheral devices such as flash read-only memory (ROM), equivalent nonvolatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in FIG. 2 .
- the processes of the present invention may be applied to a multiprocessor data processing system.
- data processing system 200 may not include SCSI host bus adapter 212 , hard disk drive 226 , tape drive 228 , and CD-ROM 230 .
- the computer to be properly called a client computer, includes some type of network communication interface, such as LAN adapter 210 , modem 222 , or the like.
- data processing system 200 may be a stand-alone system configured to be bootable without relying on some type of network communication interface, whether or not data processing system 200 comprises some type of network communication interface.
- data processing system 200 may be a personal digital assistant (PDA), which is configured with ROM and/or flash ROM to provide non-volatile memory for storing operating system files and/or user-generated data.
- PDA personal digital assistant
- data processing system 200 also may be a notebook computer or hand held computer in addition to taking the form of a PDA.
- data processing system 200 also may be a kiosk or a Web appliance.
- processor 202 uses computer implemented instructions, which may be located in a memory such as, for example, main memory 204 , memory 224 , or in one or more peripheral devices 226 - 230 .
- the present invention provides a method, apparatus and computer instructions for a generic language interface to apply a number of loop optimization transformations.
- loop optimization transformations include loop tiling, strip mining, and loop interchange.
- the present invention provides a generic language interface. Programmers may use this interface to direct a compiler to perform a number of loop transformations.
- the generic language interface includes two new user directives: BLOCK_LOOP directive and LOOP_ID directive.
- a directive is a form of comment that passes specific information to the compiler.
- a user directive is a directive placed in the source code by a user, whom may be, for example, a programmer.
- Programmers may employ user directives of the present invention to direct a compiler to apply loop optimization transformations while directing other compilers to ignore the user directives as the directives are treated as comments by other compilers. In this way, programmers may control the loop transformations process without modifying existing program instructions.
- directives of the present invention may be defined in languages, such as, for example, Fortran, C, and C++.
- directives may be used for compilers, such as, for example, IBM XL Fortran Compiler and IBM XL C/C++ compiler, which are products available from International Business Machines Corporation.
- compilers such as, for example, IBM XL Fortran Compiler and IBM XL C/C++ compiler, which are products available from International Business Machines Corporation.
- directives of the present invention may also be used with other languages and compilers.
- a BLOCK_LOOP directive is a directive that directs the compiler to create a blocking loop for a given loop in a loop nest. “Blocking” is dividing the iteration space of a loop into blocks. These blocks are also referred to as “tiles”. A blocking loop is an outer loop that is created to drive the original loop for each block. When the compiler detects a BLOCK_LOOP directive, the compiler creates the blocking loop to surround the blocked loop.
- a user may specify one or more parameters for a BLOCK_LOOP directive.
- the first parameter of the BLOCK_LOOP directive is the blocking factor.
- the blocking factor is the size of the block to which the iteration space is divided, such as, a block of 50 ⁇ 50.
- Second and subsequent parameters represent the name of one or more loops to be blocked, for example, myLoop 1 and myLoop 2 .
- An example BLOCK_LOOP directive using the above parameters may be BLOCK_LOOP ( 50 , myLoop 1 , myLoop 2 ). If no second or subsequent parameters are defined in the BLOCK_LOOP directive, the compiler is directed to block the loop immediately following the directive.
- the LOOPID directive in the illustrative embodiments directs the compiler to assign a loop with a scope-unique loop identifier.
- the loop identifier may then be used with other directives, such as BLOCK_LOOP directive, to control transformations on the loop or by a reporting facility to report on transformations and other information relating to a specific loop.
- BLOCK_LOOP directive assigns a name for the loop.
- the LOOPID directive includes one parameter, which is the name of the loop.
- An example LOOPID directive may be LOOPID (myLoop 1 ).
- the mechanism of the present invention can relate to two loops at the same time and create different behaviors by creating blocking loops at different nesting levels.
- these behaviors may include other loop transformations, such as, for example, loop interchange, loop tiling or strip mining. These loop transformations are discussed in detail below.
- FIG. 3 a diagram illustrating an exemplary implementation of components 202 , 204 and 208 in FIG. 2 is depicted in accordance with the present invention.
- processor 202 and main memory 204 in FIG. 2 may be implemented as processor 300 and main memory 310 in the FIG. 3 .
- PCI bridge 208 in FIG. 2 may include two or more levels of cache memory, In this example, level 1 cache 304 , and level 2 cache 306 are depicted.
- Level 1 cache 304 may be a fast memory chip that includes a small memory size, such as 64 kilobytes.
- level 1 cache 304 is sometimes referred to as a “primary cache”.
- This cache is located between the processor, such as processor 300 , and level 2 cache 306 .
- level 1 cache 304 may be integrated on the same integrated circuit as processor 300 .
- Level 1 cache 306 also is more expensive compared to level 2 cache 304 , because of its faster access speed.
- Level 2 cache 306 a secondary cache, is larger and slower than level 1 cache 304 .
- Level 2 cache 306 generally locates between the level 1 cache 304 and main memory 310 . When cache misses occur in level 1 cache 306 , processor 300 may attempt to retrieve data from level 2 cache 306 prior to searching for the data in main memory 310 . Unlike level 1 cache 304 , level 2 cache 306 is often located external to the integrated circuit of processor 300 . Level 2 cache 306 also is cheaper to produce compared to level 1 cache 304 , because of its slower access speed. In addition to level 1 and level 2 caches, other levels may also be added to PCI bridge 208 in FIG. 2 , for example, level 3 cache 308 , which is even larger in size than level 2 cache 306 and has slower access time.
- FIG. 4 a diagram illustrating relationships between compilers and users is depicted in a preferred embodiment of the present invention.
- a user such as a programmer, may define directives 401 in source code 400 .
- Directives are in the form of comments.
- directives may or may not be ignored by compiler 402 .
- a user may compile source code 400 through user commands 403 .
- User commands 403 may be a user input command with parameters.
- an optimized program with loop transformations applied by the compiler is generated as a set of machine language instructions, such as machine language instructions 404 .
- the machine language instructions may be for a specific platform, such as, a UNIX platform. A user may execute these instructions on the specific platform with reduced cache misses and, thus, faster execution times.
- the loop transformations include for example, loop interchange, loop tiling, and blocking.
- FIG. 5 a diagram illustrating the concept of loop interchange is depicted in accordance with a preferred embodiment of the present invention.
- Loop 502 or j loop
- Loop 504 or k loop
- Loop 502 is an inner loop that iterates from 1 to 10.
- Loop 504 or k loop
- loop 502 is an outer loop that iterates loop 502 from 1 to 5.
- Block 506 illustrates how loop 502 and loop 504 iterate before loop interchange occurs.
- the j loop, loop 502 iterates for every iteration of the k loop, loop 504 .
- the elements in the minor dimension are stored close to each other in memory. Hence, when an element is accessed, a number of neighboring elements are stored along with it in the cache to form a cache line.
- a cache miss may occur for almost every element accessed, since the elements being accessed may be sparsely distributed in memory.
- Loop Interchange allows changing the iteration order to be more compatible with the data's placement in memory.
- the j loop, or loop 502 is placed outside of the k loop, or loop 504 , represented by loop 522 .
- the order of the loop 502 and the loop 504 is interchanged.
- the k loop or loop 522 still iterates 5 times, as illustrated by block 528 - 532 .
- the j loop or loop 524 iterates loop 522 10 times, as illustrated by block 526 .
- the processor may now access elements that are in memory locations closer to each other, rather than having to fetch elements from sparse memory locations.
- FIG. 6 a diagram illustrating the concept of strip mining is depicted in accordance with a preferred embodiment of the present invention.
- loop 602 iterates i from 1 to 7, as illustrated by block 601 .
- loop 602 is divided into two loops, loop 604 and 606 .
- Loop 604 is similar to loop 602 , which iterates from 1 to 7, but with a new index, ii, assigned. In addition, a step size of 3 is added to the loop.
- the original, loop 602 is modified to loop 606 . This loop iterates the original index i from ii to minimum of (ii+2, 7).
- loop 602 is strip mined into two chunks of three elements and a single element, as illustrated by blocks 603 and 605 . Strip mining therefore fragments a large iteration space into smaller segments.
- loop tiling or loop blocking combines the techniques of loop interchange and strip mining. Loop tiling is similar to strip mining in two or more dimensions, with interchange applied to the blocking loops.
- loop tiling or blocking The main purpose of loop tiling or blocking is to eliminate as many cache misses as possible by transforming memory domain into smaller chucks, rather than sequentially traversing through the entire memory domain. Each chuck of memory should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse.
- loop tiling combines strip mining and loop interchange to form small tiles of loop iterations in order to increase spatial locality.
- loop 702 iterates i from 1 to N1 720 , which is an upper bound of the i loop.
- Loop 704 iterates j from 1 to N2 722 , which is an upper bound of the j loop.
- blocking loop 706 is created for loop 702 and loop 702 is modified to become loop 708 .
- Loop 706 is the same as loop 702 , which iterates from 1 to N1 720 , except with an index ii 724 and a step size of Bi 726 .
- blocking loop 710 is created for loop 704 and loop 704 is modified to loop 712 .
- Loop 710 is the same as loop 704 , which iterates from 1 to N2 722 , except with an index jj 728 and a step size of Bj 730 .
- Bi 726 and Bj 730 are blocking factors, which define the bound of the iteration space or the size of a tile. Index ii and jj are added to subdivide the ranges of index i and j into smaller ranges to iterate elements within the tile, in this example, shaded tile Bi 726 ⁇ Bj 730 . As shown in FIG. 7 , loop 708 iterates from index ii to minimum of (ii+N1 ⁇ 1) and Bi. Similarly, loop 712 iterates from index jj to minimum of (jj+N2 ⁇ 1) and Bj.
- loops 708 and 712 only iterate within tiles of the iteration space, while loops 706 and 710 iterate within the entire iteration space.
- cache line may be reused.
- cache line loaded for j+1 may be reused within the tile for the next iteration as j.
- tiling j and i we can fine tune the generated code such that cache lines loaded for Y(j,i) and Y(j+1,i) can be used for the next iteration of i to access Y(j,i+1) and Y(j+1,i+1). For sufficiently large values of N 2 , this can have a large impact on improving the performance of the program.
- FIG. 8 a diagram illustrating an exemplary implementation of BLOCK_LOOP directive for the purpose of strip mining is depicted in accordance with a preferred embodiment of the present invention.
- BLOCK_LOOP directive 806 is inserted by a user into source code, such as source code 400 in FIG. 4 .
- BLOCK_LOOP directive 806 may be inserted by using common directives, such as #pragma directives in C and C++, or user directives in Fortran. Other types of directives also may be used.
- BLOCK_LOOP directive 806 includes only one parameter, which is the blocking factor or tile size of 50. Since no second or subsequent parameters are defined, BLOCK_LOOP directive 806 directs the compiler to block the loop immediately following BLOCK_LOOP directive 806 , in this case, the j loop or loop 804 . In this example, loop 802 iterates i from 1 to N and loop 804 iterates j from 1 to M. N and M are the limits of loop 802 and loop 804 , respectively.
- loops 812 and 814 resulting code generated by the compiler is illustrated by loops 812 and 814 .
- the i loop or loop 802 remains, but the j loop or loop 804 is strip mined by a factor of 50, which means the iteration space of loop 804 is divided into blocks, or strips, of length at most 50.
- the compiler creates an outer loop 812 by first adding an index jj and setting index jj to the initial value of j in loop 804 , which is 1. The compiler then assigns loop 812 to iterate from 1 to M with a step size of 50.
- the compiler modifies original loop 804 with loop 814 , which sets index j to jj and iterates from jj to minimum of (jj+step size ⁇ 1) and M.
- strip mining of a loop may be performed by specifying the tile size in the first parameter, which defines the factor by which the iteration space would be divided.
- FIG. 9 a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directive for the purpose of loop tiling using machine's actual cache size is depicted in accordance with a preferred embodiment of the present invention.
- BLOCK_LOOP directive 904 is inserted by a user in front of loop 902 .
- BLOCK_LOOP directive 904 is different from BLOCK_LOOP directive 806 in FIG. 8 in that BLOCK_LOOP directive 904 includes two parameters: compute_tile_size 905 and myloop 906 .
- Compute_tile_size 905 is a user-defined function that is invoked when BLOCK_LOOP directive 904 is processed by the compiler.
- compute_tile_size 905 is invoked with an input parameter of M, the tile size of the machine currently running the generated code is computed at run-time based on a limit M.
- loop tiling based on the actual machine's cache size could be achieved using BLOCK_LOOP directive 904 .
- Myloop 906 is an identifier given by a user to identify a loop that is to be blocked by the compiler.
- LOOPID directive 908 gives a user control of marking any loop in a loop nest in order to perform loop transformations.
- LOOPID directive 908 marks either the loop immediately following the directive or a BLOCK_LOOP directive if defined by the user.
- myloop 906 parameter directs the compiler to search for a loop with an identifier of myloop 906 .
- the compiler first scans the source code and registers the nesting levels of each loop. Once the nesting levels are registered, the compiler sorts the order of the nesting levels from the highest to the lowest. Thus, the inner most loop, which has a highest nesting levels, is processed first.
- the j loop or loop 907 is the inner most loop.
- the compiler thus processes LOOPID directive 908 associated with loop 907 and registers the identifier “myloop” with loop 907 .
- the compiler processes BLOCK_LOOP directive 904 , which blocks a loop with an identifier of myloop 906 with a blocking factor computed at run-time by the function compute_tile_size 905 . Since the compiler registered loop 907 previously with an identifier of “myloop”, the compiler recognizes loop 907 as the loop to be blocked. The loop to be blocked has to be nested within the blocking loop.
- the compiler processes BLOCK_LOOP directive 904 by creating loop 910 as an outer loop to divide the iteration space into size of the tile computed at run-time by compute_tile_size 905 .
- Loop 910 iterates from 1 to M with a step size of the result of compute_tile_size 905 with an input limit of M.
- Loop 910 divides M into smaller chucks that fit in the cache.
- Loop 902 remains the same since it was not blocked by BLOCK_LOOP directive 904 .
- the compiler modifies original loop 907 to become loop 914 , which assigns index j to jj and iterates j from jj to minimum of (jj+result of compute_tile_size 904 with an input limit M ⁇ 1) and M.
- Loops 910 and 914 allow users to tune program instructions to a specific memory hierarchy by dividing the limit of M into many different sizes of caches. This tuning improves performance on the machine overall by reusing the smaller chucks of cache memory and fitting the smaller chucks of memory, such as L2 cache, into larger chunks of cache memory, such as L3 cache.
- BLOCK_LOOP directive 904 and LOOPID directive 908 also enable a better interaction between programmers and the compiler.
- FIG. 10 a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directive for the purpose of loop tiling with multiple loop identifiers is depicted in accordance with a preferred embodiment of the present invention.
- a user placed two BLOCK_LOOP directives in the source code: BLOCK_LOOP directive 1002 and BLOCK_LOOP directive 1008 .
- BLOCK_LOOP directive 1002 includes two parameters: a blocking factor 1004 of 50 and myMainLoop 1006 .
- BLOCK_LOOP directive 1002 directs the compiler to block myMainLoop 1006 with a blocking factor 1004 of 50.
- BLOCK_LOOP directive 1008 includes three parameters: a blocking factor 1010 of 20, myFirstLoop 1012 and mySecondLoop 1014 . Thus, BLOCK_LOOP directive 1008 directs the compiler to block myFirstLoop 1012 and mySecondLoop 1014 with a blocking factor 1010 of 20.
- the user also defines three loop identifiers in the source code: myMainLoop 1016 , myFirstLoop 1020 and mySecondLoop 1024 .
- MyMainLoop 1016 marks the loop immediately after the directive, in this example, the i loop or loop 1030 .
- myFirstLoop 1020 marks the j loop or loop 1032 and mySecondLoop 1024 marks the k loop or loop 1034 .
- the user aimed to block three different loops within a loop nest registered with three different loop identifiers.
- loops 1030 - 1034 are registered with the compiler when the compiler scans the source code. Therefore, compiler recognizes loops 1030 - 1034 with loop identifiers myMainLoop 1016 , myFirstLoop 1020 , and mySecondLoop 1024 , respectively.
- loop 1040 is generated first, then loop 1042 and loop 1044 .
- loop 1040 is a generated blocking loop that iterates from ii to N with a step size of 50.
- N is the limit of loop 1040 .
- Loop 1042 is a generated blocking loop that iterates from 1 to M with a step size of 20.
- Loop 1044 is a generated blocking loop that iterates from 1 to M with a step size of 20.
- M is the limit for loop 1042 and 1044 .
- loops 1040 - 44 the compiler modifies loops 1030 - 1034 to become loops 1046 - 1050 .
- Loop 1046 is a blocked loop that iterates i from ii to minimum of (ii+50 ⁇ 1) and N.
- Loop 1048 is a blocked loop that iterates j from jj to minimum of (jj+20 ⁇ 1) and M.
- Loop 1050 is a blocked loop that iterates k from kk to minimum of (kk+20 ⁇ 1) and M.
- the present invention allows users to control transformations of different loops in a loop nest by applying different blocking factors or tile sizes to different loops.
- the present invention has advantages over the prior art in that by using the BLOCK_LOOP and LOOPID directives, a user may refer to multiple loops in different locations at the same time, without modifying underlying program instructions. This gives flexibility to the user to combine any loops to perform techniques such as loop tiling or strip mining.
- FIG. 11 a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directive for the purpose of loop interchange is depicted in accordance with a preferred embodiment of the present invention.
- BLOCK_LOOP directive 1106 includes two parameters: a blocking factor 1108 of 1 and a loop identifier of myLoop 1110 .
- MyLoop 1110 is specified by the user with LOOPID directive 1114 , which marks the L loop, loop 1116 .
- a user may cause the loop being blocked, in this case, L loop or loop 1116 , to become a single iteration loop.
- the user will cause the generated blocking loop to make all the iterations of the original loop.
- the k loop or loop 1112 currently wraps the L loop or loop 1116 .
- the compiler After the compiler processed BLOCK_LOOP directive 1106 and LOOPID directive 1114 , the compiler generates loops 1124 and statement 1126 .
- Loop 1124 is generated as an outer blocking loop that blocks the L loop, loop 1116 , which iterates LL from 1 to M. M is the limit of loop 1116 . In this way, loop 1124 makes all the iterations of the original loop, loop 1116 .
- the compiler modifies original loop 1116 to become statement 1126 , which sets index L to LL.
- blocking factor 1108 specified in BLOCK_LOOP directive 1106 is 1.
- statement 1126 is a single iteration loop that only gets executed once.
- the L loop or loop 1124 now wraps the k loop or loop 1128 .
- specifying a blocking factor 1108 of 1 allows a user to perform loop interchange on two loops, which switches the order of access to which the elements are stored in memory.
- a user may direct the compiler to perform loop interchange or to create a different permutation of the loop nest.
- loop interchange changes the access pattern of elements in the loops to the order stored in memory, which improves spatial locality by rearranging access to memory locations that are closer to the processor. For example, in a 2-dimensional matrix, instead of accessing elements from each row, which forces the processor to access memory locations that are sparse, loop interchange may be performed to access elements that are closer to each other.
- FIG. 12A a diagram illustrating an exemplary implementation of BLOCK_LOOP and LOOPID directive for the purpose of loop tiling for multi-level memory hierarchy is depicted in accordance with a preferred embodiment of the present invention.
- BLOCK_LOOP directive 1202 includes two parameters: a blocking factor of L 3 Factor 1204 and an identifier of first_level_blocking 1205 , which is the blocked loop.
- L 3 Factor 1204 may represent a level 3 cache memory in a system.
- Blocked loop first_level_blocking 1205 is defined by the user using LOOPID directive 1206 .
- BLOCK_LOOP directive 1208 instead of a loop immediately following the directive, BLOCK_LOOP directive 1208 immediately follows.
- BLOCK_LOOP directive 1208 is placed in front of j loop or loop 1210 .
- BLOCK_LOOP directive 1208 includes two parameters: a blocking factor of L 2 Factor 1218 and an identifier of inner_space 1220 , which is the blocked loop.
- L 2 Factor 1218 may represent the size of a level 2 cache memory in the system.
- Blocked loop inner_space 1220 is defined by the user with LOOPID directive 1212 , with the k loop or loop 1214 immediately following the directive. Therefore, in this example, inner_space 1212 is blocked with with L2Factor 1218 . The result, represented by first_level_blocking 1206 , is in turn blocked with L3Factor 1204 .
- loop 1204 When the compiler scans the source code, the compiler first registers the nesting level of each loop. In this case, loop 1204 is registered with a nesting level of 0. Loop 1210 is registered with a nesting level of 1. Loop 1214 is registered with a nesting level of 2. Loop 1216 is registered with a nesting level of 3. Then, the compiler sorts the order of the loops from the highest nesting level to the lowest nesting level. Thus, loop 1216 is processed first, then followed by loop 1214 , 1210 and 1204 .
- the compiler processes the k loop or loop 1214 , the compiler processes LOOPID directive 1212 , which marks the k loop or loop 1214 and registers loop 1214 with an identifier of “inner_space”. Then, the compiler processes loop 1210 and BLOCK_LOOP directive 1208 , which is discussed in further detail in FIG. 12B .
- FIG. 12B a diagram illustrating exemplary resulting code after BLOCK_LOOP directive 1208 in FIG. 12A is processed by the compiler is depicted in accordance with a preferred embodiment of the present invention.
- the compiler processes BLOCK_LOOP directive 1208 in FIG. 12A , which blocks a loop with an identifier of “inner_space” with a blocking factor of “L 2 Factor”.
- loop 1218 When the compiler processes BLOCK_LOOP directive 1208 in FIG. 12A , the compiler first creates loop 1218 as an outer blocking loop that blocks loop 1214 in FIG. 12A .
- Loop 1214 is recognized by the compiler, since it is previously registered with an identifier of “inner_space”. Loop 1218 iterates index kk from 1 to M with a step size of L2Factor, which is the blocking factor specified in BLOCK_LOOP directive 1208 in FIG. 12A .
- the compiler modifies loop 1214 in FIG. 12A to become loop 1219 , which iterates original index k from kk to minimum of (kk+L2Factor ⁇ 1) and M, with a step size of 1.
- the compiler continues to process LOOPID directive 1206 that marks the kk loop or loop 1218 and registers loop 1218 with an identifier of “first_level_blocking”. Once loop 1218 is registered, the i loop or loop 1204 and BLOCK_LOOP directive 1202 is processed by the compiler, which is discussed in further detail in FIG. 12C .
- FIG. 12C a diagram illustrating exemplary resulting code after BLOCK_LOOP directive 1202 in FIG. 12B is processed by the compiler is depicted in accordance with a preferred embodiment of the present invention.
- the compiler processes BLOCK_LOOP directive 1202 in FIG. 12B , which blocks a loop with an identifier of “first_level_blocking” with a blocking factor of L3Factor.
- loop 1220 When the compiler processes BLOCK_LOOP directive 1202 in FIG. 12B , the compiler first creates loop 1220 as an outer blocking loop that blocks loop 1218 in FIG. 12B .
- Loop 1218 is recognized by the compiler, since it is previously registered with an identifier of “first_level_blocking”.
- Loop 1220 iterates index kkk from 1 to M, with a step size of L3Factor, which is the blocking factor specified in BLOCK_LOOP directive 1202 in FIG. 12B , multiplied by L2Factor.
- the blocking factor, L 3 fFactor is multiplied by the previous blocking factor, L2Factor, such that we can further divide the iteration space into blocks of L2Factor ⁇ L3Factor.
- Loop 1219 remains the same. However, the compiler modifies loop 1218 in FIG. 12B to become loop 1224 , which iterates original index kk from kkk to minimum of (kk+L3Factor ⁇ L2Factor ⁇ 1) and M, with the same step size of L2Factor.
- the BLOCK_LOOP and LOOPID directives of the present invention may be used to perform loop tiling for multi-level memory hierarchy.
- chunks of level 1 cache memories may be reused to form a larger chunk of level 2 cache memory.
- chunks of level 2 cache memories may be reused to form a larger chunk of level 3 cache memory and so on. This enhances reuse of cache memories and reduces cache miss rates.
- the directives also give user control of loop transformations, such as loop interchange and loop tiling, at each level of the memory hierarchy.
- L1 cache 1240 is represented by two elements or chunks of 2.
- the chunks of 2 of L1 caches may then be reused to form a L2 cache 1242 , which includes a chuck of 6 elements or 3 chunks of L1 caches 1240 .
- the chunks of 6 of L2 caches 1242 may then be fitted or reused to form a L3 cache 1244 , which includes a chunk of 12 elements or 2 chunks of L2 caches 1242 .
- the present invention provides a generic language interface, which includes two new directives, to apply loop optimization transformations.
- the directives convey two pieces of information that relate to two different loops in a loop nest at the same time: a location where the blocking loop may be created and a loop to be blocked at any level of a given loop nest.
- the directives also provide better interactions between the program instructions and the compiler as well as flexibility of performing loop transformation on any given loop in a loop nest. Furthermore, a user or programmer may perform tuning of program instructions to improve performance by first directing the compiler to ignore the directives and observe the result of the loop executions with correctness in mind. Later on, the programmer may apply the directives to speed up loop executions and record the performance gained by using the directives. Iteratively, the programmer may use the performance records to tune program instructions.
- a user may utilize the directives of the present invention to perform loop tiling for multi-level memory hierarchy without modifying the program instructions, such that the memory space may be divided into smaller chunks that can be reused.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
Claims (31)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/926,601 US7318223B2 (en) | 2004-08-26 | 2004-08-26 | Method and apparatus for a generic language interface to apply loop optimization transformations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/926,601 US7318223B2 (en) | 2004-08-26 | 2004-08-26 | Method and apparatus for a generic language interface to apply loop optimization transformations |
Publications (2)
Publication Number | Publication Date |
---|---|
US20060048121A1 US20060048121A1 (en) | 2006-03-02 |
US7318223B2 true US7318223B2 (en) | 2008-01-08 |
Family
ID=35944978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/926,601 Expired - Fee Related US7318223B2 (en) | 2004-08-26 | 2004-08-26 | Method and apparatus for a generic language interface to apply loop optimization transformations |
Country Status (1)
Country | Link |
---|---|
US (1) | US7318223B2 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050138029A1 (en) * | 2003-12-19 | 2005-06-23 | International Business Machines Corporation | Generalized index set splitting in software loops |
US20060212862A1 (en) * | 2005-03-15 | 2006-09-21 | International Business Machines Corporation | System, method and program product to optimize code during run time |
US20080046871A1 (en) * | 2006-08-15 | 2008-02-21 | International Business Machines Corporation | Array value substitution and propagation with loop transformations through static analysis |
US20080046870A1 (en) * | 2006-08-15 | 2008-02-21 | International Business Machines Corporation | Compile time evaluation of library functions |
US20090307675A1 (en) * | 2008-06-04 | 2009-12-10 | Ng John L | Data dependence testing for loop fusion with code replication, array contraction, and loop interchange |
US20100175056A1 (en) * | 2002-07-03 | 2010-07-08 | Hajime Ogawa | Compiler apparatus with flexible optimization |
WO2013147896A1 (en) * | 2012-03-30 | 2013-10-03 | Intel Corporation | Instruction and logic to efficiently monitor loop trip count |
US20140229915A1 (en) * | 2013-02-11 | 2014-08-14 | International Business Machines Corporation | Debugger with previous version feature |
US10365906B2 (en) * | 2017-05-23 | 2019-07-30 | Intel Corporation | Compile time interface to run-time libraries |
US10684833B2 (en) * | 2018-03-15 | 2020-06-16 | Intel Corporation | Post-compile cache blocking analyzer |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060048122A1 (en) * | 2004-08-30 | 2006-03-02 | International Business Machines Corporation | Method, system and computer program product for hierarchical loop optimization of machine executable code |
US20060048118A1 (en) * | 2004-08-30 | 2006-03-02 | International Business Machines Corporation | Method and apparatus for optimizing code with artificial statements |
US20070022178A1 (en) * | 2005-07-01 | 2007-01-25 | Lee Prescott V | Systems and methods for adding media from a content input device into a loop |
US8671401B2 (en) * | 2007-04-09 | 2014-03-11 | Microsoft Corporation | Tiling across loop nests with possible recomputation |
US8127281B2 (en) * | 2007-12-12 | 2012-02-28 | International Business Machines Corporation | Method and apparatus for efficient multiple-pattern based matching and transformation of intermediate language expression trees |
US8572595B1 (en) | 2008-02-08 | 2013-10-29 | Reservoir Labs, Inc. | Methods and apparatus for aggressive scheduling in source code compilation |
US8930926B2 (en) * | 2008-02-08 | 2015-01-06 | Reservoir Labs, Inc. | System, methods and apparatus for program optimization for multi-threaded processor architectures |
US8661422B2 (en) * | 2008-02-08 | 2014-02-25 | Reservoir Labs, Inc. | Methods and apparatus for local memory compaction |
US9858053B2 (en) | 2008-02-08 | 2018-01-02 | Reservoir Labs, Inc. | Methods and apparatus for data transfer optimization |
US8572590B2 (en) | 2008-09-17 | 2013-10-29 | Reservoir Labs, Inc. | Methods and apparatus for joint parallelism and locality optimization in source code compilation |
US9185020B2 (en) * | 2009-04-30 | 2015-11-10 | Reservoir Labs, Inc. | System, apparatus and methods to implement high-speed network analyzers |
US8892483B1 (en) | 2010-06-01 | 2014-11-18 | Reservoir Labs, Inc. | Systems and methods for planning a solution to a dynamically changing problem |
US8914601B1 (en) | 2010-10-18 | 2014-12-16 | Reservoir Labs, Inc. | Systems and methods for a fast interconnect table |
US9134976B1 (en) | 2010-12-13 | 2015-09-15 | Reservoir Labs, Inc. | Cross-format analysis of software systems |
US9489180B1 (en) | 2011-11-18 | 2016-11-08 | Reservoir Labs, Inc. | Methods and apparatus for joint scheduling and layout optimization to enable multi-level vectorization |
US9830133B1 (en) | 2011-12-12 | 2017-11-28 | Significs And Elements, Llc | Methods and apparatus for automatic communication optimizations in a compiler based on a polyhedral representation |
US9798588B1 (en) | 2012-04-25 | 2017-10-24 | Significs And Elements, Llc | Efficient packet forwarding using cyber-security aware policies |
US10936569B1 (en) | 2012-05-18 | 2021-03-02 | Reservoir Labs, Inc. | Efficient and scalable computations with sparse tensors |
US9684865B1 (en) | 2012-06-05 | 2017-06-20 | Significs And Elements, Llc | System and method for configuration of an ensemble solver |
US8935684B2 (en) * | 2012-12-13 | 2015-01-13 | Advanced Micro Devices, Inc. | Loop invariant method expression hoisting |
JP2015194881A (en) * | 2014-03-31 | 2015-11-05 | 富士通株式会社 | Compilation device, compilation program, and compilation method |
US11068247B2 (en) * | 2018-02-06 | 2021-07-20 | Microsoft Technology Licensing, Llc | Vectorizing conditional min-max sequence reduction loops |
US11556357B1 (en) | 2021-03-25 | 2023-01-17 | The Mathworks, Inc. | Systems, media, and methods for identifying loops of or implementing loops for a unit of computation |
US11656854B2 (en) * | 2021-08-30 | 2023-05-23 | Huawei Technologies Co., Ltd. | Methods and devices for computing a memory size for software optimization |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5230053A (en) | 1990-02-05 | 1993-07-20 | Matsushita Electric Industrial Co., Ltd. | Processor scheduling method for iterative loops |
US5303357A (en) * | 1991-04-05 | 1994-04-12 | Kabushiki Kaisha Toshiba | Loop optimization system |
US5535393A (en) * | 1991-09-20 | 1996-07-09 | Reeve; Christopher L. | System for parallel processing that compiles a filed sequence of instructions within an iteration space |
US5953531A (en) * | 1997-07-25 | 1999-09-14 | International Business Machines Corporation | Method of, system for, and computer program product for minimizing loop execution time by optimizing block/tile sizes |
US6038398A (en) * | 1997-05-29 | 2000-03-14 | Hewlett-Packard Co. | Method and apparatus for improving performance of a program using a loop interchange, loop distribution, loop interchange sequence |
US6226790B1 (en) * | 1997-02-28 | 2001-05-01 | Silicon Graphics, Inc. | Method for selecting optimal parameters for compiling source code |
US6567976B1 (en) | 1997-03-20 | 2003-05-20 | Silicon Graphics, Inc. | Method for unrolling two-deep loops with convex bounds and imperfectly nested code, and for unrolling arbitrarily deep nests with constant bounds and imperfectly nested code |
US20030110481A1 (en) * | 2001-12-06 | 2003-06-12 | Kiyomi Wada | Program tuning method |
US20050144605A1 (en) * | 2003-12-26 | 2005-06-30 | Keiko Motokawa | Information processing system and code generation method |
US7086039B2 (en) * | 2002-11-13 | 2006-08-01 | Sun Microsystems, Inc. | Compiler for optimizing source code with computed goto statements |
-
2004
- 2004-08-26 US US10/926,601 patent/US7318223B2/en not_active Expired - Fee Related
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5230053A (en) | 1990-02-05 | 1993-07-20 | Matsushita Electric Industrial Co., Ltd. | Processor scheduling method for iterative loops |
US5303357A (en) * | 1991-04-05 | 1994-04-12 | Kabushiki Kaisha Toshiba | Loop optimization system |
US5535393A (en) * | 1991-09-20 | 1996-07-09 | Reeve; Christopher L. | System for parallel processing that compiles a filed sequence of instructions within an iteration space |
US6226790B1 (en) * | 1997-02-28 | 2001-05-01 | Silicon Graphics, Inc. | Method for selecting optimal parameters for compiling source code |
US6567976B1 (en) | 1997-03-20 | 2003-05-20 | Silicon Graphics, Inc. | Method for unrolling two-deep loops with convex bounds and imperfectly nested code, and for unrolling arbitrarily deep nests with constant bounds and imperfectly nested code |
US6038398A (en) * | 1997-05-29 | 2000-03-14 | Hewlett-Packard Co. | Method and apparatus for improving performance of a program using a loop interchange, loop distribution, loop interchange sequence |
US5953531A (en) * | 1997-07-25 | 1999-09-14 | International Business Machines Corporation | Method of, system for, and computer program product for minimizing loop execution time by optimizing block/tile sizes |
US20030110481A1 (en) * | 2001-12-06 | 2003-06-12 | Kiyomi Wada | Program tuning method |
US7086039B2 (en) * | 2002-11-13 | 2006-08-01 | Sun Microsystems, Inc. | Compiler for optimizing source code with computed goto statements |
US20050144605A1 (en) * | 2003-12-26 | 2005-06-30 | Keiko Motokawa | Information processing system and code generation method |
Non-Patent Citations (14)
Title |
---|
"MIPSpro C/C++ Pragmas", Aug. 15, 2003, Chapter 8, retrieved from: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI<SUB>-</SUB>Developer/books/Pragmas/sgi<SUB>-</SUB>html/ch08.html. * |
"The Paramount Group FAQ", Jun. 4, 2003, section "Ploaris internal directives", retrieved from: http://cobweb.ecn.purdue.edu/ParaMount/FAQ/. * |
Apan Qasem, Guohua Jin, and John Mellor-Crummey, "Improving Performance with Integrated Program Transformations", Oct. 2003, Technical Report TR03-419, Rice University, pp. 1-13. * |
Chelcea et al., "Resynthesis and Peephole Transformations for the Optimization of Large-Scale Asynchronous Systems", ACM Digital Library 2002, pp. 405-410. |
Hewlett Packard, "Exemplar Programming Guide for HP-UX Systems", Dec. 1997, First Edition, pp. xix, 39-40, 83-91. * |
IBM, "XL Fortran Enterprise Edition for AIX, Language Reference, Version 9.1", Aug. 2004, First Edition, pp. 434-435, 445-446. * |
James Gosling, Bill Joy, and Guy Steele, "The Java(TM) Language Specification", 1996, pp. 271-272, 286. * |
Lieuwen et al., "A Transformation-Based Approach to Optimizing Loops in Database Programming Languages", ACM Digital Library 1992, pp. 91-100. |
Qing Yi, Vikram Adve, Ken Kennedy, "Transforming Loops to Recursion for Multi-Level Memory Hierarchies", 2000, Conference on Programming Language Design and Implementation, pp. 169-181. * |
Raghavachari et al., "Ace: A Language for Parallel Programming with Customizable Protocols", ACM Transactions on Computer Systems, vol. 17, No. 3, Aug. 1999, pp. 202-248. |
Stephanie Coleman and Kathryn McKinley, "Tile Size Selection Using Cache Organization and Data Layout", 1995, Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation, pp. 279-289. * |
Steve Carr and Ken Kennedy, "Compiler Blockablility of Numerical Algorithms", 1992, Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pp. 114-124. * |
Steven Muchnick, "Advanced Compiler Design Implementation", 1997, Academic Press, pp. 695-698. * |
Whitfield et al., "An Approach for Exploring Code-Improving Transformations", ACM Transactions on Programming Languages and Systems, vol. 19, No. 6, Nov. 1997, pp. 1053-1084. |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8418157B2 (en) * | 2002-07-03 | 2013-04-09 | Panasonic Corporation | Compiler apparatus with flexible optimization |
US20100175056A1 (en) * | 2002-07-03 | 2010-07-08 | Hajime Ogawa | Compiler apparatus with flexible optimization |
US20050138029A1 (en) * | 2003-12-19 | 2005-06-23 | International Business Machines Corporation | Generalized index set splitting in software loops |
US20060212862A1 (en) * | 2005-03-15 | 2006-09-21 | International Business Machines Corporation | System, method and program product to optimize code during run time |
US7797690B2 (en) * | 2005-03-15 | 2010-09-14 | International Business Machines Corporation | System, method and program product to optimize code during run time |
US20080046871A1 (en) * | 2006-08-15 | 2008-02-21 | International Business Machines Corporation | Array value substitution and propagation with loop transformations through static analysis |
US20080046870A1 (en) * | 2006-08-15 | 2008-02-21 | International Business Machines Corporation | Compile time evaluation of library functions |
US7890942B2 (en) * | 2006-08-15 | 2011-02-15 | International Business Machines Corporation | Array value substitution and propagation with loop transformations through static analysis |
US7921418B2 (en) | 2006-08-15 | 2011-04-05 | International Business Machines Corporation | Compile time evaluation of library functions |
US20090307675A1 (en) * | 2008-06-04 | 2009-12-10 | Ng John L | Data dependence testing for loop fusion with code replication, array contraction, and loop interchange |
US8677338B2 (en) * | 2008-06-04 | 2014-03-18 | Intel Corporation | Data dependence testing for loop fusion with code replication, array contraction, and loop interchange |
WO2013147896A1 (en) * | 2012-03-30 | 2013-10-03 | Intel Corporation | Instruction and logic to efficiently monitor loop trip count |
US9715388B2 (en) | 2012-03-30 | 2017-07-25 | Intel Corporation | Instruction and logic to monitor loop trip count and remove loop optimizations |
US20140229915A1 (en) * | 2013-02-11 | 2014-08-14 | International Business Machines Corporation | Debugger with previous version feature |
US20140229916A1 (en) * | 2013-02-11 | 2014-08-14 | International Business Machines Corporation | Debugger with previous version feature |
US9047403B2 (en) * | 2013-02-11 | 2015-06-02 | International Business Machines Corporation | Debugger with previous version feature |
US9117017B2 (en) * | 2013-02-11 | 2015-08-25 | International Business Machines Corporation | Debugger with previous version feature |
US10365906B2 (en) * | 2017-05-23 | 2019-07-30 | Intel Corporation | Compile time interface to run-time libraries |
US10684833B2 (en) * | 2018-03-15 | 2020-06-16 | Intel Corporation | Post-compile cache blocking analyzer |
Also Published As
Publication number | Publication date |
---|---|
US20060048121A1 (en) | 2006-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7318223B2 (en) | Method and apparatus for a generic language interface to apply loop optimization transformations | |
US10169013B2 (en) | Arranging binary code based on call graph partitioning | |
US8413127B2 (en) | Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations | |
US8713548B2 (en) | Rewriting branch instructions using branch stubs | |
US8627051B2 (en) | Dynamically rewriting branch instructions to directly target an instruction cache location | |
US7571432B2 (en) | Compiler apparatus for optimizing high-level language programs using directives | |
US20070088915A1 (en) | Method and apparatus for software-assisted data cache and prefetch control | |
US8782381B2 (en) | Dynamically rewriting branch instructions in response to cache line eviction | |
US7137111B2 (en) | Aggressive prefetch of address chains | |
US20110154289A1 (en) | Optimization of an application program | |
US9280350B2 (en) | Methods and apparatus to perform adaptive pre-fetch operations in managed runtime environments | |
US7818731B2 (en) | Method and system for reducing memory reference overhead associated with treadprivate variables in parallel programs | |
US20070089097A1 (en) | Region based code straightening | |
US8146068B2 (en) | Managing heuristic properties | |
US7107583B2 (en) | Method and apparatus for reducing cache thrashing | |
JPH1069389A (en) | Device that make good use of branch predictive cache without tag by rearrangement of branch | |
US7356812B2 (en) | Passing parameters by implicit reference | |
US20050183079A1 (en) | Tail duplicating during block layout | |
CN114217806A (en) | Compiling optimization method based on cache writing hint mechanism | |
Lattner et al. | Automatic Pool Allocation: Compile-Time Control of Data Structure Layout in the Heap |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BLAINEY, ROBERT JAMES;TAL, ARIE;REEL/FRAME:015148/0735;SIGNING DATES FROM 20040715 TO 20040820 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
AS | Assignment |
Owner name: GOOGLE INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:026664/0866 Effective date: 20110503 |
|
FEPP | Fee payment procedure |
Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20160108 |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044142/0357 Effective date: 20170929 |