WO2002069152A1 - Managing coherence via put/get windows - Google Patents
Managing coherence via put/get windows Download PDFInfo
- Publication number
- WO2002069152A1 WO2002069152A1 PCT/US2002/005615 US0205615W WO02069152A1 WO 2002069152 A1 WO2002069152 A1 WO 2002069152A1 US 0205615 W US0205615 W US 0205615W WO 02069152 A1 WO02069152 A1 WO 02069152A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cache
- processor
- data
- memory
- activities
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims description 22
- 238000004891 communication Methods 0.000 claims description 14
- 230000001427 coherent effect Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000004888 barrier function Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H05—ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
- H05K—PRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
- H05K7/00—Constructional details common to different types of electric apparatus
- H05K7/20—Modifications to facilitate cooling, ventilating, or heating
- H05K7/20709—Modifications to facilitate cooling, ventilating, or heating for server racks or cabinets; for data centers, e.g. 19-inch computer racks
- H05K7/20836—Thermal management, e.g. server temperature control
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F04—POSITIVE - DISPLACEMENT MACHINES FOR LIQUIDS; PUMPS FOR LIQUIDS OR ELASTIC FLUIDS
- F04D—NON-POSITIVE-DISPLACEMENT PUMPS
- F04D25/00—Pumping installations or systems
- F04D25/16—Combinations of two or more pumps ; Producing two or more separate gas flows
- F04D25/166—Combinations of two or more pumps ; Producing two or more separate gas flows using fans
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F04—POSITIVE - DISPLACEMENT MACHINES FOR LIQUIDS; PUMPS FOR LIQUIDS OR ELASTIC FLUIDS
- F04D—NON-POSITIVE-DISPLACEMENT PUMPS
- F04D27/00—Control, e.g. regulation, of pumps, pumping installations or pumping systems specially adapted for elastic fluids
- F04D27/004—Control, e.g. regulation, of pumps, pumping installations or pumping systems specially adapted for elastic fluids by varying driving speed
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0837—Cache consistency protocols with software control, e.g. non-cacheable data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17356—Indirect interconnection networks
- G06F15/17368—Indirect interconnection networks non hierarchical topologies
- G06F15/17381—Two dimensional, e.g. mesh, torus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/526—Mutual exclusion algorithms
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G5/00—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
- G09G5/003—Details of a display terminal, the details relating to the control arrangement of the display terminal and to the interfaces thereto
- G09G5/006—Details of the interface to the display terminal
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G5/00—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
- G09G5/003—Details of a display terminal, the details relating to the control arrangement of the display terminal and to the interfaces thereto
- G09G5/006—Details of the interface to the display terminal
- G09G5/008—Clock recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L7/00—Arrangements for synchronising receiver with transmitter
- H04L7/02—Speed or phase control by the received code signals, the signals containing no special synchronisation information
- H04L7/033—Speed or phase control by the received code signals, the signals containing no special synchronisation information using the transitions of the received signal to control the phase of the synchronising-signal-generating means, e.g. using a phase-locked loop
- H04L7/0337—Selecting between two or more discretely delayed clocks or selecting between two or more discretely delayed received code signals
- H04L7/0338—Selecting between two or more discretely delayed clocks or selecting between two or more discretely delayed received code signals the correction of the phase error being performed by a feed forward loop
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F24—HEATING; RANGES; VENTILATING
- F24F—AIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
- F24F11/00—Control or safety arrangements
- F24F11/70—Control systems characterised by their outputs; Constructional details thereof
- F24F11/72—Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure
- F24F11/74—Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure for controlling air flow rate or air velocity
- F24F11/77—Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure for controlling air flow rate or air velocity by controlling the speed of ventilators
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G2360/00—Aspects of the architecture of display systems
- G09G2360/12—Frame memory handling
- G09G2360/121—Frame memory handling using a cache memory
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02B—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO BUILDINGS, e.g. HOUSING, HOUSE APPLIANCES OR RELATED END-USER APPLICATIONS
- Y02B30/00—Energy efficient heating, ventilation or air conditioning [HVAC]
- Y02B30/70—Efficient control or regulation technologies, e.g. for control of refrigerant flow, motor or heating
Definitions
- This invention relates to the field of distributed-memory message-passing parallel computer design and system software, as applied for example to computation in the field of life sciences.
- Still other solutions in the current practice create a non-cached area of memory to (and/or from) which data are copied. This practice introduces overheads in the performance-critical path of message passing that often makes it impossible to sustain the full bandwidth of the underlying communication hardware. Still other solutions in common practice disable all caching for areas of memory intended to be used for communication. This practice penalizes all memory accesses to these areas, not just those required for message passing.
- An object of this invention is to provide an improved procedure for managing coherence in a parallel processing computer system.
- Another object of the present invention is to achieve a desired level of coherency between the caches of the processors of a two processor node without extensive circuit and with minimal slowdown of each processor.
- a further object of the invention is to provide a method and apparatus, working in conjunction with software algorithms, to accomplish efficiently high speed message- passing communications between processors (or a direct memory access or DMA device), which maintains coherence without significantly reducing performance.
- the invention provides a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing massively parallel supercomputer (such as the one described in copending patent application no. (attorney Docket 15275)) containing two or more non-coherent processing elements (or even a DMA controller) where one processing element is primarily performing calculations, while the other element is performing message passing activities.
- this algorithm uses the opening and closing of "Put/Get Windows" to coordinate the activities required to achieve cache coherence.
- the invention provides a mechanism to ensure that the data being "put" (sent) is not in the cache of either processor, and that the data being "gotten” (received) is also not in the cache of either processor.
- the invention reduces the total amount of work required to achieve coherence and allow that work to be amortized over a large number of individual messages.
- both processing elements within a node must perform this work, the invention enables this to happen concurrently. Further, when required, these activities can be coordinated over a large number of independent nodes in the massively parallel machine by employing the
- the invention provides a hardware apparatus that assists the above-described cache coherence software algorithm, and limits the total time (or latency) required to achieve cache coherence over the Put/Get Window.
- This apparatus is a simple extension to the hardware address decoder that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
- This further speeds the coherence activities because it allows garbage data (which the processor will never use) to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance. The performance is faster because this garbage data does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
- data cache block flush and invalidate may be used to write data from the memory area of the first processor into the shared memory area, while at the same time, preventing the first processor from using data the data written in its memory area.
- data cache block zero may be used to write data from the memory area of the first processor into the shared memory.
- Figure 1 shows a two processor node embodying this invention.
- Figure 2 illustrates a put/get window that may be used in the practice of this invention.
- the present invention relates to a method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system.
- Figure 1 illustrates a node 10 that may embody this invention.
- Each of the processors 12, 14 of node 10 has a respective memory area 16, 20, and the two processors share a third memory area 22.
- the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm.
- the software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence.
- the hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
- this invention utilizes a principal referred to as "put/get" data transfer.
- put/get a principal referred to as "put/get" data transfer.
- typical application messaging traffic involves an increasing number of messages, where each such message contains a piece of work performed by other nodes in the multi-computer.
- one node scatters locally produced work items to numerous other nodes (a "put"), while assembling numerous remotely produced work items into its local memory (a "get").
- a "put" locally produced work items to numerous other nodes
- a "get” assembling numerous remotely produced work items into its local memory
- One-Sided Communication uses a "put/get" message-passing paradigm, where messages carry the source (for "get”) or destination (for "put") memory address.
- put/get window This time during which puts and gets occur is termed the put/get window. This window extends both in time (when it occurs) and in memory (over which range of memory addresses does the data in the put or get reside).
- Figure 2 shows a put/get window 30 having a number of distinct messages
- the present invention utilizes this put/get window to provide a simple means to manage memory coherence.
- a software algorithm is provided that simplifies and significantly speeds the management of cache coherence in a message passing massively parallel supercomputer (such as the one described in copending patent application no. (attorney Docket 15275)) containing two or more non-coherent processing elements (or even a DMA controller) where one processing element is primarily performing calculations, while the other element is performing message passing activities.
- this algorithm uses the opening and closing of "Put/Get Windows" to coordinate the activities required to achieve cache coherence.
- this invention provides a mechanism to ensure that the data being "put" (sent) is not in the cache of either processor, and that the data being "gotten” (received) is also not in the cache of either processor.
- this invention reduces the total amount of work required to achieve coherence and allow that work to be amortized over a large number of individual messages.
- this invention enables this to happen concurrently. Further, when required, these activities can be coordinated over a large number of independent nodes in the massively parallel machine by employing the
- a novel hardware apparatus that assists the above-described cache coherence algorithm, and limits the total time (or latency) required to achieve cache coherence over the Put/Get Window.
- This apparatus is a simple extension to the hardware address decoder that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
- This further speeds the coherence activities because it allows garbage data (which the processor will never use) to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance. The performance is faster because this garbage data does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
- the first embodiment may be preferred if the size of the message being received is smaller than the size of LI, while the second embodiment may be preferred if the size of the message received is larger than LI.
- DCBF data cache block flush and invalidate
- software in the communication library instructs the receiving processor (the computation processor in our dual processor node) to flush the contents of LI in the address window, as described above.
- This simple operation insures that the data in common memory are the same as the data in the compute processor's cache, and further, because of the invalidate, allows an opportunity for the I/O processor to change the contents of the common memory.
- the software then instructs the I/O processor to proceed until all expected messages arrive.
- the software then allows the computer processor to continue to process instructions, and closes the put/get window using a global "and.”
- DCBZ data cache block zero
- DCBZ data cache block zero
- DCBZ creates a new cache line with contents of zero. If a new cache line is not available, then another cache line in LI (for example, the least recently used line), has its data written back to the common memory, and is then zero'ed with the address given by the DCBZ instruction.
- DCBZ is a PowerPC BookE instruction; similar instructions exist for other processors.
- the software executes DCBZ to all of LI, with an address of the reserved space, all lines in the LI are flushed, i.e., all modified lines are written back to common memory.
- the software then allows the compute processor to continue to process instructions, and closes the put/get window using a global "and".
- the reserved physical space need not actually exist, only the address range must be protected; that is not written to or read from by the user. All writes to this reserved memory space can, in principle, be ignored by the memory controller. All reads to this reserved memory space can, in principle, immediately return an arbitrary value to the requesting processors L 1. It may also be noted that if DCBF instructions are slower than DCBZ, then the operating system may use the DCBZ instruction for messages smaller then LI and vice- versa.
- the I/O Processor need not flush its cache at all if the communication memory space is marked write-through to its LI cache.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Data Mining & Analysis (AREA)
- Mechanical Engineering (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Discrete Mathematics (AREA)
- Thermal Sciences (AREA)
- Algebra (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Databases & Information Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Multi Processors (AREA)
Abstract
A method and apparatus for managing coherence between two processors (12 and 14) of a two processor node of a multi-processor system (10). Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows (30) to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that does not actually exist, and is therefore able to respond instantly to read and write requests from the processing elements.
Description
MANAGING COHERENCE VIA PUT/GET WINDOWS
CROSS-REFERENCE TO RELATED APPLICATIONS
The present invention claims the benefit of commonly-owned, co-pending United States Provisional Patent Application Serial Number 60/271,124 filed February 24, 2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the whole contents and disclosure of which is expressly incorporated by reference herein as if fully set forth herein. This patent application is additionally related to the following commonly- owned, co-pending United States Patent Applications filed on even date herewith, the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. U.S. patent application Serial No. (YOR920020027US1, YOR920020044US 1 (15270)), for "Class Networking Routing"; U.S. patent application Serial No. (YOR920020028US 1 (15271)), for "A Global Tree Network for Computing Structures"; U.S. patent application Serial No. (YOR920020029US1 (15272)), for 'Global Interrupt and Barrier Networks"; U.S. patent application Serial No. (YOR920020030US1 (15273)), for Optimized Scalable Network Switch"; U.S. patent application Serial No. (YOR920020031US1, YOR920020032US1 (15258)), for "Arithmetic Functions in Torus and Tree Networks'; U.S. patent application Serial No. (YOR920020033US1, YOR920020034US1 (15259)), for 'Data Capture Technique for High Speed Signaling"; U.S. patent application Serial No. (YOR920020035US1 (15260)), for 'Managing Coherence Via Put/Get Windows'; U.S. patent application Serial No. (YOR920020036US1, YOR920020037US1 (15261)), for "Low Latency Memory Access And Synchronization"; U.S. patent application Serial No. (YOR920020038US1 (15276), for 'Twin-Tailed Fail-Over for Fileservers Maintaining Full Performance in the Presence of Failure"; U.S. patent application Serial No. (YOR920020039US1 (15277)), for "Fault Isolation Through No-Overhead Link Level Checksums'; U.S. patent application Serial No. (YOR920020040US1 (15278)), for "Ethernet Addressing Via Physical Location for Massively Parallel Systems"; U.S. patent application Serial No. (YOR920020041US1 (15274)), for "Fault Tolerance in a Supercomputer Through Dynamic Repartitioning"; U.S. patent application Serial No.
(YOR920020042US1 (15279)), for "Checkpointing Filesystem"; U.S. patent application Serial No. (YOR920020043US1 (15262)), for "Efficient Implementation of Multidimensional Fast Fourier Transform on a Distributed-Memory Parallel Multi- Node Computer"; U.S. patent application Serial No. (YOR9-20010211US2 (15275)), for "A Novel Massively Parallel Supercomputer"; and U.S. patent application Serial No. (YOR920020045US1 (15263)), for "Smart Fan Modules and System".
BACKGROUND OF THE INVENTION
1. Field of the Invention This invention relates to the field of distributed-memory message-passing parallel computer design and system software, as applied for example to computation in the field of life sciences.
2. BACKGROUND ART In provisional patent application no. 60/271 , 124 titled "A Novel Massively Parallel Supercomputer," therein is described a semiconductor device with two electronic processors within each node of the multi-computer. One of these processors is designated the "Computer Processor," and is dedicated for application computation. The other processor is the "I/O Processor," which, in most modes of operation, is a service processor to perform communication activities required of a message-passing multi-computer. Each of these processors contains a separate level 1 data cache (LI), which contains a copy of data in the common memory accessed by both processors. If one processor changes the value of a location in its LI, and the other processor has information from the same location of common memory in its LI, the two copies of the same location become "coherent" if they are made to be the same.
A longstanding problem in the field of computer design, is how to keep these LI caches coherent. Typical solutions employ techniques known as "snooping" the memory bus of the other processor, which is slow and reduces the performance of each processor. Alternatively, the processor that contains an old copy in LI of the data in the common memory, can request a new copy, or mark the old copy obsolete, but this requires
knowledge of when the copy became invalid. Sometime this knowledge is incomplete, forcing unnecessary memory operations, further reducing performance. Other computers make use of "interlocks," whereby one processor is granted permission to use certain data while the other processor cannot, but this permission involves interactions between the two processors, which require additional complex circuitry in the semiconductor device, and which reduce the performance of the two processors.
Still other solutions in the current practice create a non-cached area of memory to (and/or from) which data are copied. This practice introduces overheads in the performance-critical path of message passing that often makes it impossible to sustain the full bandwidth of the underlying communication hardware. Still other solutions in common practice disable all caching for areas of memory intended to be used for communication. This practice penalizes all memory accesses to these areas, not just those required for message passing.
In the dual processor node disclosed in the above-identified provisional patent application no. 60/271,124, the condition of having two processors, each with a copy of the same memory location in the local LI, is a very common occurrence, since the service processor is gathering data and passing that data to the other compute processor. If the message passed are small and many, the problem is exacerbated. Thus, there is a clear need to find a way to make the LI 's of each processor coherent, without extensive circuit, with minimal slowdown of each processor.
SUMMARY OF THE INVENTION An object of this invention is to provide an improved procedure for managing coherence in a parallel processing computer system.
Another object of the present invention is to achieve a desired level of coherency between the caches of the processors of a two processor node without extensive circuit and with minimal slowdown of each processor.
A further object of the invention is to provide a method and apparatus, working in conjunction with software algorithms, to accomplish efficiently high speed message- passing communications between processors (or a direct memory access or DMA device), which maintains coherence without significantly reducing performance.
These and other objectives are attained with the method and apparatus of the present invention. In accordance with a first aspect, the invention provides a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing massively parallel supercomputer (such as the one described in copending patent application no. (attorney Docket 15275)) containing two or more non-coherent processing elements (or even a DMA controller) where one processing element is primarily performing calculations, while the other element is performing message passing activities. Briefly, this algorithm uses the opening and closing of "Put/Get Windows" to coordinate the activities required to achieve cache coherence.
Specifically, because the messages cannot actually be sent or received until after cache coherence has been guaranteed, the invention provides a mechanism to ensure that the data being "put" (sent) is not in the cache of either processor, and that the data being "gotten" (received) is also not in the cache of either processor. By coordinating these activities upon opening and closing the "Put/Get Window", the invention reduces the total amount of work required to achieve coherence and allow that work to be amortized over a large number of individual messages. Also, since both processing elements within a node must perform this work, the invention enables this to happen concurrently. Further, when required, these activities can be coordinated over a large number of independent nodes in the massively parallel machine by employing the
Global Barrier Network described in copending patent application no.
(attorney Docket 15275).
In accordance with a second aspect, the invention provides a hardware apparatus that assists the above-described cache coherence software algorithm, and limits the total time (or latency) required to achieve cache coherence over the Put/Get Window. This apparatus is a simple extension to the hardware address decoder that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements. This further speeds the coherence activities because it allows garbage data (which the processor will never use) to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance. The performance is faster because this garbage data does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
The performance is also faster because only actually modified data is written to memory from cache, while clean data is simply instantly discarded. Further, for the case where the total size of the "Put/Get Window" exceeds, perhaps greatly, the size of the processor's cache, cleaning the cache in this manner provides an upper bound on the total amount of work that is required to ensure that no data from the communication area remains in the cache. It may be noted that, independent of the above-described software algorithms, this hardware device is useful for computer systems in general which employ a Least Recently Used cache replacement policy.
Also, two specific software instructions may be used in the preferred implementation of the invention. One instruction, termed "data cache block flush and invalidate", may be used to write data from the memory area of the first processor into the shared memory area, while at the same time, preventing the first processor from using data the data written in its memory area. A second software instruction, termed "data cache block zero", may be used to write data from the memory area of the first processor into the shared memory. By using these, or similar software instructions, the method and apparatus of the invention, working in conjunction with software algorithms, achieve
high speed message passing communications between processors, which maintains coherence without significantly reducing performance.
Further benefits and advantages of the invention will become apparent from a consideration of the following detailed description, given with reference to the accompanying drawings, which specify and show preferred embodiments of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 shows a two processor node embodying this invention.
Figure 2 illustrates a put/get window that may be used in the practice of this invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention relates to a method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Figure 1 illustrates a node 10 that may embody this invention. Each of the processors 12, 14 of node 10 has a respective memory area 16, 20, and the two processors share a third memory area 22. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
As indicated above, this invention utilizes a principal referred to as "put/get" data transfer. As parallel multi-computers are scaled to increasing numbers of nodes, typical application messaging traffic involves an increasing number of messages, where each
such message contains a piece of work performed by other nodes in the multi-computer. Generally, one node scatters locally produced work items to numerous other nodes (a "put"), while assembling numerous remotely produced work items into its local memory (a "get"). Overall performance for these multi-computers is often gated by the message passing performance of the system.
For such data transfers, a particularly efficient message-passing interface, described in the literature (see for example http://www.mpi-forum.org/docs/docs.html. under MPI- 2), is known as One-Sided Communication. One-Sided Communication uses a "put/get" message-passing paradigm, where messages carry the source (for "get") or destination (for "put") memory address. In parallel supercomputers operating on a common problem, typically puts and gets are assembled in batches and issued simultaneously. This keeps independently operating processors in rough synchronization, allowing good performance on a common problem. This time during which puts and gets occur is termed the put/get window. This window extends both in time (when it occurs) and in memory (over which range of memory addresses does the data in the put or get reside). Figure 2 shows a put/get window 30 having a number of distinct messages
The present invention utilizes this put/get window to provide a simple means to manage memory coherence. In accordance with a first aspect, a software algorithm is provided that simplifies and significantly speeds the management of cache coherence in a message passing massively parallel supercomputer (such as the one described in copending patent application no. (attorney Docket 15275)) containing two or more non-coherent processing elements (or even a DMA controller) where one processing element is primarily performing calculations, while the other element is performing message passing activities. Briefly, this algorithm uses the opening and closing of "Put/Get Windows" to coordinate the activities required to achieve cache coherence.
Specifically, because the messages cannot actually be sent or received until after cache coherence has been guaranteed, this invention provides a mechanism to ensure that the data being "put" (sent) is not in the cache of either processor, and that the data being "gotten" (received) is also not in the cache of either processor. By coordinating these activities upon opening and closing the "Put/Get Window", this invention reduces the total amount of work required to achieve coherence and allow that work to be amortized over a large number of individual messages. Also, since both processing elements within a node must perform this work, this invention enables this to happen concurrently. Further, when required, these activities can be coordinated over a large number of independent nodes in the massively parallel machine by employing the
Global Barrier Network described in copending patent application no.
(attorney Docket 1527).
This algorithm is assisted by the hardware, described below, but even in the absence of the apparatus benefits message-passing computers in general. Without the apparatus, a special reserved area of physical memory, equal in size to the processor's cache may be utilized, albeit at reduced performance by loading from this physical area into cache by issuing a DCBT (Data Cache Block Touch) instruction for each cache line of the reserved physical area.
In accordance with a second aspect of the invention, a novel hardware apparatus is provided that assists the above-described cache coherence algorithm, and limits the total time (or latency) required to achieve cache coherence over the Put/Get Window. This apparatus is a simple extension to the hardware address decoder that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements. This further speeds the coherence activities because it allows garbage data (which the processor will never use) to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance. The performance is faster because this garbage data
does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
The performance is also faster because only actually modified data is written to memory from cache, while clean data is simply instantly discarded. Further, for the case where the total size of the "Put/Get Window" exceeds, perhaps greatly, the size of the processor's cache, cleaning the cache in this manner provides an upper bound on the total amount of work that is required to ensure that no data from the communication area remains in the cache. For example, assuming a fully associative cache, if the communication area is 16 Megabytes (common occurrence), traditional cache flush techniques would require (16MB / 32B per cache line equals) 524,288 DCBF instructions, while the algorithm described here would require at most 1,000 DCBT instructions if the processor's cache was 32 Kilobytes in size with 32 byte cache lines. It may be noted that, independent of the above-described software algorithm, this hardware device is useful for computer systems in general which employ a Least Recently Used cache replacement policy.
Two specific software embodiments are described below. The first embodiment may be preferred if the size of the message being received is smaller than the size of LI, while the second embodiment may be preferred if the size of the message received is larger than LI.
First embodiment: If the size of the message being received is smaller than the size of LI.
In this case, the invention makes use of a software instruction termed "data cache block flush and invalidate" (DCBF), whereby a contiguous range of memory is written from LI back to the common memory. DCBF is a PowerPC BookE instruction; similar instructions exist for other processors. At the same time, the data in the cache is marked
as invalid, and cannot be used without reloading contents of the common memory. A DCBF is issued for every line in the address window.
More specifically, when the window is opened for puts or gets, software, (in the communication library) instructs the receiving processor (the computation processor in our dual processor node) to flush the contents of LI in the address window, as described above. This simple operation insures that the data in common memory are the same as the data in the compute processor's cache, and further, because of the invalidate, allows an opportunity for the I/O processor to change the contents of the common memory. The software then instructs the I/O processor to proceed until all expected messages arrive. The software then allows the computer processor to continue to process instructions, and closes the put/get window using a global "and."
Second embodiment: If the size of the message received is larger than the size of LI. In this case, the invention makes use of an instruction termed "data cache block zero" (DCBZ), to reserve a continuous physical address range equal in size to LI. DCBZ creates a new cache line with contents of zero. If a new cache line is not available, then another cache line in LI (for example, the least recently used line), has its data written back to the common memory, and is then zero'ed with the address given by the DCBZ instruction. DCBZ is a PowerPC BookE instruction; similar instructions exist for other processors. The software executes DCBZ to all of LI, with an address of the reserved space, all lines in the LI are flushed, i.e., all modified lines are written back to common memory. The software then allows the compute processor to continue to process instructions, and closes the put/get window using a global "and".
It may be notes that the reserved physical space need not actually exist, only the address range must be protected; that is not written to or read from by the user. All writes to this reserved memory space can, in principle, be ignored by the memory controller. All reads to this reserved memory space can, in principle, immediately return an arbitrary value to the requesting processors L 1.
It may also be noted that if DCBF instructions are slower than DCBZ, then the operating system may use the DCBZ instruction for messages smaller then LI and vice- versa.
Using this invention, the I/O Processor need not flush its cache at all if the communication memory space is marked write-through to its LI cache.
The making of the above-mentioned global "and" in a short interval of time, which allows the put/get window to be made temporarily narrow, is discussed in detail in related patent application no. (Attorney Docket: 15258 ).
While it is apparent that the invention herein disclosed is well calculated to fulfill the objects previously stated, it will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art, and it is intended that the appended claims cover all such modifications and embodiments as fall within the true spirit and scope of the present invention.
Claims
1. A method of simplifying and speeding the management of cache coherence in a message passing parallel supercomputer including two or more non-coherent processor elements, where one processor element is primarily performing calculations, while the other processor element is performing message passing activities, the method comprising the steps:
opening and closing a put/get window;
performing activities to achieve cache coherence; and
using said opening and closing of the put/get window to coordinate the activities to achieve cache coherence.
2. A method according to Claim 1, wherein the method is implemented by a software algorithm.
3. A method according to Claim 1, wherein said using step includes the step of ensuring that data being sent is not in the cache of either processor, and that the data being received is also not in the cache of either processor.
4. A method according to Claim 3, wherein the ensuring step includes the step of loading data into cache by issuing a software command.
5. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for simplifying and speeding the management of cache coherence in a message passing parallel supercomputer including two or more non-coherent processor elements, where one processor element is primarily performing calculations, while the other processor element is performing message passing activities, the method steps comprising:
opening and closing a put/get window;
performing activities to achieve cache coherence; and
using said opening and closing of the put/get window to coordinate the activities to achieve cache coherence.
6. A program storage device according to Claim 5, wherein said using step includes the step of ensuring that data being sent is not in the cache of either processor, and that the data being received is also not in the cache of either processor.
7. A program storage device according to Claim 6, wherein the ensuring step includes the step of loading data into cache by issuing a software command.
8. A system to simplify and speed the management of cache coherence in a message passing parallel supercomputer including two or more non-coherent processor elements, where one processor element is primarily performing calculations, while the other processor element is performing message passing activities, the system comprising:
means for opening and closing a put/get window;
means for performing activities to achieve cache coherence; and
means for using said opening and closing of the put/get window to coordinate the activities to achieve cache coherence.
9. A system according to Claim 8, wherein said using means includes means for ensuring that data being sent is not in the cache of either processor, and that the data being received is also not in the cache of either processor.
10. A system according to Claim 9, wherein the ensuring means includes means for loading data into cache by issuing a software command.
11. Hardware apparatus to assist achieving cache coherence in a message passing parallel computer including two or more non-coherent processing elements, where one processing element is principally performing calculations, while the second processing element is performing message passing activities, the hardware apparatus comprising:
a memory controller to create, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
12. Hardware apparatus according to Claim 11, wherein the memory controller allows garbage data, which the processor will never use, to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance.
13. Hardware apparatus according to Claim 12, wherein the garbage data does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
14. Hardware apparatus according to Claim 13, wherein only actually modified data is written to memory from cache, while clean data is simply instantly discarded.
15. Hardware apparatus according to Claim 14, wherein, when the total size of the put/get window exceeds the size of the processor's cache, cleaning the cache in this manner provides an upper bound on the total amount of work that is required to ensure that no data from the communication area remains in the cache.
16. A method of operating computer hardware apparatus to assist achieving cache coherence in a message passing parallel computer including two or more non-coherent processing elements, where one processing element is principally performing calculations, while the second processing element is performing message passing activities, the method comprising the steps:
using a memory controller to create, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
17. A method according to Claim 16, wherein the memory controller allows garbage data, which the processor will never use, to be pulled into the processor's cache, thereby evicting just the modified data and displacing unmodified data with optimal performance.
18. A method according to Claim 17, wherein the garbage data does not actually need to be fetched from memory, rather, the memory controller need only instantly reply.
19. A method according to Claim 18, wherein only actually modified data is written to memory from cache, while clean data is simply instantly discarded.
20. A method according to Claim 19, wherein, when the total size of the put/get window exceeds the size of the processor's cache, cleaning the cache in this manner provides an upper bound on the total amount of work that is required to ensure that no data from the communication area remains in the cache.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2002/005615 WO2002069152A1 (en) | 2001-02-24 | 2002-02-25 | Managing coherence via put/get windows |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US27112401P | 2001-02-24 | 2001-02-24 | |
US60/271,124 | 2001-02-24 | ||
PCT/US2002/005615 WO2002069152A1 (en) | 2001-02-24 | 2002-02-25 | Managing coherence via put/get windows |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002069152A1 true WO2002069152A1 (en) | 2002-09-06 |
Family
ID=68499858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2002/005615 WO2002069152A1 (en) | 2001-02-24 | 2002-02-25 | Managing coherence via put/get windows |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2002069152A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4682328A (en) * | 1985-08-15 | 1987-07-21 | Mitel Corporation | Dynamic memory refresh and parity checking circuit |
US6223269B1 (en) * | 1997-09-27 | 2001-04-24 | Emc Corporation | Stacked mapped storage system |
-
2002
- 2002-02-25 WO PCT/US2002/005615 patent/WO2002069152A1/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4682328A (en) * | 1985-08-15 | 1987-07-21 | Mitel Corporation | Dynamic memory refresh and parity checking circuit |
US6223269B1 (en) * | 1997-09-27 | 2001-04-24 | Emc Corporation | Stacked mapped storage system |
Non-Patent Citations (4)
Title |
---|
ANONYMOUS: "Re: non-blocking ops vs. threads", MPI-CORE, July 1995 (1995-07-01), pages 1 - 2, XP002951110, Retrieved from the Internet <URL:http://www.mpi-forum.org/archive/mail/mpi-core/0063.html> * |
GEORGE, R.: "Re: non-blocking ops vs. threads", MPI-CORE, July 1995 (1995-07-01), pages 1 - 3, XP002951109, Retrieved from the Internet <URL:http://www.mpi-forum.org/archives/mail/mpi-core/0056.html> * |
MPI-2: EXTENSIONS TO THE MESSAGE-PASSING INTERFACE, July 1997 (1997-07-01), pages 109 - 143, XP002951111, Retrieved from the Internet <URL:http://www.mpi-forum.org/docs/> * |
SALO, E.: "Re: non-blocking ops vs. threads", MPI-CORE, July 1995 (1995-07-01), pages 1, XP002951108, Retrieved from the Internet <URL:http://www.mpi-forum.org/archive/mail/mpi-core/0057.html> * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8122197B2 (en) | Managing coherence via put/get windows | |
Iftode et al. | Improving release-consistent shared virtual memory using automatic update | |
US7657710B2 (en) | Cache coherence protocol with write-only permission | |
JP4082612B2 (en) | Multiprocessor computer system with multiple coherency regions and software process migration between coherency regions without cache purge | |
US6286090B1 (en) | Mechanism for selectively imposing interference order between page-table fetches and corresponding data fetches | |
US7174434B2 (en) | Low latency memory access and synchronization | |
JPH10133917A (en) | Multiprocess system with coherency-relative error logging capability | |
JPH10187645A (en) | Multiprocess system constituted for storage in many subnodes of process node in coherence state | |
JPH10116253A (en) | Multiprocess system executing synchronous operation | |
JPH10143482A (en) | Multiprocessor system for executing efficient write operation | |
JPH10149342A (en) | Multiprocess system executing prefetch operation | |
JPH10340227A (en) | Multi-processor computer system using local and global address spaces and multi-access mode | |
JPH10143476A (en) | Multiprocess system for executing software for starting prefetch operation | |
JPH10143477A (en) | Multiprocess system provided with reinforced blocking mechanism for read-to-share transaction in numa mode | |
JPH10214230A (en) | Multiprocessor system adopting coherency protocol including response count | |
JPH10134014A (en) | Multiprocess system using three-hop communication protocol | |
JP4577729B2 (en) | System and method for canceling write back processing when snoop push processing and snoop kill processing occur simultaneously in write back cache | |
EP1376368B1 (en) | Mechanism for maintaining cache consistency in computer systems | |
US5875468A (en) | Method to pipeline write misses in shared cache multiprocessor systems | |
Andrews et al. | Notification and multicast networks for synchronization and coherence | |
WO2002069152A1 (en) | Managing coherence via put/get windows | |
James | SCI (Scalable Coherent Interface) Cache Coherence | |
Blumrich et al. | Managing coherence via put/get windows | |
Blumrich et al. | Simplifying and speeding the management of intra-node cache coherence | |
Cheriton et al. | Optimized memory-based messaging: Leveraging the memory system for high-performance communication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |