US7627744B2 - External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level - Google Patents
External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level Download PDFInfo
- Publication number
- US7627744B2 US7627744B2 US11/798,119 US79811907A US7627744B2 US 7627744 B2 US7627744 B2 US 7627744B2 US 79811907 A US79811907 A US 79811907A US 7627744 B2 US7627744 B2 US 7627744B2
- Authority
- US
- United States
- Prior art keywords
- dma
- vpus
- messages
- vcus
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000015654 memory Effects 0.000 title claims abstract description 122
- 238000012545 processing Methods 0.000 title claims abstract description 26
- 238000012546 transfer Methods 0.000 claims abstract description 41
- 238000004891 communication Methods 0.000 claims description 71
- 238000004088 simulation Methods 0.000 claims description 24
- 230000004044 response Effects 0.000 claims description 6
- 230000000903 blocking effect Effects 0.000 claims description 4
- 238000012432 intermediate storage Methods 0.000 claims description 3
- 238000000034 method Methods 0.000 description 15
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 14
- 230000000875 corresponding effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 241001522296 Erithacus rubecula Species 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000036528 appetite Effects 0.000 description 1
- 235000019789 appetite Nutrition 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
- G06F9/3016—Decoding the operand specifier, e.g. specifier format
- G06F9/30167—Decoding the operand specifier, e.g. specifier format of immediate specifier, e.g. constants
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
- G06F9/3891—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
Definitions
- Embodiments of the present invention relate generally to circuits and methods for performing massively parallel computations. More particularly, embodiments of the invention relate to an integrated circuit architecture and related methods adapted to generate real-time physics simulations.
- Any visual display of objects and/or environments interacting in accordance with a defined set of physical constraints may generally be considered a “physics-based” simulation.
- Animated environments and objects are typically assigned physical characteristics (e.g., mass, size, location, friction, movement attributes, etc.) and thereafter allowed to visually interact in accordance with the defined set of physical constraints.
- All animated objects are visually displayed by a host system using a periodically updated body of data derived from the assigned physical characteristics and the defined set of physical constraints. This body of data is generically referred to hereafter as “physics data.”
- the frame rate of the host system display necessarily restricts the size and complexity of the physics problems underlying the physics-based simulation in relation to the speed with which the physics problems can be resolved.
- the design emphasis becomes one of increasing data processing speed.
- Data processing speed is determined by a combination of data transfer capabilities and the speed with which the mathematical/logic operations are executed.
- the speed with which the mathematical/logic operations are performed may be increased by sequentially executing the operations at a faster rate, and/or by dividing the operations into subsets and thereafter executing selected subsets in parallel. Accordingly, data bandwidth considerations and execution speed requirements largely define the architecture of a system adapted to generate physics based simulations in real-time.
- the nature of the physics data being processed also contributes to the definition of an efficient system architecture.
- FIG. 1 shows a physics processing unit (PPU) 100 adapted to perform a large number of parallel computations for a physics-based simulation.
- PPU physics processing unit
- PPU 100 typically executes physics-based computations as part of a secondary application coupled to a main application running in parallel on a host system.
- the main application may comprise an interactive game program that defines a “world state” (e.g., positions, constraints, etc.) for a collection of visual objects.
- the main application coordinates user input/output (I/O) for the game program and performs ongoing updates of the world state.
- the main application also sends data to the secondary application based on the user inputs and the secondary application performs physics-based computations to modify the world state.
- the secondary application modifies the world state, it periodically and asynchronously sends the modified world state to the main application.
- partitioning By partitioning the workload between the main and secondary applications so that the secondary application runs in parallel and asynchronously with the main application, the implementation and programming of the PPU, as well as both of the applications, is substantially simplified. For example, the partitioning allows the main application to check for updates to the world state when convenient, rather than forcing it to conform to the timing of the secondary application.
- PPU 100 can be implemented in a variety of different ways. For example, it could be implemented as a co-processor chip connected to a host system such as a conventional CPU. Similarly, it could be implemented as part of one processor core in a dual core processor. Indeed, those skilled in the art will recognize a wide variety of ways to implement the functionality of PPU 100 in hardware. Moreover, those skilled in the art will also recognize that hardware/software distinctions can be relatively arbitrary, as hardware capability can often be implemented in software, and vice versa.
- the PPU illustrated in FIG. 1 comprises a high-bandwidth external memory 102 , a Data Movement Engine (DME) 101 , a PPU Control Engine (PCE) 103 , and a plurality of Vector Processing Engines (VPEs) 105 .
- Each of VPEs 105 comprises a plurality of Vector Processing Units (VPUs) 107 , each having a primary (L 1 ) memory, and a VPU Control Unit (VCU) 106 having a secondary (L 2 ) memory.
- DME 101 provides a data transfer path between external memory 102 (and/or a host system 108 ) and a VPEs 105 .
- PCE 103 is adapted to centralize overall control of the PPU and/or a data communications process between PPU 100 and host system 108 .
- PCE 103 typically comprises a programmable PPU control unit (PCU) 104 for storing and executing PCE control and communications programming.
- PCU 104 may comprise a MIPS64 5Kf processor core from MIPS Technologies, Inc.
- Each of VPUs 107 can be generically considered a “data processing unit,” which is a lower level grouping of mathematical/logic execution units such as floating point processors and/or scalar processors.
- the primary memory L 1 of each VPU 107 is generally used to store instructions and data for executing various mathematical/logic operations. The instructions and data are typically transferred to each VPU 107 under the control of a corresponding one of VCUs 106 .
- Each VCU 106 implements one or more functional aspects of the overall memory control function of the PPU. For example, each VCU 106 may issue commands to DME 101 to fetch data from PPU memory 102 for various VPUs 107 .
- the PPU illustrated in FIG. 1 may include any number of VPEs 105 , and each VPE 105 may include any number of VPUs 107 .
- the overall computational capability of PPU 100 is not limited simply by the number of VPEs and VPUs. For instance, regardless of the number of VPEs and VPUs, memory bus bandwidth and data dependencies may still limit the amount of work that each VPE can do.
- the VCU within each VPE may become overburdened by a large number of memory access commands that it has to perform between VPUs and external memory 102 and/or PCU 104 . As a result, VPUs 106 may end up idly waiting for responses from their corresponding VCU, thus wasting valuable computational resources.
- an integrated circuit comprises an external memory, a control processor, and a plurality of parallel connected VPEs.
- Each one of the VPEs preferably comprises a plurality of VPUs, a plurality of VCUs, a DMA controller, and a VPE messaging unit (VMU) providing a data transfer path between the plurality of VPUs, the plurality of VCUs, the DMA controller, and the control processor.
- the integrated circuit further comprises an External Memory Unit (EMU) providing a data transfer path between the external memory, the control processor, and the plurality of VPEs.
- EMU External Memory Unit
- a PPU comprises an external memory storing at least physics data, a PCE comprising a programmable PCU, and a plurality of parallel connected VPEs.
- Each one of the VPEs comprises a plurality of VPUs, each comprising a grouping of mathematical/logic units adapted to perform computations on physics data for a physics simulation, a plurality of VCUs, a DMA subsystem comprising a DMA controller, and a VMU adapted to transfer messages between the plurality of VPUs, the plurality of VCUs, the DMA subsystem, and the PCE.
- the PPU further comprises an EMU providing a data transfer path between the external memory, the PCE, and the plurality of VPEs.
- a method of operating an integrated circuit comprises an external memory, a plurality of parallel connected VPEs each comprising a plurality of VPUs, a plurality of VCUs, and a VMU, and an EMU providing a data transfer path between the external memory and the plurality of VPEs.
- the method comprises transferring a communication message from a VPU in a first VPE among the plurality of VPEs to a communication message virtual queue in the VMU of the first VPE, and transferring the communication message from the communication message virtual queue to a destination communication messages receive first-in-first-out queue (FIFO) located in a VPU or VCU of the first VPE.
- FIFO first-in-first-out queue
- FIG. 1 is a block diagram illustrating a conventional Physics Processing Unit (PPU);
- FIG. 2 is a block diagram illustrating a PPU in accordance with one embodiment of the present invention.
- FIG. 3 is a block diagram of a VPE in accordance with an embodiment of the present invention.
- FIG. 4 is an illustration of a message in the VPE shown in FIG. 3 ;
- FIG. 5 is a block diagram of a scheduler for a message queuing system in the VPE shown in FIG. 3 ;
- FIG. 6 is a flowchart illustrating a typical sequence of operations performed by the VPE 205 shown in FIG. 3 when performing a calculation on data received through an external memory unit;
- FIG. 7 shows various alternative scheduler and queue configurations that could be used in the VPE shown in FIG. 3 ;
- FIG. 8 is a block diagram of a VPE according to yet an embodiment of the present invention.
- FIG. 9 is a flowchart illustrating a method of transferring a communication message between a VPU or VCU in the VPE shown in FIG. 8 according to an embodiment of the present invention.
- FIG. 10 is a flowchart illustrating a method of performing a DMA operation in a VPE based on a DMA request message according to an embodiment of the present invention.
- embodiments of the invention are designed to address problems arising in the context of parallel computing.
- several embodiments of the invention provide mechanisms for managing large numbers of concurrent memory transactions between a collection of data processing units operating in parallel and an external memory.
- Still other embodiments of the invention provide efficient means of communication between the data processing units.
- Embodiments of the invention recognize a need to balance various design, implementation, performance, and programming tradeoffs in a highly specialized hardware platform. For example, as the number of parallel connected components, e.g., vector processing units, in the platform increases, the degree of networking required to coordinate the operation of the components and data transfers between the components also increases. This networking requirement adds to programming complexity. Further, the use of Very Long Instruction Words (VLIWs), multi-threading data transfers, and multiple thread execution can also increase programming complexity. Moreover, as the number of components increases, the added components may cause resource (e.g., bus) contention. Even if the additional components increase overall throughput of the hardware platform, they may decrease response time (e.g., memory latency) for individual components. Accordingly, embodiments of the invention are adapted to strike a balance between these various tradeoffs.
- VLIWs Very Long Instruction Words
- the added components may cause resource (e.g., bus) contention. Even if the additional components increase overall throughput of the hardware platform, they may decrease response
- the invention is described below in the context of a specialized hardware platform adapted to perform mathematical/logic operations for a real-time physics simulation.
- inventive concepts described find ready application in a variety of other contexts.
- various data transfer, scheduling, and communication mechanisms described find ready application in other parallel computing contexts such as graphics processing and image processing, to name but a couple.
- FIG. 2 is a block level diagram of a PPU 200 adapted to run a physics-based simulation in accordance with one exemplary embodiment of the invention.
- PPU 200 comprises an External Memory Unit (EMU) 201 , a PCE 203 , and a plurality of VPEs 205 .
- EMU External Memory Unit
- VPEs 205 comprises a plurality of VCUs 206 , a plurality of VPUs 207 , and a VPE Messaging Unit (VMU) 209 .
- PCE 203 comprises a PCU 204 .
- PPU 200 includes eight (8) VPEs 205 , each containing two (2) VCUs 206 , and eight (8) VPUs 207 .
- EMU 201 is connected between PCE 203 , VPEs 205 , a host system 208 , and an external memory 202 .
- EMU 201 typically comprises a switch adapted to facilitate data transfers between the various components connected thereto. For example, EMU 201 allows data transfers from one VPE to another VPE, between PCE 203 and VPEs 205 , and between external memory 202 and VPEs 205 .
- EMU 201 can be implemented in a variety of ways.
- EMU 201 comprises a crossbar switch.
- EMU 201 comprises a multiplexer.
- EMU 201 comprises a crossbar switch implemented by a plurality of multiplexers. Any data transferred to a VPE through an EMU is referred to as EMU data in this written description.
- EMU memory any external memory connected to a PPU through an EMU is referred to as an EMU memory in this written description.
- DMA operation or DMA transaction denotes any data access operation that involves a VPE but not PCE 203 or a processor in host system 208 .
- a read or write operation between external memory 202 and a VPE, or between two VPEs is referred to as a DMA operation.
- DMA operations are typically initiated by VCUs 206 , VPUs 207 , or host system 208 .
- an initiator e.g., a VCU or VPU
- a DMA controller not shown
- the DMA controller then communicates with various memories in VPEs 205 and external memory 202 or host system 208 based on the DMA command to control data transfers between the various memories.
- Each of VPEs 205 typically includes its own DMA controller, and memory transfers generally occur within a VPE or through EMU 201 .
- Each of VPEs 205 includes a VPE Message Unit (VMU) adapted to facilitate DMA transfers to and from VCUs 206 and VPUs 207 .
- VMU VPE Message Unit
- Each VMU typically comprises a plurality of DMA request queues used to store DMA commands, and a scheduler adapted to receive the DMA commands from the DMA request queues and send the DMA commands to various memories in VPEs 205 and/or external memory 202 .
- Each VMU typically further comprises a plurality of communication message queues used to send communication messages between VCUs 206 and VPUs 207 .
- Each of VPEs 205 establishes an independent “computational lane” in PPU 200 .
- independent parallel computations and data transfers can be carried out via each of VPEs 205 .
- PPU 200 has a total of eight (8) computational lanes.
- FIG. 3 is a block diagram showing an exemplary VPE 205 including a plurality of queues and associated hardware for managing memory requests and other data transfers.
- the queues and associated hardware can be viewed as one embodiment of a VMU such as those shown in FIG. 2 .
- VPE 205 comprises VPUs 207 and VCUs 206 .
- Each VPU 207 comprises an instruction memory and a data memory, represented collectively as local memories 501 .
- VPUs 207 are organized in pairs that share the same instruction memory.
- Each of VCUs 207 also comprises a data memory and an instruction memory, collectively represented as local memories 502 .
- VPE 205 further comprises a DMA controller 503 adapted to facilitate data transfers between any of the memories in VPE 205 and external memories such as external memory 202 .
- VPE 205 further comprises an Intermediate Storage Memory (ISM) 505 , which is adapted to store relatively large amounts of data compared with local memories 501 and 502 .
- ISM 505 can be thought of as a “level 2” memory, and local memories 501 and 502 can be thought of as “level 1” memories in a traditional memory hierarchy.
- DMA controller 201 generally fetches chunks of EMU data through EMU 201 and stores the EMU data in ISM 505 .
- the EMU data in ISM 505 is then transferred to VPUs 207 and/or VCUs 206 to perform various computations, and any EMU data modified by VPUs 207 or VCUs 206 are generally copied back to ISM 505 before the EMU data is transferred back to a memory such as external memory 202 through EMU 201 .
- VPE 205 still further comprises a VPU message queue 508 , a VPU scheduler 509 , a VCU message queue 507 , and a VCU scheduler 506 .
- VPU message queue 508 transfers messages from VPUs 207 to VCUs 206 through scheduler 509 .
- VCU message queue 507 transfers messages from VCUs 206 to VPUs 207 via scheduler 506 .
- the term “message” here simply refers to a unit of data, preferably 128 bytes.
- a message can comprise, for example, instructions, pointers, addresses, or operands or results for some computation.
- FIG. 4 shows a simple example of a message that could be sent to a VPU from a VCU.
- a message in VCU message queue 507 of FIG. 3 includes a data type, a pointer to an output address in local memories 501 , respective sizes for first and second input data, and pointers to the first and second input data in ISM 505 .
- the VPU can use the message data to create a DMA command for transferring the first and second input data from ISM 505 to the output address in local memories 501 .
- VPE 205 shown in FIG. 3 includes one queue and scheduler for VPUs 207 and one queue and scheduler for VCUs 206
- the number and arrangement of the queues and schedulers can vary.
- each VPU 207 or VCU 206 may have its own queue and scheduler, or even many queues and schedulers.
- messages from more than one queue may be input to each scheduler.
- FIG. 5 shows an exemplary embodiment of scheduler 506 shown in FIG. 3 .
- the embodiment shown in FIG. 5 is preferably implemented in hardware to accelerate the forwarding of messages from VCUs to VPUs. However, it could also be implemented in software.
- scheduler 506 comprises a logic circuit 702 and a plurality of queues 703 corresponding to VPUs 207 .
- Scheduler 506 receives messages from VCU message queue 507 and inserts the messages into queues 703 based on logic implemented in logic circuit 702 .
- the messages in queues 703 are then sent to VPUs 207 .
- FIG. 6 is a flowchart illustrating a typical sequence of operations performed by the VPE 205 shown in FIG. 3 when performing a calculation on EMU data received through EMU 201 .
- Exemplary method steps shown in FIG. 6 are denoted below by parentheses (XXX) to distinguish them from exemplary system elements such as those shown in FIGS. 1 through 5 .
- one of VCUs 206 sends an EMU data request command to DMA controller 503 so that DMA controller 503 will copy EMU data to ISM 505 ( 801 ).
- the VCU 206 then inserts a work message into its message queue 507 .
- the message is delivered by the scheduler to an in-bound queue of a VPU 207 .
- the VPU is instructed to send a command to DMA controller 503 to load the EMU data from ISM 505 into local memory 501 ( 802 ).
- the VPUs 207 perform calculations using the data loaded from ISM 205 ( 803 ).
- VCU 206 sends a command to DMA 503 to move results of the calculations from the local memory 501 back to ISM 505 ( 804 ).
- VCU 206 sends a command to DMA controller 503 to move the results of the calculations from ISM 205 to EMU 201 ( 805 ).
- FIG. 7 shows alternative scheduler and queue configurations that could be used in the VPE 205 shown in FIG. 3 .
- FIG. 7A shows a configuration where there is a one to one correspondence between a VCU 901 and a queue and scheduler 902 and 903 .
- Scheduler 903 sends messages from queue 902 to two VPUs 904 , and in turn, VPUs 904 send messages to other VPUs and VCUs through a queue and scheduler 905 and 906 .
- FIG. 7B shows a configuration where there is a one to many correspondence between a VCU 911 and a plurality of queues and schedulers 912 and 913 .
- each scheduler 913 sends messages to one of a plurality of VPUs 914
- each of VPUs 914 sends messages back to VCU 911 through respective queues and schedulers 915 and 916 .
- the queues and schedulers shown in FIG. 7 are generally used for communication and data transfer purposes. However, these and other queues and schedulers could be used for other purposes such as storing and retrieving debugging messages.
- FIG. 8 shows a VPE according to yet an embodiment of the present invention.
- the VPU shown in FIG. 8 is intended to illustrate a way of implementing a message queue system in the VPE, and therefore various processing elements such as those used to perform computations in VPUs are omitted for simplicity of illustration.
- the VPE of FIG. 8 is adapted to pass messages of two types between its various components. These two types of messages are referred to as “communication messages” and “DMA request messages.”
- a communication message comprises a unit of user defined data that gets passed between two VPUs or between a VPU and a VCU in the VPE.
- a communication message may include, for example, instructions, data requests, pointers, or any type of data.
- a DMA request message comprises a unit of data used by a VPU or VCU to request that a DMA transaction be performed by a DMA controller in the VPE. For illustration purposes, it will be assumed that each communication and DMA request message described in relation to FIG. 8 comprises 128 bits of data.
- the VPE of FIG. 8 comprises a plurality of VPUs 207 , a plurality of VCUs 206 , a VMU 209 , and a DMA subsystem 1010 . Messages are passed between VCUs 206 , VPUs 207 , and DMA subsystem 1010 through VMU 209 .
- VMU 209 comprises a first memory 1001 for queuing communication messages and a second memory 1002 for queuing DMA request messages.
- the first and second memories are both 256 ⁇ 128 bit memories, each with one read port and one write port.
- Each of the first and second memories is subdivided into 16 virtual queues.
- the virtual queues in first memory 1001 are referred to as communication message virtual queues, and the virtual queues in second memory 1002 are referred to as DMA request virtual queues.
- VMU 209 preferably guarantees that each virtual queue acts independently from every other virtual queue. Two virtual queues act independent from each other if the usage or contents of either virtual queue never causes the other virtual queue to stop making forward progress.
- Each virtual queue in first and second memories 1001 and 1002 is configured with a capacity and a start address.
- the capacity and start address are typically specified in units of 128 bits, i.e., the size of one message. For example, a virtual queue with a capacity of two (2) can store two messages, or 256 bits. Where the capacity of a virtual queue is set to zero, then the queue is considered to be inactive. However, all active queues generally have a capacity between 2 and 256.
- Each virtual queue is also configured with a “high-water” occupancy threshold that can range between one (1) and the capacity of the virtual queue minus one. Where the amount of data stored in a virtual queue exceeds the high-water occupancy threshold, the virtual queue may generate a signal to indicate a change in the virtual queue's behavior. For example, the virtual queue may send an interrupt to PCE 203 to indicate that it will no longer accept data until its occupancy falls below the high-water occupancy threshold.
- Each virtual queue can also be configured to operate in a “normal mode” or a “ring buffer mode.”
- the high-water occupancy threshold is ignored, and new data can always be enqueued in the virtual queue, even if the new data overwrites old data stored in the virtual queue.
- a read pointer and a write pointer in the virtual queue are typically moved so that the read pointer points to the oldest data in the virtual queue and the write pointer points to a next address where data will be written.
- Each communication message virtual queue is configured with a set of destinations.
- possible destinations include eight (8) VPUs 207 , two (2) VCUs 205 , and PCE 203 , for a total of eleven (11) destinations.
- the eleven destinations are generally encoded as an eleven (11) bit bitstring so that each virtual queue can be configured to send messages to any subset of the eleven destinations.
- One way to configure the various properties of the virtual queues is by storing configuration information for each of the virtual queues in memory mapped configuration registers.
- the memory mapped configuration registers are typically mapped onto a memory address space of PCE 203 and a memory address space of VCUs 206 .
- VCUs 206 can access the configuration information stored therein, but the virtual queues are preferably only configured by PCE 203 .
- VPUs 207 and VCUs 206 each comprise two (2) first-in-first-out queues (FIFOs) for receiving messages from VMU 209 .
- the two FIFOs are referred to as “receive FIFOs,” and they include a communication messages receive FIFO and a DMA completion notifications receive FIFO.
- Each communication message receive FIFO preferably comprises an 8 entry by 128-bit queue and each DMA completion notifications receive FIFO preferably comprises a 32 entry by 32 bit queue.
- VPEs 207 and VCUs 206 both use a store instruction STQ to send messages to VMU 209 , and a load instruction LDQ to read messages from their respective receive FIFOs.
- pairs of VPUs 207 can share a single physical memory. Accordingly, the receive FIFOs for each pair of VPUs 207 can be implemented in the same physical memory. Where the receive FIFOs for a pair of VPUs 207 are implemented in the same physical memory, there may be memory contention between the VPUs 207 both trying to send load and store instructions to the memory. A simple way to address this type of memory contention is to give one of the pair of VPUs strict priority of the other VPU in the pair.
- the receive FIFOs in each VPU act independent of each other. In other words, the usage or contents of one receive FIFO will not stop the forward progress of another receive FIFO.
- the communication message receive FIFOs have a configurable high-water occupancy threshold.
- the communication message receive FIFO When the occupancy of a communication message receive FIFO reaches the high-water occupancy threshold the communication message receive FIFO generates a backpressure indication to prevent more messages from being sent to the FIFO.
- the high-water occupancy threshold for a communication message receive FIFO is typically between 1 and 5, with a default of 5.
- the communication message virtual queue is blocked from sending any communication messages to those destinations. As a result, the communication message virtual queue may fill up, causing subsequent attempts to enqueue data to the virtual queue to fail.
- VMU 209 can only transfer one communication message to a receive FIFO per clock cycle. Accordingly, a scheduler 1003 is included in VMU 209 to provide fairness between the communication message virtual queues.
- Scheduler 1003 typically schedules data transfers between communication message virtual queues and communication message receive FIFOs using a round robin scheduling technique. According to this technique, the scheduler examines each communication message virtual queue in round robin order. Where an examined virtual queue is not empty, and a next communication message in the virtual queue has a destination communication message receive FIFO that is not above its high-water occupancy threshold, the scheduler sends the communication message to the destination communication message receive FIFO. To facilitate efficient examination of the communication message virtual queues, scheduler 1003 maintains an indication of the destination communication message receive FIFO for the next message in each communication message virtual queue. This allows scheduler 1003 to efficiently check whether the destination communication message receive FIFOs are above their respective high-water occupancy thresholds.
- the DMA request message virtual queues in second memory 1002 receive DMA request messages from VPUs 207 and VCUs 206 .
- Each DMA request message typically comprises 128 bits of information, together with an optional 32-bit DMA completion notification.
- the DMA request messages are transferred through the DMA request message virtual queues to a set of DMA request FIFOs 1007 .
- the order in which messages are transferred from the DMA request message virtual queues is determined by a scheduler 1004 .
- DMA request messages in DMA request FIFOs 1007 are transferred to a DMA controller 1008 , which performs DMA transactions based on the DMA request messages.
- a typical DMA transaction comprises, for example, moving data to and/or from various memories associated with VPUs 207 and/or VCUs 206 .
- any DMA completion notification associated with a DMA request message that initiated the DMA transaction is transferred from DMA controller 1008 to a DMA completion notifications FIFO 1009 .
- the DMA completion notification is then transferred to a DMA completion notification receive FIFO in one of VPUs 207 or VCUs 206 .
- the DMA request message virtual queues may also include extended completion notification (ECN) messages.
- ECN extended completion notification
- An ECN message is a 128-bit message inserted in a DMA request message virtual queue immediately after a DMA request message.
- the ECN message is typically used instead of a 32-bit completion notification.
- the ECN message is sent to a communication message receive FIFO through one of the communication message virtual queues to indicate that the DMA request message has been sent to DMA controller 1008 .
- An exemplary ECN message is shown in FIG. 8 by a dotted arrow.
- the ECN message can be sent to the communication message virtual queue either upon sending the DMA request message to DMA controller 1008 , or upon completion of a DMA transaction initiated by the DMA request message, depending on the value of a “fence” indication in the DMA request message. If the fence indication is set to a first value, the ECN message is sent to the communication message virtual queue upon sending the DMA request message to DMA controller 1008 . Otherwise, the ECN message is sent to the communication message virtual queue upon completion of the DMA transaction.
- Scheduler 1004 preferably uses a round robin scheduling algorithm to determine the order in which DMA request messages are transferred from DMA request message virtual queues to DMA request FIFOs 1007 . Under the round robin scheduling algorithm, scheduler 1004 reads a next DMA request message from a non-empty DMA request message virtual queue during a current clock cycle. The next DMA request message is selected by cycling through the non-empty DMA request message virtual queues in successive clock cycles in round robin order.
- the next DMA request message is transferred to DMA request FIFO during the current clock cycle unless one or more of the following conditions are met: DMA request FIFOs 1007 are all fill; the next DMA request message has a DMA completion notification destined for a DMA completion notification receive FIFO that is full, or above its high-water occupancy threshold; or, the DMA request message has an associated ECN message, and the ECN message's destination communication message FIFO is full.
- VMU 209 To provide true independence between virtual queues, VMU 209 must prevent DMA completion notifications FIFO 1009 from blocking the progress of DMA controller 1008 .
- DMA completion notifications FIFO 1009 may block DMA controller 1008 , for example, if VCUs 206 or VPUs 207 are slow to drain their respective DMA completion notification receive FIFOs, causing DMA completion notifications to fill up.
- One way that VMU 209 can prevent DMA completion notifications FIFO 1009 from blocking the progress of DMA controller 1009 is by preventing any DMA request message containing a 32-bit DMA completion notification from being dequeued from its DMA request virtual queue unless a DMA completion notifications receive FIFO for which the DMA completion notification is destined is below its high-water occupancy threshold.
- DMA controller 1008 can perform various different types of DMA transactions in response to different DMA request messages. For example, some DMA transactions move data from the instruction memory of one VPU to the instruction memory of another VPU. Other transactions broadcast data from an ISM 1011 to a specified address in the data memories of several or all of VPUs 207 , e.g., VPUs labeled with the suffix “A” in FIGS. 2 and 8 . Still other DMA transactions broadcast data from ISM 1011 to the instruction memories of several or all of VPUs 207 .
- DMA controller 1008 moves data between ISM 1011 and an EMU memory 1012 using “load-locked” and “store-conditional” semantics. More specifically, load-locked semantics can be used when transferring data from EMU memory 1012 to ISM 1011 , and store-conditional semantics are used when transferring data from ISM 1011 to EMU memory 1012 .
- Load-locked semantics and store-conditional semantics both rely on a mechanism whereby an address in EMU memory 1012 is “locked” by associating the address with an identifier of a particular virtual queue within one of VPEs 205 .
- the virtual queue whose identifier is associated with the address is said to have a “lock” on the address. Also, when a virtual queue has a lock on an address, the address is said to be “locked.” If another identifier becomes associated with the address, the virtual queue is said to “lose,” or “release” the lock.
- a virtual queue typically gets a lock on an address in EMU memory 1012 when a DMA request message from the virtual queue instructs DMA controller 1008 to perform a read operation from EMU memory 1012 to ISM 1011 .
- a read operation that involves getting a lock on an address is termed a “load-locked” operation.
- an EMU controller (not shown) in EMU memory 1012 may start a timer. The timer is typically configured to have a limited duration. If the duration is set to zero, then the timer will not be used. While the timer is running, any subsequent read operation to the address in EMU memory 1012 will not unlock or lock any addresses. The use of the timer reduces a probability that an address locked by a DMA transaction from one VPE will be accessed by a DMA transaction from another VPE.
- a “store-conditional” operation is a write operation from EMU memory 1012 to ISM 1011 that only succeeds if it originates from a virtual queue that has a lock on a destination address of the write operation.
- Atomic EMU DMA transactions can be initiated by DMA request messages having 32-bit DMA completion notifications. However, if a store-conditional operation does not succeed, a bit in the corresponding DMA completion notification is set to a predetermined value to indicate the failure to one of VPUs 207 or VCUs 206 .
- FIGS. 9 and 10 are flowcharts illustrating methods of sending messages in a circuit such as the VPE shown in FIG. 8 .
- FIG. 9 illustrates a method of transferring a communication message from a VPU or VCU to another VPU or VCU in a VPE according to one embodiment of the invention
- FIG. 10 illustrates a method of performing a DMA operation in a VPE based on a DMA request message according to an embodiment of the present invention.
- the method of transferring a communication message from a VPU or VCU to another VPU or VCU in a VPE comprises the following. First, in a step 1101 , a VPU or VCU writes a communication message to one of a plurality of communication message queues. Next, in a step 1102 , a scheduler checks the occupancy of a destination receive FIFO for the communication message. Finally, in a step 1103 , if the occupancy of the destination receive FIFO is below a predetermined high-water occupancy threshold, the communication message is transferred to the destination receive FIFO.
- the method of performing the DMA operation in a VPE comprises the following. First, in a step 1201 , a VPU or VCU writes a DMA request message to one of a plurality of DMA request message queues. Next, in a step 1202 , the DMA request message is transferred from the DMA request message queue to a DMA request FIFO. Then, in a step 1203 , the DMA request message is transferred to a DMA controller and the DMA controller performs a DMA operation based on the DMA request message. Finally, in a step 1204 , a DMA completion notification associated with the DMA request message is sent to a DMA completion notification receive FIFO in one or more VPUs and/or VCUs within the VPE.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multi Processors (AREA)
Abstract
Description
Claims (17)
Priority Applications (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/798,119 US7627744B2 (en) | 2007-05-10 | 2007-05-10 | External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level |
DE102008022080A DE102008022080B4 (en) | 2007-05-10 | 2008-05-05 | Message queuing system for a parallel integrated circuit architecture and associated operating method |
GB0808251A GB2449168B (en) | 2007-05-10 | 2008-05-07 | Message queuing system for parallel integrated circuit architecture and related method of operation |
TW097117334A TWI416405B (en) | 2007-05-10 | 2008-05-09 | A parallel integrated circuit, a physics processing unit and a method of operating integrated circuit |
CN2008100993042A CN101320360B (en) | 2007-05-10 | 2008-05-09 | Message queuing system for parallel integrated circuit and related operation method |
KR1020080043541A KR100932038B1 (en) | 2007-05-10 | 2008-05-09 | Message Queuing System for Parallel Integrated Circuit Architecture and Its Operation Method |
JP2008124877A JP4428485B2 (en) | 2007-05-10 | 2008-05-12 | Message queuing system for parallel integrated circuit architecture and related operating method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/798,119 US7627744B2 (en) | 2007-05-10 | 2007-05-10 | External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level |
Publications (2)
Publication Number | Publication Date |
---|---|
US20080282058A1 US20080282058A1 (en) | 2008-11-13 |
US7627744B2 true US7627744B2 (en) | 2009-12-01 |
Family
ID=39537379
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/798,119 Active 2027-12-20 US7627744B2 (en) | 2007-05-10 | 2007-05-10 | External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level |
Country Status (7)
Country | Link |
---|---|
US (1) | US7627744B2 (en) |
JP (1) | JP4428485B2 (en) |
KR (1) | KR100932038B1 (en) |
CN (1) | CN101320360B (en) |
DE (1) | DE102008022080B4 (en) |
GB (1) | GB2449168B (en) |
TW (1) | TWI416405B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100161914A1 (en) * | 2008-12-23 | 2010-06-24 | Eilert Sean S | Autonomous memory subsystems in computing platforms |
US20100165991A1 (en) * | 2008-12-30 | 2010-07-01 | Veal Bryan E | SIMD processing of network packets |
US20100268904A1 (en) * | 2009-04-15 | 2010-10-21 | Sheffield Robert L | Apparatus and methods for region lock management assist circuit in a storage system |
US20100268743A1 (en) * | 2009-04-15 | 2010-10-21 | Hallyal Basavaraj G | Apparatus and methods for tree management assist circuit in a storage system |
CN104639596A (en) * | 2013-11-08 | 2015-05-20 | 塔塔咨询服务有限公司 | System and method for multiple sender support in low latency fifo messaging using rdma |
CN104639597A (en) * | 2013-11-08 | 2015-05-20 | 塔塔咨询服务有限公司 | System(s) and method(s) for multiple sender support in low latency fifo messaging using tcp/ip protocol |
US9268695B2 (en) | 2012-12-12 | 2016-02-23 | Avago Technologies General Ip (Singapore) Pte. Ltd. | Methods and structure for using region locks to divert I/O requests in a storage controller having multiple processing stacks |
US11397694B2 (en) | 2019-09-17 | 2022-07-26 | Micron Technology, Inc. | Memory chip connecting a system on a chip and an accelerator chip |
US11416422B2 (en) | 2019-09-17 | 2022-08-16 | Micron Technology, Inc. | Memory chip having an integrated data mover |
US11563621B2 (en) | 2006-06-13 | 2023-01-24 | Advanced Cluster Systems, Inc. | Cluster computing |
US12045503B2 (en) | 2019-09-17 | 2024-07-23 | Micron Technology, Inc. | Programmable engine for data movement |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080240324A1 (en) * | 2007-03-27 | 2008-10-02 | Microsoft Corporation | Independent Dispatch of Multiple Streaming Queues Via Reserved Time Slots |
US8131889B2 (en) | 2009-11-10 | 2012-03-06 | Apple Inc. | Command queue for peripheral component |
CN101833441B (en) * | 2010-04-28 | 2013-02-13 | 中国科学院自动化研究所 | Parallel vector processing engine structure |
CN101847093B (en) * | 2010-04-28 | 2013-09-04 | 中国科学院自动化研究所 | Digital signal processor with reconfigurable low power consumption data interleaving network |
US8407389B2 (en) * | 2010-07-20 | 2013-03-26 | International Business Machines Corporation | Atomic operations with page migration in PCIe |
US9552206B2 (en) * | 2010-11-18 | 2017-01-24 | Texas Instruments Incorporated | Integrated circuit with control node circuitry and processing circuitry |
US9021146B2 (en) | 2011-08-30 | 2015-04-28 | Apple Inc. | High priority command queue for peripheral component |
SE537552C2 (en) * | 2011-12-21 | 2015-06-09 | Mediatek Sweden Ab | Digital signal processor |
CN102609245B (en) * | 2011-12-22 | 2014-09-17 | 中国科学院自动化研究所 | Heterogeneous multi-core processor of two-stage computing architecture |
US9135081B2 (en) * | 2012-10-26 | 2015-09-15 | Nvidia Corporation | Work-queue-based graphics processing unit work creation |
US9401857B2 (en) | 2013-03-15 | 2016-07-26 | International Business Machines Corporation | Coherent load monitoring of physical and virtual networks with synchronous status acquisition |
US9954781B2 (en) | 2013-03-15 | 2018-04-24 | International Business Machines Corporation | Adaptive setting of the quantized congestion notification equilibrium setpoint in converged enhanced Ethernet networks |
US9219689B2 (en) | 2013-03-15 | 2015-12-22 | International Business Machines Corporation | Source-driven switch probing with feedback request |
US9253096B2 (en) | 2013-03-15 | 2016-02-02 | International Business Machines Corporation | Bypassing congestion points in a converged enhanced ethernet fabric |
WO2016024508A1 (en) | 2014-08-12 | 2016-02-18 | 高田 周一 | Multiprocessor device |
US9830275B2 (en) * | 2015-05-18 | 2017-11-28 | Imagination Technologies Limited | Translation lookaside buffer |
US10725825B2 (en) * | 2017-07-10 | 2020-07-28 | Fungible, Inc. | Data processing unit for stream processing |
TWI840631B (en) * | 2020-11-18 | 2024-05-01 | 財團法人工業技術研究院 | Multi-threads tracking method, multi-threads tracking system for operating system and electronic device using the same |
KR20220102399A (en) | 2021-01-13 | 2022-07-20 | 삼성전자주식회사 | Electronic device including host box and one or more expansion boxes |
Citations (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5010477A (en) | 1986-10-17 | 1991-04-23 | Hitachi, Ltd. | Method and apparatus for transferring vector data between parallel processing system with registers & logic for inter-processor data communication independents of processing operations |
US5123095A (en) | 1989-01-17 | 1992-06-16 | Ergo Computing, Inc. | Integrated scalar and vector processors with vector addressing by the scalar processor |
US5577250A (en) | 1992-02-18 | 1996-11-19 | Apple Computer, Inc. | Programming model for a coprocessor on a computer system |
US5664162A (en) | 1994-05-23 | 1997-09-02 | Cirrus Logic, Inc. | Graphics accelerator with dual memory controllers |
US5721834A (en) | 1995-03-08 | 1998-02-24 | Texas Instruments Incorporated | System management mode circuits systems and methods |
US5765022A (en) | 1995-09-29 | 1998-06-09 | International Business Machines Corporation | System for transferring data from a source device to a target device in which the address of data movement engine is determined |
US5812147A (en) | 1996-09-20 | 1998-09-22 | Silicon Graphics, Inc. | Instruction methods for performing data formatting while moving data between memory and a vector register file |
US5841444A (en) | 1996-03-21 | 1998-11-24 | Samsung Electronics Co., Ltd. | Multiprocessor graphics system |
US5938530A (en) | 1995-12-07 | 1999-08-17 | Kabushiki Kaisha Sega Enterprises | Image processing device and image processing method |
US5966528A (en) | 1990-11-13 | 1999-10-12 | International Business Machines Corporation | SIMD/MIMD array processor with vector processing |
US6058465A (en) | 1996-08-19 | 2000-05-02 | Nguyen; Le Trong | Single-instruction-multiple-data processing in a multimedia signal processor |
US6119217A (en) | 1997-03-27 | 2000-09-12 | Sony Computer Entertainment, Inc. | Information processing apparatus and information processing method |
US6223198B1 (en) | 1998-08-14 | 2001-04-24 | Advanced Micro Devices, Inc. | Method and apparatus for multi-function arithmetic |
US6317819B1 (en) | 1996-01-11 | 2001-11-13 | Steven G. Morton | Digital signal processor containing scalar processor and a plurality of vector processors operating from a single instruction |
US6317820B1 (en) | 1998-06-05 | 2001-11-13 | Texas Instruments Incorporated | Dual-mode VLIW architecture providing a software-controlled varying mix of instruction-level and task-level parallelism |
US6324623B1 (en) | 1997-05-30 | 2001-11-27 | Oracle Corporation | Computing system for implementing a shared cache |
US6341318B1 (en) | 1999-08-10 | 2002-01-22 | Chameleon Systems, Inc. | DMA data streaming |
US6342892B1 (en) | 1995-11-22 | 2002-01-29 | Nintendo Co., Ltd. | Video game system and coprocessor for video game system |
US6366998B1 (en) | 1998-10-14 | 2002-04-02 | Conexant Systems, Inc. | Reconfigurable functional units for implementing a hybrid VLIW-SIMD programming model |
US6425822B1 (en) | 1998-11-26 | 2002-07-30 | Konami Co., Ltd. | Music game machine with selectable controller inputs |
US20020135583A1 (en) | 1997-08-22 | 2002-09-26 | Sony Computer Entertainment Inc. | Information processing apparatus for entertainment system utilizing DMA-controlled high-speed transfer and processing of routine data |
US20020156993A1 (en) | 2001-03-22 | 2002-10-24 | Masakazu Suzuoki | Processing modules for computer architecture for broadband networks |
US6570571B1 (en) | 1999-01-27 | 2003-05-27 | Nec Corporation | Image processing apparatus and method for efficient distribution of image processing to plurality of graphics processors |
US20030179205A1 (en) | 2000-03-10 | 2003-09-25 | Smith Russell Leigh | Image display apparatus, method and program based on rigid body dynamics |
US20040075623A1 (en) | 2002-10-17 | 2004-04-22 | Microsoft Corporation | Method and system for displaying images on multiple monitors |
US20040083342A1 (en) | 2002-10-24 | 2004-04-29 | International Business Machines Corporation | Method and apparatus for enabling access to global data by a plurality of codes in an integrated executable for a heterogeneous architecture |
US6779049B2 (en) | 2000-12-14 | 2004-08-17 | International Business Machines Corporation | Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism |
US20040193754A1 (en) | 2003-03-27 | 2004-09-30 | International Business Machines Corporation | DMA prefetch |
US20050041031A1 (en) | 2003-08-18 | 2005-02-24 | Nvidia Corporation | Adaptive load balancing in a multi-processor graphics processing system |
US6862026B2 (en) | 2001-02-09 | 2005-03-01 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Process and device for collision detection of objects |
US20050086040A1 (en) | 2003-10-02 | 2005-04-21 | Curtis Davis | System incorporating physics processing unit |
US20050120187A1 (en) | 2001-03-22 | 2005-06-02 | Sony Computer Entertainment Inc. | External data interface in a computer architecture for broadband networks |
US20050251644A1 (en) * | 2004-05-06 | 2005-11-10 | Monier Maher | Physics processing unit instruction set architecture |
US6966837B1 (en) | 2001-05-10 | 2005-11-22 | Best Robert M | Linked portable and video game systems |
US7120653B2 (en) | 2002-05-13 | 2006-10-10 | Nvidia Corporation | Method and apparatus for providing an integrated file system |
US7149875B2 (en) * | 2003-03-27 | 2006-12-12 | Micron Technology, Inc. | Data reordering processor and method for use in an active memory device |
JP2007052790A (en) | 2005-08-19 | 2007-03-01 | Internatl Business Mach Corp <Ibm> | System, method, computer program and device for communicating command parameter between processor and memory flow controller |
US20070279422A1 (en) * | 2006-04-24 | 2007-12-06 | Hiroaki Sugita | Processor system including processors and data transfer method thereof |
US7421303B2 (en) * | 2004-01-22 | 2008-09-02 | Nvidia Corporation | Parallel LCP solver and system incorporating same |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5818452A (en) * | 1995-08-07 | 1998-10-06 | Silicon Graphics Incorporated | System and method for deforming objects using delta free-form deformation |
US5796400A (en) * | 1995-08-07 | 1998-08-18 | Silicon Graphics, Incorporated | Volume-based free form deformation weighting |
US5892691A (en) * | 1996-10-28 | 1999-04-06 | Reel/Frame 8218/0138 Pacific Data Images, Inc. | Method, apparatus, and software product for generating weighted deformations for geometric models |
JP3597360B2 (en) * | 1997-11-17 | 2004-12-08 | 株式会社リコー | Modeling method and recording medium |
KR100356919B1 (en) * | 1999-07-19 | 2002-10-19 | 한국전자통신연구원 | An interprocess communication method utilizing message queue combined with shared memory |
US6608631B1 (en) * | 2000-05-02 | 2003-08-19 | Pixar Amination Studios | Method, apparatus, and computer program product for geometric warps and deformations |
US7058750B1 (en) * | 2000-05-10 | 2006-06-06 | Intel Corporation | Scalable distributed memory and I/O multiprocessor system |
US6967658B2 (en) * | 2000-06-22 | 2005-11-22 | Auckland Uniservices Limited | Non-linear morphing of faces and their dynamics |
US6829697B1 (en) * | 2000-09-06 | 2004-12-07 | International Business Machines Corporation | Multiple logical interfaces to a shared coprocessor resource |
US6867770B2 (en) * | 2000-12-14 | 2005-03-15 | Sensable Technologies, Inc. | Systems and methods for voxel warping |
TW200513959A (en) * | 2003-10-02 | 2005-04-16 | Ageia Technologies Inc | Method for providing physics simulation data |
WO2005074425A2 (en) | 2004-01-22 | 2005-08-18 | Ageia Technologies, Inc. | Parallel lcp solver and system incorporating same |
US7236170B2 (en) * | 2004-01-29 | 2007-06-26 | Dreamworks Llc | Wrap deformation using subdivision surfaces |
TWI257790B (en) * | 2004-10-29 | 2006-07-01 | Ind Tech Res Inst | System for protocol processing engine |
US7630388B2 (en) * | 2005-05-04 | 2009-12-08 | Arm Limited | Software defined FIFO memory for storing a set of data from a stream of source data |
US8149854B2 (en) * | 2005-06-30 | 2012-04-03 | Intel Corporation | Multi-threaded transmit transport engine for storage devices |
-
2007
- 2007-05-10 US US11/798,119 patent/US7627744B2/en active Active
-
2008
- 2008-05-05 DE DE102008022080A patent/DE102008022080B4/en active Active
- 2008-05-07 GB GB0808251A patent/GB2449168B/en active Active
- 2008-05-09 TW TW097117334A patent/TWI416405B/en active
- 2008-05-09 KR KR1020080043541A patent/KR100932038B1/en active IP Right Grant
- 2008-05-09 CN CN2008100993042A patent/CN101320360B/en not_active Expired - Fee Related
- 2008-05-12 JP JP2008124877A patent/JP4428485B2/en active Active
Patent Citations (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5010477A (en) | 1986-10-17 | 1991-04-23 | Hitachi, Ltd. | Method and apparatus for transferring vector data between parallel processing system with registers & logic for inter-processor data communication independents of processing operations |
US5123095A (en) | 1989-01-17 | 1992-06-16 | Ergo Computing, Inc. | Integrated scalar and vector processors with vector addressing by the scalar processor |
US5966528A (en) | 1990-11-13 | 1999-10-12 | International Business Machines Corporation | SIMD/MIMD array processor with vector processing |
US5577250A (en) | 1992-02-18 | 1996-11-19 | Apple Computer, Inc. | Programming model for a coprocessor on a computer system |
US5664162A (en) | 1994-05-23 | 1997-09-02 | Cirrus Logic, Inc. | Graphics accelerator with dual memory controllers |
US5721834A (en) | 1995-03-08 | 1998-02-24 | Texas Instruments Incorporated | System management mode circuits systems and methods |
US5765022A (en) | 1995-09-29 | 1998-06-09 | International Business Machines Corporation | System for transferring data from a source device to a target device in which the address of data movement engine is determined |
US6342892B1 (en) | 1995-11-22 | 2002-01-29 | Nintendo Co., Ltd. | Video game system and coprocessor for video game system |
US5938530A (en) | 1995-12-07 | 1999-08-17 | Kabushiki Kaisha Sega Enterprises | Image processing device and image processing method |
US6317819B1 (en) | 1996-01-11 | 2001-11-13 | Steven G. Morton | Digital signal processor containing scalar processor and a plurality of vector processors operating from a single instruction |
US5841444A (en) | 1996-03-21 | 1998-11-24 | Samsung Electronics Co., Ltd. | Multiprocessor graphics system |
US6058465A (en) | 1996-08-19 | 2000-05-02 | Nguyen; Le Trong | Single-instruction-multiple-data processing in a multimedia signal processor |
US5812147A (en) | 1996-09-20 | 1998-09-22 | Silicon Graphics, Inc. | Instruction methods for performing data formatting while moving data between memory and a vector register file |
US6119217A (en) | 1997-03-27 | 2000-09-12 | Sony Computer Entertainment, Inc. | Information processing apparatus and information processing method |
US6324623B1 (en) | 1997-05-30 | 2001-11-27 | Oracle Corporation | Computing system for implementing a shared cache |
US20020135583A1 (en) | 1997-08-22 | 2002-09-26 | Sony Computer Entertainment Inc. | Information processing apparatus for entertainment system utilizing DMA-controlled high-speed transfer and processing of routine data |
US6317820B1 (en) | 1998-06-05 | 2001-11-13 | Texas Instruments Incorporated | Dual-mode VLIW architecture providing a software-controlled varying mix of instruction-level and task-level parallelism |
US6223198B1 (en) | 1998-08-14 | 2001-04-24 | Advanced Micro Devices, Inc. | Method and apparatus for multi-function arithmetic |
US6366998B1 (en) | 1998-10-14 | 2002-04-02 | Conexant Systems, Inc. | Reconfigurable functional units for implementing a hybrid VLIW-SIMD programming model |
US6425822B1 (en) | 1998-11-26 | 2002-07-30 | Konami Co., Ltd. | Music game machine with selectable controller inputs |
US6570571B1 (en) | 1999-01-27 | 2003-05-27 | Nec Corporation | Image processing apparatus and method for efficient distribution of image processing to plurality of graphics processors |
US6341318B1 (en) | 1999-08-10 | 2002-01-22 | Chameleon Systems, Inc. | DMA data streaming |
US20030179205A1 (en) | 2000-03-10 | 2003-09-25 | Smith Russell Leigh | Image display apparatus, method and program based on rigid body dynamics |
US6779049B2 (en) | 2000-12-14 | 2004-08-17 | International Business Machines Corporation | Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism |
US6862026B2 (en) | 2001-02-09 | 2005-03-01 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Process and device for collision detection of objects |
US20020156993A1 (en) | 2001-03-22 | 2002-10-24 | Masakazu Suzuoki | Processing modules for computer architecture for broadband networks |
US20050120187A1 (en) | 2001-03-22 | 2005-06-02 | Sony Computer Entertainment Inc. | External data interface in a computer architecture for broadband networks |
US6966837B1 (en) | 2001-05-10 | 2005-11-22 | Best Robert M | Linked portable and video game systems |
US7120653B2 (en) | 2002-05-13 | 2006-10-10 | Nvidia Corporation | Method and apparatus for providing an integrated file system |
US20040075623A1 (en) | 2002-10-17 | 2004-04-22 | Microsoft Corporation | Method and system for displaying images on multiple monitors |
US20040083342A1 (en) | 2002-10-24 | 2004-04-29 | International Business Machines Corporation | Method and apparatus for enabling access to global data by a plurality of codes in an integrated executable for a heterogeneous architecture |
US20040193754A1 (en) | 2003-03-27 | 2004-09-30 | International Business Machines Corporation | DMA prefetch |
US7149875B2 (en) * | 2003-03-27 | 2006-12-12 | Micron Technology, Inc. | Data reordering processor and method for use in an active memory device |
US20050041031A1 (en) | 2003-08-18 | 2005-02-24 | Nvidia Corporation | Adaptive load balancing in a multi-processor graphics processing system |
US20050086040A1 (en) | 2003-10-02 | 2005-04-21 | Curtis Davis | System incorporating physics processing unit |
US7421303B2 (en) * | 2004-01-22 | 2008-09-02 | Nvidia Corporation | Parallel LCP solver and system incorporating same |
US20050251644A1 (en) * | 2004-05-06 | 2005-11-10 | Monier Maher | Physics processing unit instruction set architecture |
JP2006107514A (en) | 2004-10-05 | 2006-04-20 | Sony Computer Entertainment Inc | System and device which have interface device which can perform data communication with external device |
JP2007052790A (en) | 2005-08-19 | 2007-03-01 | Internatl Business Mach Corp <Ibm> | System, method, computer program and device for communicating command parameter between processor and memory flow controller |
US20070079018A1 (en) | 2005-08-19 | 2007-04-05 | Day Michael N | System and method for communicating command parameters between a processor and a memory flow controller |
US20070279422A1 (en) * | 2006-04-24 | 2007-12-06 | Hiroaki Sugita | Processor system including processors and data transfer method thereof |
Non-Patent Citations (9)
Title |
---|
Bishop, et al. "Sparta: Simulation of Physics on a Real-Time Architecture," Proceedings of the 10th Great Lakes Symposium on VLSI, pp. 177-182, 2000. |
Final Office Action, U.S. Appl. No. 10/715,459, dated Oct. 9, 2009. |
Hauth, et al. "Corotational Simulation of Deformable Objects": Journal of WSCG, vol. 12 No. 1-3: 2003. |
Intel Corp. "Intel PCI and PCI Express," 1992-2004, 3 pages. |
Office Action. U.S. Appl. No. 10/715,440. Dated Feb. 23, 2009. |
Patent Acts 1977: Search Report under Section 17. Sep. 3, 2008. |
Telekinesys Research Ltd. "Havok Game Dynamics SDK," 2002, 33 pages. |
Wakamatsu, et al. "Static Modeling of Linear Object Deformation Based On Differential Geometry"; Stage Publications 2004, Journal of Robotics Research, V23, No. 3, International Mar. 2004, p. 293-311. |
Zhuang, et al. "Real-Time Simulation of Physically Realistic Global Deformation", SIGGRAPH 1999. |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11563621B2 (en) | 2006-06-13 | 2023-01-24 | Advanced Cluster Systems, Inc. | Cluster computing |
US12021679B1 (en) | 2006-06-13 | 2024-06-25 | Advanced Cluster Systems, Inc. | Cluster computing |
US11811582B2 (en) | 2006-06-13 | 2023-11-07 | Advanced Cluster Systems, Inc. | Cluster computing |
US11570034B2 (en) | 2006-06-13 | 2023-01-31 | Advanced Cluster Systems, Inc. | Cluster computing |
US20100161914A1 (en) * | 2008-12-23 | 2010-06-24 | Eilert Sean S | Autonomous memory subsystems in computing platforms |
US20100165991A1 (en) * | 2008-12-30 | 2010-07-01 | Veal Bryan E | SIMD processing of network packets |
US8493979B2 (en) * | 2008-12-30 | 2013-07-23 | Intel Corporation | Single instruction processing of network packets |
US9054987B2 (en) | 2008-12-30 | 2015-06-09 | Intel Corporation | Single instruction processing of network packets |
US20100268904A1 (en) * | 2009-04-15 | 2010-10-21 | Sheffield Robert L | Apparatus and methods for region lock management assist circuit in a storage system |
US20100268743A1 (en) * | 2009-04-15 | 2010-10-21 | Hallyal Basavaraj G | Apparatus and methods for tree management assist circuit in a storage system |
US9268695B2 (en) | 2012-12-12 | 2016-02-23 | Avago Technologies General Ip (Singapore) Pte. Ltd. | Methods and structure for using region locks to divert I/O requests in a storage controller having multiple processing stacks |
CN104639596A (en) * | 2013-11-08 | 2015-05-20 | 塔塔咨询服务有限公司 | System and method for multiple sender support in low latency fifo messaging using rdma |
CN104639596B (en) * | 2013-11-08 | 2018-04-27 | 塔塔咨询服务有限公司 | System and method for supporting multiple transmitters in the low latency FIFO information receiving and transmitting using RDMA |
CN104639597B (en) * | 2013-11-08 | 2018-03-30 | 塔塔咨询服务有限公司 | For the system and method using transmitters is supported in the low latency FIFO information receiving and transmitting of ICP/IP protocol |
AU2014200239B2 (en) * | 2013-11-08 | 2015-11-05 | Tata Consultancy Services Limited | System and method for multiple sender support in low latency fifo messaging using rdma |
CN104639597A (en) * | 2013-11-08 | 2015-05-20 | 塔塔咨询服务有限公司 | System(s) and method(s) for multiple sender support in low latency fifo messaging using tcp/ip protocol |
US11397694B2 (en) | 2019-09-17 | 2022-07-26 | Micron Technology, Inc. | Memory chip connecting a system on a chip and an accelerator chip |
US11416422B2 (en) | 2019-09-17 | 2022-08-16 | Micron Technology, Inc. | Memory chip having an integrated data mover |
US12045503B2 (en) | 2019-09-17 | 2024-07-23 | Micron Technology, Inc. | Programmable engine for data movement |
US12086078B2 (en) | 2019-09-17 | 2024-09-10 | Micron Technology, Inc. | Memory chip having an integrated data mover |
Also Published As
Publication number | Publication date |
---|---|
KR20080099823A (en) | 2008-11-13 |
DE102008022080B4 (en) | 2011-05-05 |
TWI416405B (en) | 2013-11-21 |
GB2449168A (en) | 2008-11-12 |
DE102008022080A1 (en) | 2008-12-11 |
JP2009037593A (en) | 2009-02-19 |
CN101320360B (en) | 2012-03-21 |
TW200901028A (en) | 2009-01-01 |
GB2449168B (en) | 2009-04-22 |
US20080282058A1 (en) | 2008-11-13 |
JP4428485B2 (en) | 2010-03-10 |
KR100932038B1 (en) | 2009-12-15 |
CN101320360A (en) | 2008-12-10 |
GB0808251D0 (en) | 2008-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7627744B2 (en) | External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level | |
EP3631636B1 (en) | Continuation analysis tasks for gpu task scheduling | |
US20210349763A1 (en) | Technique for computational nested parallelism | |
US9830158B2 (en) | Speculative execution and rollback | |
US10067768B2 (en) | Execution of divergent threads using a convergence barrier | |
US20020083373A1 (en) | Journaling for parallel hardware threads in multithreaded processor | |
US20130198760A1 (en) | Automatic dependent task launch | |
US20140337848A1 (en) | Low overhead thread synchronization using hardware-accelerated bounded circular queues | |
US10146575B2 (en) | Heterogeneous enqueuing and dequeuing mechanism for task scheduling | |
TWI489289B (en) | Pre-scheduled replays of divergent operations | |
US20130135327A1 (en) | Saving and Restoring Non-Shader State Using a Command Processor | |
Kornaros et al. | Enabling efficient job dispatching in accelerator-extended heterogeneous systems with unified address space | |
Moore et al. | Introduction to Multithreaded Processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AGEIA TECHNOLOGIES, INC., MISSOURI Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MAHER, MONIER;BORDES, JEAN PIERRE;LAMB, CHRISTOPHER;AND OTHERS;REEL/FRAME:019377/0568;SIGNING DATES FROM 20070430 TO 20070502 |
|
AS | Assignment |
Owner name: AGEIA TECHNOLOGIES, INC.,CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:HERCULES TECHNOLOGY GROWTH CAPITAL, INC.;REEL/FRAME:020827/0853 Effective date: 20080207 Owner name: AGEIA TECHNOLOGIES, INC., CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:HERCULES TECHNOLOGY GROWTH CAPITAL, INC.;REEL/FRAME:020827/0853 Effective date: 20080207 |
|
AS | Assignment |
Owner name: NVIDIA CORPORATION,CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AGEIA TECHNOLOGIES, INC.;REEL/FRAME:021011/0059 Effective date: 20080523 Owner name: NVIDIA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AGEIA TECHNOLOGIES, INC.;REEL/FRAME:021011/0059 Effective date: 20080523 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |