JP2021530813A

JP2021530813A - Integrated address space for multiple hardware accelerators with dedicated low latency links

Info

Publication number: JP2021530813A
Application number: JP2021503580A
Authority: JP
Inventors: サラブジートシン，; ヘムシー．ニーマ，; ソナルサンタン，; カンケイ．ダオ，; カイルコーベット，; イーワン，; クリストファージェイ．ケース，
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-07-26
Filing date: 2019-07-25
Publication date: 2021-11-11
Anticipated expiration: 2039-07-25
Also published as: KR20210033996A; EP3827356A1; CN112543925B; CN112543925A; US10802995B2; JP7565911B2; JP2024099640A; WO2020023797A1; US20200081850A1

Abstract

システムが、通信バスに接続されたホストプロセッサ（１０５）と、通信バスを通してホストプロセッサ（１０５）に通信可能にリンクされた第１のハードウェアアクセラレータ（１３５−１）と、通信バスを通してホストプロセッサ（１０５）に通信可能にリンクされた第２のハードウェアアクセラレータ（１３５−２）とを含み得る。第１のハードウェアアクセラレータ（１３５−１）と第２のハードウェアアクセラレータ（１３５−２）とは、通信バスから独立したアクセラレータリンクを通して直接接続される。ホストプロセッサ（１０５）は、アクセラレータリンクを直接通した、第１のハードウェアアクセラレータ（１３５−１）と第２のハードウェアアクセラレータ（１３５−２）との間のデータ転送を開始するように設定される。【選択図】図１The system has a host processor (105) connected to the communication bus, a first hardware accelerator (135-1) that is communicably linked to the host processor (105) through the communication bus, and a host processor (135-1) through the communication bus. It may include a second hardware accelerator (135-2) communicably linked to 105). The first hardware accelerator (135-1) and the second hardware accelerator (135-2) are directly connected through an accelerator link independent of the communication bus. The host processor (105) is configured to initiate data transfer between the first hardware accelerator (135-1) and the second hardware accelerator (135-2) through the accelerator link directly. NS. [Selection diagram] Fig. 1

Description

本開示は、ハードウェアアクセラレーションに関し、より詳細には、統合されたアドレス空間および低レイテンシ通信リンクを通した複数のハードウェアアクセラレータの使用を可能にすることに関する。 The present disclosure relates to hardware acceleration, and more particularly to enabling the use of multiple hardware accelerators through an integrated address space and low latency communication links.

異種コンピューティングプラットフォーム（ＨＣＰ：ｈｅｔｅｒｏｇｅｎｅｏｕｓｃｏｍｐｕｔｉｎｇｐｌａｔｆｏｒｍ）は、インターフェース回路を通して１つまたは複数の他のデバイスに接続されたホストプロセッサを含むデータ処理システムを指す。デバイスは、一般に、アーキテクチャ上、ホストプロセッサとは異なる。ホストプロセッサは、デバイスにタスクをオフロードすることが可能である。デバイスは、そのタスクを実施し、ホストプロセッサにとって利用可能な結果を作ることが可能である。例示的な例として、ホストプロセッサは、一般に、中央処理ユニットとして実装され、デバイスは、グラフィックス処理ユニット（ＧＰＵ）および／またはデジタル信号プロセッサ（ＤＳＰ）として実装される。 A heterogeneous computing platform (HCP) refers to a data processing system that includes a host processor connected to one or more other devices through an interface circuit. Devices are generally architecturally different from host processors. The host processor can offload tasks to the device. The device is capable of performing that task and producing results that are available to the host processor. As an exemplary example, the host processor is typically implemented as a central processing unit and the device is implemented as a graphics processing unit (GPU) and / or digital signal processor (DSP).

他のＨＣＰでは、ホストプロセッサからオフロードされたタスクを実施するデバイスのうちの１つまたは複数が、（「ハードウェアアクセラレータ」と呼ばれる）ハードウェアアクセラレーションのために適応されたデバイスを含む。ハードウェアアクセラレータは、タスクを実施するためにソフトウェアまたはプログラムコードを実行することとは対照的に、ホストからオフロードされたタスクを実施することが可能である回路を含む。ハードウェアアクセラレータの回路は、ソフトウェアを実行することと機能的に等価であるが、一般に、より少ない時間においてタスクを完了することが可能である。 In other HCPs, one or more of the devices performing tasks offloaded from the host processor include devices adapted for hardware acceleration (called "hardware accelerators"). Hardware accelerators include circuits that are capable of performing tasks offloaded from the host, as opposed to executing software or program code to perform the tasks. The circuit of a hardware accelerator is functionally equivalent to running software, but in general it is possible to complete a task in less time.

ハードウェアアクセラレータの例は、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、部分的にプログラム可能な集積回路（ＩＣ）、特定用途向けＩＣ（ＡＳＩＣ）など、プログラマブルＩＣを含む。明らかに、ＨＣＰは、１つまたは複数がプログラムコードを実行するように適応され、１つまたは複数の他のものがハードウェアアクセラレーションのために適応された、異なるデバイスの組合せを含み得る。 Examples of hardware accelerators include programmable ICs such as field programmable gate arrays (FPGAs), partially programmable integrated circuits (ICs), and application specific integrated circuits (ASICs). Obviously, the HCP may include a combination of different devices, one or more adapted to execute program code and one or more others adapted for hardware acceleration.

１つまたは複数の実施形態では、システムが、通信バスに接続されたホストプロセッサと、通信バスを通してホストプロセッサに通信可能にリンクされた第１のハードウェアアクセラレータと、通信バスを通してホストプロセッサに通信可能にリンクされた第２のハードウェアアクセラレータとを含み得る。第１のハードウェアアクセラレータと第２のハードウェアアクセラレータとは、通信バスから独立したアクセラレータリンクを通して直接接続される。ホストプロセッサは、アクセラレータリンクを直接通した、第１のハードウェアアクセラレータと第２のハードウェアアクセラレータとの間のデータ転送を開始するように設定される。 In one or more embodiments, the system can communicate to the host processor through the communication bus with a host processor connected to the communication bus and a first hardware accelerator linked to the host processor so that it can communicate through the communication bus. It may include a second hardware accelerator linked to. The first hardware accelerator and the second hardware accelerator are directly connected through an accelerator link independent of the communication bus. The host processor is configured to initiate data transfer between the first hardware accelerator and the second hardware accelerator directly through the accelerator link.

１つまたは複数の実施形態では、ハードウェアアクセラレータは、通信バスを介してホストプロセッサと通信するように設定されたエンドポイントと、ハードウェアアクセラレータにローカルなメモリに接続されたメモリコントローラと、エンドポイントとメモリコントローラとに接続されたリンク回路とを含み得る。リンク回路は、通信バスにも接続されたターゲットハードウェアアクセラレータとのアクセラレータリンクを確立するように設定される。アクセラレータリンクは、通信バスから独立した、ハードウェアアクセラレータとターゲットハードウェアアクセラレータとの間の直接接続である。 In one or more embodiments, the hardware accelerator is an endpoint configured to communicate with the host processor over a communication bus, a memory controller connected to memory local to the hardware accelerator, and an endpoint. And a link circuit connected to the memory controller. The link circuit is configured to establish an accelerator link with a target hardware accelerator that is also connected to the communication bus. Accelerator links are direct connections between a hardware accelerator and a target hardware accelerator, independent of the communication bus.

１つまたは複数の実施形態では、方法が、第１のハードウェアアクセラレータ内で、通信バスを介してホストプロセッサから送られた命令とデータ転送についてのターゲットアドレスとを受信することと、第１のハードウェアアクセラレータが、ターゲットアドレスを、第１のハードウェアアクセラレータに対応するアドレス範囲の上限と比較することと、比較することに基づいてターゲットアドレスがアドレス範囲を超えると決定したことに応答して、第１のハードウェアアクセラレータが、第１のハードウェアアクセラレータと第２のハードウェアアクセラレータとを直接接続するアクセラレータリンクを使用してデータ転送を実施するために、第２のハードウェアアクセラレータとのトランザクションを開始することとを含み得る。 In one or more embodiments, the method is to receive an instruction sent from a host processor over a communication bus and a target address for data transfer within the first hardware accelerator, and the first. In response to the hardware accelerator comparing the target address with the upper bound of the address range corresponding to the first hardware accelerator and determining that the target address exceeds the address range based on the comparison. A first hardware accelerator performs a transaction with a second hardware accelerator in order to perform a data transfer using an accelerator link that directly connects the first hardware accelerator and the second hardware accelerator. It can include starting.

本発明の概要セクションは、いくつかの概念を導入するために提供されるにすぎず、請求される主題の重要な、または本質的な特徴を識別するために提供されるものではない。本発明の構成の他の特徴は、添付の図面および以下の発明を実施するための形態から明らかになろう。 The overview section of the invention is provided only to introduce some concepts, not to identify important or essential features of the claimed subject matter. Other features of the configuration of the present invention will become apparent from the accompanying drawings and the embodiments for carrying out the following inventions.

本発明の構成は、添付の図面において例として示される。しかしながら、図面は、本発明の構成を、図示される特定の実装形態のみに限定するものと解釈されるべきではない。様々な態様および利点が、以下の発明を実施するための形態を検討し、図面を参照すると明らかになろう。 The configuration of the present invention is shown as an example in the accompanying drawings. However, the drawings should not be construed as limiting the configuration of the invention to the particular implementations shown. Various aspects and advantages will become apparent when examining embodiments for carrying out the following inventions and referring to the drawings.

複数のハードウェアアクセラレータをもつシステムの一例を示す図である。It is a figure which shows an example of the system which has a plurality of hardware accelerators. ハードウェアアクセラレータの例示的な一実装形態を示す図である。It is a figure which shows an exemplary implementation form of a hardware accelerator. 再送信エンジン（ＲＴＥ：ｒｅｔｒａｎｓｍｉｔｅｎｇｉｎｅ）の一例を示す図である。It is a figure which shows an example of a retransmission engine (RTE: retransmit engine). 複数のハードウェアアクセラレータをもつシステムのための動作の例示的な方法を示す図である。FIG. 5 illustrates an exemplary method of operation for a system with multiple hardware accelerators. 複数のハードウェアアクセラレータと１つまたは複数の追加のデバイスとをもつシステムの一例を示す図である。FIG. 5 shows an example of a system with a plurality of hardware accelerators and one or more additional devices. 集積回路（ＩＣ）のための例示的なアーキテクチャを示す図である。FIG. 5 illustrates an exemplary architecture for an integrated circuit (IC).

本開示は、新規の特徴を定義する特許請求の範囲で締めくくるが、本開示内で説明される様々な特徴は、図面とともにその説明を考慮することにより、より良く理解されると考えられる。本明細書で説明される（１つまたは複数の）プロセス、（１つまたは複数の）機械、（１つまたは複数の）製造物およびその任意の変形形態は、例示のために提供される。本開示内で説明される特定の構造的および機能的詳細は、限定するものとして解釈されるべきではなく、単に、特許請求の範囲のための基礎として、およびほぼすべての適切に詳細な構造において説明される特徴を様々に採用するように当業者に教示するための代表的基礎として解釈されるべきである。さらに、本開示内で使用される用語および句は、限定するものではなく、むしろ、説明される特徴の理解可能な説明を提供するものである。 The present disclosure concludes with claims that define new features, but the various features described within this disclosure are believed to be better understood by considering their description along with the drawings. The (s) processes, (s) machines, (s) products and any variants thereof described herein are provided for illustration purposes. The particular structural and functional details described within this disclosure should not be construed as limiting, but merely as a basis for the claims and in almost all well-detailed structures. It should be interpreted as a representative basis for teaching those skilled in the art to adopt various features described. Moreover, the terms and phrases used in this disclosure are not limiting, but rather provide an understandable description of the features being described.

本開示は、ハードウェアアクセラレーションに関し、より詳細には、統合されたアドレス空間および低レイテンシ通信リンクを通した複数のハードウェアアクセラレータの使用を可能にすることに関する。データ処理システムとともにハードウェアアクセラレータを使用することが、ホストプロセッサからタスクをオフロードするための有効な技法になっており、それにより、ホストプロセッサ上の作業負荷を低減する。ハードウェアアクセラレータは、一般に、バスを通してホストプロセッサに取り付けられる。たとえば、ハードウェアアクセラレータは、ホストシステムの利用可能なバススロットに挿入された回路板に取り付けられ得る。一般に、各ハードウェアアクセラレータは、対応する回路板に取り付けられる。システムに追加のハードウェアアクセラレータを追加することは、通常、利用可能なバススロットにハードウェアアクセラレータをもつ追加の回路板を挿入することを伴う。 The present disclosure relates to hardware acceleration, and more particularly to enabling the use of multiple hardware accelerators through an integrated address space and low latency communication links. Using a hardware accelerator with a data processing system has become an effective technique for offloading tasks from the host processor, thereby reducing the workload on the host processor. Hardware accelerators are typically mounted on the host processor through the bus. For example, a hardware accelerator can be mounted on a circuit board inserted into an available bus slot in a host system. Generally, each hardware accelerator is mounted on the corresponding circuit board. Adding additional hardware accelerators to the system usually involves inserting additional circuit boards with hardware accelerators into the available bus slots.

従来のシステム内では、特に（たとえば、ハードウェアアドレスによって）任意の新たに追加されたハードウェアアクセラレータにアクセスするために、ホストプロセッサによって実行されるアプリケーションが、更新され、および／または書き直されなければならない。さらに、あるハードウェアアクセラレータから別のハードウェアアクセラレータにデータを転送するために、データは、ソースハードウェアアクセラレータからホストプロセッサに移動され、次いで、ホストプロセッサからターゲットハードウェアアクセラレータまで移動される。データは、バスを介してホストプロセッサを通して各ハードウェアアクセラレータにおよび各ハードウェアアクセラレータから移動する。したがって、システムに追加される各追加のハードウェアアクセラレータが、バス上のデバイスの数を増加させ、それにより、バス上の帯域幅についての競合を生じる。ハードウェアアクセラレータ（または他のデバイス）によって実施されるタスクの複雑さ、数、および／またはサイズが増加するにつれて、バス上の利用可能帯域幅がさらに制約される。 Within a traditional system, applications run by the host processor must be updated and / or rewritten, especially to access any newly added hardware accelerator (by hardware address, for example). It doesn't become. In addition, to transfer data from one hardware accelerator to another, the data is moved from the source hardware accelerator to the host processor and then from the host processor to the target hardware accelerator. Data travels via the bus through the host processor to and from each hardware accelerator. Therefore, each additional hardware accelerator added to the system increases the number of devices on the bus, thereby creating a competition for bandwidth on the bus. As the complexity, number, and / or size of tasks performed by hardware accelerators (or other devices) increases, the available bandwidth on the bus is further constrained.

本開示内で説明される本発明の構成によれば、デバイスのための統合されたアドレス空間が提供される。さらに、本明細書では「アクセラレータリンク」と呼ばれる、バスから独立して動作することが可能であるハードウェアアクセラレータ間の直接通信リンクが提供される。ホストプロセッサによって実行されるアプリケーションが、システムにおける特定のハードウェアアクセラレータを直接参照する（たとえば、アドレス指定する）ことなしに動作し得るように、ホストによって実行されるランタイムライブラリおよびドライバが、統合されたアドレス空間を活用することが可能である。ランタイムライブラリは、ハードウェアアクセラレータの間のデータ転送を実現するために使用するための適切なアドレスを決定することが可能である。したがって、アプリケーションは、システムに追加され得る追加のハードウェアアクセラレータにアクセスするために修正される必要がない。さらに、データ転送がアクセラレータリンクを介して実施され得、アクセラレータリンクは、データが、ホストプロセッサを通過することなしに、あるハードウェアアクセラレータから別のハードウェアアクセラレータに直接転送されることを可能にし、バスを効果的にバイパスする。したがって、バス上のハードウェアアクセラレータによって使用される帯域幅が著しく低減され、それにより、全体的なシステム性能を増加させ得る。 The configurations of the invention described within the present disclosure provide an integrated address space for the device. Further provided herein are direct communication links between hardware accelerators that can operate independently of the bus, called "accelerator links". The runtime libraries and drivers run by the host have been integrated so that applications run by the host processor can run without directly referencing (for example, addressing) specific hardware accelerators in the system. It is possible to utilize the address space. The run-time library can determine the appropriate address to use to achieve data transfer between hardware accelerators. Therefore, the application does not need to be modified to access additional hardware accelerators that may be added to the system. In addition, data transfer can be performed over the accelerator link, which allows data to be transferred directly from one hardware accelerator to another without going through the host processor. Effectively bypass the bus. Therefore, the bandwidth used by the hardware accelerators on the bus can be significantly reduced, thereby increasing the overall system performance.

述べられたように、ホストプロセッサによって実行されるプログラムコード（たとえば、アプリケーション）に対する対応する変更または修正を必要とすることなしに、既存のアドレス空間を使用して追加のハードウェアアクセラレータがシステムに追加され得る。これは、少なくとも部分的に、ハードウェアアクセラレータボードについての自動発見プロセスの実装、およびそのようなボードをシステムに追加することの実装、リモートバッファフラグ対ローカルバッファフラグの使用、少なくともいくつかの場合におけるデータ転送のためのアクセラレータリンクへの自動切替え、ならびにリモートバッファのための自動アドレス変換を通してサポートされる。 As mentioned, additional hardware accelerators are added to the system using the existing address space without the need for corresponding changes or modifications to the program code (for example, applications) executed by the host processor. Can be done. This is, at least in part, the implementation of an auto-discovery process for hardware accelerator boards, and the implementation of adding such boards to the system, the use of remote buffer flags vs. local buffer flags, at least in some cases. Supported through automatic switching to accelerator links for data transfer, as well as automatic address translation for remote buffers.

図を参照しながら、本発明の構成のさらなる態様が以下でより詳細に説明される。例示を単純および明快にするために、図に示されている要素は、必ずしも一定の縮尺で描かれているとは限らない。たとえば、要素のうちのいくつかの寸法は、明快のために、他の要素に対して誇張され得る。さらに、適切と見なされる場合、対応する、類似する、または同様の特徴を指示するために、参照番号が図の間で繰り返される。 Further aspects of the configuration of the present invention will be described in more detail below with reference to the figures. For the sake of simplicity and clarity, the elements shown in the figure are not always drawn to a constant scale. For example, some dimensions of an element may be exaggerated relative to other elements for clarity. In addition, reference numbers are repeated between figures to indicate corresponding, similar, or similar features where deemed appropriate.

図１は、複数のハードウェアアクセラレータをもつシステム１００の一例を示す。システム１００は、コンピュータ、サーバ、または他のデータ処理システムを実装するために使用され得るコンピュータハードウェアの一例である。システム１００は、異種コンピューティングシステムの一例でもある。描かれているように、システム１００は、インターフェース回路１１５を通してホストメモリ１１０に接続された少なくとも１つのホストプロセッサ１０５を含む。 FIG. 1 shows an example of a system 100 having a plurality of hardware accelerators. System 100 is an example of computer hardware that can be used to implement a computer, server, or other data processing system. System 100 is also an example of a heterogeneous computing system. As depicted, system 100 includes at least one host processor 105 connected to host memory 110 through interface circuit 115.

システム１００は、複数のハードウェアアクセラレータ１３５をも含む。図１の例では、システム１００は、３つのハードウェアアクセラレータ１３５−１、１３５−２、および１３５−３を含む。図１の例は、３つのハードウェアアクセラレータを示しているが、システム１００は、３つよりも少ないハードウェアアクセラレータまたは４つ以上のハードウェアアクセラレータを含み得ることを諒解されたい。さらに、システム１００は、グラフィックス処理ユニット（ＧＰＵ）またはデジタル信号プロセッサ（ＤＳＰ）など、１つまたは複数の他のデバイスを含み得る。 System 100 also includes a plurality of hardware accelerators 135. In the example of FIG. 1, system 100 includes three hardware accelerators 135-1, 135-2, and 135-3. Although the example in FIG. 1 shows three hardware accelerators, it should be appreciated that system 100 may include less than three hardware accelerators or four or more hardware accelerators. In addition, system 100 may include one or more other devices, such as a graphics processing unit (GPU) or digital signal processor (DSP).

システム１００は、ホストメモリ１１０内に（「プログラムコード」とも呼ばれる）コンピュータ可読命令を記憶することが可能である。ホストメモリ１１０は、コンピュータ可読記憶媒体の一例である。ホストプロセッサ１０５は、インターフェース回路１１５を介してホストメモリ１１０からアクセスされるプログラムコードを実行することが可能である。１つまたは複数の実施形態では、ホストプロセッサ１０５は、メモリコントローラ（図示せず）を通してホストメモリ１１０と通信する。 The system 100 can store computer-readable instructions (also referred to as "program code") in the host memory 110. The host memory 110 is an example of a computer-readable storage medium. The host processor 105 can execute the program code accessed from the host memory 110 via the interface circuit 115. In one or more embodiments, the host processor 105 communicates with the host memory 110 through a memory controller (not shown).

ホストメモリ１１０は、たとえば、ローカルメモリおよびバルク記憶デバイス（ｂｕｌｋｓｔｏｒａｇｅｄｅｖｉｃｅ）など、１つまたは複数の物理メモリデバイスを含み得る。ローカルメモリは、概してプログラムコードの実際の実行中に使用される（１つまたは複数の）非永続的メモリデバイスを指す。ローカルメモリの例は、ランダムアクセスメモリ（ＲＡＭ）、および／または、ＤＲＡＭ、ＳＲＡＭ、ＤＤＲＳＤＲＡＭなど、プログラムコードの実行中のプロセッサによる使用のために好適である様々なタイプのＲＡＭのいずれかを含む。バルク記憶デバイスは、永続的データ記憶デバイスを指す。バルク記憶デバイスの例は、限定はしないが、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）、フラッシュメモリ、読取り専用メモリ（ＲＯＭ）、消去可能プログラマブル読取り専用メモリ（ＥＰＲＯＭ）、電気的消去可能プログラマブル読取り専用メモリ（ＥＥＰＲＯＭ）、または他の好適なメモリを含む。システム１００は、プログラムコードが実行中にバルク記憶デバイスから取り出されなければならない回数を低減するために少なくともあるプログラムコードの一時的記憶を行う１つまたは複数のキャッシュメモリ（図示せず）をも含み得る。 The host memory 110 may include one or more physical memory devices, such as local memory and bulk storage devices. Local memory generally refers to non-persistent memory devices (s) used during the actual execution of program code. Examples of local memory include random access memory (RAM) and / or any of various types of RAM suitable for use by a processor running program code, such as DRAM, SRAM, DDR SDRAM. .. Bulk storage device refers to a persistent data storage device. Examples of bulk storage devices are, but are not limited to, hard disk drives (HDD), solid state drives (SSD), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable. Includes read-only memory (EEPROM), or other suitable memory. System 100 also includes one or more cache memories (not shown) that temporarily store at least some program code in order to reduce the number of times the program code must be retrieved from the bulk storage device during execution. obtain.

ホストメモリ１１０は、プログラムコードおよび／またはデータを記憶することが可能である。たとえば、ホストメモリ１１０は、オペレーティングシステム１２０と、命令１２５と、データ１３０とを記憶し得る。図１の例では、命令１２５は、１つまたは複数のアプリケーション１７０と、（本明細書では「ランタイム」と呼ばれる）ランタイムライブラリ１７２と、ハードウェアアクセラレータ１３５と通信することが可能であるドライバ１７４とを含み得る。ランタイム１７２は、完了イベントをハンドリングすることと、コマンド待ち行列を管理することと、（１つまたは複数の）アプリケーション１７０に通知を提供することとが可能である。データ１３０は、他のタイプのデータ項目のうち、ハードウェアアクセラレータ１３５間の直接データ転送を可能にする、バッファオブジェクト１７６および１７８などのバッファオブジェクトを含み得る。バッファオブジェクト１７６は、リモートフラグ１８０を含み、バッファオブジェクト１７８は、リモートフラグ１８２を含む。例示の目的で、リモートフラグ１８０はセットされておらず、リモートフラグ１８２はセットされている。システム１００、たとえば、ホストプロセッサ１０５は、本開示内で説明される動作を実施するために、オペレーティングシステム１２０と命令１２５とを実行することが可能である。 The host memory 110 can store program code and / or data. For example, the host memory 110 may store the operating system 120, the instructions 125, and the data 130. In the example of FIG. 1, instruction 125 is associated with one or more applications 170, a runtime library 172 (referred to herein as the "runtime"), and a driver 174 capable of communicating with the hardware accelerator 135. May include. Runtime 172 can handle completion events, manage command queues, and provide notifications to application 170 (s). Data 130 may include buffer objects such as buffer objects 176 and 178 that allow direct data transfer between hardware accelerators 135, among other types of data items. The buffer object 176 includes the remote flag 180 and the buffer object 178 includes the remote flag 182. For illustrative purposes, the remote flag 180 is not set and the remote flag 182 is set. System 100, such as the host processor 105, is capable of executing operating system 120 and instructions 125 to perform the operations described herein.

インターフェース回路１１５の例は、限定はしないが、システムバスと入出力（Ｉ／Ｏ）バスとを含む。インターフェース回路１１５は、様々なバスアーキテクチャのいずれかを使用して実装され得る。バスアーキテクチャの例は、限定はしないが、拡張業界標準アーキテクチャ（ＥＩＳＡ）バス、アクセラレーテッドグラフィックスポート（ＡＧＰ）、ビデオエレクトロニクス規格協会（ＶＥＳＡ）ローカルバス、ユニバーサルシリアルバス（ＵＳＢ）、および周辺構成要素相互接続エクスプレス（ＰＣＩｅ）バスを含み得る。ホストプロセッサ１０５は、ハードウェアアクセラレータ１３５に結合するために使用されるものとは異なるインターフェース回路を通してホストメモリ１１０に接続され得る。例示の目的で、ホストプロセッサ１０５がそれを通して他のデバイスと通信するインターフェース回路１１５のためのエンドポイントは示されていない。 Examples of the interface circuit 115 include, but are not limited to, a system bus and an input / output (I / O) bus. The interface circuit 115 can be implemented using any of a variety of bus architectures. Examples of bus architectures are, but are not limited to, Extended Industry Standard Architecture (EISA) Buses, Accelerated Graphics Ports (AGP), Video Electronics Standards Association (VESA) Local Buses, Universal Serial Buses (USB), and Peripheral Components. It may include an interconnected express (PCIe) bus. The host processor 105 may be connected to the host memory 110 through a different interface circuit than that used to couple to the hardware accelerator 135. For illustrative purposes, no endpoint is shown for the interface circuit 115 through which the host processor 105 communicates with other devices.

システム１００は、インターフェース回路１１５に接続された１つまたは複数の他のＩ／Ｏデバイス（図示せず）をさらに含み得る。Ｉ／Ｏデバイスは、直接、または介在するＩ／Ｏコントローラを通してのいずれかで、システム１００、たとえば、インターフェース回路１１５に接続され得る。Ｉ／Ｏデバイスの例は、限定はしないが、キーボード、ディスプレイデバイス、ポインティングデバイス、１つまたは複数の通信ポート、およびネットワークアダプタを含む。ネットワークアダプタは、システム１００が、介在するプライベートまたは公衆ネットワークを通して他のシステム、コンピュータシステム、リモートプリンタ、および／またはリモート記憶デバイスに接続されるようになることを可能にする回路を指す。モデム、ケーブルモデム、イーサネットカード、およびワイヤレストランシーバが、システム１００とともに使用され得る異なるタイプのネットワークアダプタの例である。 The system 100 may further include one or more other I / O devices (not shown) connected to the interface circuit 115. The I / O device can be connected to the system 100, eg, the interface circuit 115, either directly or through an intervening I / O controller. Examples of I / O devices include, but are not limited to, keyboards, display devices, pointing devices, one or more communication ports, and network adapters. A network adapter refers to a circuit that allows a system 100 to be connected to other systems, computer systems, remote printers, and / or remote storage devices through an intervening private or public network. Modems, cable modems, Ethernet cards, and wireless transceivers are examples of different types of network adapters that can be used with System 100.

図１の例では、ハードウェアアクセラレータ１３５−１、１３５−２、および１３５−３の各々は、それぞれ、メモリ１４０−１、１４０−２、および１４０−３に接続される。メモリ１４０−１、１４０−２、および１４０−３は、概してホストメモリ１１０に関して説明されるようなＲＡＭとして実装される。１つまたは複数の実施形態では、各ハードウェアアクセラレータ１３５は、ＩＣとして実装される。ＩＣはプログラマブルＩＣであり得る。プログラマブルＩＣの一例は、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）である。 In the example of FIG. 1, each of the hardware accelerators 135-1, 135-2, and 135-3 is connected to memories 140-1, 140-2, and 140-3, respectively. The memories 140-1, 140-2, and 140-3 are generally implemented as RAM as described for the host memory 110. In one or more embodiments, each hardware accelerator 135 is implemented as an IC. The IC can be a programmable IC. An example of a programmable IC is a field programmable gate array (FPGA).

図１の例では、ハードウェアアクセラレータ１３５の各々は、エンドポイント１４５と、リンク回路１５０と、（図１では「ＭＣ」と省略される）メモリコントローラ１５５と、相互接続回路１６８とを含む。各ハードウェアアクセラレータ１３５は、（図１では「ＣＵ」と省略される）１つまたは複数の算出ユニットをも含む。算出ユニットは、ホストプロセッサ１０５からオフロードされたタスクを実施することが可能である回路である。例示の目的で、ハードウェアアクセラレータ１３５の各々は、算出ユニット１６０と、算出ユニット１６５とを含むように示されている。ハードウェアアクセラレータ１３５は、図示されているよりも少ないまたは多い算出ユニットを含み得ることを諒解されたい。 In the example of FIG. 1, each of the hardware accelerators 135 includes an endpoint 145, a link circuit 150, a memory controller 155 (abbreviated as "MC" in FIG. 1), and an interconnect circuit 168. Each hardware accelerator 135 also includes one or more calculation units (abbreviated as "CU" in FIG. 1). The calculation unit is a circuit capable of performing tasks offloaded from the host processor 105. For purposes of illustration, each of the hardware accelerators 135 is shown to include a calculation unit 160 and a calculation unit 165. It should be appreciated that the hardware accelerator 135 may contain fewer or more computing units than shown.

一例では、エンドポイント１４５の各々は、ＰＣＩｅエンドポイントとして実装される。エンドポイント１４５は、システム１００によって使用されるインターフェース回路１１５の特定のタイプまたは実装を介して通信するために好適な任意のタイプのエンドポイントとして実装され得ることを諒解されたい。メモリコントローラ１５５の各々は、ハードウェアアクセラレータ１３５によるメモリ１４０のアクセス（たとえば、読取りおよび書込み）を可能にするために、それぞれのメモリ１４０に接続される。 In one example, each of the endpoints 145 is implemented as a PCIe endpoint. It should be appreciated that the endpoint 145 can be implemented as any type of endpoint suitable for communicating over a particular type or implementation of the interface circuit 115 used by the system 100. Each of the memory controllers 155 is connected to the respective memory 140 to allow access (eg, read and write) of the memory 140 by the hardware accelerator 135.

１つまたは複数の実施形態では、ハードウェアアクセラレータ１３５−１とメモリ１４０−１とが、第１の回路板（図示せず）に取り付けられ、ハードウェアアクセラレータ１３５−２とメモリ１４０−２とが、第２の回路板（図示せず）に取り付けられ、ハードウェアアクセラレータ１３５−３とメモリ１４０−３とが、第３の回路板（図示せず）に取り付けられる。これらの回路板の各々は、バスポートまたはスロットに結合するための好適なコネクタを含み得る。たとえば、回路板の各々は、システム１００の利用可能なＰＣＩｅスロット（または他のバス／インターフェースコネクタ）への挿入のために設定されたコネクタを有し得る。 In one or more embodiments, the hardware accelerator 135-1 and the memory 140-1 are mounted on a first circuit board (not shown), and the hardware accelerator 135-2 and the memory 140-2 are attached. , Attached to a second circuit board (not shown), hardware accelerator 135-3 and memory 140-3 are attached to a third circuit board (not shown). Each of these circuit boards may include a suitable connector for coupling to a bus port or slot. For example, each of the circuit boards may have a connector configured for insertion into the available PCIe slots (or other bus / interface connectors) of the system 100.

リンク回路１５０の各々は、少なくとも１つの他の、たとえば、隣接する、リンク回路１５０とのアクセラレータリンクを確立することが可能である。本明細書で使用される「アクセラレータリンク」は、２つのハードウェアアクセラレータを直接接続する通信リンクを指す。たとえば、ハードウェアアクセラレータ１３５を有する回路板の各々が、リンク回路１５０に接続するワイヤを通して接続され得る。リンク回路１５０は、ワイヤを介してアクセラレータリンクを確立し得る。 Each of the link circuits 150 is capable of establishing an accelerator link with at least one other, eg, adjacent, link circuit 150. As used herein, "accelerator link" refers to a communication link that directly connects two hardware accelerators. For example, each of the circuit boards with the hardware accelerator 135 may be connected through a wire that connects to the link circuit 150. The link circuit 150 may establish an accelerator link via a wire.

特定の実施形態では、リンク回路１５０は、リングトポロジーを使用して通信可能にリンクされる。リンク回路１５０によって確立された（１つまたは複数の）アクセラレータリンクを介して送られるデータが、方向矢印によって指示されるように左から右にマスタする。たとえば、図１の例を参照すると、左側のリンク回路（たとえば、リンク回路１５０−１）がマスタとして動作し得、右側の隣接するリンク回路（たとえば、リンク回路１５０−２）がスレーブとして動作し得る。同様に、リンク回路１５０−２が、リンク回路１５０−３に関してマスタとして動作し得る。リンク回路１５０−３が、リンク回路１５０−１に関してマスタとして動作し得る。 In certain embodiments, the link circuit 150 is communicably linked using a ring topology. Data sent over the (s) accelerator links established by the link circuit 150 is mastered from left to right as indicated by the directional arrows. For example, referring to the example of FIG. 1, the left link circuit (eg, link circuit 150-1) can act as a master and the right adjacent link circuit (eg, link circuit 150-2) can act as a slave. obtain. Similarly, the link circuit 150-2 may act as a master with respect to the link circuit 150-3. The link circuit 150-3 may act as a master with respect to the link circuit 150-1.

１つまたは複数の実施形態では、各リンク回路１５０は、（たとえば、各ボード上の）各ハードウェアアクセラレータのためのメモリ１４０の量（またはサイズ）を指定するテーブルまたはレジスタを含む。テーブルを使用して、各リンク回路１５０は、アクセラレータリンクを使用して情報を交換する目的で、トランザクションにおいて指定されたアドレスを修正することが可能である。特定の実施形態では、テーブルまたはレジスタは、静的である。１つまたは複数の他の実施形態では、ドライバは、動的に、たとえば、ランタイムにおいて、テーブルまたはレジスタに記憶された情報を読み取り、および／または更新することが可能である。 In one or more embodiments, each link circuit 150 includes a table or register that specifies the amount (or size) of memory 140 for each hardware accelerator (eg, on each board). Using the table, each link circuit 150 can modify the address specified in the transaction for the purpose of exchanging information using the accelerator link. In certain embodiments, the table or register is static. In one or more other embodiments, the driver can dynamically read and / or update the information stored in a table or register, eg, at runtime.

例示の目的で、ハードウェアアクセラレータ１３５−２の動作が説明される。各それぞれのハードウェアアクセラレータにおける同様の番号の構成要素が、同じまたは同様の様式で動作することが可能であることを諒解されたい。したがって、ハードウェアアクセラレータ１３５−２を参照すると、リンク回路１５０−２は、様々な異なるソースまたはイニシエータのいずれかからトランザクションを受信することと、様々なターゲットのいずれかにトランザクションをルーティングすることとが可能である。たとえば、リンク回路１５０−２は、（たとえば、ホストプロセッサ１０５から発信した）エンドポイント１４５−２からのトランザクション、算出ユニット１６０−２からのトランザクション、算出ユニット１６５−２からのトランザクション、ハードウェアアクセラレータ１３５−１からのリンク回路１５０−１を介したトランザクション、またはハードウェアアクセラレータ１３５−３からのリンク回路１５０−３を介してリンク回路１５０−１に流れ、次いでリンク回路１５０−２上に至るトランザクションを受信することが可能である。リンク回路１５０−２は、（たとえば、ホストプロセッサ１０５への）エンドポイント１４５−２へのトランザクション、算出ユニット１６０−２へのトランザクション、算出ユニット１６５−２へのトランザクション、メモリコントローラ１５５−２へのトランザクション、リンク回路１５０−３を介してリンク回路１５０−１上に至る、ハードウェアアクセラレータ１３５−１へのトランザクション、またはリンク回路１５０−３を介したハードウェアアクセラレータ１３５−３へのトランザクションなど、任意のターゲットへのトランザクションをルーティングすることが可能であり、ここで、ターゲットは、ソースまたはイニシエータとは異なる。 For illustrative purposes, the operation of hardware accelerator 135-2 will be described. It should be appreciated that components with similar numbers in each respective hardware accelerator can operate in the same or similar manner. Therefore, referring to hardware accelerator 135-2, link circuit 150-2 can receive transactions from any of a variety of different sources or initiators and route transactions to any of a variety of targets. It is possible. For example, link circuit 150-2 may include transactions from endpoint 145-2 (eg, originating from host processor 105), transactions from compute unit 160-2, transactions from compute unit 165-2, hardware accelerator 135. A transaction from -1 via the link circuit 150-1, or a transaction flowing from the hardware accelerator 135-3 via the link circuit 150-3 to the link circuit 150-1 and then onto the link circuit 150-2. It is possible to receive. The link circuit 150-2 has a transaction to the endpoint 145-2 (for example, to the host processor 105), a transaction to the calculation unit 160-2, a transaction to the calculation unit 165-2, and a memory controller 155-2. Any transaction, such as a transaction, a transaction to the hardware accelerator 135-1 via the link circuit 150-3 over the link circuit 150-1, or a transaction to the hardware accelerator 135-3 via the link circuit 150-3. It is possible to route a transaction to a target, where the target is different from the source or initiator.

たとえば、ホストプロセッサ１０５は、統合されたアドレス空間の一部として、メモリ１４０−１、メモリ１４０−２、および／またはメモリ１４０−３における任意のロケーションにアクセスすることが可能である。しかしながら、そのようなメモリにアクセスする際に、ホストプロセッサ１０５は、選択されたハードウェアアクセラレータ、たとえば、ハードウェアアクセラレータ１３５−２にアクセスすることと、次いで、アクセラレータリンクを使用して、選択されたハードウェアアクセラレータを通してメモリ１４０−１、メモリ１４０−２、またはメモリ１４０−３など、任意のターゲットに達することとによって、そのようなメモリにアクセスし得る。 For example, the host processor 105 can access any location in memory 140-1, memory 140-2, and / or memory 140-3 as part of the integrated address space. However, when accessing such memory, the host processor 105 was selected using the selected hardware accelerator, eg, hardware accelerator 135-2, and then using the accelerator link. Such memory can be accessed by reaching any target, such as memory 140-1, memory 140-2, or memory 140-3, through a hardware accelerator.

例示的なおよび非限定的な例として、ホストプロセッサ１０５は、ハードウェアアクセラレータ１３５−２および１３５−３に関与するデータ転送を開始し得る。ハードウェアアクセラレータ１３５−２はイニシエータであり得る。この例では、ホストプロセッサ１０５、たとえば、ランタイム１７２および／またはドライバ１７４は、ハードウェアアクセラレータ１３５−２に対応するバッファオブジェクト１７６と、ハードウェアアクセラレータ１３５−３に対応するバッファオブジェクト１７８とを作成する。ホストプロセッサ１０５は、（ハードウェアアクセラレータ１３５−３中にある）データ転送のためのターゲットアドレスが、開始ハードウェアアクセラレータ（ハードウェアアクセラレータ１３５−２）に対してリモートであることを指示するリモートフラグ１８２をセットする。 As exemplary and non-limiting example, host processor 105 may initiate data transfer involving hardware accelerators 135-2 and 135-3. Hardware accelerator 135-2 can be an initiator. In this example, the host processor 105, eg, the runtime 172 and / or the driver 174, creates a buffer object 176 corresponding to the hardware accelerator 135-2 and a buffer object 178 corresponding to the hardware accelerator 135-3. Host processor 105 indicates that the target address for data transfer (in hardware accelerator 135-3) is remote to the starting hardware accelerator (hardware accelerator 135-2) with remote flag 182. To set.

エンドポイント１４５−２は、インターフェース回路１１５を介してホストプロセッサ１０５からオフロードされたタスクを受信することが可能である。１つまたは複数の実施形態では、ホストプロセッサ１０５は、ランタイム１７２およびドライバ１７４を実行することを介して、ハードウェアアクセラレータ１３５を統合されたアドレス空間と見なすことが可能である。エンドポイント１４５−２は、算出ユニット１６０−２にタスク（たとえば、データ）を提供し得る。タスクは、算出ユニット１６０−２が、オフロードされたタスクを実施するためのデータをそこから取り出すべきであるメモリ１４０−３内のターゲットアドレスを指定し得る。ハードウェアアクセラレータ１３５−２は、リンク回路１５０−２を使用して、リンク回路１５０−２とリンク回路１５０−３との間で確立されたアクセラレータリンクを介して、ハードウェアアクセラレータ１３５−３と直接、データ転送を開始および実施することが可能である。 The endpoint 145-2 is capable of receiving tasks offloaded from the host processor 105 via the interface circuit 115. In one or more embodiments, the host processor 105 can consider the hardware accelerator 135 as an integrated address space through running runtime 172 and driver 174. The endpoint 145-2 may provide a task (eg, data) to calculation unit 160-2. The task may specify a target address in memory 140-3 from which calculation unit 160-2 should retrieve data for performing the offloaded task. The hardware accelerator 135-2 uses the link circuit 150-2 and directly with the hardware accelerator 135-3 via the accelerator link established between the link circuit 150-2 and the link circuit 150-3. , It is possible to start and carry out data transfer.

データ転送はホストプロセッサ１０５によって開始されるが、データ転送は、リンク回路１５０を使用して実施され、ホストプロセッサ１０５、ホストメモリ１１０、またはインターフェース回路１１５に関与することなしに行われる。データ転送は、ハードウェアアクセラレータ間で直接行われる。従来のシステムでは、データ転送は、ホストプロセッサ１０５が、インターフェース回路１１５を介してハードウェアアクセラレータ１３５−３からデータを取り出すことと、次いで、インターフェース回路１１５を介してハードウェアアクセラレータ１３５−２にデータを提供することとによって行われることになる。 The data transfer is initiated by the host processor 105, but the data transfer is performed using the link circuit 150 and without involvement of the host processor 105, the host memory 110, or the interface circuit 115. Data transfer takes place directly between hardware accelerators. In conventional systems, data transfer involves the host processor 105 retrieving data from hardware accelerator 135-3 via interface circuit 115 and then transferring data to hardware accelerator 135-2 via interface circuit 115. It will be done by providing.

ハードウェアアクセラレータ１３５自体の間のデータの読取りおよび書込みを、ホストプロセッサ１０５を通してそのデータを移動させることなしに行う、ハードウェアアクセラレータ１３５の能力は、インターフェース回路１１５（たとえば、ＰＣＩｅバス）を介して受け渡されるデータの量を著しく低減する。これは、ホストプロセッサ１０５と他のハードウェアアクセラレータ１３５との間のデータを伝達する際に使用するためのインターフェース回路１１５のかなりの帯域幅を節約する。さらに、システム１００の動作の速度が、ハードウェアアクセラレータ１３５がデータを共有するために必要とされる時間の低減により増加され得る。 The ability of the hardware accelerator 135 to read and write data between the hardware accelerator 135 itself without moving that data through the host processor 105 is received via the interface circuit 115 (eg, the PCIe bus). Significantly reduce the amount of data passed. This saves considerable bandwidth of the interface circuit 115 for use in transmitting data between the host processor 105 and the other hardware accelerator 135. In addition, the speed of operation of the system 100 can be increased by reducing the time required for the hardware accelerator 135 to share data.

システム１００は、実装されるデバイスおよび／またはシステムの特定のタイプに応じて、図示された構成要素よりも少数の構成要素、または図１に示されていない追加の構成要素を含み得る。さらに、含まれる特定のオペレーティングシステム、（１つまたは複数の）アプリケーション、および／またはＩ／Ｏデバイスは、システムタイプに基づいて変動し得る。さらに、例示的な構成要素のうちの１つまたは複数は、別の構成要素に組み込まれるか、またはさもなければ、別の構成要素の一部分を形成し得る。たとえば、プロセッサが、少なくともあるメモリを含み得る。システム１００は、図１のアーキテクチャまたはそれと同様のアーキテクチャを使用して各々実装される単一のコンピュータあるいは複数のネットワーク化されたまたは相互接続されたコンピュータを実装するために使用され得る。 The system 100 may include fewer components than the components shown, or additional components not shown in FIG. 1, depending on the device to be mounted and / or the particular type of system. In addition, the particular operating system, application (s), and / or I / O devices included may vary based on the system type. In addition, one or more of the exemplary components may be incorporated into another component or otherwise form part of another component. For example, the processor may include at least some memory. System 100 can be used to implement a single computer or multiple networked or interconnected computers, each implemented using the architecture of FIG. 1 or similar.

図２は、図１のハードウェアアクセラレータ１３５−２の例示的な実装形態を示す。図２内に、リンク回路１５０−２の例示的な実装形態が提供される。図２中のリンク回路１５０−２のために示されているアーキテクチャは、図１に示されているリンク回路１５０のいずれかを実装するために使用され得ることを諒解されたい。 FIG. 2 shows an exemplary implementation of the hardware accelerator 135-2 of FIG. An exemplary implementation of the link circuit 150-2 is provided in FIG. It should be appreciated that the architecture shown for link circuit 150-2 in FIG. 2 can be used to implement any of the link circuits 150 shown in FIG.

１つまたは複数の実施形態では、リンク回路１５０−２は、他のハードウェアアクセラレータに送られるべきであるトランザクションをデータストリームベースのパケットにコンバートし、リンク回路１５０の間で確立されたアクセラレータリンクを介してパケットをルーティングすることが可能である。特定の実施形態では、リンク回路１５０−２は、送信のためにＡＭＢＡ拡張可能インターフェース（ＡＸＩ）準拠メモリマッピングされたトランザクションをＡＸＩデータストリームにコンバートすることが可能である。本開示内では、ＡＸＩは、例示的な通信プロトコルとして使用される。他の通信プロトコルが使用され得ることを諒解されたい。この点について、ＡＸＩの使用は、限定ではなく、例示のためのものである。リンク回路１５０−２は、他のハードウェアアクセラレータ（たとえば、ハードウェアアクセラレータ１３５−１および１３５−３）からの着信パケットをハンドリングし、そのパケットをメモリマッピングされたトランザクションにコンバートし、そのデータをハードウェアアクセラレータ１３５−２内でローカルにルーティングすることも可能である。さらに、リンク回路１５０−２は、受信されたパケットをメモリマッピングされたトランザクションにコンバートし、トランザクションを修正し、メモリマッピングされたトランザクションをパケットにコンバートし、パケットを次のハードウェアアクセラレータに受け渡すことが可能である。アクセラレータリンクを介して受信されたデータは、メモリマッピングされたトランザクションとしてハードウェアアクセラレータ１３５−２内で内部的にルーティングされ得る。 In one or more embodiments, the link circuit 150-2 converts a transaction that should be sent to another hardware accelerator into a data stream-based packet, providing an accelerator link established between the link circuits 150. It is possible to route packets through. In certain embodiments, the link circuit 150-2 is capable of converting an AMBA Extensible Interface (AXI) compliant memory-mapped transaction into an AXI data stream for transmission. Within this disclosure, AXI is used as an exemplary communication protocol. Please understand that other communication protocols may be used. In this regard, the use of AXI is for illustrative purposes only. The link circuit 150-2 handles incoming packets from other hardware accelerators (eg, hardware accelerators 135-1 and 135-3), converts the packets into memory-mapped transactions, and hardens the data. It is also possible to route locally within the hardware accelerator 135-2. In addition, the link circuit 150-2 converts the received packet into a memory-mapped transaction, modifies the transaction, converts the memory-mapped transaction into a packet, and passes the packet to the next hardware accelerator. Is possible. Data received over the accelerator link can be routed internally within the hardware accelerator 135-2 as a memory-mapped transaction.

図２の例では、リンク回路１５０−２は、トランシーバ２０２および２０４と、再送信エンジン（ＲＴＥ）２０６および２０８と、メモリマップ−ストリーム（ＭＭ−ストリーム）マッパ２１０および２１２とを含む。ＭＭ−ストリームマッパ２１０および２１２は、相互接続回路２１４に接続される。 In the example of FIG. 2, the link circuit 150-2 includes transceivers 202 and 204, retransmission engines (RTE) 206 and 208, and memory map-stream (MM-stream) mappers 210 and 212. The MM-stream mappers 210 and 212 are connected to the interconnect circuit 214.

描かれているように、トランシーバ２０２は、ハードウェアアクセラレータ１３５−１における対応するトランシーバに接続され得、トランシーバ２０４は、ハードウェアアクセラレータ１３５−３における対応するトランシーバに接続される。トランシーバ２０２および２０４は、他のハードウェアアクセラレータと確立されたアクセラレータリンクの物理レイヤを実装する。トランシーバ２０２および２０４の各々は、マルチギガビット通信リンクのための軽量のシリアル通信プロトコルを実装することが可能である。１つまたは複数の実施形態では、トランシーバ２０２および２０４の各々は、隣接するＩＣにおけるトランシーバへの双方向インターフェースを実装することが可能である。トランシーバ２０２および２０４は、他のハードウェアアクセラレータとのアクセラレータリンクを自動的に初期化することが可能である。概して、トランシーバ２０２および２０４は、フロー制御に関係する低レベルシグナリングおよび低ＰＨＹレベルプロトコルを実装するための双方向通信が可能である。しかしながら、前に説明されたようにリングトポロジーおよび（たとえば、リングの周りの単一の方向における）マスタからスレーブへの流れを使用して、データフローが実装され得る。 As depicted, transceiver 202 may be connected to the corresponding transceiver in hardware accelerator 135-1 and transceiver 204 may be connected to the corresponding transceiver in hardware accelerator 135-3. Transceivers 202 and 204 implement the physical layer of the accelerator link established with other hardware accelerators. Each of the transceivers 202 and 204 can implement a lightweight serial communication protocol for multi-Gigabit communication links. In one or more embodiments, each of the transceivers 202 and 204 can implement a bidirectional interface to the transceiver in adjacent ICs. Transceivers 202 and 204 are capable of automatically initializing accelerator links with other hardware accelerators. In general, transceivers 202 and 204 are capable of bidirectional communication for implementing low level signaling and low PHY level protocols related to flow control. However, data flow can be implemented using the ring topology and master-to-slave flow (eg, in a single direction around the ring) as previously described.

たとえば、トランシーバ２０２は、ハードウェアアクセラレータ１３５−１のリンク回路１５０−１内の対応するトランシーバと双方向に通信することが可能である。トランシーバ２０４は、ハードウェアアクセラレータ１３５−３のリンク回路１５０−３内の対応するトランシーバと双方向に通信することが可能である。トランシーバ２０２および２０４の各々は、データストリーム、たとえば、ＡＸＩデータストリームを使用して、隣接するトランシーバと通信することが可能である。 For example, transceiver 202 is capable of bidirectionally communicating with the corresponding transceiver in the link circuit 150-1 of hardware accelerator 135-1. Transceiver 204 is capable of bidirectionally communicating with the corresponding transceiver in the link circuit 150-3 of hardware accelerator 135-3. Each of the transceivers 202 and 204 can use a data stream, eg, an AXI data stream, to communicate with an adjacent transceiver.

特定の実施形態では、トランシーバ２０２および２０４は、８Ｂ／１０Ｂコーディングルールを使用して、隣接するハードウェアアクセラレータにデータを送るおよび受信することが可能である。トランシーバ２０２および２０４の各々は、８Ｂ／１０Ｂコーディングルールを使用して、シングルビットエラーおよびたいていのマルチビットエラーを検出することが可能である。 In certain embodiments, transceivers 202 and 204 are capable of sending and receiving data to adjacent hardware accelerators using 8B / 10B coding rules. Each of the transceivers 202 and 204 can detect single-bit errors and most multi-bit errors using 8B / 10B coding rules.

１つまたは複数の実施形態では、トランシーバ２０２および２０４の各々は、Ａｕｒｏｒａ８Ｂ／１０ＢＩＰコアとして実装され、これは、カリフォルニア州サンノゼのＸｉｌｉｎｘ，Ｉｎｃ．から入手可能である。しかしながら、言及される特定のコアは、例示の目的で提供され、限定として意図されないことを諒解されたい。本明細書で説明されるように動作することが可能である他のトランシーバが使用され得る。 In one or more embodiments, each of the transceivers 202 and 204 is implemented as an Aurora 8B / 10B IP core, which is described by Xilinx, Inc., San Jose, Calif. It is available from. However, please understand that the particular core mentioned is provided for illustrative purposes and is not intended as a limitation. Other transceivers capable of operating as described herein may be used.

トランシーバ２０２は、ＲＴＥ２０６に接続される。トランシーバ２０２とＲＴＥ２０６とは、双方向通信をサポートする各方向において動く複数のデータストリームを通して通信することが可能である。トランシーバ２０４は、ＲＴＥ２０８に接続される。トランシーバ２０４とＲＴＥ２０８とは、双方向通信をサポートする各方向において動く複数のデータストリームを通して通信することが可能である。 The transceiver 202 is connected to the RTE 206. The transceiver 202 and RTE206 can communicate through a plurality of data streams moving in each direction that support bidirectional communication. Transceiver 204 is connected to RTE208. Transceiver 204 and RTE 208 can communicate through multiple data streams moving in each direction that support bidirectional communication.

ＲＴＥ２０６および２０８は、トランザクションを管理することが可能である。１つまたは複数の実施形態では、ＲＴＥ２０６およびＲＴＥ２０８は、各々、通信プロトコルの追加のレイヤを、それぞれ、トランシーバ２０２および２０４によって実装されたものの上に実装する。たとえば、ＲＴＥ２０６およびＲＴＥ２０８は、各々、トランザクションレイヤ（ＴＬ）／リンクレイヤ（ＬＬ）およびユーザレイヤを実装する。これらの追加のレイヤは、データの完全性に関するさらなる保証を提供する。初期化の後に、アプリケーションは、データのストリームとしてアクセラレータリンクにわたってデータを受け渡すことが可能である。追加のデータの完全性対策は、メモリマッピングされたトランザクションをストリームデータにコンバートするとき、制御信号がデータとマージされるので、特に有益である。データの完全性問題が、破損した制御信号を生じ得る。オンチップ相互接続および／またはバスは、制御信号に関するデータ損失に耐えられない。 RTE206 and 208 are capable of managing transactions. In one or more embodiments, the RTE 206 and RTE 208 each implement an additional layer of communication protocol on top of those implemented by transceivers 202 and 204, respectively. For example, RTE206 and RTE208 implement a transaction layer (TL) / link layer (LL) and a user layer, respectively. These additional layers provide additional assurance of data integrity. After initialization, the application can pass data over the accelerator link as a stream of data. Additional data integrity measures are particularly useful when converting memory-mapped transactions to stream data, as the control signals are merged with the data. Data integrity issues can result in corrupted control signals. On-chip interconnects and / or buses cannot tolerate data loss on control signals.

ＴＬ／ＬＬは、ロスレスデータ通信を保証するために、トークンベースのフロー制御を実装する。１つまたは複数の実施形態では、隣接するトランシーバ間の通信チャネルおよびトランシーバとＲＴＥとの間の通信チャネルは、幅が１２８ビットである。データを送るとき、各ＲＴＥは、ターゲットハードウェアアクセラレータにおける受信リンク回路が、トランシーバによって実装された物理レイヤに、送られるべきトランザクションを実際に送る前に、トランザクション全体を受信するための十分なバッファリングリソース（たとえば、トークン）を有することを検査することが可能である。たとえば、ＲＴＥ２０６は、ハードウェアアクセラレータ１３５−１における受信リンク回路１５０−１が、送るために（リンク回路１５０−２内で）トランシーバ２０２にデータを提供するより前に、データを受信するための十分なバッファリソースを有することを検査し得る。 TL / LL implements token-based flow control to ensure lossless data communication. In one or more embodiments, the communication channel between adjacent transceivers and the communication channel between the transceiver and the RTE are 128 bits wide. When sending data, each RTE buffers enough to receive the entire transaction before the receive link circuit in the target hardware accelerator actually sends the transaction to be sent to the physical layer implemented by the transceiver. It is possible to check that you have a resource (eg, a token). For example, the RTE 206 is sufficient for the receive link circuit 150-1 in the hardware accelerator 135-1 to receive the data before it provides the data to the transceiver 202 (within the link circuit 150-2) to send. It can be inspected to have a good buffer resource.

ＲＴＥ２０６および２０８は、データ破損を検出することが可能である。たとえば、ＲＴＥ２０６および２０８の各々は、受信された各パケットについての、パケット長情報、パケットシーケンス情報、および／または巡回冗長検査（ＣＲＣ）チェックサムを検証することが可能である。ＲＴＥスレーブ（たとえば、受信するＲＴＥ）がパケットエラーを検出したとき、ＲＴＥは、エラーアボートモードに入り得る。エラーアボートモードでは、ＲＴＥは、エラーをもつパケットを失敗したパケットとしてドロップする。ＲＴＥは、トランザクションのすべての後続のパケットをさらにドロップする。特定の実施形態では、エラーアボートモードの開始が、ＲＴＥにリンク再試行シーケンスを起動させる。リンク再試行シーケンスが成功すると、リンクマスタ（たとえば、送るＲＴＥ）は、失敗したポイントから開始することによって、送信を復元することが可能である。 RTE206 and 208 are capable of detecting data corruption. For example, each of the RTEs 206 and 208 can verify packet length information, packet sequence information, and / or cyclic redundancy check (CRC) checksums for each packet received. When an RTE slave (eg, a receiving RTE) detects a packet error, the RTE may enter error abort mode. In error abort mode, the RTE drops the packet with the error as a failed packet. The RTE further drops all subsequent packets of the transaction. In certain embodiments, the start of error abort mode causes the RTE to trigger a link retry sequence. If the link retry sequence is successful, the link master (eg, the sending RTE) can restore the transmission by starting from the point of failure.

ＲＴＥ２０６は、ＭＭストリームマッパ２１０に接続される。ＲＴＥ２０６は、双方向通信をサポートする各方向において動く複数のデータストリームを介してＭＭストリームマッパ２１０と通信することが可能である。ＲＴＥ２０８は、ＭＭストリームマッパ２１２に接続される。ＲＴＥ２０８は、双方向通信をサポートする各方向において動く複数のデータストリームを介してＭＭストリームマッパ２１２と通信することが可能である。 The RTE 206 is connected to the MM stream mapper 210. The RTE 206 is capable of communicating with the MM stream mapper 210 via a plurality of data streams moving in each direction that support bidirectional communication. The RTE 208 is connected to the MM stream mapper 212. The RTE 208 can communicate with the MM stream mapper 212 via a plurality of data streams moving in each direction that support bidirectional communication.

ＭＭストリームマッパ２１０およびＭＭストリームマッパ２１２の各々は、相互接続回路２１４に接続される。相互接続回路２１４は、ＭＭストリームマッパ２１０および２１２ならびにそれらに接続されたハードウェアアクセラレータ１３５−２の他のマスタおよび／またはスレーブ回路の間で、データをルーティングすることが可能である。相互接続回路２１４は、１つまたは複数のオンチップ相互接続として実装され得る。オンチップ相互接続の一例は、ＡＸＩバスである。ＡＸＩバスは、回路ブロックおよび／またはシステムの間にオンチップ接続を確立する際に使用するための埋込みマイクロコントローラバスインターフェースである。相互接続回路の他の例示的な実装形態は、限定はしないが、他のバス、クロスバー、ネットワークオンチップ（ＮｏＣ）などを含み得る。 Each of the MM stream mapper 210 and the MM stream mapper 212 is connected to the interconnection circuit 214. The interconnect circuit 214 is capable of routing data between the MM stream mappers 210 and 212 and other master and / or slave circuits of the hardware accelerator 135-2 connected to them. The interconnect circuit 214 may be implemented as one or more on-chip interconnects. An example of on-chip interconnection is the AXI bus. The AXI bus is an embedded microcontroller bus interface for use in establishing on-chip connections between circuit blocks and / or systems. Other exemplary implementations of interconnect circuits may include, but are not limited to, other buses, crossbars, network on chip (NoC), and the like.

ＭＭストリームマッパ２１０および２１２は、それぞれ、ＲＴＥ２０６および２０８からの受信されたデータストリームを、相互接続回路ブロック２１４に提供され得るメモリマッピングされたトランザクションにコンバートすることが可能である。この点について、データストリームは、メモリマッピングされたトランザクションをサポートする複数のチャネルに多重化解除され得る。ＭＭストリームマッパ２１０および２１２は、相互接続回路ブロック２１４からの受信されたメモリマッピングされたトランザクションを、それぞれ、ＲＴＥ２０６および２０８に提供され得るストリームデータにコンバートすることも可能である。ＭＭストリームマッパ２１０および２１２は、（たとえば、説明される制御信号を含む）メモリマッピングされたトランザクションをサポートする複数のチャネルを、それぞれ、ＲＴＥ２０６および２０８に送るための単一のデータストリームに多重化することが可能である。 The MM stream mappers 210 and 212 are capable of converting the received data streams from the RTEs 206 and 208 into memory-mapped transactions that may be provided to the interconnect circuit block 214, respectively. In this regard, the data stream can be demultiplexed into multiple channels that support memory-mapped transactions. The MM stream mappers 210 and 212 can also convert the received memory-mapped transactions from the interconnect circuit block 214 into stream data that can be provided to RTE 206 and 208, respectively. The MM stream mappers 210 and 212 multiplex multiple channels that support memory-mapped transactions (including, for example, the control signals described) into a single data stream for sending to RTEs 206 and 208, respectively. It is possible.

１つまたは複数の実施形態では、ＭＭストリームマッパ２１０および２１２の各々は、トランザクションにおいて受信されたターゲットアドレスを調整することが可能である。ＭＭストリームマッパ２１０は、たとえば、アクセラレータリンクを介してハードウェアアクセラレータ１３５−１からトランザクションを受信する際に、トランザクションのターゲットアドレスから、ハードウェアアクセラレータ１３５−２のためのアドレス範囲（たとえば、メモリ１４０−２のアドレス範囲）の上限を減算し得る。トランザクションがリンク回路１５０を通過するとき、ターゲットアドレスを調整することによって、トランザクションが、アクセラレータリンクを介してあるハードウェアアクセラレータから別のハードウェアアクセラレータに向けられ得る。アクセラレータリンクを使用する際のアドレスの動作に関係するさらなる詳細が、図４に関してより詳細に説明される。 In one or more embodiments, each of the MM stream mappers 210 and 212 is capable of adjusting the target address received in the transaction. When the MM stream mapper 210 receives a transaction from the hardware accelerator 135-1 via the accelerator link, for example, the address range for the hardware accelerator 135-2 from the target address of the transaction (for example, memory 140-). The upper limit of the address range of 2) can be subtracted. As the transaction passes through the link circuit 150, the transaction can be directed from one hardware accelerator to another via the accelerator link by adjusting the target address. Further details relating to address behavior when using accelerator links are described in more detail with respect to FIG.

例示の目的で、ハードウェアアクセラレータ１３５−２の他の部分が、リンク回路１５０−２に関して説明される。図２の例では、相互接続回路２１４は、直接メモリアクセス（ＤＭＡ）マスタ回路２１６に接続される。ＤＭＡマスタ回路２１６は、たとえば、相互接続回路ブロック２１４と通信するためのメモリマッピングされたインターフェースを含む。ＤＭＡマスタ回路２１６は、ＰＣＩｅエンドポイント２１８に接続される。図１のエンドポイント１４５−２の例示的な実装形態であるＰＣＩｅエンドポイント２１８は、ホストプロセッサ１０５に通信可能にリンクされる。 For illustrative purposes, other parts of the hardware accelerator 135-2 are described with respect to the link circuit 150-2. In the example of FIG. 2, the interconnect circuit 214 is directly connected to the memory access (DMA) master circuit 216. The DMA master circuit 216 includes, for example, a memory-mapped interface for communicating with the interconnect circuit block 214. The DMA master circuit 216 is connected to the PCIe endpoint 218. The PCIe endpoint 218, which is an exemplary implementation of the endpoint 145-2 of FIG. 1, is communicably linked to the host processor 105.

図２の例では、相互接続回路２１４は、１つまたは複数の算出ユニットマスタ２２０−１〜２２０−Ｎにも接続される。各算出ユニットマスタ２２０は、ハードウェアアクセラレータ１３５−２内に実装された算出ユニットと、相互接続回路ブロック２１４との間の双方向インターフェースを提供する。各算出ユニットマスタ２２０は、相互接続回路ブロック２１４と通信するためのメモリマッピングされたインターフェースをさらに含む。算出ユニット１６０−２および算出ユニット１６５−２の各々は、スレーブインターフェース（図示せず）を介して相互接続回路２１４に接続され得る。 In the example of FIG. 2, the interconnect circuit 214 is also connected to one or more calculation unit masters 220-1 to 220-N. Each calculation unit master 220 provides a bidirectional interface between the calculation unit mounted in the hardware accelerator 135-2 and the interconnect circuit block 214. Each calculation unit master 220 further includes a memory-mapped interface for communicating with the interconnect circuit block 214. Each of the calculation unit 160-2 and the calculation unit 165-2 may be connected to the interconnect circuit 214 via a slave interface (not shown).

図２の例では、相互接続回路２１４は、１つまたは複数のメモリコントローラスレーブ回路２２５−１〜２２５−Ｎにも接続される。各メモリコントローラスレーブ回路２２５は、メモリ１４０−２のための読取りおよび書込み動作を可能にする。メモリ１４０−２は、ハードウェアアクセラレータ１３５−２によってアクセス可能な１つまたは複数のオフチップメモリとして実装され得る。メモリコントローラ２２５−１〜２２５−Ｎの各々は、相互接続回路ブロック２１４と通信するためのメモリマッピングされたインターフェースをさらに含む。 In the example of FIG. 2, the interconnect circuit 214 is also connected to one or more memory controller slave circuits 225-1 to 225-N. Each memory controller slave circuit 225 enables read and write operations for memory 140-2. Memory 140-2 may be implemented as one or more off-chip memories accessible by hardware accelerator 135-2. Each of the memory controllers 225-1 to 225-N further includes a memory-mapped interface for communicating with the interconnect circuit block 214.

図３は、ＲＴＥ２０６の例示的な一実装形態を示す。図３に関して説明される例示的なアーキテクチャは、フロー制御ユニット（ＦＬＩＴ）を使用するクレジットベースのフロー制御／再送信制御方式を実装する。ＲＴＥ２０６は、アプリケーションによって使用され得るプロトコルおよび／またはインターフェースに対して内部的に使用されるＦＬＩＴベースのプロトコルおよび／またはインターフェースの間で変換することが可能である。 FIG. 3 shows an exemplary implementation of the RTE 206. The exemplary architecture described with reference to FIG. 3 implements a credit-based flow control / retransmission control scheme using a flow control unit (FLIT). The RTE 206 is capable of translating between FLIT-based protocols and / or interfaces that are used internally for protocols and / or interfaces that may be used by the application.

ＲＴＥ２０６は、送信チャネル３３０を含む。送信チャネル３３０は、データ（たとえば、ＡＸＩ）ストリームをＦＬＩＴベースのトランザクションにカプセル化解除することが可能である。図３の例では、送信チャネル３３０は、送信（ＴＸ）パケット巡回冗長検査（ＣＲＣ）生成器３０２と、再試行ポインタ復帰コマンド（ＰＲＥＴ）パケット／初期再試行コマンド（ＩＲＴＲＹ）パケット生成器および復帰再試行ポインタ（ＲＲＰ）埋込み器３０４と、トークン復帰（ＴＲＥＴ）パケット生成器およびシーケンス（ＳＥＱ）番号／前方再試行ポインタ（ＦＲＰ）／復帰トークンカウント（ＲＴＣ）埋込み器３０６と、フロー制御回路３０８と、出力バッファ３１０とを含む。ＴＲＥＴ生成器およびＳＥＱ／ＦＲＰ／ＲＴＣ埋込み器３０６は、再試行バッファ３１２にも接続される。 The RTE 206 includes a transmission channel 330. Transmission channel 330 is capable of decapsulating a data (eg, AXI) stream into a FLIT-based transaction. In the example of FIG. 3, the transmit channel 330 includes a transmit (TX) packet cyclic redundancy check (CRC) generator 302, a retry pointer return command (PRET) packet / initial retry command (IRTRY) packet generator, and a return retry. Trial pointer (RRP) embedding device 304, token return (TRET) packet generator and sequence (SEQ) number / forward retry pointer (FRP) / return token count (RTC) embedding device 306, flow control circuit 308, and Includes an output buffer 310. The TRET generator and SEQ / FRP / RTC implanter 306 are also connected to the retry buffer 312.

ＲＴＥ２０６は、受信チャネル３４０を含む。受信チャネル３４０は、ＦＬＩＴベースのインターフェースをカプセル化することと、そのインターフェースをデータ（たとえば、ＡＸＩ）ストリームにコンバートすることとが可能である。図３の例では、受信チャネル３４０は、パケット境界検出器３１６と、受信（ＲＸ）パケットＣＲＣ回路３１８と、ＲＸパケットプロセッサ３２０と、入力バッファ３２２とを含む。Ｒｘパケットプロセッサ３２０は、エラーハンドラ３２４および再試行シーケンス回路３１４に接続される。 The RTE 206 includes a receive channel 340. The receive channel 340 can encapsulate a FLIT-based interface and convert that interface into a data (eg, AXI) stream. In the example of FIG. 3, the receive channel 340 includes a packet boundary detector 316, a receive (RX) packet CRC circuit 318, an RX packet processor 320, and an input buffer 322. The Rx packet processor 320 is connected to the error handler 324 and the retry sequence circuit 314.

ＲＴＥ２０６は、限定ではなく、例示の目的で提供される。クレジットベースのフロー制御／再送信制御方式を実装するために好適な他のアーキテクチャが使用され得ることを諒解されたい。図３に関して説明されるアーキテクチャは、データフローに関して反転したまたは逆の配向をもつ図２のＲＴＥ２０８を実装するためにも使用され得る。 RTE206 is provided for purposes of illustration, but not limitation. Please understand that other architectures suitable for implementing credit-based flow control / retransmission control schemes may be used. The architecture described with respect to FIG. 3 can also be used to implement the RTE 208 of FIG. 2 which has an inverted or reversed orientation with respect to the data flow.

図４は、複数のハードウェアアクセラレータをもつシステムのための動作の例示的な方法４００を示す。方法４００は、ハードウェアアクセラレータの間の直接的なデータ転送の一例を示す。方法４００は、図１に関して説明されたシステム１００と同じまたは同様のシステムによって実施され得る。方法４００は、ホストプロセッサとハードウェアアクセラレータとを結合するバス上の不十分な帯域幅がどのように緩和され得るかを示す。通常ならばバス上で行われるデータ転送が、アクセラレータリンクに向けられ、それにより、他の動作のためにバス上の帯域幅を解放し得る。 FIG. 4 shows an exemplary method of operation 400 for a system with multiple hardware accelerators. Method 400 illustrates an example of direct data transfer between hardware accelerators. Method 400 may be performed by the same or similar system as system 100 described with respect to FIG. Method 400 shows how insufficient bandwidth on the bus connecting the host processor and the hardware accelerator can be mitigated. Data transfers that would normally take place on the bus can be directed to the accelerator link, thereby freeing up bandwidth on the bus for other operations.

ブロック４０５において、システムは、ハードウェアアクセラレータシーケンスを自動的に発見することが可能である。１つまたは複数の実施形態では、ハードウェアアクセラレータ、たとえば、ハードウェアアクセラレータのボードは、システム内のリングトポロジーにおいて構成される。ホストプロセッサは、既存のＰＣＩｅトポロジー、したがって、ＰＣＩｅバスに接続されたシステム内に存在するハードウェアアクセラレータの数に気づいている。さらに、ホストプロセッサは、たとえば、ランタイムを介して、各ハードウェアアクセラレータにロードされた特定の回路（たとえば、イメージまたは設定ビットストリーム）に気づいている。したがって、ホストプロセッサは、ハードウェアアクセラレータが、本明細書で説明されるようなアクセラレータリンクをサポートすることに気づいている。ホストプロセッサは、依然として、ハードウェアアクセラレータのシーケンスを決定しなければならない。ドライバは、たとえば、説明されるハードウェアアクセラレータシーケンスの自動発見を実施することが可能である。この自動発見能力は、ホストプロセッサによって実行されるアプリケーションを修正する必要なしに、システムへの新しいおよび／または追加のハードウェアアクセラレータの追加をサポートする。 At block 405, the system can automatically discover the hardware accelerator sequence. In one or more embodiments, the hardware accelerator, eg, the board of the hardware accelerator, is configured in a ring topology within the system. The host processor is aware of the existing PCIe topology, and thus the number of hardware accelerators present in the system connected to the PCIe bus. In addition, the host processor is aware of the particular circuit (eg, image or configuration bitstream) loaded into each hardware accelerator, for example, through the runtime. Therefore, the host processor is aware that hardware accelerators support accelerator links as described herein. The host processor still has to determine the sequence of hardware accelerators. The driver can, for example, perform automatic discovery of the hardware accelerator sequence described. This auto-discovery capability supports the addition of new and / or additional hardware accelerators to the system without the need to modify the applications run by the host processor.

各ハードウェアアクセラレータは、知られているおよび同じアドレス範囲を有し得る。たとえば、各ハードウェアアクセラレータは、メモリ１４０の１６ＧＢに対応する１６ＧＢのアドレス範囲を有すると仮定され得る。１つまたは複数の実施形態では、ホストプロセッサは、１６ＧＢ間隔においてメモリアドレスに一意の値を書き込むことが可能である。ホストプロセッサは、次いで、書込み値および読取り値に基づいて、リングトポロジー内でハードウェアアクセラレータのシーケンスを決定するために、値を再び読み取り得る。 Each hardware accelerator may have known and the same address range. For example, each hardware accelerator can be assumed to have an address range of 16 GB corresponding to 16 GB of memory 140. In one or more embodiments, the host processor is capable of writing a unique value to the memory address at 16 GB intervals. The host processor can then read the values again to determine the sequence of hardware accelerators within the ring topology based on the write and read values.

ブロック４１０において、ホストプロセッサは、スタートアップ時に各ハードウェアアクセラレータ上にバッファを作成することが可能である。たとえば、ホストプロセッサによって実行されるランタイムは、各それぞれのハードウェアアクセラレータのメモリ内にバッファを作成するために、各ハードウェアアクセラレータと通信することが可能である。図１を参照すると、ハードウェアアクセラレータ１３５−１は、メモリ１４０−１内にバッファを作成する。ハードウェアアクセラレータ１３５−２は、メモリ１４０−２内にバッファを作成する。ハードウェアアクセラレータ１３５−３は、メモリ１４０−３内にバッファを作成する。 At block 410, the host processor can create a buffer on each hardware accelerator at startup. For example, the runtime run by the host processor can communicate with each hardware accelerator to create a buffer in the memory of each hardware accelerator. Referring to FIG. 1, hardware accelerator 135-1 creates a buffer in memory 140-1. Hardware accelerator 135-2 creates a buffer in memory 140-2. Hardware accelerator 135-3 creates a buffer in memory 140-3.

ブロック４１５において、ホストプロセッサは、ハードウェアアクセラレータ間のデータ転送を開始する。データ転送は、たとえば、ホストプロセッサからハードウェアアクセラレータにオフロードされるべきであるタスクの一部であり得る。例示的なおよび非限定的な例として、ホストプロセッサ１０５は、アプリケーションについてのタスクをハードウェアアクセラレータ１３５−１の算出ユニット１６０−１にオフロードし得る。タスクは、命令と、算出ユニット１６０−１がタスクのためのデータをそこから取得するべきであるターゲットアドレスとを含み得る。この例におけるターゲットアドレスは、ハードウェアアクセラレータ１３５−２中に（たとえば、メモリ１４０−２中に）ある。したがって、ホストプロセッサからオフロードされたタスクを実施するために、算出ユニット１６０−１は、メモリ１４０−２中のターゲットアドレスからデータを取り出さなければならない。 At block 415, the host processor initiates data transfer between hardware accelerators. Data transfer can be, for example, part of a task that should be offloaded from the host processor to the hardware accelerator. As an exemplary and non-limiting example, the host processor 105 may offload tasks for an application to compute unit 160-1 of hardware accelerator 135-1. The task may include an instruction and a target address from which calculation unit 160-1 should obtain data for the task. The target address in this example is in hardware accelerator 135-2 (eg, in memory 140-2). Therefore, in order to perform a task offloaded from the host processor, compute unit 160-1 must retrieve data from the target address in memory 140-2.

ブロック４２０において、ランタイムは、ハードウェアアクセラレータ１３５−１とハードウェアアクセラレータ１３５−２との間のデータ転送を要求し得る。たとえば、ランタイムは、ハードウェアアクセラレータ１３５−１による、またはハードウェアアクセラレータ１３５−１からのハードウェアアクセラレータ１３５−２の読取りを要求し得る。 At block 420, the runtime may request data transfer between hardware accelerator 135-1 and hardware accelerator 135-2. For example, the runtime may request a read of hardware accelerator 135-2 by or from hardware accelerator 135-1.

ブロック４２５において、ドライバは、ハードウェアアクセラレータ１３５−２に対応するホストメモリ中のバッファオブジェクトと、ハードウェアアクセラレータ１３５−１に対応するホストメモリ中のバッファオブジェクトとを作成することが可能である。バッファオブジェクトは、ホストメモリ中で実装されるシャドーデータ構造である。各バッファオブジェクトは、システムにおけるデバイスに対応するか、またはそのデバイスを表し得る。バッファオブジェクトは、ホストプロセッサによって実行されるランタイムによって実施される管理機能をサポートするデータを含み得る。 At block 425, the driver can create a buffer object in host memory corresponding to hardware accelerator 135-2 and a buffer object in host memory corresponding to hardware accelerator 135-1. A buffer object is a shadow data structure implemented in host memory. Each buffer object can correspond to or represent a device in the system. The buffer object may contain data that supports the management functions performed by the runtime executed by the host processor.

１つまたは複数の実施形態では、ホストメモリ中で作成されたバッファオブジェクトは、リモートフラグを含み得る。リモートフラグは、バッファオブジェクトが、トランザクションを開始しているハードウェアアクセラレータの観点からリモートであることを指示するためにセットされ得る。この例では、ハードウェアアクセラレータ１３５−１は、ハードウェアアクセラレータ１３５−２からデータを読み取っている。したがって、ハードウェアアクセラレータ１３５−１は、トランザクションを開始している。ドライバは、作成時にハードウェアアクセラレータ１３５−２に対応するバッファオブジェクト中のリモートフラグをセットする。 In one or more embodiments, the buffer object created in host memory may include a remote flag. The remote flag can be set to indicate that the buffer object is remote in terms of the hardware accelerator initiating the transaction. In this example, hardware accelerator 135-1 is reading data from hardware accelerator 135-2. Therefore, hardware accelerator 135-1 is initiating a transaction. The driver sets the remote flag in the buffer object corresponding to hardware accelerator 135-2 at creation time.

ブロック４３０において、ランタイムライブラリは、開始ハードウェアアクセラレータによるバッファオブジェクト（たとえば、リモートバッファオブジェクト）へのアクセスを開始する。ランタイムライブラリは、ハードウェアアクセラレータ１３５−１からのハードウェアアクセラレータ１３５−２に対応するバッファオブジェクトのアクセスを開始する。たとえば、ランタイムは、リモートフラグが、ハードウェアアクセラレータ１３５−２についてのバッファオブジェクト内にセットされると決定する。リモートフラグがセットされると決定したことに応答して、ランタイムライブラリは、リンク回路によって確立されたアクセラレータリンクを使用して転送をスケジュールする。ハードウェアアクセラレータ間のアクセラレータリンクを使用して転送をスケジュールする際に、ランタイムは、ハードウェアアクセラレータ１３５−２からのデータにアクセスするためにハードウェアアクセラレータ１３５−１によって使用されるべきアドレスを決定する。 At block 430, the runtime library initiates access to the buffer object (eg, the remote buffer object) by the start hardware accelerator. The runtime library initiates access to the buffer object corresponding to hardware accelerator 135-2 from hardware accelerator 135-1. For example, the runtime determines that the remote flag is set in the buffer object for hardware accelerator 135-2. In response to determining that the remote flag is set, the runtime library schedules the transfer using the accelerator link established by the link circuit. When scheduling a transfer using an accelerator link between hardware accelerators, the runtime determines the address that should be used by the hardware accelerator 135-1 to access the data from the hardware accelerator 135-2. ..

例示の目的で、ハードウェアアクセラレータ１３５の各々が１〜１０００のアドレス範囲を有する一例について考える。そのような例では、ランタイムは、ハードウェアアクセラレータ１３５−１によってハードウェアアクセラレータ１３５−２から取り出されるべきデータが、ハードウェアアクセラレータ１３５−２に対応するアドレス５００における（たとえば、メモリ１４０−２に対応するアドレス５００における）バッファ中にあると決定し得る。この例では、ランタイムは、ターゲットアドレスに１０００を加算し、１５００のアドレスを生じ、そのアドレスは、オフロードされたタスクのために動作するためのデータを読み出すためのターゲットアドレスとして、ハードウェアアクセラレータ１３５−１に提供される。 For illustrative purposes, consider an example in which each of the hardware accelerators 135 has an address range of 1 to 1000. In such an example, the runtime corresponds to the data to be retrieved from hardware accelerator 135-2 by hardware accelerator 135-1 at address 500 corresponding to hardware accelerator 135-2 (eg, memory 140-2). Can be determined to be in the buffer (at address 500). In this example, the runtime adds 1000 to the target address to yield 1500 addresses, which are the hardware accelerator 135 as the target address for reading data to operate for offloaded tasks. Provided in -1.

別の例として、データがメモリ１４０−３内のアドレス５００において記憶された場合、ランタイムは、トランザクションがハードウェアアクセラレータ１３５−３に達するために、ハードウェアアクセラレータ１３５の各々が１〜１０００のアドレス範囲を有すると仮定して、２０００を加算することになる。概して、知られているように、使用されるオンチップバス相互接続（たとえば、ＡＸＩ相互接続）を通して戻り経路データが追跡され得る。マスタからの読取り要求が発行されたとき、たとえば、読取り要求は、読取り要求が各ハードウェアアクセラレータにわたって横断するとき、（ｍｍストリームマッパによって実施される）一連のアドレス復号および／またはアドレスシフトとともに相互接続を通してスレーブにルーティングされる。各個々の相互接続は、どのマスタが各スレーブへの未解決のトランザクションを有するかを追跡することが可能である。読取りデータが返されると、読取りデータは、（１つまたは複数の）正しいインターフェースを介して返送され得る。いくつかの場合には、特定の読取りデータを返すために、その読取りデータを特定のマスタに関連付けるために、識別子（ＩＤ）ビットが使用され得る。 As another example, if data is stored at address 500 in memory 140-3, the runtime will run an address range of 1 to 1000 for each of the hardware accelerator 135 in order for the transaction to reach hardware accelerator 135-3. 2000 will be added, assuming that In general, as is known, return route data can be tracked through the on-chip bus interconnects used (eg, AXI interconnects). When a read request is issued from the master, for example, the read request interconnects with a series of address decryption and / or address shifts (implemented by the mm stream mapper) as the read request traverses each hardware accelerator. Routed to the slave through. Each individual interconnect can track which master has an open transaction to each slave. When the read data is returned, the read data can be sent back through the correct interface (s). In some cases, an identifier (ID) bit may be used to associate the read data with a particular master in order to return the particular read data.

ブロック４３５において、開始ハードウェアアクセラレータ（たとえば、第１のハードウェアアクセラレータ）は、ホストプロセッサからタスクを受信する。エンドポイント１４５−１は、たとえば、タスクを受信し、算出ユニット１６０−１にタスクを提供し得る。タスクは、算出ユニット１６０−１によれる動作の対象となるべきデータがターゲットアドレスに位置することを指定し、ターゲットアドレスは、この例では１５００である。算出ユニット１６０−１は、たとえば、ターゲットアドレスが記憶され得る制御ポートを有し得る。アドレス１５００に位置するデータにアクセスすることを試みる際に、算出ユニット１６０−１は、アドレスがハードウェアアクセラレータ１３５−１の範囲内にないことを認識する。たとえば、算出ユニット１６０−１は、アドレスを１０００のアドレス範囲の上限と比較することと、アドレスが上限を超えると決定することとが可能である。この例では、算出ユニット１６０−１は、アドレス１５００からの読取りトランザクションを開始することが可能である。たとえば、算出ユニット１６０−１は、相互接続２１４を介して送られたメモリマッピングされたトランザクションとして読取りトランザクションを開始し得る。 At block 435, the starting hardware accelerator (eg, the first hardware accelerator) receives the task from the host processor. The endpoint 145-1 may, for example, receive a task and provide the task to calculation unit 160-1. The task specifies that the data to be the target of the operation by the calculation unit 160-1 is located at the target address, and the target address is 1500 in this example. The calculation unit 160-1 may have, for example, a control port in which the target address can be stored. When attempting to access data located at address 1500, compute unit 160-1 recognizes that the address is not within range of hardware accelerator 135-1. For example, the calculation unit 160-1 can compare the address with the upper bound of the 1000 address range and determine that the address exceeds the upper bound. In this example, compute unit 160-1 is capable of initiating a read transaction from address 1500. For example, compute unit 160-1 may initiate a read transaction as a memory-mapped transaction sent over interconnect 214.

ブロック４４０において、開始ハードウェアアクセラレータは、アクセラレータリンクを介してターゲットハードウェアアクセラレータ（たとえば、第２のハードウェアアクセラレータ）にアクセスする。たとえば、リンク回路１５０−１は、（たとえば、ＭＭストリームマッパを使用して）算出ユニット１６０−１によって開始されたメモリマッピングされたトランザクションをストリームベースのパケットにコンバートすることが可能である。リンク回路１５０−１は、さらに、（たとえば、ＲＰＥを使用して）データの完全性検査、再送信、初期化、およびエラー報告をサポートする追加のデータをもつパケットを符号化することが可能である。リングトポロジーは、左から右にマスタし得る。したがって、パケットは、リンク回路１５０−１のトランシーバによってリンク回路１５０−２に出力され得る。 At block 440, the starting hardware accelerator accesses the target hardware accelerator (eg, a second hardware accelerator) via the accelerator link. For example, link circuit 150-1 can convert a memory-mapped transaction initiated by compute unit 160-1 (eg, using an MM stream mapper) into a stream-based packet. The link circuit 150-1 can further encode packets with additional data that support data integrity checking, retransmission, initialization, and error reporting (eg, using RPE). be. The ring topology can be mastered from left to right. Therefore, the packet can be output to the link circuit 150-2 by the transceiver of the link circuit 150-1.

リンク回路１５０−２は、トランシーバ２０２においてデータストリームを受信し、ＲＴＥ２０６においてトランザクションを処理する。ＭＭストリームマッパ２１０は、ストリームデータベースのパケットを受信したことに応答して、様々な動作を実施することが可能である。ＭＭストリームマッパ２１０は、たとえば、ストリームベースのパケットを、メモリマッピングされたトランザクションにコンバートすることが可能である。さらに、ＭＭストリームマッパ２１０は、１５００のターゲットアドレスをハードウェアアクセラレータ１３５−２のアドレス範囲の上限だけ減分することが可能である。述べられたように、上限は、リンク回路１５０−２内の、たとえば、ＭＭストリームマッパ２１０中のテーブルまたはレジスタに記憶され得る。この例では、ＭＭストリームマッパ２１０は、１５００のターゲットアドレスを１０００だけ減分し、５００のターゲットアドレスを生じる。ターゲットアドレスがハードウェアアクセラレータ１３５−２にローカルであるので、ハードウェアアクセラレータ１３５−２は、受信されたトランザクションに作用することが可能である。この例では、ＭＭストリームマッパ２１０は、メモリマッピングされたトランザクションを相互接続２１４に提供する。メモリマッピングされたトランザクションは、読取りトランザクションを実施するために、（たとえば、メモリコントローラスレーブを通して）メモリコントローラ１５５−２に提供され得る。このようにして、ハードウェアアクセラレータ１３５−１は、ハードウェアアクセラレータ１３５−２からデータを読み出すこと（またはハードウェアアクセラレータ１３５−２にデータを書き込むこと）が可能である。要求されたデータは、読取り要求を送るために使用される同じ経路を使用して、メモリ１４０−２から要求側に提供され得る。たとえば、メモリ１４０−２から読み取られたデータは、リングトポロジーを通って前方にハードウェアアクセラレータ１３５−３に横断し、次いで、ハードウェアアクセラレータ１３５−１に横断する必要なしに、ハードウェアアクセラレータ１３５−２からハードウェアアクセラレータ１３５−１に送られる。 The link circuit 150-2 receives the data stream on the transceiver 202 and processes the transaction on the RTE 206. The MM stream mapper 210 can perform various operations in response to receiving a packet of the stream database. The MM stream mapper 210 can, for example, convert stream-based packets into memory-mapped transactions. Further, the MM stream mapper 210 can decrement the target address of 1500 by the upper limit of the address range of the hardware accelerator 135-2. As mentioned, the upper limit can be stored in a table or register in the link circuit 150-2, eg, in the MM stream mapper 210. In this example, the MM stream mapper 210 subtracts 1500 target addresses by 1000 to produce 500 target addresses. Since the target address is local to hardware accelerator 135-2, hardware accelerator 135-2 can act on received transactions. In this example, the MM stream mapper 210 provides memory-mapped transactions to interconnect 214. The memory-mapped transaction may be provided to the memory controller 155-2 (eg, through the memory controller slave) to perform a read transaction. In this way, the hardware accelerator 135-1 can read data from the hardware accelerator 135-2 (or write data to the hardware accelerator 135-2). The requested data can be provided from memory 140-2 to the requester using the same route used to send the read request. For example, data read from memory 140-2 traverses forward through the ring topology to hardware accelerator 135-3 and then to hardware accelerator 135-1 without having to traverse to hardware accelerator 135-1. 2 is sent to the hardware accelerator 135-1.

たとえば、ターゲットアドレスが２５００であった場合、減分した結果は１５００になる。その場合、ＭＭストリームマッパ２１０は、ターゲットアドレスがハードウェアアクセラレータ１３５−２についてのアドレス範囲の上限（たとえば、１０００）よりも大きいので、ターゲットアドレスがハードウェアアクセラレータ１３５−２中にないと決定する。その場合、ＭＭストリームマッパ２１０は、次のハードウェアアクセラレータ上にフォワーディングするために、トランザクションを、相互接続回路を通してＭＭストリームマッパ２１２に送り得る。 For example, if the target address is 2500, the result of the decrement is 1500. In that case, the MM stream mapper 210 determines that the target address is not in the hardware accelerator 135-2 because the target address is greater than the upper limit of the address range for the hardware accelerator 135-2 (eg, 1000). In that case, the MM stream mapper 210 may send a transaction to the MM stream mapper 212 through an interconnect circuit for forwarding on the next hardware accelerator.

ブロック４４５において、ハードウェアアクセラレータ１３５−１中の算出ユニット１６０−１は、ハードウェアアクセラレータ間のデータ転送が完了したことをホストプロセッサに知らせる、ホストプロセッサへの割込みを生成することが可能である。ブロック４５０において、ランタイムは、アプリケーションに必要な、データ転送が完了したという通知を提供することが可能である。ランタイムは、たとえば、アプリケーションへの、完了イベント、コマンド待ち行列、および通知をハンドリングすることが可能である。 At block 445, the calculation unit 160-1 in the hardware accelerator 135-1 can generate an interrupt to the host processor that informs the host processor that the data transfer between the hardware accelerators is complete. At block 450, the runtime can provide the application with the necessary notification that the data transfer is complete. The runtime can, for example, handle completion events, command queues, and notifications to the application.

１つまたは複数の実施形態では、ＰＣＩｅエンドポイントおよびＤＭＡマスタは、異なるハードウェアアクセラレータ中にあるターゲットアドレスに書き込むことが可能である。例示的なおよび非限定的な例として、ホストプロセッサは、ハードウェアアクセラレータ１３５−２中にあるターゲットアドレスとともにデータをハードウェアアクセラレータ１３５−１に送り得る。その場合、ＤＭＡマスタは、ターゲットアドレスが、異なるハードウェアアクセラレータ中にあることを認識することと、アクセラレータリンクを介したデータ転送をスケジュールすることとが可能である。たとえば、ＤＭＡマスタは、ターゲットアドレスを、ハードウェアアクセラレータ１３５−１についてのアドレス範囲の上限と比較し得る。ターゲットアドレスが上限を超えると決定したことに応答して、ＤＭＡマスタは、アクセラレータリンクを介してハードウェアアクセラレータ１３５−２に送るために、相互接続回路を介した、リンク回路１５０−１中のＭＭストリームマッパ２１２へのメモリマッピングされたトランザクションを開始することが可能である。 In one or more embodiments, the PCIe endpoint and DMA master can write to target addresses in different hardware accelerators. As an exemplary and non-limiting example, the host processor may send data to hardware accelerator 135-1 with a target address in hardware accelerator 135-2. In that case, the DMA master can recognize that the target address is in a different hardware accelerator and schedule data transfer over the accelerator link. For example, the DMA master may compare the target address with the upper bound of the address range for hardware accelerator 135-1. In response to determining that the target address exceeds the upper bound, the DMA master MM in the link circuit 150-1 via the interconnect circuit to send to the hardware accelerator 135-2 over the accelerator link. It is possible to initiate a memory-mapped transaction to the stream mapper 212.

１つまたは複数の実施形態では、ホストプロセッサは、ロードバランシングの目的でアクセラレータリンクを使用することが可能である。たとえば、ホストプロセッサは、データが提供されるべきであるか、またはタスクがオフロードされるべきである、選択されたハードウェアアクセラレータ中のＤＭＡチャネル（たとえば、ＤＭＡマスタ）のステータスを決定するために、ランタイムを使用することが可能である。ＤＭＡマスタが、ビジーであるかまたはアクティビティのしきい値量を上回って動作していると決定したことに応答して、ホストプロセッサは、バスを介して異なるハードウェアアクセラレータにデータを送り得る。データは、選択されたハードウェアアクセラレータ内のターゲットアドレスを指定し得る。受信ハードウェアアクセラレータ内のＤＭＡマスタは、ホストプロセッサからデータを受信すると、（１つまたは複数の）アクセラレータリンクを介して、選択されたハードウェアアクセラレータにデータをフォワーディングすることが可能である。特定の実施形態では、ホストプロセッサは、受信ハードウェアアクセラレータ中のＤＭＡマスタが、ビジーでないかまたはアクティビティのしきい値量を下回って動作しているという決定に基づいて、受信ハードウェアアクセラレータを選定することが可能である。 In one or more embodiments, the host processor can use accelerator links for load balancing purposes. For example, the host processor to determine the status of a DMA channel (eg, DMA master) in a selected hardware accelerator for which data should be provided or tasks should be offloaded. , It is possible to use the runtime. In response to the DMA master determining that it is busy or operating above the activity threshold, the host processor may send data over the bus to different hardware accelerators. The data may specify a target address within the selected hardware accelerator. When the DMA master in the receiving hardware accelerator receives the data from the host processor, it can forward the data to the selected hardware accelerator via the accelerator link (s). In certain embodiments, the host processor selects the receive hardware accelerator based on the determination that the DMA master in the receive hardware accelerator is not busy or is operating below the activity threshold. It is possible.

例示の目的で、ハードウェアアクセラレータ１３５−１からハードウェアアクセラレータ１３５−３への書込みトランザクションの一例が、概して、ホストプロセッサによって開始されるものとして説明される。ホストプロセッサは、ランタイムおよびドライバを介して、ターゲットハードウェアアクセラレータについてのリモートフラグをセットし、（所望のアドレスがハードウェアアクセラレータ１３５−３中のアドレス５００に位置する、前の例を使用して）２５００のアドレスを決定する。ホストプロセッサは、アドレス２５００に書き込むために、ハードウェアアクセラレータ１３５−１に命令を提供する。ハードウェアアクセラレータ１３５−１内で、２５００のアドレスをもつトランザクションが相互接続２１４に提示される。アドレスがハードウェアアクセラレータ１３５−１の上限を超えるので、相互接続２１４は、リンク回路１５０−１にトランザクションを送る。リンク回路１５０−１は、リンク回路１５０−２にトランザクションを送る。ハードウェアアクセラレータ１３５−２中のＭＭストリームマッパは、アドレスを１０００だけ減分し、１５００の新しいアドレスを生じる。新しいアドレスは、１５００がハードウェアアクセラレータ１３５−２の上側アドレス限界を超えるので、依然としてリモートである。したがって、トランザクションは、ハードウェアアクセラレータ１３５−３にフォワーディングされる。 For illustrative purposes, an example of a write transaction from hardware accelerator 135-1 to hardware accelerator 135-3 is generally described as being initiated by the host processor. The host processor sets the remote flag for the target hardware accelerator via the runtime and driver (using the previous example, where the desired address is located at address 500 in the hardware accelerator 135-3). Determine the address of 2500. The host processor provides an instruction to the hardware accelerator 135-1 to write to address 2500. Within the hardware accelerator 135-1, a transaction with 2500 addresses is presented to interconnect 214. Since the address exceeds the upper limit of the hardware accelerator 135-1, the interconnect 214 sends a transaction to the link circuit 150-1. The link circuit 150-1 sends a transaction to the link circuit 150-2. The MM stream mapper in the hardware accelerator 135-2 decrements the address by 1000, resulting in 1500 new addresses. The new address is still remote as 1500 exceeds the upper address limit of hardware accelerator 135-2. Therefore, the transaction is forwarded to hardware accelerator 135-3.

ハードウェアアクセラレータ１３５−３中のＭＭストリームマッパは、アドレスを減分し、５００の新しいアドレスを生じる。トランザクションは、次いで、ハードウェアアクセラレータ１３５−３中で相互接続２１４を介してメモリコントローラに提供され、データがメモリ１４０−３に書き込まれる。説明される例では、アドレスは、トランザクションが、ハードウェアアクセラレータによってサービスされ得るのか、および、トランザクションがハードウェアアクセラレータによってサービスされ得る場合、トランザクションを内部的にどこに（たとえば、メモリコントローラまたは他の回路ブロックに）ルーティングすべきか、または次のハードウェアアクセラレータにフォワーディングされるべきであるのかを決定するために、各ハードウェアアクセラレータによって使用される。特定の実施形態では、アドレスは、データがメモリ中に書き込まれる実際のアドレスとは異なる。書込み肯定応答は、説明されるように、ハードウェアアクセラレータ１３５−３からハードウェアアクセラレータ１３５−２を通してハードウェアアクセラレータ１３５−１に送られる。 The MM stream mapper in the hardware accelerator 135-3 decrements the address, yielding 500 new addresses. The transaction is then provided to the memory controller via interconnect 214 in hardware accelerator 135-3 and data is written to memory 140-3. In the example described, the address indicates where the transaction can be serviced by the hardware accelerator and, if the transaction can be serviced by the hardware accelerator, where the transaction is internally (eg, a memory controller or other circuit block). Used by each hardware accelerator to determine whether it should be routed or forwarded to the next hardware accelerator. In certain embodiments, the address is different from the actual address at which the data is written into memory. The write acknowledgment is sent from hardware accelerator 135-3 through hardware accelerator 135-2 to hardware accelerator 135-1 as described.

例示の目的で、ハードウェアアクセラレータ１３５−１によって開始された、ハードウェアアクセラレータ１３５−３への読取りトランザクションの別の例が、概して、ホストプロセッサによって開始されるものとして説明される。ホストプロセッサは、ランタイムおよびドライバを介して、ターゲットハードウェアアクセラレータについてのリモートフラグをセットし、（所望のアドレスがハードウェアアクセラレータ１３５−３中のアドレス５００に位置する、前の例を使用して）２５００のアドレスを決定する。ホストプロセッサは、アドレス２５００から読み出すために、ハードウェアアクセラレータ１３５−１に命令を提供する。ハードウェアアクセラレータ１３５−１内で、２５００のアドレスをもつトランザクションが相互接続２１４に提示される。アドレスがハードウェアアクセラレータ１３５−１の上限を超えるので、相互接続２１４は、リンク回路１５０−１にトランザクションを送る。リンク回路１５０−１は、リンク回路１５０−２にトランザクションを送る。ハードウェアアクセラレータ１３５−２中のＭＭストリームマッパは、アドレスを１０００だけ減分し、１５００の新しいアドレスを生じる。新しいアドレスは、１５００がハードウェアアクセラレータ１３５−２の上側アドレス限界を超えるので、依然としてリモートである。したがって、トランザクションは、ハードウェアアクセラレータ１３５−３にフォワーディングされる。 For illustrative purposes, another example of a read transaction to hardware accelerator 135-3 initiated by hardware accelerator 135-1 is generally described as being initiated by the host processor. The host processor sets the remote flag for the target hardware accelerator via the runtime and driver (using the previous example, where the desired address is located at address 500 in the hardware accelerator 135-3). Determine the address of 2500. The host processor provides instructions to the hardware accelerator 135-1 to read from address 2500. Within the hardware accelerator 135-1, a transaction with 2500 addresses is presented to interconnect 214. Since the address exceeds the upper limit of the hardware accelerator 135-1, the interconnect 214 sends a transaction to the link circuit 150-1. The link circuit 150-1 sends a transaction to the link circuit 150-2. The MM stream mapper in the hardware accelerator 135-2 decrements the address by 1000, resulting in 1500 new addresses. The new address is still remote as 1500 exceeds the upper address limit of hardware accelerator 135-2. Therefore, the transaction is forwarded to hardware accelerator 135-3.

ハードウェアアクセラレータ１３５−３中のＭＭストリームマッパは、アドレスを減分し、５００の新しいアドレスを生じる。トランザクションは、次いで、ハードウェアアクセラレータ１３５−３中で相互接続２１４を介してメモリコントローラに提供され、データがメモリ１４０−３から読み出される。説明される例では、アドレスは、トランザクションが、ハードウェアアクセラレータによってサービスされ得るのか、および、トランザクションがハードウェアアクセラレータによってサービスされ得る場合、トランザクションを内部的にどこにルーティングすべきか、または次のハードウェアアクセラレータにフォワーディングされるべきであるのかを決定するために、各ハードウェアアクセラレータによって使用される。特定の実施形態では、アドレスは、データがメモリから読み取られる実際のアドレスとは異なる。読み取られたデータは、説明されるように、ハードウェアアクセラレータ１３５−３からハードウェアアクセラレータ１３５−２を通してハードウェアアクセラレータ１３５−１に送られる。 The MM stream mapper in the hardware accelerator 135-3 decrements the address, yielding 500 new addresses. The transaction is then provided to the memory controller via interconnect 214 in hardware accelerator 135-3 and data is read from memory 140-3. In the example described, the address indicates whether the transaction can be serviced by a hardware accelerator, and where the transaction should be routed internally if the transaction can be serviced by a hardware accelerator, or the next hardware accelerator. Used by each hardware accelerator to determine if it should be forwarded to. In certain embodiments, the address is different from the actual address from which the data is read from memory. The read data is sent from hardware accelerator 135-3 through hardware accelerator 135-2 to hardware accelerator 135-1 as described.

図５は、ハードウェアアクセラレータと１つまたは複数の追加のデバイスとを含むシステムの一例を示す。図５の例では、ハードウェアアクセラレータ１３５−１および１３５−２が示されており、各それぞれのハードウェアアクセラレータ中のリンク回路を使用して、アクセラレータリンクによって接続される。例示の目的で、ハードウェアアクセラレータ１３５−３は示されていない。システムはＧＰＵ５１５をも含み、ＧＰＵ５１５は、メモリ５２０とＩ／Ｏデバイス５２５とに接続される。 FIG. 5 shows an example of a system that includes a hardware accelerator and one or more additional devices. In the example of FIG. 5, hardware accelerators 135-1 and 135-2 are shown and are connected by accelerator links using the link circuitry in each hardware accelerator. For illustrative purposes, hardware accelerators 135-3 are not shown. The system also includes a GPU 515, which is connected to a memory 520 and an I / O device 525.

図５の例では、ＧＰＵ５１５は、ハードウェアアクセラレータ１３５−２にデータを書き込むか、またはハードウェアアクセラレータ１３５−２からデータを読み取り得る。この例では、ホストプロセッサ（図示せず）は、ＧＰＵ５１５にハンドル５０５−Ｎを提供する。特定の実施形態では、ハンドルは、ファイル記述子として実装され得る。ハンドル５０５−Ｎは、バッファオブジェクト５１０−Ｎを指し得、バッファオブジェクト５１０−Ｎは、ハードウェアアクセラレータ１３５−２に対応する。ＧＰＵ５１５が、読取りまたは書込み動作のためにハンドル５０５−Ｎを使用することによって、ホストプロセッサは、ハンドル５０５−Ｎに対応するバッファオブジェクト、たとえば、バッファオブジェクト５１０−Ｎ上でアクションを開始する。ホストプロセッサは、バッファオブジェクト５１０−Ｎがローカルであるのか、リモートであるのかを決定する。ホストプロセッサは、バッファオブジェクト５１０−Ｎ中のリモートフラグがセットされていないので、ＰＣＩｅを介してメモリ１４０−２からデータを取り出し、ＰＣＩｅを介してＧＰＵ５１５にデータを提供し得る。 In the example of FIG. 5, the GPU 515 may write data to or read data from hardware accelerator 135-2. In this example, the host processor (not shown) provides the GPU 515 with handles 505-N. In certain embodiments, the handle can be implemented as a file descriptor. Handles 505-N may point to buffer object 510-N, which corresponds to hardware accelerator 135-2. When the GPU 515 uses handles 505-N for read or write operations, the host processor initiates an action on the buffer object corresponding to handle 505-N, eg, buffer object 510-N. The host processor determines whether the buffer object 510-N is local or remote. Since the remote flag in the buffer object 510-N is not set, the host processor may retrieve data from memory 140-2 via PCIe and provide data to GPU 515 via PCIe.

１つまたは複数の他の実施形態では、ホストプロセッサは、異なるハードウェアアクセラレータにアクセスすることによって、メモリ１４０−２からのデータの取出しを開始し得る。たとえば、ホストプロセッサは、メモリ１４０−２からデータを取り出すために、ＰＣＩｅを介してハードウェアアクセラレータ１３５−１との通信を開始し得る。その場合、ハードウェアアクセラレータ１３５−１は、メモリ１４０−２からデータを取り出すために、リンク回路を使用してハードウェアアクセラレータ１３５−２と直接通信し得る。ハードウェアアクセラレータ１３５−１は、次いで、ホストプロセッサにデータを提供し得、ホストプロセッサは、ＰＣＩｅを介してＧＰＵ５１５にデータを提供する。 In one or more other embodiments, the host processor may initiate the retrieval of data from memory 140-2 by accessing different hardware accelerators. For example, the host processor may initiate communication with the hardware accelerator 135-1 via PCIe to retrieve data from memory 140-2. In that case, the hardware accelerator 135-1 may communicate directly with the hardware accelerator 135-2 using a link circuit to retrieve data from memory 140-2. The hardware accelerator 135-1 may then provide the data to the host processor, which provides the data to the GPU 515 via PCIe.

別の例では、Ｉ／Ｏデバイス５２５、たとえば、カメラが、ハードウェアアクセラレータ１３５−１にデータを書き込み得る。その場合、ホストプロセッサは、Ｉ／Ｏデバイス５２５にハンドル５０５−１を提供することが可能である。ハンドル５０５−１は、バッファオブジェクト５１０−１を指し得、バッファオブジェクト５１０−１は、ハードウェアアクセラレータ１３５−１に対応する。Ｉ／Ｏデバイス５２５が、書込み動作のためにハンドル５０５−１を使用することによって、ホストプロセッサは、ハンドル５０５−１に対応するバッファオブジェクト、たとえば、バッファオブジェクト５１０−１上でアクションを開始する。ホストプロセッサは、バッファオブジェクト５１０−１がローカルであるのか、リモートであるのかを決定する。ホストプロセッサは、バッファオブジェクト５１０−１中のリモートフラグがセットされていないので、Ｉ／Ｏデバイス５２５からデータを受信し、メモリ１４０−１中への書込みおよび／またはさらなる処理のために、ＰＣＩｅを介してハードウェアアクセラレータ１３５−１にそのようなデータを提供し得る。 In another example, an I / O device 525, such as a camera, may write data to the hardware accelerator 135-1. In that case, the host processor can provide handle 505-1 to the I / O device 525. Handle 505-1 may point to buffer object 510-1, and buffer object 510-1 corresponds to hardware accelerator 135-1. When the I / O device 525 uses the handle 505-1 for the write operation, the host processor initiates an action on the buffer object corresponding to the handle 505-1, eg, the buffer object 510-1. The host processor determines whether the buffer object 510-1 is local or remote. Since the remote flag in buffer object 510-1 is not set, the host processor receives data from the I / O device 525 and sends the PCIe for writing to memory 140-1 and / or for further processing. Such data may be provided to the hardware accelerator 135-1 via.

１つまたは複数の実施形態では、ドライバは、説明されるようにアクセラレータリンクを使用することが可能であるハードウェアアクセラレータ間のデータ転送の場合にのみ、バッファオブジェクト内のリモートフラグをセットすることが可能である。図５は、他のタイプのデバイスが、ハードウェアアクセラレータとともに使用され得るが、そのような他のデバイスとハードウェアアクセラレータとの間のデータ転送が、バスを介して行われ、ホストプロセッサを伴うことを示す。 In one or more embodiments, the driver may set the remote flag in the buffer object only for data transfers between hardware accelerators where accelerator links can be used as described. It is possible. FIG. 5 shows that other types of devices can be used with hardware accelerators, but data transfer between such other devices and the hardware accelerators takes place over the bus and involves a host processor. Is shown.

図６は、ＩＣのための例示的なアーキテクチャ６００を示す。一態様では、アーキテクチャ６００は、プログラマブルＩＣ内に実装され得る。たとえば、アーキテクチャ６００は、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）を実装するために使用され得る。アーキテクチャ６００はまた、ＩＣのシステムオンチップ（ＳＯＣ）タイプを表し得る。ＳＯＣは、プログラムコードを実行するプロセッサと、１つまたは複数の他の回路とを含むＩＣである。他の回路は、ハードワイヤード回路、プログラマブル回路、および／またはそれらの組合せとして実装され得る。回路は、互いと、および／またはプロセッサと協働して動作し得る。 FIG. 6 shows an exemplary architecture 600 for ICs. In one aspect, the architecture 600 may be implemented within a programmable IC. For example, architecture 600 can be used to implement a field programmable gate array (FPGA). Architecture 600 may also represent a system-on-chip (SOC) type of IC. An SOC is an IC that includes a processor that executes program code and one or more other circuits. Other circuits may be implemented as hard-wired circuits, programmable circuits, and / or combinations thereof. The circuits may operate with each other and / or in cooperation with the processor.

図示のように、アーキテクチャ６００は、いくつかの異なるタイプのプログラマブル回路、たとえば、論理、ブロックを含む。たとえば、アーキテクチャ６００は、マルチギガビットトランシーバ（ＭＧＴ：ｍｕｌｔｉ−ｇｉｇａｂｉｔｔｒａｎｓｃｅｉｖｅｒ）６０１、設定可能論理ブロック（ＣＬＢ）６０２、ランダムアクセスメモリブロック（ＢＲＡＭ）６０３、入出力ブロック（ＩＯＢ）６０４、設定およびクロッキング論理（ＣＯＮＦＩＧ／ＣＬＯＣＫＳ）６０５、デジタル信号処理ブロック（ＤＳＰ）６０６、特殊なＩ／Ｏブロック６０７（たとえば、設定ポートおよびクロックポート）、ならびにデジタルクロックマネージャ、アナログデジタル変換器、システム監視論理などの他のプログラマブル論理６０８を含む、多数の異なるプログラマブルタイルを含み得る。 As shown, the architecture 600 includes several different types of programmable circuits, such as logic, blocks. For example, architecture 600 includes a multi-gigabit transceiver (MGT) 601, configurable logic block (CLB) 602, random access memory block (BRAM) 603, input / output block (IOB) 604, configuration and clocking logic. (CONFIG / CLOCKS) 605, Digital Signal Processing Block (DSP) 606, Special I / O Block 607 (eg, Configuration Ports and Clock Ports), and Others such as Digital Clock Managers, Analog Digital Converters, System Monitoring Logic, etc. It may include a number of different programmable tiles, including programmable logic 608.

いくつかのＩＣでは、各プログラマブルタイルは、プログラマブル相互接続要素（ＩＮＴ）６１１を含み、ＩＮＴ６１１は、各隣接するタイル中の対応するＩＮＴ６１１との間の規格化された接続を有する。したがって、ＩＮＴ６１１は、まとめると、示されているＩＣのためのプログラマブル相互接続構造を実装する。各ＩＮＴ６１１は、図６の上部に含まれる例によって示されているように、同じタイル内のプログラマブル論理要素との間の接続をも含む。 In some ICs, each programmable tile contains a programmable interconnect element (INT) 611, which has a standardized connection to and from the corresponding INT 611 in each adjacent tile. Therefore, INT611, in summary, implements a programmable interconnect structure for the ICs shown. Each INT611 also includes a connection between programmable logic elements within the same tile, as shown by the example contained at the top of FIG.

たとえば、ＣＬＢ６０２は、ユーザ論理を実装するようにプログラムされ得る設定可能論理要素（ＣＬＥ）６１２と、単一のＩＮＴ６１１とを含み得る。ＢＲＡＭ６０３は、１つまたは複数のＩＮＴ６１１に加えてＢＲＡＭ論理要素（ＢＲＬ）６１３を含み得る。一般的に、タイル中に含まれるＩＮＴ６１１の数は、タイルの高さに依存する。描かれているように、ＢＲＡＭタイルは、５つのＣＬＢと同じ高さを有するが、他の数（たとえば、４つ）も使用され得る。ＤＳＰタイル６０６は、適切な数のＩＮＴ６１１に加えてＤＳＰ論理要素（ＤＳＰＬ）６１４を含み得る。ＩＯＢ６０４は、たとえば、ＩＮＴ６１１の１つのインスタンスに加えてＩ／Ｏ論理要素（ＩＯＬ）６１５の２つのインスタンスを含み得る。ＩＯＬ６１５に接続された実際のＩ／Ｏパッドは、ＩＯＬ６１５のエリアに制限されないことがある。 For example, CLB602 may include a configurable logic element (CLE) 612, which may be programmed to implement user logic, and a single INT611. The BRAM 603 may include a BRAM logical element (BRL) 613 in addition to the one or more INT 611s. Generally, the number of INT611s contained in a tile depends on the height of the tile. As depicted, BRAM tiles have the same height as 5 CLBs, but other numbers (eg, 4) may be used. The DSP tile 606 may include a DSP logical element (DSPL) 614 in addition to an appropriate number of INT611s. The IOB604 may include, for example, one instance of INT611 plus two instances of the I / O logical element (IOL) 615. The actual I / O pad connected to the IOL 615 may not be limited to the area of the IOL 615.

図６に描かれている例では、ダイの中心の近くの、たとえば、領域６０５、６０７、および６０８から形成された、列状エリアが、設定、クロック、および他の制御論理のために使用され得る。この列から延びる水平エリア６０９が、プログラマブルＩＣの幅にわたってクロックおよび設定信号を分散させるために使用され得る。 In the example depicted in FIG. 6, a columnar area near the center of the die, for example formed from regions 605, 607, and 608, is used for configuration, clock, and other control logic. obtain. A horizontal area 609 extending from this row can be used to distribute the clock and set signals across the width of the programmable IC.

図６に示されているアーキテクチャを利用するいくつかのＩＣは、ＩＣの大部分を作り上げる規則的な列状構造を損なう追加の論理ブロックを含む。追加の論理ブロックは、プログラマブルブロックおよび／または専用回路であり得る。たとえば、ＰＲＯＣ６１０として示されているプロセッサブロックが、ＣＬＢおよびＢＲＡＭのいくつかの列にまたがる。 Some ICs utilizing the architecture shown in FIG. 6 include additional logical blocks that compromise the regular sequence structure that makes up most of the ICs. Additional logic blocks can be programmable blocks and / or dedicated circuits. For example, the processor block, designated as PROC610, spans several columns of CLB and BRAM.

一態様では、ＰＲＯＣ６１０は、ＩＣのプログラマブル回路を実装するダイの一部として作製される専用回路として、たとえば、ハードワイヤードプロセッサとして実装され得る。ＰＲＯＣ６１０は、個々のプロセッサ、たとえば、プログラムコードを実行することが可能な単一のコアから、１つまたは複数のコア、モジュール、コプロセッサ、インターフェースなどを有するプロセッサシステム全体まで、複雑さに幅がある様々な異なるプロセッサタイプおよび／またはシステムのいずれかを表し得る。 In one aspect, the PROC610 can be implemented as a dedicated circuit made as part of a die that implements a programmable circuit of an IC, for example as a hardwired processor. The PROC610 ranges in complexity from individual processors, such as a single core capable of executing program code, to an entire processor system with one or more cores, modules, coprocessors, interfaces, and so on. It can represent any of a variety of different processor types and / or systems.

別の態様では、ＰＲＯＣ６１０は、アーキテクチャ６００から省略され、説明されるプログラマブルブロックの他の種類のうちの１つまたは複数と置き換えられ得る。さらに、そのようなブロックは、ＰＲＯＣ６１０の場合のようにプログラムコードを実行することができるプロセッサを形成するためにプログラマブル回路の様々なブロックが使用され得るという点で、「ソフトプロセッサ」を形成するために利用され得る。 In another aspect, PROC610 can be omitted from architecture 600 and replaced with one or more of the other types of programmable blocks described. In addition, such blocks form a "soft processor" in that various blocks of programmable circuits can be used to form a processor capable of executing program code as in the case of PROC610. Can be used for.

「プログラマブル回路」という句は、ＩＣ内のプログラマブル回路要素、たとえば、本明細書で説明される様々なプログラマブルまたは設定可能回路ブロックまたはタイル、ならびに、ＩＣにロードされた設定データに従って様々な回路ブロック、タイル、および／または要素を選択的に結合する相互接続回路を指す。たとえば、ＣＬＢ６０２およびＢＲＡＭ６０３など、ＰＲＯＣ６１０の外部にある、図６に示されている回路ブロックは、ＩＣのプログラマブル回路と見なされる。 The phrase "programmable circuit" refers to programmable circuit elements within an IC, such as the various programmable or configurable circuit blocks or tiles described herein, as well as the various circuit blocks according to the configuration data loaded into the IC. Refers to an interconnect circuit that selectively connects tiles and / or elements. For example, the circuit blocks shown in FIG. 6 outside the PROC610, such as CLB602 and BRAM603, are considered as programmable circuits of the IC.

概して、プログラマブル回路の機能性は、設定データがＩＣにロードされるまで確立されない。ＦＰＧＡなど、ＩＣのプログラマブル回路をプログラムするために、設定ビットのセットが使用され得る。（１つまたは複数の）設定ビットは、一般に、「設定ビットストリーム」と呼ばれる。概して、プログラマブル回路は、設定ビットストリームをＩＣに最初にロードしなければ、動作可能でないか、または機能可能でない。設定ビットストリームは、プログラマブル回路内に特定の回路設計を効果的に実装する。回路設計は、たとえば、プログラマブル回路ブロックの機能的態様と、様々なプログラマブル回路ブロックの間の物理的接続性とを指定する。 In general, the functionality of a programmable circuit is not established until the configuration data is loaded into the IC. A set of set bits can be used to program programmable circuits of ICs, such as FPGAs. Setting bits (one or more) are commonly referred to as "setting bitstreams". In general, a programmable circuit is not operational or non-functional without first loading the set bitstream into the IC. The set bitstream effectively implements a particular circuit design within the programmable circuit. The circuit design specifies, for example, the functional aspects of the programmable circuit block and the physical connectivity between the various programmable circuit blocks.

「ハードワイヤード」または「ハード化（ｈａｒｄｅｎ）」される、すなわち、プログラマブルでない回路が、ＩＣの一部として製造される。プログラマブル回路とは異なり、ハードワイヤード回路または回路ブロックは、設定ビットストリームのローディングを通してＩＣの製造後に実装されない。ハードワイヤード回路は、概して、たとえば、設定ビットストリームを、ＩＣ、たとえば、ＰＲＯＣ６１０に最初にロードすることなしに機能可能である、専用回路ブロックおよび相互接続を有すると見なされる。 Circuits that are "hard-wired" or "hardened," that is, non-programmable, are manufactured as part of the IC. Unlike programmable circuits, hard-wired circuits or circuit blocks are not implemented after the IC is manufactured through the loading of the set bitstream. Hardwired circuits are generally considered to have dedicated circuit blocks and interconnects that can function, for example, without first loading a set bitstream into an IC, eg, PROC610.

いくつかの事例では、ハードワイヤード回路は、ＩＣ内の１つまたは複数のメモリ要素に記憶されたレジスタセッティングまたは値に従ってセットまたは選択され得る１つまたは複数の動作モードを有し得る。動作モードは、たとえば、ＩＣへの設定ビットストリームのローディングを通してセットされ得る。この能力にもかかわらず、ハードワイヤード回路が、ＩＣの一部として製造されたとき、動作可能であり、特定の機能を有するので、ハードワイヤード回路はプログラマブル回路と見なされない。 In some cases, the hardwired circuit may have one or more modes of operation that can be set or selected according to register settings or values stored in one or more memory elements in the IC. The operating mode can be set, for example, through the loading of the set bitstream into the IC. Despite this capability, hardwired circuits are not considered programmable circuits because they are operational and have specific functionality when manufactured as part of an IC.

ＳＯＣの場合、設定ビットストリームは、プログラマブル回路内に実装されるべきである回路と、ＰＲＯＣ６１０またはソフトプロセッサによって実行されるべきであるプログラムコードとを指定し得る。いくつかの場合には、アーキテクチャ６００は、適切な設定メモリおよび／またはプロセッサメモリに設定ビットストリームをロードする専用設定プロセッサを含む。専用設定プロセッサは、ユーザ指定のプログラムコードを実行しない。他の場合には、アーキテクチャ６００は、設定ビットストリームを受信し、設定ビットストリームを適切な設定メモリにロードし、および／または実行のためのプログラムコードを抽出するために、ＰＲＯＣ６１０を利用し得る。 In the case of SOC, the set bitstream may specify the circuit to be implemented in the programmable circuit and the program code to be executed by the PROC610 or soft processor. In some cases, architecture 600 includes a dedicated configuration processor that loads the configuration bitstream into the appropriate configuration memory and / or processor memory. The dedicated configuration processor does not execute user-specified program code. In other cases, architecture 600 may utilize PROC610 to receive the configuration bitstream, load the configuration bitstream into the appropriate configuration memory, and / or extract the program code for execution.

図６は、プログラマブル回路、たとえば、プログラマブルファブリックを含むＩＣを実装するために使用され得る例示的なアーキテクチャを示すことを意図される。たとえば、１つの列中の論理ブロックの数、列の相対幅、列の数および順序、列中に含まれる論理ブロックのタイプ、論理ブロックの相対サイズ、および図６の上部に含まれる相互接続／論理実装形態は、例示にすぎない。実際のＩＣでは、たとえば、ＣＬＢの２つ以上の隣接する列は、一般に、ユーザ回路設計の効率的な実装を容易にするために、ＣＬＢが現れるところならどこでも含まれる。しかしながら、隣接するＣＬＢ列の数は、ＩＣの全体的サイズとともに変動し得る。さらに、ＩＣ内のＰＲＯＣ６１０などのブロックのサイズおよび／または配置は、例示のためのものにすぎず、限定として意図されていない。 FIG. 6 is intended to show an exemplary architecture that can be used to implement programmable circuits, eg, ICs that include programmable fabrics. For example, the number of logical blocks in a column, the relative width of the columns, the number and order of columns, the types of logical blocks contained in a column, the relative size of logical blocks, and the interconnects included at the top of FIG. The logical implementation form is only an example. In a real IC, for example, two or more adjacent columns of CLBs are generally included wherever CLBs appear to facilitate efficient implementation of user circuit design. However, the number of adjacent CLB columns can vary with the overall size of the IC. Moreover, the size and / or placement of blocks such as PROC610 within the IC is for illustration purposes only and is not intended as a limitation.

アーキテクチャ６００は、本明細書で説明されるようなハードウェアアクセラレータを実装するために使用され得る。特定の実施形態では、エンドポイント、リンク回路、およびメモリコントローラのうちの１つまたは複数または各々が、ハードワイヤード回路ブロックとして実装され得る。特定の実施形態では、エンドポイント、リンク回路、およびメモリコントローラのうちの１つまたは複数または各々が、プログラマブル回路を使用して実装され得る。さらに他の実施形態では、言及される回路ブロックのうちの１つまたは複数は、ハードワイヤード回路ブロックとして実装され得、他のものは、プログラマブル回路を使用して実装される。 Architecture 600 can be used to implement hardware accelerators as described herein. In certain embodiments, one or more of endpoints, link circuits, and memory controllers may be implemented as hardwired circuit blocks. In certain embodiments, one or more of endpoints, link circuits, and memory controllers may be implemented using programmable circuits. In yet another embodiment, one or more of the mentioned circuit blocks may be implemented as hard-wired circuit blocks, the other being implemented using a programmable circuit.

本開示内で説明される実施形態は、たとえば、データベースアクセラレーション、複数のビデオストリームを処理すること、リアルタイムネットワークトラフィック監視、機械学習、または複数のハードウェアアクセラレータを伴い得る任意の他の適用例など、様々な適用例のいずれかにおいて使用され得る。 The embodiments described herein include, for example, database acceleration, processing multiple video streams, real-time network traffic monitoring, machine learning, or any other application that may involve multiple hardware accelerators. , Can be used in any of various applications.

説明のために、特定の名称が、本明細書で開示される様々な発明概念の完全な理解を提供するために記載される。しかしながら、本明細書で使用される専門用語は、本発明の構成の特定の態様を説明するためのものにすぎず、限定するものではない。 For illustration purposes, specific names are given to provide a complete understanding of the various invention concepts disclosed herein. However, the terminology used herein is merely for explaining certain aspects of the configuration of the present invention and is not limiting.

本明細書で定義される単数形「ａ」、「ａｎ」および「ｔｈｅ」は、文脈が別段に明確に指示するのでなければ、複数形をも含むものとする。 The singular forms "a", "an" and "the" as defined herein shall also include the plural, unless the context specifically dictates otherwise.

本明細書で定義される「約」という用語は、正確ではないが、ほぼ正しいまたは厳密である、値または量が近い、を意味する。たとえば、「約」という用語は、具陳された特性、パラメータ、または値が、厳密な特性、パラメータ、または値の所定の量内にあることを意味し得る。 The term "about" as defined herein means inaccurate, but nearly correct or exact, close in value or quantity. For example, the term "about" can mean that the specified property, parameter, or value is within a predetermined quantity of exact property, parameter, or value.

本明細書で定義される「少なくとも１つ」、「１つまたは複数」、および「および／または」という用語は、別段に明記されていない限り、運用において連言的と選言的の両方である、オープンエンド表現である。たとえば、「Ａ、Ｂ、およびＣのうちの少なくとも１つ」、「Ａ、Ｂ、またはＣのうちの少なくとも１つ」、「Ａ、Ｂ、およびＣのうちの１つまたは複数」、「Ａ、Ｂ、またはＣのうちの１つまたは複数」、および「Ａ、Ｂ、および／またはＣ」という表現の各々は、Ａのみ、Ｂのみ、Ｃのみ、ＡとＢを一緒に、ＡとＣを一緒に、ＢとＣを一緒に、またはＡとＢとＣを一緒に、を意味する。 The terms "at least one," "one or more," and "and / or" as defined herein are both conjunctive and selective in operation, unless otherwise stated. There is an open-ended expression. For example, "at least one of A, B, and C", "at least one of A, B, or C", "one or more of A, B, and C", "A. , B, or C, and each of the expressions "A, B, and / or C" are A only, B only, C only, A and B together, A and C. Means together, B and C together, or A, B and C together.

本明細書で定義される「自動的に」という用語は、ユーザ介入なしに、を意味する。本明細書で定義される「ユーザ」という用語は、人間を意味する。 The term "automatically" as defined herein means without user intervention. The term "user" as defined herein means human.

本明細書で定義される「コンピュータ可読記憶媒体」という用語は、命令実行システム、装置、またはデバイスが使用するための、あるいはそれとともに使用するためのプログラムコードを含んでいるかまたは記憶する記憶媒体を意味する。本明細書で定義される「コンピュータ可読記憶媒体」は、それ自体は、一時的な伝搬信号でない。コンピュータ可読記憶媒体は、限定はしないが、電子記憶デバイス、磁気記憶デバイス、光記憶デバイス、電磁記憶デバイス、半導体記憶デバイス、または上記の任意の好適な組合せであり得る。本明細書で説明される、様々な形態のメモリが、コンピュータ可読記憶媒体の例である。コンピュータ可読記憶媒体のより具体的な例の非網羅的なリストは、ポータブルコンピュータディスケット、ハードディスク、ＲＡＭ、読取り専用メモリ（ＲＯＭ）、消去可能プログラマブル読取り専用メモリ（ＥＰＲＯＭまたはフラッシュメモリ）、電子的消去可能プログラマブル読取り専用メモリ（ＥＥＰＲＯＭ）、スタティックランダムアクセスメモリ（ＳＲＡＭ）、ポータブルコンパクトディスク読取り専用メモリ（ＣＤ−ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリスティック、フロッピーディスクなどを含み得る。 The term "computer-readable storage medium" as defined herein refers to a storage medium that contains or stores program code for or with use by an instruction execution system, device, or device. means. The "computer-readable storage medium" as defined herein is not itself a transient propagating signal. The computer-readable storage medium can be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination described above. The various forms of memory described herein are examples of computer-readable storage media. A non-exhaustive list of more specific examples of computer-readable storage media is portable computer disksets, hard disks, RAM, read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), electronically erasable. It may include programmable read-only memory (EEPROM), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks and the like.

本明細書で定義される「する場合（ｉｆ）」という用語は、文脈に応じて、「するとき（ｗｈｅｎ）」または「すると（ｕｐｏｎ）」または「に応答して（ｉｎｒｅｓｐｏｎｓｅｔｏ）」または「に反応して（ｒｅｓｐｏｎｓｉｖｅｔｏ）」を意味する。したがって、「それが決定された場合」または「［述べられた条件またはイベント］が検出された場合」という句は、文脈に応じて、「決定すると」または「決定したことに応答して」あるいは「［述べられた条件またはイベント］を検出すると」または「［述べられた条件またはイベント］を検出したことに応答して」または「［述べられた条件またはイベント］を検出したことに反応して」を意味すると解釈され得る。 The term "if" as defined herein means "when" or "upon" or "in response to" or, depending on the context. It means "responsive to". Therefore, the phrase "if it is determined" or "if [stated condition or event] is detected" may be "determined" or "in response to the decision" or, depending on the context. "When [stated condition or event] is detected" or "in response to [stated condition or event] detected" or "in response to [stated condition or event] detected" Can be interpreted as meaning.

本明細書で定義される「に反応して」という用語および上記で説明されたような同様の言い回し、たとえば、「する場合」、「するとき」、または「すると」は、アクションまたはイベントに容易に応答または反応することを意味する。応答または反応は、自動的に実施される。したがって、第２のアクションが第１のアクション「に反応して」実施される場合、第１のアクションの発生と第２のアクションの発生との間に因果関係がある。「に反応して」という用語は、因果関係を指示する。 The term "in response" as defined herein and similar phrases as described above, such as "when", "when", or "to", are easy for an action or event. Means to respond to or react to. The response or response is automatic. Therefore, when the second action is performed "in response to" the first action, there is a causal relationship between the occurrence of the first action and the occurrence of the second action. The term "in response" indicates a causal relationship.

本明細書で定義される「一実施形態（ｏｎｅｅｍｂｏｄｉｍｅｎｔ）」、「一実施形態（ａｎｅｍｂｏｄｉｍｅｎｔ）」、「１つまたは複数の実施形態」、「特定の実施形態」という用語、または同様の言い回しは、実施形態に関して説明される特定の特徴、構造、または特性が、本開示内で説明される少なくとも１つの実施形態に含まれることを意味する。したがって、本開示全体にわたる、「一実施形態では（ｉｎｏｎｅｅｍｂｏｄｉｍｅｎｔ）」、「一実施形態では（ｉｎａｎｅｍｂｏｄｉｍｅｎｔ）」、「１つまたは複数の実施形態では」、「特定の実施形態では」という句、および同様の言い回しの出現は、必ずしもそうとは限らないが、すべて、同じ実施形態を指し得る。「実施形態」および「構成」という用語は、本開示内では互換的に使用される。 The terms "one embodied", "an embodied", "one or more embodiments", "specific embodiments", or similar terms as defined herein. Means that a particular feature, structure, or property described with respect to an embodiment is included in at least one embodiment described herein. Thus, throughout the disclosure, "in one embodied," "in an embodied," "in one or more embodiments," and "in a particular embodiment." The appearance of phrases and similar phrases can all, but not necessarily, refer to the same embodiment. The terms "embodiment" and "configuration" are used interchangeably within this disclosure.

本明細書で定義される「プロセッサ」という用語は、少なくとも１つのハードウェア回路を意味する。ハードウェア回路は、プログラムコード中に含まれている命令を行うように設定され得る。ハードウェア回路は集積回路であり得る。プロセッサの例は、限定はしないが、中央処理ユニット（ＣＰＵ）、アレイプロセッサ、ベクトルプロセッサ、デジタル信号プロセッサ（ＤＳＰ）、ＦＰＧＡ、プログラマブル論理アレイ（ＰＬＡ）、ＡＳＩＣ、プログラマブル論理回路、およびコントローラを含む。 The term "processor" as defined herein means at least one hardware circuit. Hardware circuits can be configured to perform the instructions contained in the program code. The hardware circuit can be an integrated circuit. Examples of processors include, but are not limited to, central processing units (CPUs), array processors, vector processors, digital signal processors (DSPs), FPGAs, programmable logic arrays (PLAs), ASICs, programmable logic circuits, and controllers.

本明細書で定義される「出力」という用語は、物理メモリ要素、たとえば、デバイスに記憶すること、ディスプレイまたは他の周辺出力デバイスに書き込むこと、別のシステムに送ることまたは送信すること、エクスポートすることなどを意味する。 The term "output" as defined herein refers to a physical memory element, such as storing in a device, writing to a display or other peripheral output device, sending or sending to another system, or exporting. It means things like that.

本明細書で定義される「リアルタイム」という用語は、ユーザまたはシステムが、特定のプロセスまたは決定が行われるのに十分に即時であると感じる、あるいは、プロセッサが、何らかの外部プロセスについていくことを可能にする、処理応答性のレベルを意味する。 The term "real time" as defined herein allows a user or system to feel immediate enough for a particular process or decision to be made, or to allow a processor to keep up with some external process. Means the level of processing responsiveness.

本明細書で定義される「実質的に」という用語は、具陳された特性、パラメータ、または値が正確に達成される必要がないこと、ただし、たとえば、当業者に知られている許容差、測定誤差、測定精度限界、および他のファクタを含む、偏差または変動が、特性が提供することを意図された効果を妨げない量で生じ得ることを意味する。 The term "substantially" as defined herein means that the specified properties, parameters, or values do not need to be achieved exactly, but for example, tolerances known to those skilled in the art. It means that deviations or variations, including measurement errors, measurement accuracy limits, and other factors, can occur in quantities that do not interfere with the effect the property is intended to provide.

第１の、第２のなどの用語は、様々な要素を説明するために本明細書で使用され得る。これらの用語は、別段に述べられていない限り、または文脈が別段に明確に指示しない限り、ある要素を別の要素と区別するために使用されるにすぎないので、これらの要素はこれらの用語によって限定されるべきでない。 Terms such as first and second can be used herein to describe various elements. These terms are used only to distinguish one element from another, unless otherwise stated or the context specifically dictates otherwise. Should not be limited by.

コンピュータプログラム製品は、プロセッサに本明細書で説明される本発明の構成の態様を行わせるためのコンピュータ可読プログラム命令をその上に有する（１つまたは複数の）コンピュータ可読記憶媒体を含み得る。本開示内では、「プログラムコード」という用語は、「コンピュータ可読プログラム命令」という用語と互換的に使用される。本明細書で説明されるコンピュータ可読プログラム命令は、コンピュータ可読記憶媒体からそれぞれのコンピューティング／処理デバイスに、あるいはネットワーク、たとえば、インターネット、ＬＡＮ、ＷＡＮおよび／またはワイヤレスネットワークを介して外部コンピュータまたは外部記憶デバイスにダウンロードされ得る。ネットワークは、銅伝送ケーブル、光伝送ファイバー、ワイヤレス送信、ルータ、ファイアウォール、スイッチ、ゲートウェイコンピュータ、および／またはエッジサーバを含むエッジデバイスを含み得る。各コンピューティング／処理デバイス中のネットワークアダプタカードまたはネットワークインターフェースは、ネットワークからコンピュータ可読プログラム命令を受信し、そのコンピュータ可読プログラム命令を、それぞれのコンピューティング／処理デバイス内のコンピュータ可読記憶媒体に記憶するためにフォワーディングする。 Computer program products may include computer-readable storage media (s) on which computer-readable program instructions for causing a processor to perform aspects of the configuration of the invention described herein. Within this disclosure, the term "program code" is used interchangeably with the term "computer-readable program instructions." The computer-readable program instructions described herein are external computers or external storage from computer-readable storage media to their respective computing / processing devices or via networks such as the Internet, LAN, WAN and / or wireless networks. Can be downloaded to your device. The network may include edge devices including copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and stores the computer-readable program instructions in a computer-readable storage medium in each computing / processing device. Forward to.

本明細書で説明される本発明の構成のための動作を行うためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セットアーキテクチャ（ＩＳＡ）命令、機械命令、機械依存命令、マイクロコード、ファームウェア命令、あるいは、オブジェクト指向プログラミング言語および／または手続き型プログラミング言語を含む１つまたは複数のプログラミング言語の任意の組合せで書き込まれたソースコードまたはオブジェクトコードのいずれかであり得る。コンピュータ可読プログラム命令は、状態セッティングデータを含み得る。コンピュータ可読プログラム命令は、完全にユーザのコンピュータ上で、部分的にユーザのコンピュータ上で、スタンドアロンソフトウェアパッケージとして、部分的にユーザのコンピュータ上でおよび部分的にリモートコンピュータ上で、あるいは完全にリモートコンピュータまたはサーバ上で実行し得る。後者のシナリオでは、リモートコンピュータは、ＬＡＮまたはＷＡＮを含む任意のタイプのネットワークを通してユーザのコンピュータに接続され得るか、あるいは接続は、（たとえば、インターネットサービスプロバイダを使用してインターネットを通して）外部コンピュータに対して行われ得る。いくつかの場合には、たとえば、プログラマブル論理回路、ＦＰＧＡ、またはＰＬＡを含む電子回路が、本明細書で説明される本発明の構成の態様を実施するために、電子回路を個人化するためにコンピュータ可読プログラム命令の状態情報を利用することによって、コンピュータ可読プログラム命令を実行し得る。 The computer-readable program instructions for performing the operations for the configuration of the invention described herein are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, or It can be either source code or object code written in any combination of one or more programming languages, including an object-oriented programming language and / or a procedural programming language. Computer-readable program instructions may include state setting data. Computer-readable program instructions are entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on the remote computer, or entirely on the remote computer. Or it can be run on the server. In the latter scenario, the remote computer can be connected to the user's computer through any type of network, including LAN or WAN, or the connection can be made to an external computer (eg, through the Internet using an Internet service provider). Can be done. In some cases, for example, an electronic circuit, including a programmable logic circuit, FPGA, or PLA, is used to personalize the electronic circuit in order to implement the aspects of the configuration of the invention described herein. Computer-readable program instructions can be executed by using the state information of computer-readable program instructions.

本発明の構成のいくつかの態様が、方法、装置（システム）、およびコンピュータプログラム製品のフローチャート例示図および／またはブロック図を参照しながら本明細書で説明された。フローチャート例示図および／またはブロック図の各ブロック、ならびにフローチャートの例示図および／またはブロック図中のブロックの組合せが、コンピュータ可読プログラム命令、たとえば、プログラムコードによって実装され得ることを理解されよう。 Some aspects of the configuration of the present invention have been described herein with reference to flow chart exemplary and / or block diagrams of methods, devices (systems), and computer program products. It will be appreciated that each block of the flowchart and / or block diagram, and the combination of blocks in the flowchart and / or block diagram, can be implemented by computer-readable program instructions, such as program code.

これらのコンピュータ可読プログラム命令は、汎用コンピュータ、専用コンピュータ、または機械を製造するための他のプログラマブルデータ処理装置のプロセッサに与えられ得、その結果、コンピュータまたは他のプログラマブルデータ処理装置のプロセッサを介して実行する命令は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された関数／行為を実装するための手段を作成する。これらのコンピュータ可読プログラム命令はまた、コンピュータ、プログラマブルデータ処理装置、および／または他のデバイスに特定の様式で機能するように指示することができるコンピュータ可読記憶媒体に記憶され得、その結果、命令が記憶されたコンピュータ可読記憶媒体は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された動作の態様を実装する命令を含む製造品を備える。 These computer-readable program instructions can be given to a general purpose computer, a dedicated computer, or the processor of another programmable data processor for manufacturing the machine, and as a result, through the processor of the computer or other programmable data processor. The command to be executed creates means for implementing the specified function / action in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored on a computer-readable storage medium that can instruct the computer, programmable data processor, and / or other device to function in a particular manner, so that the instructions are The stored computer-readable storage medium comprises a product containing instructions that implement the specified mode of operation in one or more blocks of the flowchart and / or block diagram.

コンピュータ可読プログラム命令はまた、コンピュータ実装プロセスを作り出すために、一連の動作をコンピュータ、他のプログラマブルデータ処理装置または他のデバイス上で実施させるように、コンピュータ、他のプログラマブル装置、または他のデバイスにロードされ得、その結果、コンピュータ、他のプログラマブル装置、または他のデバイス上で実行する命令は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された機能／行為を実装する。 Computer-readable program instructions also cause a computer, other programmable device, or other device to perform a series of operations on a computer, other programmable data processor, or other device to create a computer-implemented process. Instructions that can be loaded and thus executed on a computer, other programmable device, or other device implement the specified function / action in one or more blocks of flowcharts and / or block diagrams.

図中のフローチャートおよびブロック図は、本発明の構成の様々な態様によるシステム、方法、およびコンピュータプログラム製品の可能な実装形態のアーキテクチャ、機能性、および動作を示す。この点について、フローチャートまたはブロック図中の各ブロックは、指定された動作を実装するための１つまたは複数の実行可能な命令を備える、命令のモジュール、セグメント、または部分を表し得る。 Flowcharts and block diagrams in the drawings show the architecture, functionality, and behavior of possible implementations of systems, methods, and computer program products in various aspects of the configuration of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of an instruction that comprises one or more executable instructions to implement the specified operation.

いくつかの代替実装形態では、ブロック中で言及される動作は、図中で言及される順序から外れて行われ得る。たとえば、関与する機能性に応じて、連続して示されている２つのブロックが、実質的に同時に実行され得るか、またはブロックが、時々、逆の順序で実行され得る。他の例では、ブロックは、概して小さい数字から順に実施され得、さらに他の例では、１つまたは複数のブロックは、変動順で実施され得、結果は、記憶され、後続の、または直後にこない他のブロックにおいて利用される。また、ブロック図および／またはフローチャート例示図の各ブロック、ならびにブロック図および／またはフローチャート例示図中のブロックの組合せが、指定された機能または行為を実施するかあるいは専用ハードウェアとコンピュータ命令との組合せを行う専用ハードウェアベースシステムによって実装され得ることに留意されたい。 In some alternative implementations, the actions mentioned in the block may occur out of the order mentioned in the figure. For example, depending on the functionality involved, two blocks shown in succession can be executed substantially simultaneously, or the blocks can sometimes be executed in reverse order. In another example, the blocks can generally be performed in ascending order, and in yet another example, one or more blocks can be performed in variable order, and the results are stored and subsequently or immediately after. It is used in other blocks that do not come. In addition, each block of the block diagram and / or the flowchart example diagram, and the combination of the blocks in the block diagram and / or the flowchart example diagram perform the specified function or action, or the combination of the dedicated hardware and the computer instruction. Note that it can be implemented by a dedicated hardware-based system that does.

以下の特許請求の範囲において見られ得るすべての手段またはステップおよび機能要素の対応する構造、材料、行為、および等価物は、機能を実施するための任意の構造、材料、または行為を一体となって含むことを意図される。 Corresponding structures, materials, acts, and equivalents of all means or steps and functional elements that may be found in the claims below integrate any structure, material, or act to perform a function. Is intended to be included.

一態様では、ホストプロセッサは、通信バスを介して第１のハードウェアアクセラレータおよび第２のハードウェアアクセラレータと通信するように設定される。 In one aspect, the host processor is configured to communicate with the first hardware accelerator and the second hardware accelerator via the communication bus.

別の態様では、データ転送は、第１のハードウェアアクセラレータが、アクセラレータリンクを通して第２のハードウェアアクセラレータのメモリにアクセスすることを含む。 In another aspect, data transfer involves the first hardware accelerator accessing the memory of the second hardware accelerator through the accelerator link.

別の態様では、ホストプロセッサは、ターゲットアドレスを含むデータを第１のハードウェアアクセラレータに送ることによって第２のハードウェアアクセラレータのメモリにアクセスするように設定され、ターゲットアドレスは、第２のハードウェアアクセラレータに対応するようにホストプロセッサによって変換され、第１のハードウェアアクセラレータは、ターゲットアドレスに基づいて、アクセラレータリンクを介して第２のハードウェアアクセラレータのメモリにアクセスするためのトランザクションを開始する。 In another aspect, the host processor is configured to access the memory of the second hardware accelerator by sending data containing the target address to the first hardware accelerator, where the target address is the second hardware. Translated by the host processor to correspond to the accelerator, the first hardware accelerator initiates a transaction to access the memory of the second hardware accelerator over the accelerator link based on the target address.

別の態様では、第２のハードウェアアクセラレータは、アクセラレータリンクを介してトランザクションを受信したことに応答して、第２のハードウェアアクセラレータについてのアドレス範囲の上限だけ、データ転送についてのターゲットアドレスを減分することと、減分されたターゲットアドレスがローカルであるかどうかを決定することとを行うように設定される。 In another aspect, the second hardware accelerator decrements the target address for data transfer by the upper limit of the address range for the second hardware accelerator in response to receiving a transaction over the accelerator link. It is set to divide and determine if the decremented target address is local.

別の態様では、ホストプロセッサは、通信バスに接続された第２のハードウェアアクセラレータの直接メモリアクセス回路のステータスに基づいて、第１のハードウェアアクセラレータと第２のハードウェアアクセラレータとの間のデータ転送を開始するように設定される。 In another aspect, the host processor receives data between the first hardware accelerator and the second hardware accelerator based on the status of the direct memory access circuit of the second hardware accelerator connected to the communication bus. Set to start the transfer.

別の態様では、ホストプロセッサは、リングトポロジーにおいて第１のハードウェアアクセラレータおよび第２のハードウェアアクセラレータのシーケンスを自動的に決定するように設定される。 In another aspect, the host processor is configured to automatically determine the sequence of the first hardware accelerator and the second hardware accelerator in the ring topology.

別の態様では、ホストプロセッサは、リモートバッファフラグを使用して、第１のハードウェアアクセラレータおよび第２のハードウェアアクセラレータに対応するバッファを追跡するように設定される。 In another aspect, the host processor is configured to use the remote buffer flag to track the buffers corresponding to the first hardware accelerator and the second hardware accelerator.

リンク回路は、アクセラレータリンクを介したターゲットハードウェアアクセラレータとのデータ転送を開始するように設定され、データ転送は、通信バスを介してハードウェアアクセラレータによって受信されたホストプロセッサからの命令に応答して行われる。 The link circuit is configured to initiate data transfer with the target hardware accelerator over the accelerator link, which is in response to instructions from the host processor received by the hardware accelerator over the communication bus. Will be done.

別の態様では、リンク回路は、第１のメモリマップ−ストリームマッパ回路と、第２のメモリマップ−ストリームマッパ回路とを含み、各々は、データストリームをメモリをマッピングされたトランザクションに、およびメモリマッピングされたトランザクションをデータストリームにコンバートするように設定される。 In another aspect, the link circuit comprises a first memory map-stream mapper circuit and a second memory map-stream mapper circuit, each of which translates the data stream into a memory-mapped transaction and memory-mapped. It is set to convert the executed transaction to a data stream.

別の態様では、各メモリマップ−ストリームマッパ回路は、ハードウェアアクセラレータのアドレス範囲の上限だけ、受信されたトランザクションにおいてターゲットアドレスを減分するように設定される。 In another aspect, each memory map-stream mapper circuit is configured to decrement the target address in received transactions by the upper bound of the hardware accelerator's address range.

別の態様では、リンク回路は、ストリームデータを送るおよび受信するように設定された第１のトランシーバと、第１のトランシーバと第１のメモリマップ−ストリームマッパ回路とに接続された第１の再送信エンジンとを含む。 In another aspect, the link circuit is a first transceiver configured to send and receive stream data, and a first re-connected to the first transceiver and the first memory map-stream mapper circuit. Including transmission engine.

別の態様では、リンク回路は、ストリームデータを送るおよび受信するように設定された第２のトランシーバと、第２のトランシーバと第２のメモリマップ−ストリームマッパ回路とに接続された第２の再送信エンジンとをさらに含む。 In another aspect, the link circuit is a second transceiver configured to send and receive stream data and a second re-connected to the second transceiver and the second memory map-stream mapper circuit. Further includes a transmission engine.

別の態様では、アクセラレータリンクは通信バスから独立している。 In another aspect, the accelerator link is independent of the communication bus.

別の態様では、トランザクションを開始することは、メモリマッピングされたトランザクションを開始することと、メモリマッピングされたトランザクションをアクセラレータリンクを介して送られるべきデータストリームにコンバートすることとを含む。 In another aspect, initiating a transaction involves initiating a memory-mapped transaction and converting the memory-mapped transaction into a data stream that should be sent over an accelerator link.

別の態様では、方法は、第２のハードウェアアクセラレータにおいてトランザクションを受信したことに応答して、第２のハードウェアアクセラレータは、ターゲットアドレスから第２のハードウェアアクセラレータのアドレス範囲の上限を減算することと、減算することの結果が、第２のハードウェアアクセラレータのアドレス範囲内にあるかどうかを決定することとを含む。 In another aspect, the method subtracts the upper bound of the address range of the second hardware accelerator from the target address in response to receiving a transaction on the second hardware accelerator. This includes determining if the result of the subtraction is within the address range of the second hardware accelerator.

別の態様では、第２のハードウェアアクセラレータは、データストリームとしてトランザクションを受信し、データストリームをメモリマッピングされたトランザクションにコンバートする。 In another aspect, the second hardware accelerator receives the transaction as a data stream and converts the data stream into a memory-mapped transaction.

別の態様では、方法は、第２のハードウェアアクセラレータのダイレクトメモリアクセス回路のステータスを決定することと、第２のハードウェアアクセラレータのダイレクトメモリアクセス回路のステータスに応答してデータ転送を開始することとを含む。 In another aspect, the method determines the status of the direct memory access circuit of the second hardware accelerator and initiates data transfer in response to the status of the direct memory access circuit of the second hardware accelerator. And include.

本明細書で提供される本発明の構成の説明は、例示のためであり、網羅的なものでも、開示される形式および例に限定されるものでもない。本明細書で使用される専門用語は、本発明の構成の原理、実際的適用例、または市場で見られる技術に対する技術的改善を説明するために、および／あるいは、他の当業者が本明細書で開示される本発明の構成を理解することを可能にするために選定された。説明される本発明の構成の範囲および趣旨から逸脱することなく、修正および変形が当業者に明らかになり得る。したがって、そのような特徴および実装形態の範囲を指示するものとして、上記の開示に対してではなく、以下の特許請求の範囲に対して参照が行われるべきである。 The description of the configuration of the present invention provided herein is for illustration purposes only and is not exhaustive or limited to the disclosed forms and examples. The terminology used herein is to describe the principles of construction of the invention, practical applications, or technical improvements to the techniques found on the market, and / or by others skilled in the art. Selected to allow understanding of the constitution of the invention disclosed in the document. Modifications and modifications may be apparent to those skilled in the art without departing from the scope and gist of the present invention described. Therefore, references should be made to the following claims, not to the above disclosure, as an indication of the scope of such features and implementations.

Claims

With the host processor connected to the communication bus
A first hardware accelerator that is communicably linked to the host processor through the communication bus,
A system comprising a second hardware accelerator that is communicably linked to the host processor through the communication bus.
The first hardware accelerator and the second hardware accelerator are directly connected through an accelerator link independent of the communication bus.
A system in which the host processor is configured to initiate data transfer between the first hardware accelerator and the second hardware accelerator directly through the accelerator link.

The system of claim 1, wherein the data transfer comprises the first hardware accelerator accessing the memory of the second hardware accelerator through the accelerator link.

The host processor is set to access the memory of the second hardware accelerator by sending data including a target address to the first hardware accelerator, and the target address is the second hardware. Converted by the host processor to correspond to a hardware accelerator so that the first hardware accelerator accesses the memory of the second hardware accelerator via the accelerator link based on the target address. 2. The system of claim 2, which initiates a transaction of.

The second hardware accelerator decrements the target address for the data transfer by the upper limit of the address range for the second hardware accelerator in response to receiving a transaction over the accelerator link. The system of claim 1, wherein the system is configured to do and determine if the deducted target address is local.

The host processor is said to be between the first hardware accelerator and the second hardware accelerator based on the status of the direct memory access circuit of the second hardware accelerator connected to the communication bus. The system of claim 1, configured to initiate data transfer.

The system of claim 1, wherein the host processor is configured to automatically determine the sequence of the first hardware accelerator and the second hardware accelerator in a ring topology.

The system of claim 1, wherein the host processor is configured to use the remote buffer flag to track the buffers corresponding to the first hardware accelerator and the second hardware accelerator.

It ’s an integrated circuit,
With endpoints configured to communicate with the host processor over the communication bus,
A memory controller connected to a memory local to the integrated circuit,
It comprises a link circuit connected to the endpoint and the memory controller, the link circuit is set to establish an accelerator link with a target hardware accelerator also connected to the communication bus, and the accelerator link. Is an integrated circuit that is independent of the communication bus and is a direct connection between the integrated circuit and the target hardware accelerator.

The link circuit is configured to initiate data transfer to and from the target hardware accelerator over the accelerator link, and the data transfer is from the host processor received by the integrated circuit via the communication bus. 8. The integrated circuit according to claim 8, which is performed in response to the instruction of.

The integrated circuit according to claim 8, wherein the target hardware accelerator is set to decrement the target address in a transaction received from the integrated circuit by an upper limit of the address range of the integrated circuit.

In the first hardware accelerator, receiving the instruction sent from the host processor via the communication bus and the target address for data transfer, and
The first hardware accelerator compares the target address with the upper limit of the address range corresponding to the first hardware accelerator.
In response to the determination that the target address exceeds the address range based on the comparison, the first hardware accelerator causes the first hardware accelerator and the second hardware accelerator to move. A method comprising initiating a transaction with said second hardware accelerator to perform a data transfer using a directly connected accelerator link.

11. The method of claim 11, wherein the accelerator link is independent of the communication bus.

In response to receiving the transaction in the second hardware accelerator, the second hardware accelerator subtracts the upper limit of the address range of the second hardware accelerator from the target address. 11. The method of claim 11, further comprising determining if the result of the subtraction is within said address range of the second hardware accelerator.

Determining the status of the direct memory access circuit of the second hardware accelerator
11. The method of claim 11, further comprising initiating the data transfer in response to the status of the direct memory access circuit of the second hardware accelerator.

The method of claim 11, wherein the data transfer comprises the first hardware accelerator accessing the memory of the second hardware accelerator through the accelerator link.