JP2000259502A

JP2000259502A - Method and device for transferring write cache data for data storage and data processing system

Info

Publication number: JP2000259502A
Application number: JP11257732A
Authority: JP
Inventors: Kit M Chow; エムチョウキット; Keith P Muller; キースミュラーピー; Michael W Meyer; ダブリューメイヤーマイケル; Gary L Boggs; エルボッグスギャリー
Original assignee: NCR International Inc
Current assignee: NCR International Inc
Priority date: 1998-08-11
Filing date: 1999-08-10
Publication date: 2000-09-22
Anticipated expiration: 2019-08-10
Also published as: EP0980041A2; US6711632B1; JP4567125B2; EP0980041A3

Abstract

PROBLEM TO BE SOLVED: To reduce the overhead concerning the communication, etc., between a calculation node and the storage media and to improve both system speed and efficiency by using a specific protocol of an effective data write cache of a distributed architecture. SOLUTION: Each of calculation resources is defined by a calculation node 200 and has a processor 216 which executes an application 204 that is managed by an OS 202. Then each of storage resources 104 is defined by a clique and includes a 1st I/O node or an ION 212 and a 2nd I/O node or an ION 214 which are set opposite to an inter-connect fabric 106 respectively via a system inter-connect 228 in terms of operation. In the ION 212, a write request including the write data is received from the node 200. Then the write data are transferred to the ION 214 from the ION 212 and a confirmation message is transmitted to the node 200 from the ION 214.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明が属する技術分野】本発明は、一般に計算システ
ムに関し、特にデータ・ストレージとデータ処理システ
ムにおける書き込みキャッシュデータの転送方法及び装
置に関する。これは、プロセッサやメモリキャビネット
の境界に関係なく、運用上において、仮想ストレージ割
り当ての単一の視点を提供する。The present invention relates generally to computing systems, and more particularly, to a method and apparatus for transferring write cache data in a data storage and data processing system. This provides a single point of view of virtual storage allocation in operation, regardless of processor or memory cabinet boundaries.

【０００２】[0002]

【従来の技術】テクノロジの進化は、無関係に見える一
連の技術開発の成果であることが多い。こうした無関係
の開発はそれぞれ意義深いものかもしれないが、これら
が結合することで、大きなテクノロジの進化の基盤を形
成することができる。歴史的に見て、大きく複雑なコン
ピュータシステムの構成要素では、不均一な技術的成長
がなされてきた。これには例えば、（１）ディスクＩ／
Ｏ性能に比較して急速なＣＰＵ性能の進歩、（２）進化
する内部ＣＰＵアーキテクチャ、（３）インターコネク
ト・ファブリックが含まれる。BACKGROUND OF THE INVENTION Technology evolution is often the result of a series of seemingly unrelated technological developments. Each of these unrelated developments may be significant, but when combined, they can form the basis for a major technology evolution. Historically, components of large and complex computer systems have experienced uneven technological growth. This includes, for example, (1) disk I /
It includes rapid CPU performance advances compared to O performance, (2) evolving internal CPU architecture, and (3) interconnect fabric.

【０００３】過去１０年間に渡り、ディスクＩ／Ｏ性能
が向上する割合は、ノードのそれに比べ、かなり遅かっ
た。ＣＰＵ性能は１年に４０％〜１００％の割合で増加
してきたが、ディスクのシークタイムが改善される割合
は１年にわずか７％だった。この傾向が予測通りに継続
すれば、代表的なサーバノードが動かすことのできるデ
ィスクドライブの数は、ほとんどの大規模システムの量
と価値の両方において、ディスクドライブが最も影響の
大きな構成要素となるほどに上昇する。この現象はすで
に既存の大規模システムの導入において現実のものにな
っている。Over the past decade, the rate at which disk I / O performance has improved has been much slower than that of nodes. While CPU performance has increased at a rate of 40% to 100% per year, only 7% per year has improved disk seek time. If this trend continues as expected, the number of disk drives a typical server node can run will be such that disk drives will be the most influential component in both the volume and value of most large systems. To rise. This phenomenon has already become a reality in the introduction of existing large-scale systems.

【０００４】不均一な性能の向上はＣＰＵの中でも発生
している。ＣＰＵ性能を改善するために、ＣＰＵベンダ
はクロック速度の増加とアーキテクチャの変更を組み合
わせて利用している。こうしたアーキテクチャ変更の多
くは、並行処理コミュニティの影響による証明されたテ
クノロジである。こうした変更はアンバランスな性能を
生み出し、期待されるほどの性能の向上につながらない
可能性もある。単純な例としては、ＣＰＵが割込みを処
理できる割合は、基本命令のそれと同じ速度で向上して
はいない。したがって、割込み性能に依存するシステム
機能（Ｉ／Ｏなど）は計算能力と一緒に向上してはいな
い。[0004] Non-uniform performance improvements also occur in CPUs. To improve CPU performance, CPU vendors have used a combination of increasing clock speeds and architectural changes. Many of these architectural changes are proven technologies influenced by the concurrency community. These changes may result in unbalanced performance and may not be as good as expected. As a simple example, the rate at which a CPU can handle an interrupt has not increased at the same rate as that of a basic instruction. Therefore, system functions (such as I / O) that depend on interrupt performance have not improved with computing power.

【０００５】インターコネクト・ファブリックも不均一
なテクノロジ成長の特徴を示している。これは長年に渡
り、１０〜２０ＭＢ／秒ほどの性能レベルにあったが、
過去１年間に、帯域幅において１００ＭＢ／秒（以上）
レベルの大きな飛躍も起こっている。[0005] Interconnect fabrics have also been characterized by uneven technology growth. This has been at a performance level of about 10-20 MB / sec for many years,
100MB / s (or more) in bandwidth over the past year
A major leap in levels has also taken place.

【０００６】この大きな性能の向上により、大規模並行
処理システムの経済的な配置が可能になった。[0006] This significant performance improvement has made it possible to economically deploy large scale parallel processing systems.

【０００７】この不均一な性能はアプリケーション・ア
ーキテクチャ及びシステム構成のオプションに悪い影響
を与える。例えば、アプリケーション性能に関して言え
ば、ＣＰＵ性能の増加といった、システムのいくつかの
部分における性能改善を利用するために仕事量を増やす
試みは、ディスク・サブシステムにおいて同等の性能改
善がなされていないために阻害されることが多い。ＣＰ
Ｕが１秒間に行うデータ処理の数を２倍にできたとして
も、ディスク・サブシステムはその増加分の一部しか扱
うことができない。不均一なハードウェア性能の成長に
よる全体的な影響は、アプリケーション性能が特定の仕
事の特徴に依存する度合いが高まっていることにある。[0007] This uneven performance adversely affects application architecture and system configuration options. For example, in terms of application performance, attempts to increase the amount of work to take advantage of performance improvements in some parts of the system, such as increasing CPU performance, are due to the lack of comparable performance improvements in the disk subsystem. Often inhibited. CP
Even if U can double the number of data operations performed per second, the disk subsystem can handle only a part of the increase. The overall effect of uneven hardware performance growth is that application performance is increasingly dependent on specific job characteristics.

【０００８】プラットフォーム・ハードウェア・テクノ
ロジの不均一な成長は別の深刻な問題も生み出してい
る。マルチノードシステムの構成に利用できるオプショ
ンの数の減少である。この良い例は、ストレージ・イン
ターコネクト・テクノロジの変化によってＴＥＲＡＤＡ
ＴＥ４ノード・クリークのソフトウェア・アーキテクチ
ャが受けた影響である。ＴＥＲＡＤＡＴＥのクリークモ
デルでは、シングル・クリークのノード間での均一なス
トレージ接続を期待していた。この場合、すべてのノー
ドから各ディスクドライブにアクセスすることが可能で
ある。したがって、一つのノードが止まった場合、その
ノード用のストレージは残りのノードが分割できる。ス
トレージ及びノードテクノロジの不均一な成長により、
ストレージ共有環境において、一つのノードに接続でき
るディスクの数は制限されている。この制限は、一つの
Ｉ／Ｏチャンネルに接続できるドライブの数及び４ノー
ド共有Ｉ／Ｏトポロジにおいて接続可能なバスの物理的
な数によって生じている。ノード性能の改善が続くにつ
れ、私たちは性能向上を実現するために、一つのノード
に接続するディスクスピンドルの数を増加させなくては
いけない。[0008] The uneven growth of platform hardware technology has also created another serious problem. A reduction in the number of options available for configuring a multi-node system. A good example of this is the change in storage interconnect technology due to TERADA
This is the effect on the software architecture of TE4 Node Creek. The TERADATE creek model expected uniform storage connections between nodes in a single creek. In this case, it is possible to access each disk drive from all nodes. Therefore, when one node stops, the storage for that node can be divided by the remaining nodes. Due to the uneven growth of storage and node technology,
In a storage sharing environment, the number of disks that can be connected to one node is limited. This limitation is caused by the number of drives that can be connected to one I / O channel and the physical number of buses that can be connected in a four-node shared I / O topology. As node performance continues to improve, we must increase the number of disk spindles connected to a node in order to achieve performance improvements.

【０００９】クラスタ及び大規模並行処理（ＭＰＰ）の
設計は、上述の問題の解決を試みたマルチノードシステ
ム設計の例である。クラスタは限られた拡張性が欠点で
あり、ＭＰＰシステムには、十分にシンプルなアプリケ
ーションモデルを提供するための追加ソフトウェアが必
要である（市販のＭＰＰシステムでは、このソフトウェ
アは通常ＤＢＭＳ）。ＭＰＰシステムでは、非常に高い
可用性を提供するために、内部クラスタリングの形式
（クリーク）も必要になる。どちらのソリューションに
おいても、いまだにディスクドライブの多数化の可能性
を管理する上で課題が生じている。こうしたディスクド
ライブは、電気機械装置であり、十分に予測可能なほど
の故障率を有している。ＭＰＰシステムでは、通常ノー
ド数が大きいため、ノードのインターコネクトの問題は
より悪化することになる。また、どちらのアプローチに
おいても、ディスク接続性の課題が生じており、これも
非常に大きなデータベースを保存するのに必要なドライ
ブの数の多さが原因となっている。The design of clusters and massively parallel processing (MPP) is an example of a multi-node system design that attempts to solve the above-mentioned problems. Clusters suffer from limited scalability, and MPP systems require additional software to provide a sufficiently simple application model (in commercial MPP systems, this software is usually DBMS). MPP systems also require a form of internal clustering (clique) to provide very high availability. Both solutions still face challenges in managing the potential for a proliferation of disk drives. Such disk drives are electromechanical devices and have a sufficiently predictable failure rate. In the MPP system, since the number of nodes is usually large, the problem of the interconnection of the nodes is further exacerbated. Both approaches also suffer from disk connectivity issues, again due to the large number of drives required to store very large databases.

【００１０】上述の問題は、ストレージ装置と計算装置
が、高性能の接続ファブリックで計算を行い、アーキテ
クチャ上の同等の存在として働くアーキテクチャでは改
善される。このアーキテクチャでは、ストレージ及び計
算リソースの管理における柔軟性を増加させることがで
きる。しかし、この柔軟性により、いくつかの独特な問
題が現れる。こうした問題の一つは、このアーキテクチ
ャが提供する速度と柔軟性を維持しながら、安全なデー
タ・ストレージを確保することである。[0010] The above problems are ameliorated in an architecture in which the storage device and the computing device perform calculations in a high performance connection fabric and act as architectural equivalents. This architecture allows for increased flexibility in managing storage and computing resources. However, this flexibility presents some unique problems. One of these issues is ensuring secure data storage while maintaining the speed and flexibility provided by this architecture.

【００１１】従来のアーキテクチャでは、書き込みキャ
ッシュのテクニックにより効率的なデータ・ストレージ
が可能となる。ＣＰＵによって通常はディスクに書き込
まれるデータは、まず書き込みキャッシュに書き込まれ
る。そして、このデータはＣＰＵのアイドル・サイクル
中にディスクに書き込まれる。書き込みキャッシュへの
書き込みはディスクやＲＡＭへの書き込みよりも速いた
め、このテクニックによって性能が向上する。In conventional architectures, write cache techniques allow for efficient data storage. Data normally written to disk by the CPU is first written to the write cache. This data is then written to disk during the CPU idle cycle. This technique improves performance because writing to the write cache is faster than writing to disk or RAM.

【００１２】ディスクのために書き込みキャッシュを使
用するのには一定のリスクも伴う。ディスクメディアに
書き込まれる前に、データがディスク装置の揮発性メモ
リに長い時間とどまることになるからである。これに要
する時間は、普通は長くとも数秒だが、データが不揮発
性のストレージに書き込まれる前にクラッシュやシステ
ム不良が起これば、そのデータは失われる可能性があ
る。There are certain risks associated with using a write cache for a disk. This is because data will remain in the volatile memory of the disk device for a long time before being written to the disk medium. This typically takes at most a few seconds, but if the crash or system failure occurs before the data is written to non-volatile storage, the data can be lost.

【００１３】書き込みキャッシュは高度な分散アーキテ
クチャでも使用することができる。しかし、こうしたア
ーキテクチャに書き込みキャッシュ・プロトコルを導入
した場合、計算ノードとストレージ・メディアの間で通
信及び処理に関する大量のオーバーヘッドが必要にな
り、システムの速度と効率が減少してしまう。[0013] The write cache can also be used in highly distributed architectures. However, introducing a write cache protocol in such an architecture requires a large amount of communication and processing overhead between the compute nodes and the storage media, reducing the speed and efficiency of the system.

【００１４】[0014]

【発明が解決しようとする課題】本発明の目的は、分散
アーキテクチャにおける効率的なデータ書き込みキャッ
シュのプロトコルを提供して、上述の欠点を改善するこ
とである。SUMMARY OF THE INVENTION It is an object of the present invention to provide an efficient data write cache protocol in a distributed architecture to remedy the above-mentioned disadvantages.

【００１５】[0015]

【課題を解決するための手段】第一の観点によれば、本
発明は、データ・ストレージとデータ処理システムにお
ける書き込みキャッシュデータの転送方法であって、第
一のＩ／Ｏノードにおいて、書き込みデータを含む書き
込み要求を計算ノードから受領するステップと、書き込
みデータを第一のＩ／Ｏノードから第二のＩ／Ｏノード
に転送するステップと、確認メッセージを第二のＩ／Ｏ
ノードから計算ノードに送信するステップと、を含むこ
とを特徴とする方法に存する。According to a first aspect, the present invention is a method for transferring write cache data in a data storage and data processing system, comprising the steps of: Receiving a write request from a computing node, transferring write data from a first I / O node to a second I / O node, and transmitting a confirmation message to the second I / O node.
Transmitting from the node to the computing node.

【００１６】データが第一のＩ／Ｏノードの不揮発性ス
トレージに書き込まれた後、第二のＩ／Ｏノードの揮発
性メモリから書き込みデータを排除するために、第二の
Ｉ／Ｏノードに排除要求又はコマンドが送られる。実施
形態によっては、排除要求は第一のＩ／Ｏノードが第二
の書き込み要求を受け取るまで送られず、この場合、排
除要求は第二の書き込み要求の書き込みデータと同じ割
込みの中で送られる。このデータ処理システムは第一の
及び第二のＩ／Ｏノードで構成され、各ノードは計算ノ
ードからの書き込み要求を受領し、別のＩ／Ｏノードへ
書き込みノードを転送するための手段を有している。ま
た、各Ｉ／Ｏノードには、書き込みデータを送ったＩ／
Ｏノードを介して確認メッセージを送信するのではな
く、計算ノードに直接確認メッセージを送る手段も有し
ている。この成果が、データ保存に要する割込み回数を
減らしながら、書き込みキャッシュを導入してストレー
ジ速度及びターンアラウンドを改善するＩ／Ｏプロトコ
ルである。After the data has been written to the non-volatile storage of the first I / O node, the data is written to the second I / O node to remove the write data from the volatile memory of the second I / O node. An exclusion request or command is sent. In some embodiments, the eviction request is not sent until the first I / O node receives the second write request, in which case the eviction request is sent in the same interrupt as the write data of the second write request. . The data processing system comprises first and second I / O nodes, each node having a means for receiving a write request from a compute node and transferring the write node to another I / O node. are doing. In addition, each I / O node is provided with an I / O
Instead of sending the confirmation message via the O-node, there is also a means for sending the confirmation message directly to the computing node. The result is an I / O protocol that improves storage speed and turnaround by introducing a write cache while reducing the number of interrupts required to store data.

【００１７】第二のの観点によれば、本発明は、データ
・ストレージとデータ処理システムにおける書き込みキ
ャッシュデータの転送装置であって、第一のＩ／Ｏノー
ドにおいて書き込みデータを含む書き込み要求を計算ノ
ードから受領する手段と、書き込みデータを第一のＩ／
Ｏノードから第二のＩ／Ｏノードに転送する手段と、確
認メッセージを第二のＩ／Ｏノードから計算ノードに送
信する手段と、を備えたことを特徴とする装置に存す
る。According to a second aspect, the present invention is an apparatus for transferring write cache data in a data storage and data processing system, wherein the first I / O node calculates a write request including the write data. Means for receiving from the node, and writing data to the first I / O
An apparatus comprises: means for transferring an O node to a second I / O node; and means for transmitting a confirmation message from the second I / O node to a computation node.

【００１８】更に、別の観点によれば、本発明は、コン
ピュータによる読み出しが可能で、請求項１乃至７に係
るデータ・ストレージ・システムにおける書き込みキャ
ッシュデータの転送を行うステップを遂行するために、
コンピュータにより実行可能な一つ以上の命令プログラ
ムを確実に具現するプログラム・ストレージ・デバイス
に存する。According to yet another aspect, the present invention provides a method for performing a step of transferring write cache data in a data storage system readable by a computer and according to claims 1 to 7,
The present invention resides in a program storage device that reliably implements one or more instruction programs executable by a computer.

【００１９】[0019]

【発明の実施の形態】以下、本発明の一実施形態につい
て、添付図面を利用して例示的に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the accompanying drawings.

【００２０】Ａ．概要図１は、本発明の同列間データ処理アーキテクチャの概
観である。このアーキテクチャ１００は１つ以上の計算
リソース１０２及び一つ以上のストレージリソース１０
４で構成され、ストレージリソース１０４は１つ以上の
インターコネクト・ファブリック１０６及び通信パス１
０８を介して計算リソース１０２と通信上で対になって
いる。ファブリック１０６はすべてのノードとストレー
ジに通信媒体を与えているため、計算リソース１０２と
ストレージ・リソース１０４間の均一な同列アクセスを
可能にしている。A. Overview FIG. 1 is an overview of the same-row data processing architecture of the present invention. The architecture 100 includes one or more compute resources 102 and one or more storage resources 10
4, the storage resource 104 includes one or more interconnect fabrics 106 and a communication path 1
08 and a communication pair with the computational resource 102. The fabric 106 provides a communication medium for all nodes and storage, thus allowing for uniform, equal access between the computing resources 102 and the storage resources 104.

【００２１】図１のアーキテクチャでは、ストレージは
最新のノード中心アーキテクチャ内にあるため、単一の
ノードセットに限定されてはおらず、各ノードがすべて
のストレージと通信することができる。これは、物理シ
ステムトポロジがストレージとノードの通信を制限し、
仕事量に応じて異なるトポロジが必要になることの多
い、今日のマルチノード・システムとは対照的である。
図１のアーキテクチャでは、広範囲のシステム・トポロ
ジをサポートする単一の物理アーキテクチャを提供して
いるため、アプリケーション・ソフトウェアの通信パタ
ーンは、あらゆる時期においてシステムのトポロジを決
定し、不均一なテクノロジの成長を受け入れることが可
能である。ファブリック１０６が与える分離によって、
各主要システム・コンポーネントのための詳細なスケー
リングが可能である。In the architecture of FIG. 1, since the storage is in a modern node-centric architecture, it is not limited to a single set of nodes, but each node can communicate with all storage. This is because the physical system topology limits storage and node communication,
In contrast to today's multi-node systems, where different topologies are often required depending on the workload.
Because the architecture of FIG. 1 provides a single physical architecture that supports a wide range of system topologies, the communication patterns of the application software determine the topology of the system at any time, resulting in uneven technology growth. It is possible to accept. Due to the separation provided by fabric 106,
Detailed scaling for each major system component is possible.

【００２２】図２は、本発明の同列アーキテクチャの詳
細な説明を表している。計算リソースは、１つ以上の計
算ノード２００で定義され、それぞれがＯＳ２０２の管
理下にある１つ以上のアプリケーション２０４を実行す
る１つ以上のプロセッサ２１６を有している。テープド
ライブ、プリンタその他のネットワークといった周辺機
器２０８は計算ノード２００と運用上で対になってい
る。さらに、ＯＳ２０２、アプリケーション２０４その
他の情報を構成する命令といった、計算ノード２００固
有の情報を保存するハードディスクなどのローカルスト
レージ装置２１０も、計算ノード２００と運用上で対に
なっている。アプリケーション命令は、分散処理の形式
で、１つ以上の計算ノード２００で保存や実行が行われ
る。実施形態では、プロセッサ２０６はＩＮＴＥＬＰ
６のような既製の市販多目的プロセッサと、付随するメ
モリやＩ／Ｏ要素で構成される。FIG. 2 shows a detailed description of the parallel architecture of the present invention. Computational resources are defined by one or more compute nodes 200, each having one or more processors 216 executing one or more applications 204 under the control of an OS 202. Peripheral devices 208 such as tape drives, printers, and other networks are operationally paired with the computing node 200. Furthermore, a local storage device 210 such as a hard disk for storing information unique to the computing node 200, such as the OS 202, the application 204, and other commands constituting information, is operationally paired with the computing node 200. Application instructions are stored and executed on one or more compute nodes 200 in the form of distributed processing. In an embodiment, the processor 206 includes the INTEL P
6 and an off-the-shelf off-the-shelf general-purpose processor and associated memory and I / O elements.

【００２３】ストレージ・リソース１０４は、クリーク
２２６で定義され、このそれぞれには第一のＩ／Ｏノー
ド又はＩＯＮ２１２と第二のＩ／Ｏノード又はＩＯＮ２
１４が含まれ、そのそれぞれがシステム・インターコネ
クト２２８によってインターコネクト・ファブリック１
０５と運用上で対になっている。第一のＩＯＮ２１２及
び第二のＩＯＮ２１４は運用上、１つ以上のストレージ
ディスク２２４（「一群のディスク」又はＪＢＯＤとし
て知られる）と対になっており、これにはＩＢＯＤエン
クロージャ２２２が付随している。Storage resources 104 are defined at cliques 226, each of which has a first I / O node or ION 212 and a second I / O node or ION2.
14, each of which is interconnected by the system interconnect 228 in the interconnect fabric 1.
05 in operation. The first ION 212 and the second ION 214 are operationally paired with one or more storage disks 224 (also known as a "group of disks" or JBOD), which is accompanied by an IBOD enclosure 222. .

【００２４】図２は、中規模システムを表しており、Ｉ
ＯＮ２１２と計算ノードは代表的な２対１の割合になっ
ている。本発明のクリーク２２６には、３つ以上のＩＯ
Ｎ２１４を導入したり、ストレージ・ノードの可用性が
不足している場合は、単一のＩＯＮ２１２を導入するこ
ともできる。クリーク２２６内の数は、ＩＯＮ２１２内
で共有されるハードウェアは無いため、純粋にソフトウ
ェアの問題である。ペアになったＩＯＮ２１２は「ダイ
ポール」と呼ばれることもある。FIG. 2 shows a medium-sized system,
The ON 212 and the computation nodes have a typical ratio of 2: 1. The clique 226 of the present invention has three or more IOs.
A single ION 212 may be installed if the N214 is installed or if the availability of the storage node is insufficient. The number in creek 226 is purely a software problem, since no hardware is shared within ION 212. The paired IONs 212 are sometimes called “dipoles”.

【００２５】本発明の構成要素には、計算ノード２０
０、ＩＯＮ２１２、インターコネクト・ファブリック１
０６を接続する管理コンポーネント又はシステム管理者
２３０も含まれる。The components of the present invention include a computing node 20
0, ION212, interconnect fabric 1
06 is also included.

【００２６】ＩＯＮ２１２とＪＢＯＤ２２２の接続は、
ここでは簡単な形式で示してある。実際の接続では、図
の構成にあるストレージ・ディスク２２４のそれぞれの
階層（列、ここでは４列）にファイバ・チャンネル・ケ
ーブルを使用する。実際には、各ＩＯＮ２１２が管理す
るストレージ・ディスクの数は、図の実施形態のように
２０ではなく、４０〜８０になると思われる。The connection between the ION 212 and the JBOD 222 is as follows.
Here, it is shown in a simple format. The actual connection uses Fiber Channel cables for each tier (row, here four rows) of storage disks 224 in the configuration shown. In practice, it is expected that the number of storage disks managed by each ION 212 will be 40-80 instead of 20, as in the illustrated embodiment.

【００２７】Ｂ．ＩＯＮ（ストレージ・ノード）１．内部アーキテクチャａ）ハードウェア・アーキテクチャ図３は、ＩＯＮ２１２の構成とＪＢＯＤ２２２とのイン
ターフェースに関する詳細な図である。各ＩＯＮ２１２
は、ＪＢＯＤインターコネクト２１６を介してＪＢＯＤ
２２２アレイ内のストレージ・ディスク２２４との通信
接続を行うＩ／Ｏ接続モジュール３０２、ＩＯＮ２１２
の機能を実行し、本文書で説明するＩＯＮ物理ディスク
ドライバ５００の導入を行うＣＰＵ及びメモリ３０４、
ＩＯＮ２１２の動作をサポートする電力を供給する電源
モジュール３０６で構成される。B. ION (storage node) Internal Architecture a) Hardware Architecture FIG. 3 is a detailed diagram of the configuration of the ION 212 and the interface with the JBOD 222. Each ION 212
Is JBOD via JBOD Interconnect 216
I / O connection module 302 for making communication connection with storage disk 224 in the array 222, ION 212
CPU and memory 304, which perform the functions of the ION physical disk driver 500 described in this document.
The power supply module 306 supplies power to support the operation of the ION 212.

【００２８】ｂ）ＪＢＯＤ図４は、ＪＢＯＤエンクロージャ２２２の詳細を示す図
である。モニター又はコントロールの可能なＪＢＯＤエ
ンクロージャ２２２のすべての構成要素はエレメント４
０２〜４２４と呼ばれる。任意のＪＢＯＤエンクロージ
ャのあらゆるエレメント４０２〜４２４は、受領診断結
果コマンドを通じて、構成ページコードと共に返送され
る。ＩＯＮ２１２はこのエレメントの順序リストを利用
して、エレメントに番号を付ける。図の第一のエレメン
ト４０２はエレメント０、二番目のエレメント４０４は
エレメント１等となる。こうしたエレメント番号は、本
文書で説明する管理サービスレイヤ７０６がコンポーネ
ントのアドレスに使用するＬＵＮ_Ｃを作成する際に使
用される。B) JBOD FIG. 4 is a diagram showing details of the JBOD enclosure 222. All components of the monitor or controllable JBOD enclosure 222 are element 4
02-424. All elements 402-424 of any JBOD enclosure are returned with the configuration page code via the Receive Diagnostic Result command. The ION 212 uses the ordered list of the elements to number the elements. The first element 402 in the figure is element 0, the second element 404 is element 1, and so on. These element numbers are used when the management service layer 706 described in this document creates a LUN_C used for the address of the component.

【００２９】[0029]

【表１】 [Table 1]

【００３０】エンクロージャ内では、エレメントの位置
は、上の表Ｉにあるように、ラック、シャシ、エレメン
トの番号で特定される。ラック番号は、ダイポールに属
するラックに割り当てられたダイポール内部の番号であ
る。シャシ位置は、キャビネット管理デバイスが報告し
た高さを表す。エレメント番号は、ＳＥＳ構成ページが
返送したエレメント・リストのインデックスである。こ
れらのフィールドはＬＵＮ_Ｃフォーマットを形成す
る。Within the enclosure, element locations are identified by rack, chassis, and element numbers, as shown in Table I above. The rack number is a number inside the dipole assigned to the rack belonging to the dipole. The chassis position represents the height reported by the cabinet management device. The element number is the index of the element list returned by the SES configuration page. These fields form the LUN_C format.

【００３１】ｃ）Ｉ／Ｏインターフェース・ドライバ・
アーキテクチャ図５は、ＩＯＮ２１２の「ＳＣＳＩドライバ」として働
くＩＯＮ物理ディスクドライバ５００を含めた、ＩＯＮ
２１２のＩ／Ｏアーキテクチャを示す図である。ＩＯＮ
物理ディスクドライバ５００は、ＲＡＩＤ（安価ディス
クのリダンダント・アレイ）ソフトウェアドライバやシ
ステム管理者２３０の管理ユーティリティからのＩ／Ｏ
要求の取り込みを管理し、ＪＢＯＤインターコネクト２
１６のデバイス側にあるデバイスで要求を実行する。C) I / O interface driver
Architecture FIG. 5 shows an ION including an ION physical disk driver 500 that acts as a “SCSI driver” for the ION 212.
FIG. 2 illustrates an I / O architecture of an I / O module 212. ION
The physical disk driver 500 is an I / O from a RAID (redundant disk redundant array) software driver or a management utility of the system administrator 230.
Manages fetching of requests, and JBOD Interconnect 2
The request is executed on the device on the side of the 16 devices.

【００３２】本発明の物理ディスクドライバ５００には
３つの主要なコンポーネントが含まれる。これは、高レ
ベルドライバ（ＨＤＬ）５０２、デバイス固有高レベル
ドライバ５０４、低レベルドライバ５０６である。ＨＬ
Ｄ５０２は、共通部分５０３とデバイス固有高レベル部
分５０４、及び低レベルドライバ５０６で構成される。
共通及びデバイス固有高レベルドライバ５０２及び５０
４は、アダプタに依存せず、新しいアダプタ・タイプの
ための修正は必要ない。ファイバ・チャンネル・インタ
ーフェース（ＦＣＩ）低レベルドライバ５０６は、ファ
イバ・チャンネル・アダプタをサポートしており、その
ためアダプタ固有ではなくプロトコル固有である。The physical disk driver 500 of the present invention includes three major components. These are a high-level driver (HDL) 502, a device-specific high-level driver 504, and a low-level driver 506. HL
D502 includes a common part 503, a device-specific high-level part 504, and a low-level driver 506.
Common and device specific high level drivers 502 and 50
4 is adapter independent and requires no modification for new adapter types. The Fiber Channel Interface (FCI) low level driver 506 supports Fiber Channel adapters and is therefore protocol-specific rather than adapter-specific.

【００３３】ＦＣＩ低レベルドライバ５０６はＳＣＳＩ
要求をＦＣＰフレームにトランスレートし、Ｌｏｇｉｎ
やＰｒｏｃｅｓｓＬｏｇｉｎといったファイバ・チャ
ンネル共通サービスを扱う。ＦＣＩ低レベルドライバ５
０６が運用上で対になっているのは、ハードウェア・イ
ンターフェース・モジュール（ＨＩＭ）インターフェー
ス５０８で、これはファイバ・チャンネル・プロトコル
操作とアダプタ固有ルーチンを分離する。上述のコンポ
ーネントの詳しい説明は以下に記述する。The FCI low level driver 506 is a SCSI
Translates the request into an FCP frame, Login
And common services such as Process Login. FCI low level driver 5
Operationally paired with 06 is a hardware interface module (HIM) interface 508, which separates Fiber Channel protocol operations from adapter-specific routines. A detailed description of the above components is provided below.

【００３４】（１）高レベルドライバ高レベルドライバ（ＨＬＤ）５０２は、アクセスしてい
るデバイスのタイプに関係なく、ＩＯＮ２１２に対する
すべての要求の入口点である。デバイスがオープンにな
ると、ＨＬＤ５０２はそのデバイスにコマンドページを
結びつける。こうしたベンダ固有のコマンドページは、
特定のＳＣＳＩ機能に対して、どのようにＳＣＳＩコマ
ンド記述子ブロックを形成するかを指示する。コマンド
ページにより、ドライバは、特定のＳＣＳＩ機能を異な
る方法で扱うデバイスを、ＳＣＳＩ仕様による指定より
も容易にサポートすることができる。(1) High Level Driver The High Level Driver (HLD) 502 is the entry point for all requests to the ION 212, regardless of the type of device being accessed. When a device is opened, HLD 502 binds a command page to the device. These vendor-specific command pages are:
Indicate how to form a SCSI command descriptor block for a particular SCSI function. The command page allows the driver to more easily support devices that handle a particular SCSI function differently than specified by the SCSI specification.

【００３５】（ａ）共通（非デバイス固有）部分ＨＬＤ５０２の共通部分には以下の入口点が含まれる。(A) Common (Non-Device Specific) Portion The common portion of the HLD 502 includes the following entry points.

【００３６】・ｃｓ_ｉｎｉｔドライバ構造を初期
化し、リソースを割り当てる。Cs_init Initializes the driver structure and allocates resources.

【００３７】・ｃｓ_ｏｐｅｎデバイス使用の準備
をする。Cs_open Prepare to use the device.

【００３８】・ｃｓ_ｃｌｏｓｅＩ／Ｏを完了し、
サービスからデバイスを除去する。Completion of cs_close I / O,
Remove a device from service.

【００３９】・ｃｓ_ｓｔｒａｔｅｇｙデバイスの
読み出し／書き込みエントリをブロックする（Ｂｕｆ_
ｔインターフェース）。Cs_strategy Block device read / write entries (Buf_strategy)
t interface).

【００４０】・ｃｓ_ｉｎｔｒハードウェアの割込
みを支援する。Cs_intr Supports hardware interrupts.

【００４１】これらのルーチンはあらゆるデバイスのタ
イプで同じ機能を実行する。こうしたルーチンのほとん
どでは、デバイスのタイプ（ディスク、テープ、ＷＯＲ
Ｍ、ＣＤ−ＲＯＭなど）によって示されるスイッチテー
ブルを利用し、デバイス固有のルーチンを呼び出して、
デバイス固有の要件を取り扱う。These routines perform the same function on any device type. For most of these routines, the type of device (disk, tape, WOR
M, CD-ROM, etc.) and calls device specific routines,
Handles device-specific requirements.

【００４２】ｃｓ_ｏｐｅｎ機能はデバイスが存在して
おり、そこでＩ／Ｏ作業を行う準備ができていることを
保証する。現行のシステムアーキテクチャとは違い、こ
の共通部分５０３は、オペレーティング・システム（Ｏ
Ｓ）の初期化中に既知デバイスの表を作成しない。代わ
りに、ドライバ共通部分５０３は自己構成を行う。ドラ
イバ共通部分５０３は、デバイスが最初にオープンにな
る間に、そのデバイスの状態を判断する。これにより、
ドライバ共通部分５０３では、ＯＳ２０２の初期化段階
の後でオンラインに存在するであろうデバイスを「確
認」することが可能になる。The cs_open function ensures that the device is present and ready for I / O work there. Unlike current system architectures, this intersection 503 is based on the operating system (O
Do not create a table of known devices during initialization in S). Instead, the driver common part 503 performs self-configuration. The driver common part 503 determines the state of the device while the device is first opened. This allows
The driver common part 503 allows to "verify" devices that will be online after the OS 202 initialization phase.

【００４３】最初にオープンになる間、ＳＣＳＩデバイ
スは、ターゲットデバイスに対するＳＣＳＩ問合わせコ
マンドの発行によって、コマンドページと結びつけられ
る。デバイスが肯定の反応を示した場合、ＳＣＳＩ構成
モジュール５１６内で、応答データ（ベンダＩＤ、製品
ＩＤ、ファームウェア改訂レベルなどの情報を含む）が
既知デバイスの表と比較される。一致が見つかった場
合、デバイスは、その表エントリが指定するコマンドペ
ージと明確に結びつけられる。一致が見つからなかった
場合、デバイスは、応答データのフォーマットに基づい
て、一般的なＣＣＳ（共通コマンドセット）又はＳＣＳ
ＩＩＩコマンドページに不明確に結びつけられる。ド
ライバ共通部分５０３には、リソースの割り当て、分散
収集作業用のＤＭＡリストの作成、ＳＣＳＩ作業の完了
のために、低レベルドライバ５０６が使用するルーチン
とコマンドページ機能が含まれる。During the initial open, the SCSI device is associated with a command page by issuing a SCSI inquiry command to the target device. If the device responds positively, the response data (including information such as vendor ID, product ID, firmware revision level, etc.) is compared in the SCSI configuration module 516 to a table of known devices. If a match is found, the device is specifically tied to the command page specified by the table entry. If no match is found, the device will send a generic CCS (Common Command Set) or SCS based on the format of the response data.
Unclearly tied to the II command page. The driver common part 503 includes routines and command page functions used by the low-level driver 506 to allocate resources, create a DMA list for distributed collection work, and complete SCSI work.

【００４４】すべてのＦＣＩ低レベルドライバ５０６ル
ーチンは、ドライバ共通部分５０３から呼び出される。
ドライバ共通部分５０３は、ハードウェアをセットアッ
プして作業を開始するハードウェア・インターフェース
・モジュール（ＨＩＭ）内の適切な低レベルドライバ
（ＬＬＤ）ルーチンを呼び出すことで、ＳＣＳＩ作業を
開始する唯一のレイヤである。また、ＬＬＤルーチン
は、ＳＣＳＩ構成モジュール５１６から構成中に割り当
てられるドライバＩＤが示すスイッチ表によってもアク
セスされる。All FCI low level driver 506 routines are called from the driver common part 503.
Driver common part 503 is the only layer that initiates SCSI work by calling the appropriate low-level driver (LLD) routines in the Hardware Interface Module (HIM) to set up the hardware and begin work. is there. The LLD routine is also accessed by the SCSI configuration module 516 via the switch table indicated by the driver ID assigned during configuration.

【００４５】（ｂ）デバイス固有部分共通部分５０２とデバイス固有ルーチン５０４のインタ
ーフェースは、共通部分とのインターフェースと同様
で、ｃｓｘｘ_ｉｎｔｔ、ｃｓｘｘ_ｏｐｅｎ、ｃｓｘｘ
_ｃｌｏｓｅ、ｃｓｘｘ_ｓｔｒａｔｅｇｙのコマンドが
含まれる。「ｘｘ」の指定箇所はストレージ・デバイス
のタイプ（ディスクの「ｄｋ」やテープの「ｔｐ」）を
示す。これらのルーチンはあらゆるデバイス固有要件を
扱う。例えば、デバイスがディスクの場合、ｃｓｄｋ_
ｏｐｅｎはディスクの特定のエリアからパーティション
・テーブルの情報を読み出し、ｃｓｄｋ_ｓｔｒａｔｅ
ｇｙはパーティション・テーブルの情報を利用してブロ
ックが境界外にあるかどうかを判断することになる。
（パーティション・テーブルは、特定の物理ディスクご
とに、論理ディスクから物理ディスクのブロックマッピ
ングを定義する）（ｃ）高レベルドライバのエラー／フェールオーバの
処理（ｉ）エラーの処理（ａ）再試行ＨＬＤ５０２の最も一般的な復旧措置では、失敗したＩ
／Ｏの再試行を利用する。特定のコマンド・タイプの再
試行の回数は、コマンドページによって指定されてい
る。例えば、読み出しや書き込みコマンドは非常に重要
とみなされるため、これに関連するコマンドページでは
再試行回数を３に設定する場合もある。問合わせコマン
ドはそれほど重要ではなく、その日の第一の作業中に継
続的に再試行を行うとシステムがスローダウンする可能
性があるため、再試行回数を０に設定する場合もある。(B) Device-specific part The interface between the common part 502 and the device-specific routine 504 is the same as the interface with the common part, and is csxx_intt, csxx_open, csxx.
_close, csxx_strategy commands are included. The designated location of “xx” indicates the type of the storage device (“dk” for disk or “tp” for tape). These routines handle any device-specific requirements. For example, if the device is a disk, csdk_
open reads the information of the partition table from a specific area of the disk, and reads csdk_state.
gy will use the information in the partition table to determine whether the block is outside the boundary.
(The partition table defines the block mapping from logical disk to physical disk for each specific physical disk.) (C) High-level driver error / failover processing (i) Error processing (a) Retry HLD502 The most common rehabilitation measures include failed I
Utilize / O retry. The number of retries for a particular command type is specified by the command page. For example, read and write commands are considered very important, so the associated command page may set the retry count to three. The retry count may be set to zero because the query command is not critical and continual retries during the first task of the day can slow down the system.

【００４６】要求が初めて発行されるとき、その再試行
回数は０に設定される。要求が失敗し、復旧スキムが繰
り返される度に、再試行の回数は増えていく。再試行の
回数が、コマンドページで指定された最大再試行回数を
上回ると、Ｉ／Ｏは失敗に終わり、要求者にメッセージ
が返信される。これがない場合は、要求の再発行が行わ
れる。このルールの唯一の例外は、ユニット・アテンシ
ョンに関するもので、通常はエラーではなくイベントの
通知である。ユニット・アテンションをコマンドとして
受け取り、その最大再試行回数が０又は１に設定されて
いた場合、高レベルドライバ５０２は、このＩ／Ｏに関
する最大再試行回数を２に設定する。こうして、ユニッ
ト・アテンション状態の影響でＩ／Ｏの失敗が早くなる
のを防止する。When a request is issued for the first time, its retry count is set to zero. Each time the request fails and the recovery scheme is repeated, the number of retries increases. If the number of retries exceeds the maximum number of retries specified on the command page, the I / O will fail and a message will be returned to the requestor. If not, a request is reissued. The only exception to this rule is for unit attention, which is usually an event notification rather than an error. If a unit attention is received as a command and its maximum number of retries is set to 0 or 1, high-level driver 502 sets the maximum number of retries for this I / O to two. In this way, it is possible to prevent the I / O failure from being accelerated due to the influence of the unit attention state.

【００４７】遅延再試行も、一定の時間、待ち行列上で
再試行の置き換えが行われない点を除き、上述の再試行
スキムと同様に扱われる。Delayed retries are treated similarly to the retry scheme described above, except that retries are not replaced on the queue for a fixed amount of time.

【００４８】（ｂ）失敗したＳｃｓｉ_ｏｐＦＣＩ低レベルドライバ５０６に対して発行されたＳｃ
ｓｉ_ｏｐは、いくつかの状況により失敗することがあ
る。表ＩＩは、ＦＣＩ低レベルドライバ５０６がＨＬＤ
５０２に戻すことのできる失敗のタイプを示している。(B) Sc issued from the failed Scsi_op FCI low-level driver 506
si_op may fail under some circumstances. Table II shows that the FCI low level driver 506 is HLD
502 indicates the type of failure that can be returned.

【００４９】[0049]

【表２】 [Table 2]

【００５０】（ｃ）リソース不足リソース不足のエラーは、いくつかの希望するリソース
が要求時に利用できなかった際に起こる。普通、こうし
たリソースはシステムメモリ及びドライバ構造メモリで
ある。(C) Resource Shortage Resource shortage errors occur when some desired resources are not available at the time of the request. Typically, such resources are system memory and driver structure memory.

【００５１】システムメモリ不足の処理は、セマフォ・
ブロックによって行われる。メモリリソースをブロック
するスレッドは、すべての新しいＩ／Ｏの発行を妨げ
る。このスレッドは、Ｉ／Ｏが完了してメモリがフリー
になるまでブロックを続ける。The processing for the system memory shortage is performed by using a semaphore
Done by blocks. Threads that block memory resources prevent all new I / Os from being issued. This thread continues to block until the I / O has completed and the memory is free.

【００５２】ドライバ構造リソースは、Ｓｃｓｉ_ｏｐ
及びＩ／Ｏベクトル（ＩＯＶ）リストプールに関連して
いる。ＩＯＶリストは、ディスクが送受信するメモリの
開始値及び長さの値のリストである。このメモリプール
は、プールのサイズを指定する調整可能なパラメータを
利用して、その日の最初に初期化される。Ｓｃｓｉ_ｏ
ｐ又はＩＯＶプールが空の場合、新しいＩ／Ｏによっ
て、これらのプールが増加する。どちらかのプールを増
加させるために、一度に１ページ（4096バイト）のメモ
リが割り当てられる。新しいページのすべてのＳｃｓｉ
_ｏｐ又はＩＯＶプールがフリーになるまで、このペー
ジはフリーにならない。ＩＯＮ２１２が、Ｓｃｓｉ_ｏ
ｐのページの割り当てとフリー化を行っている場合やペ
ージの割り当てとフリー化を絶えず行っている場合は、
関連するパラメータの調整が望ましい。The driver structure resource is Scsi_op
And an I / O vector (IOV) list pool. The IOV list is a list of values of the start value and the length of the memory transmitted and received by the disk. This memory pool is initialized at the beginning of the day with adjustable parameters that specify the size of the pool. Scsi_o
If p or IOV pools are empty, new I / Os will increase these pools. One page (4096 bytes) of memory is allocated at a time to increase either pool. All Scsi on new page
This page will not be free until the _op or IOV pool is free. ION 212 is Scsi_o
If you are allocating and freeing pages from p, or if you are constantly allocating and freeing pages,
Adjustment of the relevant parameters is desirable.

【００５３】リソース不足の処置はすべてイベントを通
じて記録される。All resource shortage actions are recorded through events.

【００５４】（ｉｉ）その日の第一の処理その日の最初に、ＨＬＤ５０２は必要な構造とプールを
初期化し、アダプタ固有のドライバ及びハードウェアを
初期化するために呼び出しを行う。その日の第一の処理
は、ｃｓ_ｉｎｉｔ（）の呼び出しによって始まり、こ
れは（１）Ｓｃｓｉ_Ｏｐプールの割り当て、（２）Ｉ
ＯＶプールの割り当て、（３）ファイバ・チャンネル構
造及びハードウェアの初期化を行うＦＣＩｈｗ_ｉｎｔ
（）の呼び出し、（４）割込みサービスルーチンｃｓ_
ｉｎｔｒ（）と該当する割込みベクトルとの結びつけを
行う。(Ii) First Processing of the Day At the beginning of the day, the HLD 502 initializes the necessary structures and pools and makes calls to initialize adapter-specific drivers and hardware. The first processing of the day begins with a call to cs_init (), which consists of (1) Scsi_Op pool allocation, (2) I
FCIhw_int for allocating the OV pool and (3) initializing the Fiber Channel structure and hardware
(4) Interrupt service routine cs_
Intr () is linked with the corresponding interrupt vector.

【００５５】（ｉｉｉ）フェールオーバ処理ＩＯＮ２１２のダイポールの半分はどちらも共通のディ
スクデバイスセットに接続されている。ダイポール２２
６のＩＯＮ２１２及び２１４は常にすべてのデバイスに
アクセスできなくてはいけない。ＨＬＤ５０２から見
て、特別なフェールオーバの処理は存在しない。(Iii) Failover Processing Both dipole halves of the ION 212 are connected to a common disk device set. Dipole 22
6 IONs 212 and 214 must always be able to access all devices. From the perspective of the HLD 502, there is no special failover process.

【００５６】（２）コマンドページ本発明のＩＯＮ２１２は、ＳＣＳＩコマンドの実際の組
立から、共通部分とデバイス固有部分を抽象化するコマ
ンドページ法を使用する。コマンドページは機能のポイ
ンタのリストで、各機能はＳＣＳＩコマンド（ＳＣＳＩ
_2_Ｔｅｓｔ_ＵｎｉｔＲｅａｄｙなど）を意味する。上
述のように、デバイスには、第一のオープン時又はその
デバイスのアクセス時に特定のコマンドページが結びつ
けられる。ベンダー独自や非対応の特殊なＳＣＳＩデバ
イスは、そのデバイスの固有コマンドページで参照され
る機能によって管理する。通常のシステムは、出荷時
に、コマンドコントロールセット（ＣＣＳ）、ＳＣＳＩ
Ｉ及びＳＣＳＩＩＩページ、ベンダー独自のページが
入っており、非対応ＳＣＳＩデバイス又はベンダ独自の
ＳＣＳＩコマンドの統合を可能にしている。(2) Command Page The ION 212 of the present invention uses a command page method that abstracts a common part and a device-specific part from the actual assembly of SCSI commands. The command page is a list of pointers to functions, and each function is a SCSI command (SCSI
_2_Test_UnitReady). As described above, a device is associated with a specific command page upon first opening or upon accessing the device. Vendor-specific or non-compliant special SCSI devices are managed by functions referenced on the device specific command page. Normal systems are shipped with Command Control Set (CCS), SCSI
Contains I and SCSI II pages, and vendor-specific pages, allowing integration of non-compliant SCSI devices or vendor-specific SCSI commands.

【００５７】コマンドページ機能は、デバイス共通部分
５０３、デバイス固有部分５０４、ＦＣＩ低レベルドラ
イバ５０６（ＲｅｑｕｅｓｔＳｅｎｓｅ）から、仮想
デバイス（ＶＤＥＶ）インターフェースと呼ばれるイン
ターフェースを通じて実施される。このレベルでは、ソ
フトウェアは、デバイスがどのＳＣＳＩ言語を使用して
いるかではなく、デバイスが意図した機能を実施してい
るかどうかのみを問題にする。The command page function is implemented from the device common part 503, the device specific part 504, and the FCI low level driver 506 (Request Sense) through an interface called a virtual device (VDEV) interface. At this level, the software only matters whether the device is performing its intended function, not what SCSI language the device uses.

【００５８】各コマンドページ機能はＳＣＳＩコマンド
を組み立て、必要があればダイレクト・メモリ・アクセ
ス（ＤＭＡ）データの転送にメモリを割り当てる。次に
この機能はドライバ共通部分５０３にコントロールを戻
す。ドライバ共通部分５０３は、ＳＣＳＩ作業を待ち行
列に加え（必要であればここでソートする）、ＦＣＩ低
レベルドライバ５０６の開始ルーチンを呼び出して、こ
のコマンドを実施する。コマンドが実行された後、「コ
ール・オン・インタラプト」（ＣＯＩ）ルーチンがコマ
ンドページ機能に存在していたら、ドライバのドライバ
共通部分５０３が完了したコマンドのデータ／情報を検
査する前にＣＯＩが呼び出される。返信データ／情報を
操作することで、ＣＯＩは一致しないＳＣＳＩデータ／
情報を標準のＳＣＳＩデータ／情報に変換できる。例え
ば、デバイスの問合わせデータに、バイト８ではなくバ
イト１２で始まるベンダＩＤが含まれる場合、問い合わ
せのコマンドページ機能には、返信の問合わせデータの
ベンダＩＤをバイト８に変えるＣＯＩが含まれる。ドラ
イバ共通部分５０３は常にバイト８で始まるベンダＩＤ
情報を抽出するため、一致しないデバイスについて知る
必要はなくなる。Each command page function assembles SCSI commands and allocates memory for direct memory access (DMA) data transfers, if necessary. This function then returns control to the driver common part 503. Driver common part 503 queues the SCSI work (sort here if necessary) and invokes the start routine of FCI low level driver 506 to implement this command. After the command has been executed, if a "call on interrupt" (COI) routine is present in the command page function, the COI is called before the driver common part 503 of the driver checks the data / information of the completed command. It is. By manipulating the reply data / information, the COI does not match the SCSI data /
The information can be converted to standard SCSI data / information. For example, if the device query data includes a vendor ID that starts with byte 12 instead of byte 8, the query command page function includes a COI that changes the reply query data vendor ID to byte 8. Driver common part 503 is always the vendor ID starting with byte 8
Extracting information eliminates the need to know about unmatched devices.

【００５９】（３）ＩＢＯＤとＳＣＳＩ構成モジュー
ルＲＡＩＤコントローラの重要な機能はデータを損失から
守ることである。この機能を実行するために、ＲＡＩＤ
ソフトウェアはディスクデバイスがどこにあり、そのケ
ーブルがどのように接続されているかを、物理的に知ら
なくてはいけない。したがって、ＲＡＩＤコントローラ
のテクニックを実施する上での重要な要件は、ストレー
ジ・デバイスの構成をコントロールする能力である。(3) IBOD and SCSI Configuration Module An important function of a RAID controller is to protect data from loss. To perform this function, RAID
Software must physically know where the disk devices are and how their cables are connected. Therefore, an important requirement in implementing RAID controller techniques is the ability to control the configuration of the storage device.

【００６０】ＪＢＯＤ及びＳＣＳＩ構成モジュール５１
６のＪＢＯＤ部分にはＩＯＮ２１２のスタティックＩＢ
ＯＤ構成の定義のタスクがある。ＪＢＯＤ及びＳＣＳＩ
構成モジュール５１６が記述する構成情報を表ＩＩＩに
示す。JBOD and SCSI configuration module 51
6 JBOD part is ION212 static IB
There is a task of defining the OD configuration. JBOD and SCSI
The configuration information described by the configuration module 516 is shown in Table III.

【００６１】[0061]

【表３】 [Table 3]

【００６２】アダプタ、ＪＢＯＤエンクロージャ２２
２、ストレージディスク２２４の物理的な位置情報に加
え、ＦＣＩ低レベルドライバ５０６及びドライバデバイ
ス固有部分５０４の入口点といったその他の構成情報や
コマンドページの定義も記述しなくてはいけない。この
情報の提供にはｓｐａｃｅ．ｃファイルが使用され、Ｉ
ＯＮ２１２はＩＯＮ物理ディスクドライバ５００のコン
パイル時に構成情報を組み立てる。サポートするＩＯＮ
２１２の構成が変化した場合は、新しいバージョンのＩ
ＯＮ物理ディスクドライバ５００をコンパイルしなくて
はいけない。Adapter, JBOD enclosure 22
2. In addition to the physical location information of the storage disk 224, other configuration information such as the entry point of the FCI low-level driver 506 and the driver device specific part 504 and the definition of the command page must be described. To provide this information, use space. c file is used and I
The ON 212 assembles the configuration information when compiling the ION physical disk driver 500. Supported ION
If the configuration of 212 changes, a new version of I
The ON physical disk driver 500 must be compiled.

【００６３】（４）ファイバ・チャンネル・インター
フェース（ＦＣＩ）低レベルドライバＦＣＩ低レベルドライバ５０６は高レベルドライバ５０
２のためにＳＣＳＩインターフェースを管理する。ドラ
イバ共通部分５０３とＦＣＩ低レベルドライバ５０６の
インターフェースには、以下のルーチンが含まれる。な
お「ｘｘ」の指定箇所には、ＦＣＩ低レベルドライバ５
０６がコントロールするハードウェア固有の識別子が入
る（ＦＣＩｈｗ_ｉｎｉｔなど）。(4) Fiber Channel Interface (FCI) Low Level Driver The FCI low level driver 506 is a high level driver 50
2 for managing the SCSI interface. The interface between the driver common part 503 and the FCI low-level driver 506 includes the following routines. In addition, the FCI low-level driver 5
06 contains an identifier unique to the hardware controlled (eg, FCIhw_init).

【００６４】・ｘｘｈｗ_ｉｎｉｔハードウェアを
初期化する。Xxhw_init Initializes the hardware.

【００６５】・ｘｘｈｗ_ｏｐｅｎホストアダプタ
の現在の状況を判断する。Xxhw_open Determines the current status of the host adapter.

【００６６】・ｘｘｈｗ_ｃｏｎｆｉｇホストアダ
プタの構成情報をセットアップする（ＳＣＳＩＩＤな
ど）。Xxhw_config Set up host adapter configuration information (SCSI ID, etc.).

【００６７】・ｘｘｈｗ_ｓｔａｒｔ可能ならば、
ＳＣＳＩ作業を開始する。Xxhw_start If possible,
Start SCSI work.

【００６８】・ｘｘｈｗ_ｉｎｔｒすべてのＳＣＳ
Ｉ割込みを処理する。Xxhw_intr All SCSs
Handle the I interrupt.

【００６９】低レベルドライバは、デバイスの仕様を認
識したり問題にしたりせず、単純に上のレベルからのＳ
ＣＳＩコマンドの経路となっている点では、純粋なＳＣ
ＳＩドライバである。このレイヤには割込みサービスル
ーチン、ハードウェアの初期化、マッピング、アドレス
・トランスレート、エラー復旧ルーチンがある。加え
て、同じシステムに複数タイプの低レベルドライバが共
存できる。こうした、ドライバのハードウェア管理レイ
ヤとその他の分割により、同じ高レベルドライバを異な
るマシンで実行することが可能になる。The low-level driver does not recognize or make a problem with the specification of the device, and simply executes S from the upper level.
In that it is a path for CSI commands, pure SC
It is an SI driver. This layer includes interrupt service routines, hardware initialization, mapping, address translation, and error recovery routines. In addition, multiple types of low-level drivers can coexist in the same system. Such a driver hardware management layer and other divisions allow the same high-level driver to run on different machines.

【００７０】ＦＣＩモジュールの基本機能は（１）ＳＣ
ＳＩＯｐをＦＣＩ作業オブジェクト構造（Ｉ／Ｏブロ
ック（ＩＯＢ））にトランスレートするためのＳＣＳＩ
高レベルドライバ（ＳＨＬＤ）とのインターフェース、
（２）異なるＨＩＭ５０８を通じた新しいファイバ・チ
ャンネル・アダプタのサポートを容易にする共通インタ
ーフェースの提供、（３）いずれかのＦＣ−４プロトコ
ルレイヤ（図のファイバ・チャンネル・プロトコル（Ｆ
ＣＰ））が使用するＦＣ−３共通サービスの提供、
（４）ＨＩＭ508又はハードウェアが応答しない場合に
ＨＩＭに送る非同期コマンド（ＦＣＰコマンド、ＦＣ−
３コマンド、ＬＩＰコマンド）を保護するためのタイマ
ーサービスの提供、（５）（ａ）Ｉ／Ｏ要求ブロック
（ＩＯＢ）、（ｂ）ベクトル表、（ｃ）ＨＩＭ５０８リ
ソース（ホストアダプタメモリ、ＤＭＡチャンネル、Ｉ
／Ｏポート、スクラッチメモリなど）を含むファイバ・
チャンネル・ドライバ全体（ＦＣＩ及びＨＩＭ）のリソ
ース管理、（６）ファイバ・チャンネルの（ファイバ・
チャンネル・ファブリックに対する）アービトレイト・
ループ使用の最適化ＦＣＩ低レベルドライバ５０６の重要なデータ構造のリ
ストを下の表ＩＶに示す。The basic function of the FCI module is (1) SC
SCSI to translate SI Op into FCI work object structure (I / O block (IOB))
Interface with high level driver (SHLD)
(2) providing a common interface to facilitate support of new Fiber Channel adapters through different HIMs 508; (3) any FC-4 protocol layer (Fibre Channel Protocol (F
Provision of FC-3 common services used by CP)),
(4) Asynchronous command (FCP command, FC-
3 commands, LIP commands), (5) (a) I / O request block (IOB), (b) vector table, (c) HIM508 resource (host adapter memory, DMA channel, I
/ O port, scratch memory, etc.)
(6) Resource management of the entire channel driver (FCI and HIM)
Arbitrate (for channel fabric)
Optimization of Loop Use A list of important data structures of the FCI low level driver 506 is shown in Table IV below.

【００７１】[0071]

【表４】 [Table 4]

【００７２】（ａ）エラー処理ＦＣＩ低レベルドライバ５０６が扱うエラーは、ファイ
バチャンネルやＦＣＩ自身に固有のエラーとなる傾向が
ある。(A) Error Processing Errors handled by the FCI low-level driver 506 tend to be errors specific to the Fiber Channel or FCI itself.

【００７３】（ｉ）複数段階のエラー処理ＦＣＩ低レベルドライバ506は特定のエラーを複数ステ
ージ処理で扱う。これにより、エラーのタイプによって
エラー処理テクニックを最適化することができる。例え
ば、破壊的でない手順を使用して効果がない場合は、さ
らに抜本的なエラー処理テクニックを行ったりする。(I) Multi-Step Error Processing The FCI low-level driver 506 handles a specific error in a multi-stage processing. This allows optimizing error handling techniques depending on the type of error. For example, if a non-destructive procedure is not effective, more radical error handling techniques may be used.

【００７４】（ｉｉ）失敗したＩＯＢすべてのＩ／Ｏ要求は、Ｉ／Ｏ要求ブロックを通じて、
ＨＩＭ５０８へ送られる。以下は、ＨＩＭ５０８が返信
する可能性のあるエラーである。(Ii) Failed IOBs All I / O requests are passed through the I / O request block
Sent to HIM 508. The following are the errors that HIM 508 may return.

【００７５】[0075]

【表５】 [Table 5]

【００７６】（ｉｉｉ）リソース不足ＦＣＩ低レベルドライバ５０６はＩＯＢのリソースプー
ルとベクトル表を管理する。こうしたプールのサイズは
ＩＯＮ２１２の構成に同調するため、こうしたリソース
が足りなくなる可能性はないはずであるため、簡単な復
旧手順が実施される。(Iii) Resource Shortage The FCI low-level driver 506 manages the IOB resource pool and vector table. Since the size of such a pool is tuned to the configuration of the ION 212, there should not be a possibility of running out of such resources, so a simple recovery procedure is performed.

【００７７】ＩＯＢやベクトル表の要求が行われ、この
要求を満たすのに十分なリソースがない場合は、Ｉ／Ｏ
を待ち行列に戻し、Ｉ／Ｏを再試行するためのタイマー
を設定する。リソース不足の発生は記録される。If an IOB or vector table request is made and there are not enough resources to satisfy this request, the I / O
Back to the queue and set a timer to retry the I / O. The occurrence of resource shortage is recorded.

【００７８】（ｂ）その日の第一の処理その日の最初に、高レベルドライは５０２は、サポート
するそれぞれの低レベルドライバ（ＦＣＩ低レベルドラ
イバ５０６を含む）に対して呼び出しを行う。ＦＣＩの
低レベルドライバ５０６のその日の第一の処理は、ＦＣ
Ｉｈｗ_ｉｎｔ（）ルーチンの呼び出しで始まり、これ
は以下の作業を行う。まず、指定されたＰＣＩバスとデ
バイスのために、ＨＩＭ_ＦｉｎｄＣｏｎｔｒｏｌｌｅ
ｒ（）関数が呼び出される。これはＦｉｎｄＣｏｎｔｒ
ｏｌｌｅｒ（）のバージョンを呼び出す。ＪＢＯＤ及び
ＳＣＳＩ構成モジュール５１６は、探索のためにＰＣＩ
バスとデバイスを指定する。次に、アダプタ（ＡＤＡＰ
ＴＥＣから￥利用できるものなど）が見つかった場合
は、ＨＣＢを割り当て、アダプタ用に初期化する。そし
て、スクラッチメモリ、メモリマップＩ／Ｏ、ＤＭＡチ
ャンネルといったアダプタ固有リソースを得るために、
ＨＩＭ_ＧｅｔＣｏｎｆｉｇｕｒａｔＩＯＮ（）を呼び
出す。次に、リソースを割り当てて初期化し、ＡＤＡＰ
ＴＥＣＨＩＭとハードウェアを初期化するために、Ｈ
ＩＭ_Ｉｎｉｔｉａｌｉｚｅ（）を呼び出す。最後に、
ＩＯＢとベクトル表を割り当てて初期化する。(B) First Processing of the Day At the beginning of the day, the high-level driver 502 makes a call to each of the low-level drivers it supports (including the FCI low-level driver 506). The first process of the day for the FCI low level driver 506 is the FC
Beginning with a call to the Ihw_int () routine, which does the following: First, for the specified PCI bus and device, HIM_FindControlle
The r () function is called. This is FindContr
Call the version of urller (). The JBOD and SCSI configuration module 516 uses the PCI
Specify bus and device. Next, the adapter (ADAP
If found (such as available from the TEC), allocate the HCB and initialize it for the adapter. Then, in order to obtain adapter-specific resources such as a scratch memory, a memory map I / O, and a DMA channel,
Call HIM_GetConfigurationION (). Next, resources are allocated and initialized, and ADAP
H to initialize TEC HIM and hardware
Call IM_Initialize (). Finally,
Initialize by allocating IOB and vector table.

【００７９】（ｃ）フェールオーバ処理ＩＯＮ２１２のダイポールの半分はどちらも共通のディ
スクデバイスセットに接続されている。両方のＩＯＮ２
１２は常にすべてのデバイスにアクセスできなくてはい
けない。ＦＣＩ低レベルドライバ５０６から見て、特別
なフェールオーバの処理は存在しない。(C) Failover Process Both dipole halves of the ION 212 are connected to a common disk device set. Both ION2
12 must always be able to access all devices. From the perspective of the FCI low-level driver 506, there is no special failover process.

【００８０】（５）ハードウェア・インターフェース
・モジュール（ＨＩＭ）ハードウェア・インターフェース・モジュール（ＨＩ
Ｍ）５０８はＡＤＡＰＴＥＣのＳｌｉｍＨＩＭ５０９と
インターフェースするように設計されている。ＨＩＭ５
０８モジュールは、ＦＣＩ低レベルドライバ５０６から
の要求を、ＳｌｉｍＨＩＭ５０９が理解してハードウェ
アに発行できる要求にトランスレートすることに主な責
任を有している。これには、Ｉ／Ｏブロック（ＩＯＢ）
要求の取り込みと、ＳｌｉｍＨＩＭ５０９が理解可能
な、対応する転送コントロールブロック（ＴＣＢ）要求
へのトランスレートが含まれる。(5) Hardware Interface Module (HIM) Hardware Interface Module (HIM)
M) 508 is designed to interface with ADAPTEC's SlimHIM 509. HIM5
The 08 module is primarily responsible for translating requests from the FCI low level driver 506 into requests that the SlimHIM 509 can understand and issue to hardware. This includes I / O blocks (IOB)
Includes request capture and translation to the corresponding Transfer Control Block (TCB) request that the SlimHIM 509 can understand.

【００８１】ＨＩＭ５０８の基本機能に含まれるのは
（１）検索、構成、初期化、アダプタへのＩ／Ｏの送信
を行うハードウェア固有関数に対する、低レベル・アプ
リケーション・プログラム・インターフェース（ＡＰ
Ｉ）の定義、（２）Ｉ／ＯブロックをＳｌｉｍＨＩＭ／
ハードウェアが理解できるＴＣＢ要求（ＦＣプリミティ
ブＴＣＢ、ＦＣ拡張リンクサービス（ＥＬＳ）ＴＣＢ、
ＳＣＳＩ−ＦＣＰ作業ＴＣＢなど）にトランスレートす
るためのＦＣＩ低レベルドライバ５０６とのインターフ
ェース、（３）ＳｌｉｍＨＩＭに対して発行されたコマ
ンド（ＴＣＢ）の配信と完了の追跡、（４）ＳｌｉｍＨ
ＩＭ５０９からの割り込みやイベント情報の解釈やＦＣ
Ｉ低レベルドライバ５０６に関する適切な割り込み処理
やエラー復旧の実行である。ＴＣＢのデータ構造を下の
表ＶＩに示す。The basic functions of the HIM 508 include (1) a low-level application program interface (AP) for hardware-specific functions for searching, configuring, initializing, and sending I / O to the adapter.
Definition of I), (2) I / O block is SlimHIM /
Hardware understandable TCB requests (FC primitive TCB, FC extended link service (ELS) TCB,
(3) Interface with FCI low-level driver 506 for translation to SCSI-FCP work TCB, etc., (3) Track delivery and completion of commands (TCB) issued to SlimHIM, (4) SlimH
Interpretation of interrupt and event information from IM509 and FC
This is to execute appropriate interrupt processing and error recovery for the low-level driver 506. The data structure of the TCB is shown in Table VI below.

【００８２】[0082]

【表６】 [Table 6]

【００８３】（ａ）その日の第一の処理ＨＩＭ５０８は、その日の第一の処理中に使用する３つ
の入口点を定義する。第一の入口点はＨＩＭ_Ｆｉｎｄ
Ａｄａｐｔｅｒで、これはＦＣＩｈｗ_ｉｎｔ（）によ
って呼び出され、ＰＣＩＢＩＯＳルーチンを使用し
て、与えられたＰＣＩバス及びデバイス上にアダプタが
あるかどうかを判断する。アダプタの存在の判断には、
アダプタのＰＣＩベンダ及び製品ＩＤを使用する。(A) First Process of the Day HIM 508 defines three entry points to use during the first process of the day. The first entry point is HIM_Find
In Adapter, this is called by FCIhw_int () and uses the PCI BIOS routine to determine if there is an adapter on a given PCI bus and device. To determine if an adapter is present,
Use the adapter's PCI vendor and product ID.

【００８４】二番目の入口点はＨＩＭ_ＧｅｔＣｏｎｆ
ｉｇｕｒａｔＩＯＮで、これはアダプタが存在する場合
にＦＣＩｈｗ_ｉｎｉｔ（）によって呼び出され、提供
されたＨＣＢにリソース要件を加える。ＡＤＡＰＴＥＣ
の場合、このリソースにはＩＲＱ、スクラッチ、ＴＣＢ
メモリが含まれる。この情報はＳｌｉｍＨＩＭ５０９へ
の呼び出しを行うことで見つかる。The second entry point is HIM_GetConf
iguratION, which is called by FCIhw_init () if an adapter is present, adding resource requirements to the provided HCB. ADAPTEC
, This resource contains IRQ, scratch, TCB
Includes memory. This information can be found by making a call to SlimHIM509.

【００８５】三番目の入口点はＨＩＭ_Ｉｎｉｔｉａｌ
ｉｚｅで、これはリソースの割り当てと初期化が終わっ
た後でＦＣｉｈｗ_ｉｎｔ（）によって呼び出され、ス
クラッチメモリ、ＴＣＢ、ハードウェアの初期化を行う
ためにＳｌｉｍＨＩＭへの呼び出しを行うＴＣＢメモリ
プールを初期化する。The third entry point is HIM_Initial
In ize, this is called by FCihw_int () after allocating and initializing resources and initializes the TCB memory pool that makes calls to SlimHIM to perform scratch memory, TCB, and hardware initialization. .

【００８６】（ｂ）フェールオーバ処理ＩＯＮ２１２のダイポールの半分はどちらも共通のディ
スクデバイスセットに接続されている。ＩＯＮ２１２、
２１４の両方は常にすべてのデバイスにアクセスできな
くてはいけない。ＨＩＭ５０９から見て、特別なフェー
ルオーバの処理は存在しない。(B) Failover Processing Both dipole halves of the ION 212 are connected to a common disk device set. ION212,
Both 214 must always have access to all devices. From the viewpoint of the HIM 509, there is no special failover process.

【００８７】（６）ＡＩＣ−１１６０ＳｌｉｍＨＩ
ＭＳｌｉｍＨＩＭ５０９モジュールの最終的な目的は、ア
ダプタ（図ではＡＤＡＰＴＥＣＡＩＣ−１１６０）の
ハードウェア抽象を提供することである。ＳｌｉｍＨＩ
Ｍ５０９の主な役割には、ＡＩＣ−１１６０アダプタへ
のファイバ・チャンネル要求の転送、割り込みのサービ
ス、ＳｌｉｍＨＩＭ５０９インターフェースを通じての
ＨＩＭモジュールへのステータス報告の返信がある。(6) AIC-1160 SlimHI
The ultimate purpose of the M SlimHIM 509 module is to provide a hardware abstraction of the adapter (ADAPTEC AIC-1160 in the figure). SlimHI
The primary role of the M509 is to forward Fiber Channel requests to the AIC-1160 adapter, service interrupts, and return status reports to the HIM module via the SlimHIM509 interface.

【００８８】さらに、ＳｌｉｍＨＩＭ５０９は、ＡＩＣ
−１１６０ハードウェアの管理と初期化、ファームウェ
アのロード、ランタイム作業の実行を行い、ＡＩＣ−１
１６０のエラーの際にはＡＩＣ−１１６０ハードウェア
をコントロールする。Further, SlimHIM 509 has an AIC
AIC-1160 manages and initializes hardware, loads firmware, and performs runtime work.
In the event of a 160 error, the AIC-1160 hardware is controlled.

【００８９】２．外部インターフェースとプロトコルＩＯＮ物理ディスクドライバ・サブシステム５００のす
べての要求は、共通高レベルドライバ５０２を通じて行
われる。2. External Interfaces and Protocols All requests for the ION physical disk driver subsystem 500 are made through a common high level driver 502.

【００９０】ａ）初期化（ｃｓ_ｉｎｉｔ）サブシステムへの単一呼び出しにより、デバイスがＩ／
Ｏの準備をするのに必要なすべての初期化が行われる。
サブシステムの初期化中には、すべてのドライバ構造と
デバイス又はアダプタ・ハードウェアの割り当てと初期
化が行われる。A) Initialization (cs_init) A single call to the subsystem causes the device to
All initialization necessary to prepare O is performed.
During initialization of the subsystem, all driver structures and device or adapter hardware allocation and initialization are performed.

【００９１】ｂ）オープン／クローズ（ｃｓ_ｏｐｅ
ｎ/ｃｓ_ｃｌｏｓｅ）オープン／クローズ・インターフェース５１０は、デバ
イスにアクセスするのに必要な構造の初期化と分類を行
う。インターフェース５１０は一般的なオープン／クロ
ーズ・ルーチンと異なり、すべての「オープン」と「ク
ローズ」が明確に階層化されている。したがって、Ｉ／
Ｏ物理インターフェースドライバ５００が受領したすべ
ての「オープン」には、受領した関連する「クローズ」
が伴わなくてはならず、デバイス関連構造はすべての
「オープン」が「クローズ」するまではフリーにならな
い。オープン／クローズ・インターフェース５１０は、
要求の完了を示す「オープン」又は「クローズ」の返信
については同期式である。B) Open / Close (cs_ope)
n / cs_close) The open / close interface 510 performs initialization and classification of a structure required to access the device. The interface 510 differs from a general open / close routine in that all “open” and “close” are clearly layered. Therefore, I /
O All “opens” received by the physical interface driver 500 include the associated “closed” received.
Must be accompanied, and the device-related structure is not free until all "opens" are "closed". The open / close interface 510
The response of "open" or "close" indicating the completion of the request is synchronous.

【００９２】ｃ）Ｂｕｆ_ｔ（ｃｓ_ｓｔｒａｔｅｇ
ｙ）Ｂｕｆ_ｔインターフェース５１２はデバイスに対する
論理ブロックの読み出し及び書き込み要求の発行を可能
にする。要求者は、そのＩ／Ｏを記述したＢｕｆ_ｔ構
造を伝える。デバイスＩＤ、論理ブロックアドレス、デ
ータアドレス、Ｉ／Ｏタイプ（読み出し／書き込み）、
コールバック・ルーチンといった属性はＢｕｆ_ｔによ
って記述される。要求が完了すると、要求者がコールバ
ックで指定した関数が呼び出される。Ｂｕｆ_ｔンター
フェース５１２は非同期インターフェースである。要求
者への関数の返信は要求の完了を示すものではない。関
数が戻された時には、Ｉ／Ｏがデバイスで実行中の可能
性もあり、そうではない可能性もある。その要求は待ち
行列上で実行されるのを待っている可能性もある。その
要求は、コールバック関数が呼び出されるまでは完了し
ない。C) Buf_t (cs_strateg)
y) The Buf_t interface 512 enables issuing logical block read and write requests to the device. The requester transmits a Buf_t structure describing the I / O. Device ID, logical block address, data address, I / O type (read / write),
Attributes such as callback routines are described by Buf_t. When the request is completed, the function specified by the requester in the callback is called. The Buf_t interface 512 is an asynchronous interface. Returning the function to the requestor does not indicate that the request has been completed. When the function returns, the I / O may or may not be running on the device. The request may be waiting to be executed on the queue. The request is not completed until the callback function is called.

【００９３】ｄ）ＳＣＳＩＬｉｂＳＣＳＩＬｉｂ５１４は、通常の読み出し及び書き込み
を除くＳＣＳＩコマンド記述子ブロック（ＣＤＢ）のデ
バイスへの送信を可能にするインターフェースを提供す
る。このインターフェースを通じて、ディスクのスピン
やスピンダウンのためにユニットの開始及び停止といっ
た要求が利用されたり、エンクロージャ・デバイスのモ
ニタや管理のために診断の送信又は受領といった要求が
利用される。すべてのＳＣＳＩＬｉｂルーチンは同期式
である。呼び出した関数の返信が要求の完了を意味す
る。D) SCSILib SCSILib 514 provides an interface that allows the transmission of SCSI Command Descriptor Blocks (CDBs) to the device except for normal reads and writes. Through this interface, requests such as starting and stopping units are used to spin and spin down disks, and requests to send or receive diagnostics to monitor and manage enclosure devices. All SCSILib routines are synchronous. The return of the called function indicates the completion of the request.

【００９４】ｅ）割り込み（ｃｓ_ｉｎｔｒ）ＩＯＮ物理ディスクドライバ５００は、すべてのＳＣＳ
Ｉ及びファイバ・チャンネル・アダプタの割り込みの中
心的なディスパッチャである。実施形態では、フロント
エンド／バックエンド割り込みスキムが利用されてい
る。この場合、割り込みがサービスされると、フロント
エンド割り込みサービスルーチンが呼び出される。フロ
ントエンドは割り込みスタックから実行され、割り込み
のソースのクリア、アダプタによる更なる割り込み生成
の無効化、バックエンド割り込みサービスルーチンのス
ケジュールに責任を有する。バックエンドは、（アダプ
タの割り込み無効化とバックエンド・タスク開始の間に
発生した他の割り込みと共に）実際に割り込みを扱う優
先度の高いタスクとして実行される。E) Interruption (cs_intr) The ION physical disk driver 500 checks all SCS
It is the central dispatcher for I and Fiber Channel adapter interrupts. In the embodiment, a front-end / back-end interrupt scheme is used. In this case, when the interrupt is serviced, the front-end interrupt service routine is called. The front end is executed from the interrupt stack and is responsible for clearing the source of the interrupt, disabling further interrupt generation by the adapter, and scheduling the back-end interrupt service routine. The backend is executed as a high priority task that actually handles the interrupt (along with any other interrupts that occur between the interrupt disablement of the adapter and the start of the backend task).

【００９５】３．ＩＯＮの機能ＩＯＮ２１２は主に５つの機能を果たす。この機能は以
下の通りである。3. Functions of ION The ION 212 mainly performs five functions. This function is as follows.

【００９６】ストレージのネーミングと射影：ストレー
ジディスク２２４に保存されるストレージ・リソース・
オブジェクトのイメージを計算ノード２００に射影し
て、一貫したストレージのネーミングを行うために、計
算ノード２００と協力する。Storage naming and projection: storage resources and resources stored in the storage disk 224
Work with the compute node 200 to project the image of the object to the compute node 200 and provide consistent storage naming.

【００９７】ディスク管理：ＩＯＮ２１２と運用上で対
になっているストレージディスクドライブ２２４にデー
タ分配及びデータ冗長テクニックを実施する。Disk management: Performs data distribution and data redundancy techniques on the storage disk drives 224 that are operationally paired with the ION 212.

【００９８】ストレージ管理：ストレージのセットアッ
プ、計算ノード２００からのＩ／Ｏ要求の処理を含むデ
ータ移動、性能の計測、イベント分配を取り扱う。Storage management: Handles storage setup, data movement including processing of I / O requests from the computing nodes 200, performance measurement, and event distribution.

【００９９】キャッシュ管理：アプリケーション・ヒン
ト・プレフェッチのようなキャッシュ充足作業を含むキ
ャッシュデータの読み出しと書き込み。Cache management: read and write of cache data including cache sufficiency operations such as application hint prefetch.

【０１００】インターコネクト管理：性能の最適化のた
めの計算ノード２００のデータフローの制御及び要求の
経路選択の制御と、それに伴うダイポール２２６の２つ
のＩＯＮ２１２の間のストレージ分配の制御。Interconnect management: control of the data flow of the compute node 200 for performance optimization and control of the routing of requests, and consequently control of the storage distribution between the two IONs 212 of the dipole 226.

【０１０１】ａ）ストレージのネーミングと射影ＩＯＮ２１２はストレージディスク２２４に保存される
ストレージ・リソース・オブジェクトを計算ノード２０
０に射影する。この機能の重要な部分は、ＩＯＮ２１２
が管理する各ストレージリソース（仮想ファブリックデ
ィスクを含む）に、大域的にユニークな名前、ファブリ
ックでユニークなＩＤ、ボリュームセット識別子（ＶＳ
Ｉ）のいずれかの作成と割り当てである。A) Storage Naming and Projection The ION 212 stores the storage resource object stored in the storage disk 224 in the computing node 20.
Project to 0. An important part of this function is the ION212
Globally unique names, fabric unique IDs, and volume set identifiers (VS
The creation and assignment of any of I).

【０１０２】図６は、ＶＳＩ６０２と関連データの構造
と内容を示す図である。ＶＳＩ６０２がユニークで競合
がないことが重要であるため、各ＩＯＮ２１２は、その
ＩＯＮ２１２がローカルで管理するストレージリソース
のために、大域的にユニークな名前を作成し割り当てる
ことに責任を有する。また、ストレージ・リソース・オ
ブジェクトを保存しているストレージリソースを管理す
るＩＯＮ２１２のみが、そのストレージリソースにＶＳ
Ｉ６０２を割り当てることができる。ＶＳＩ602の作成
と割り当てができるのは常駐するストレージリソースを
管理するＩＯＮのみだが、他のＩＯＮ２１２もその後の
ストレージリソースの保存と検索の管理をすることがあ
る。これは、ＩＯＮが割り当てたＶＳＩ６０２が、他の
ＩＯＮによって管理されるストレージリソースに移動し
たとしても、特定のデータオブジェクトのＶＳＩ６０２
を変更する必要がないためである。FIG. 6 is a diagram showing the structure and contents of the VSI 602 and related data. Because it is important that the VSI 602 be unique and contention free, each ION 212 is responsible for creating and assigning a globally unique name for storage resources that the ION 212 manages locally. Further, only the ION 212 that manages the storage resource storing the storage resource object has the VS 212
I602 can be assigned. Only the ION that manages the resident storage resource can create and assign the VSI 602, but other IONs 212 may also manage storage and retrieval of the storage resource thereafter. This is because even if the VSI 602 assigned by the ION is moved to a storage resource managed by another ION, the VSI 602
This is because there is no need to change.

【０１０３】ＶＳＩ６０２は、ＩＯＮ識別子６０４とシ
ーケンス番号５０６の２つの部分を含む64ビットの数字
として導入される。ＩＯＮ識別子は各ＩＯＮ２１２に割
り当てられた、大域的にユニークなＩＤ番号である。大
域的にユニークなＩＯＮ識別子６０４を入手するテクニ
ックの１つでは、実時間時計チップにストアされること
の多い電子的に読み出し可能なマザーボードのシリアル
番号を使う。唯一のマザーボードに割り当てられるた
め、シリアル番号はユニークである。ＩＯＮ識別子６０
４は大域的にユニークな番号であるため、各ＩＯＮ２１
２は、ローカルでのみユニークなシーケンス番号６０６
を割り当てることで、大域的にユニークなＶＳＩ６０２
を作成できる。The VSI 602 is introduced as a 64-bit number including two parts, an ION identifier 604 and a sequence number 506. The ION identifier is a globally unique ID number assigned to each ION 212. One technique for obtaining a globally unique ION identifier 604 uses an electronically readable motherboard serial number that is often stored on a real-time clock chip. The serial number is unique because it is assigned to only one motherboard. ION identifier 60
4 is a globally unique number, so each ION 21
2 is a sequence number 606 that is unique only locally
By assigning a globally unique VSI 602
Can be created.

【０１０４】ＶＳＩ６０２をＩＯＮ２１２上のストレー
ジリソースに結びつけた後、ＩＯＮ２１２はストレージ
リソース１０４へのアクセスを可能にするために、同報
メッセージを通じて、ファブリックのすべてのノードに
ＶＳＩ６０２をエキスポートする。このプロセスは、本
文書のＩＯＮ名のエキスポートのセクションで詳しく論
じる。After binding the VSI 602 to storage resources on the ION 212, the ION 212 exports the VSI 602 to all nodes of the fabric via broadcast messages to enable access to the storage resource 104. This process is discussed in detail in the ION Name Export section of this document.

【０１０５】エキスポートしたＶＳＩ６０２を使用し
て、計算ノード２００のソフトウェアは、ローカル接続
された他のすべてのストレージデバイスからは区別でき
ないという点で意味的に透過なストレージリソースのた
めに、ローカル入口点を作成する。例えば、計算ノード
のＯＳ２０２がＵＮＩＸであれば、周辺機器１０８やデ
ィスク２１０といったローカル接続デバイスと同様に、
ブロックデバイスと未使用デバイスの入口がデバイスデ
ィレクトリに作成される。その他のＯＳ２０２でも、同
様の意味的な同値性が守られる。異なるＯＳ２０２を実
行する計算ノード２００間でも、この複合コンピューテ
ィング環境での最善のサポートを行うために、ルート名
の一貫性が維持される。計算ノード２００のローカル入
口点は、エキスポートしたストレージリソース１０４の
現在の可用性を追跡するために、動的に更新される。計
算ノード２００で稼働するＯＳ依存のアルゴリズムはＶ
ＳＩ６０２を使用して、インポートしたストレージリソ
ースのデバイス入口点名を作成する。この方法により、
共通のＯＳを共有するノード間での名前の一貫性が保証
される。これによりシステムは、各計算ノード２００上
の大域的な名前の付いたストレージリソースのローカル
入口点を（静的ではなく）動的に作成することで、複合
コンピューティング環境をサポートするルート名の一貫
性を維持できる。Using the exported VSI 602, the software on the compute node 200 can use the local entry point for storage resources that are semantically transparent in that they are indistinguishable from all other locally connected storage devices. Create For example, if the OS 202 of the computing node is UNIX, like the locally connected devices such as the peripheral device 108 and the disk 210,
Entries for block devices and unused devices are created in the device directory. Similar semantic equivalence is maintained in other OS 202 as well. Root names are also kept consistent between compute nodes 200 running different OSs 202 to provide the best support in this complex computing environment. The local entry point of the compute node 200 is dynamically updated to track the current availability of the exported storage resource 104. The OS-dependent algorithm running on the computing node 200 is V
Using the SI 602, a device entry point name of the imported storage resource is created. In this way,
Name consistency is guaranteed between nodes sharing a common OS. This allows the system to dynamically (rather than statically) create local (not static) entry points for globally named storage resources on each compute node 200 to provide consistent root names to support complex computing environments. Can maintain sex.

【０１０６】上述のように、ストレージリソース１０４
のＶＳＩ６０２作成の詳細は、ストレージリソース１０
４をエキスポートしているＩＯＮ２１２が直接コントロ
ールしている。計算ノード２００内でのＯＳ１０４の相
違の可能性を考慮して、各ＶＳＩ６０２には１つ又は複
数の記述ヘッダが結びついており、ＩＯＮ２１２上のＶ
ＳＩ６０２と共にストアされている。各ＶＳＩ６０２記
述子６０８にはオペレーティングシステム（ＯＳ）従属
データセクション６１０が含まれており、これは特定の
ＶＳＩ６０２のために計算ノード２００上でデバイス入
口点の作成を一貫して（名前と運用上の意味を計算ノー
ド２００全体で同じにして）行う上で必要になる、十分
なＯＳ２０２従属データをストアするためである。この
ＯＳ従属データ６１０に含まれるのは、例えば、データ
記述ローカルアクセス権６１２や所有者情報６１４であ
る。ＶＳＩ６０２がＩＯＮ２１２によって確立され、計
算ノード２００によってインポートされた後で、ＶＳＩ
６０２に結びつくストレージリソース１０４の入口点が
作成される前に、ＩＯＮ２１２は該当するＯＳ固有デー
タ６１０を計算ノード２００に送る。ＶＳＩ６０２の複
数の記述ヘッダは、異なるＯＳ（各ＯＳは独自の記述ヘ
ッダを持つ）を使用する複合計算ノード２００の同時サ
ポートと、異なる計算ノード２００グループ内の分離ア
クセス権のサポートを可能にする。同じ記述ヘッダを共
有する計算ノード２００は、デバイス入口点の作成を共
通する一貫した方法で行う。そのため、共通のアクセス
権を共有するすべての計算ノード２００において、名前
と運用上の意味の一貫性が守られる。As described above, the storage resource 104
For details of VSI 602 creation, see Storage Resource 10
4 is directly controlled by the ION 212 that is exporting 4. Each VSI 602 is associated with one or more description headers, taking into account the possibility of OS 104 differences within the
Stored together with SI 602. Each VSI 602 descriptor 608 includes an operating system (OS) dependent data section 610 that consistently creates device entry points (name and operational) on the compute node 200 for a particular VSI 602. This is to store sufficient OS 202 dependent data, which is necessary for performing the meaning (the same is applied to the entire computation node 200). Included in the OS dependent data 610 are, for example, a data description local access right 612 and owner information 614. After VSI 602 is established by ION 212 and imported by compute node 200, VSI
Before the entry point of the storage resource 104 linked to 602 is created, the ION 212 sends the corresponding OS-specific data 610 to the computing node 200. The multiple description headers of the VSI 602 allow for simultaneous support of multiple compute nodes 200 using different OSs (each OS has its own description header) and support for separate access rights within different groups of compute nodes 200. Compute nodes 200 sharing the same description header create device entry points in a common and consistent manner. Therefore, in all the computing nodes 200 sharing the common access right, consistency between the name and the operational meaning is maintained.

【０１０７】ＶＳＩ記述子６０８にはエイリアス・フィ
ールド６１６が含まれ、これは人間が読めるＶＳＩ６０
２名を計算ノードに表示するのに使用される。例えば、
ＶＳＩ１９８４のエイリアスが「ｓｏｍａ」の場合、計
算ノード２００は１９８４と「ｓｏｍａ」の両方のディ
レクトリ・エントリを持つことになる。ＶＳＩ記述子６
０８はＶＳＩ６０２と一緒にＩＯＮ上でストアされるた
め、同じエイリアス及びローカルアクセス権が、ＶＳＩ
６０２をインポートする各計算ノード上に現れる。The VSI descriptor 608 includes an alias field 616, which is a human-readable VSI 60
Used to display two people on a compute node. For example,
If the alias for VSI 1984 is "soma", then compute node 200 will have both 1984 and "soma" directory entries. VSI descriptor 6
08 is stored on the ION together with the VSI 602, so that the same alias and local access rights
602 appears on each import node that imports.

【０１０８】上述のように、本発明では分散割り当てス
キムに適したネーミング方法を使用する。この方法で
は、大域的なユニーク性を保証するアルゴリズムに従っ
て、名前をローカルで生成する。これのバリエーション
として、中央ネームサーバが各システムに存在する、ロ
ーカルな中央化の方法があるが、純粋な分散式の方法に
比べて可用性と堅牢性の要件が大きくなる。上述の方法
により、本発明では、大域的なユニーク性を保証するロ
ーカル実行のアルゴリズムを作成できる。As described above, the present invention uses a naming method suitable for the distributed allocation scheme. In this method, names are generated locally according to an algorithm that guarantees global uniqueness. A variation on this is a local centralization method, where a central name server exists on each system, but with greater availability and robustness requirements than a purely distributed method. According to the method described above, the present invention can create a locally executed algorithm that guarantees global uniqueness.

【０１０９】大域的に一貫したストレージシステムの作
成には、単に計算ノード２００での名前の一貫性を維持
する以上のサポートが必要である。名前と密接な関係に
あるのはセキュリティの問題であり、本発明では二つの
形をとる。一つはＩＯＮ２１２と計算ノード２００間の
インターフェースのセキュリティであり、二つ目は計算
ノード２００内からのストレージのセキュリティであ
る。The creation of a globally consistent storage system requires more support than simply maintaining the consistency of names on compute node 200. Closely related to names is a security issue, and the present invention takes two forms. One is the security of the interface between the ION 212 and the computing node 200, and the second is the security of the storage from within the computing node 200.

【０１１０】ｂ）ストレージの認証と許可ＶＳＩ６０２リソースは２種類のメカニズム、認証、許
可によって保護される。ＩＯＮ２１２が計算ノード２０
０を認証した場合、ＶＳＩ名は計算ノード２００にエキ
スポートされる。エキスポートしたＶＳＩ６０２は計算
ノード２００上でデバイス名となる。計算ノード上のア
プリケーション・スレッドは、このデバイス名で作業を
試みることができる。デバイス入口点のアクセス権と計
算ノード２００のＯＳ意味は、アプリケーション・スレ
ッドが特定の許可の実施を許可されているかどうかを判
断する。B) Storage Authentication and Authorization VSI 602 resources are protected by two types of mechanisms: authentication and authorization. ION 212 is a compute node 20
If 0 is authenticated, the VSI name is exported to the compute node 200. The exported VSI 602 becomes a device name on the computing node 200. Application threads on compute nodes can attempt to work with this device name. The access rights of the device entry point and the OS meaning of the compute node 200 determine whether the application thread is authorized to perform a particular permission.

【０１１１】この許可の方法は、インターコネクト・フ
ァブリック１０６がアクセスできる位置にあるストレー
ジリソース１０４の、計算ノード２００による許可を拡
大する。しかし、本発明は、計算ノード２００が本発明
のストレージリソース１０４を直接管理しない点におい
て、他のコンピュータアーキテクチャとは異なってい
る。この違いにより、ローカルの許可データを単純にフ
ァイルシステムと結びつけるのは実用的ではなくなる。
代わりに、本発明では、計算ノード２００の許可方針デ
ータをＩＯＮ２１２のＶＳＩ６０２と結びつけ、計算ノ
ード２００とＩＯＮ２１２が相互信頼のレベルを共有す
る２段階のアプローチを使用する。ＩＯＮ２１２は各計
算ノード２００による特定のＶＳＩ６０２へのアクセス
を許可するが、ＶＳＩが指定するデータへの特定のアプ
リケーション・スレッドの許可の詳細化は計算ノード２
００の責任である。計算ノード２００は、ＩＯＮ２１２
にストアされた許可メタデータに含まれる方針を使用し
て、ストレージ１０４に許可方針を強制する。したがっ
て、計算ノード２００はＩＯＮ２１２によるメタデータ
の維持を信頼する必要があり、ＩＯＮ２１２に計算ノー
ド２００による許可の強制を信頼させる必要がある。こ
の方法の利点の一つは、ＩＯＮ２１２がメタデータの解
釈方法に関する知識を持つ必要がないことである。その
ため、ＩＯＮ２１２は、計算ノード２００が使用する異
なるＯＳ２０２がインポーズする異なる許可の意味によ
ってインポーズされる特定の許可の意味の強制から分離
される。This permission method extends the permission by the compute node 200 of the storage resource 104 at a location accessible by the interconnect fabric 106. However, the present invention differs from other computer architectures in that the compute node 200 does not directly manage the storage resources 104 of the present invention. This difference makes it impractical to simply tie local permissions data to the file system.
Instead, the present invention ties the authorization policy data of the compute node 200 with the VSI 602 of the ION 212 and uses a two-stage approach where the compute node 200 and the ION 212 share a level of mutual trust. The ION 212 allows each compute node 200 to access a particular VSI 602, but the refinement of granting a particular application thread to data specified by the VSI is
00 responsibility. The calculation node 200 is an ION 212
To enforce the authorization policy on the storage 104 using the policy contained in the authorization metadata stored in the storage 104. Therefore, the compute node 200 needs to trust the maintenance of the metadata by the ION 212 and the ION 212 to trust the enforcement of the permission by the compute node 200. One advantage of this method is that the ION 212 does not need to have knowledge of how to interpret the metadata. As such, the ION 212 is decoupled from the enforcement of specific permission semantics that are imposed by the different permission semantics imposed by the different OSs 202 used by the compute nodes 200.

【０１１２】ＶＳＩ６０２に関連するすべてのデータ
（アクセス権を含む）はＩＯＮ２１２にストアされる
が、アクセス権データの内容を管理する責務は計算ノー
ド２００にある。さらに詳しく言えば、ＩＯＮ２１２が
エキスポートしているＶＳＩ６０２のリストを計算ノー
ド２００に送る時、各ＶＳＩ６０２に結びつくのは、計
算ノード２００がローカルでの許可を強制するのに必要
なすべてのＯＳ指定データである。例えば、ＵＮＩＸを
動かす計算ノード２００は名前、グループ名、ユーザＩ
Ｄ、つまりファイルシステムにデバイス・エントリ・ノ
ードを作成するのに十分なデータを送信される。計算ノ
ードＯＳ２０２に固有の（又は計算ノード２００に固有
の）ＶＳＩ６０２の代替名は、各ＶＳＩ６０２に含まれ
る。ストレージデバイスのアクセス権を変更するローカ
ルＯＳ固有コマンドは、計算ノード２００のソフトウェ
アが捕獲し、ＩＯＮ２１２へ送るメッセージに変換され
る。このメッセージは、そのＯＳバージョン固有のＶＳ
Ｉアクセス権をアップデートする。この変更が完了する
と、ＩＯＮ２１２は、システムのＯＳを使用するすべて
の計算ノード２００に、アップデートを送信する。All data (including the access right) related to the VSI 602 is stored in the ION 212, but the responsibility for managing the contents of the access right data lies with the computing node 200. More specifically, when the ION 212 sends a list of exported VSIs 602 to the compute nodes 200, each VSI 602 is tied to all OS-specific data required by the compute nodes 200 to enforce local permissions. It is. For example, the computing node 200 that runs UNIX has a name, group name,
D, that is, enough data to create a device entry node in the file system. An alternative name of the VSI 602 unique to the computing node OS 202 (or unique to the computing node 200) is included in each VSI 602. The local OS specific command for changing the access right of the storage device is converted into a message captured by the software of the computing node 200 and sent to the ION 212. This message is the OS version specific VS
Update I access rights. When this change is completed, the ION 212 sends an update to all the compute nodes 200 that use the OS of the system.

【０１１３】計算ノード（ＣＮ）２００は、オンライン
になると、「自分はここにいる」というメッセージを各
ＩＯＮに送信する。このメッセージは計算ノード２００
を識別するデジタル署名を含んでいる。ＩＯＮ２１２が
計算ノード２００を認識すると（ＩＯＮ２１２が計算ノ
ード２００を承認すると）、ＩＯＮ２１２は計算ノード
２００がアクセス権を持っているすべてのＶＳＩ名をエ
キスポートする。計算ノード２００は、このＶＳＩ６０
２名のリストを使用して、システムストレージのローカ
ルアクセス入口点を組み立てる。計算ノード２００内の
アプリケーション２０４が最初にローカル端点を参照し
た時、計算ノード２００はインターコネクト・ファブリ
ック１０６を通じて、そのＶＳＩ６０２のアクセス権記
述データをＩＯＮ２１２に要求する。この要求メッセー
ジには、要求している計算ノード２００のデジタル署名
が含まれる。ＩＯＮ２１２はメッセージを受領し、デジ
タル署名を使用して返信すべき該当ＶＳＩアクセス権の
セットを見つけて、インターコネクト・ファブリック１
０６を介して、要求した計算ノード２００にそのデータ
を送信する。しかし、ＩＯＮ２１２は計算ノード２００
に送信するアクセス権を解釈せず、データの送信のみを
行う。計算ノード２００のソフトウェアは、このデータ
を利用して、該当するローカルアクセス権のセットを、
この対象ストレージ・オブジェクトのローカル入口点に
結びつける。When the computing node (CN) 200 goes online, it sends a message “I am here” to each ION. This message is sent to compute node 200
Includes a digital signature that identifies If the ION 212 recognizes the compute node 200 (if the ION 212 approves the compute node 200), the ION 212 exports all VSI names to which the compute node 200 has access. The calculation node 200 uses the VSI 60
Assemble the local access entry point for system storage using the two-person list. When the application 204 in the computing node 200 first refers to the local endpoint, the computing node 200 requests the ION 212 for the access right description data of the VSI 602 through the interconnect fabric 106. This request message includes the digital signature of the requesting compute node 200. The ION 212 receives the message, finds the appropriate set of VSI permissions to return using the digital signature, and
The data is transmitted to the requesting computing node 200 via 06. However, the ION 212 is
It does not interpret the access right to send to, but only sends data. The software of the computing node 200 uses this data to generate a corresponding set of local access rights,
Bind to the local entry point of this target storage object.

【０１１４】計算ノード２００のセットは、同じデジタ
ル署名の使用、又はＩＯＮ２１２に複数の異なる署名を
同じアクセス権のセットに結びつけさせることで、同じ
アクセス権のセットを共有することができる。本発明で
は、認証を使用して、計算ノード２００の確認と、ロー
カル入口点の作成にどのローカル許可データを使用する
かの指定の両方を行う。計算ノードが許可データを引き
出すのは、アプリケーションが最初にＶＳＩ６０２を参
照した時のみである。この「必要時に引き出す」モデル
は、非常に大きなシステムの大量のアクセス権メタデー
タの移動によるスタートアップ・コストを回避すること
になる。The set of compute nodes 200 can share the same set of access rights by using the same digital signature or by having the ION 212 tie multiple different signatures to the same set of access rights. In the present invention, authentication is used to both identify the compute node 200 and specify which local authorization data is used to create a local entry point. The computation node retrieves the permission data only when the application first references the VSI 602. This “pull on demand” model avoids the startup costs of moving large amounts of access rights metadata for very large systems.

【０１１５】計算ノード２００が認証に失敗した場合、
ＩＯＮ２１２はＶＳＩ６０２名を含まないメッセージを
返信し、認証失敗フラグが設定される。計算ノード２０
０はＩＯＮ２１２からのＶＳＩデバイス名なしでそのま
ま継続することが可能で、システム管理者の希望に応じ
て失敗した認証の報告をすることができる。もちろん、
認証が成功した場合でも、計算ノードへのＶＳＩデバイ
ス名の送信は行われない。When the computation node 200 fails in the authentication,
The ION 212 returns a message not including the VSI 602 name, and an authentication failure flag is set. Compute node 20
0 can be continued without the VSI device name from the ION 212, and a failed authentication can be reported according to the request of the system administrator. of course,
Even if the authentication is successful, the transmission of the VSI device name to the computing node is not performed.

【０１１６】ｃ）立ち上げ時の競合解消立ち上げ時、ＩＯＮ２１２はインターコネクト・ファブ
リック１０６にＶＳＩ６０２をエキスポートしようとす
る。この場合、新しいＩＯＮ２１２によるあらゆる中断
からシステムのデータ完全性を守らなくてはいけない。
これを実現するために、新しいＩＯＮ２１２は、ストレ
ージのエキスポートが可能になる前にチェックされる。
これは次のように行われる。まず、ＩＯＮ２１２はロー
カルストレージを検査して、エキスポート可能なＶＳＩ
６０２のリストを作成する。ＶＳＩ６０２メタデータに
はＶＳＩ生成又は変異番号が含まれる。ＶＳＩ変異番号
は、そのＶＳＩ６０２の状態に大きな変化があったとき
に増加する（ＶＳＩのネットワークへのエキスポートが
成功した時など）。ＶＳＩの競合検知に関与するすべて
のノードは、計算ノード２００及びＩＯＮ２１２を含
め、メモリにエキスポートされたＶＳＩの履歴と変異番
号を保持する。インターコネクト・ファブリック上のす
べてのノードは、エキスポートされたＶＳＩ６０２のＶ
ＳＩ競合を常にモニターする必要がある。最初（ストレ
ージ拡張した最初に作成された時）、ＶＳＩ変異番号は
ゼロに設定されている。エキスポートされたＶＳＩ６０
２に前回エキスポートされた時よりも小さな変異番号が
ついている場合、たとえ実ＶＳＩ６０２に結びつくＩＯ
Ｎ２１２がサービス外だったとしも、そのＶＳＩは詐称
ＶＳＩだと推定できる点では、変異番号は競合解消の基
準を提供する。実ＶＳＩ６０２の変異番号より大きな変
異番号を持つＩＯＮ２１２に付随する詐称ＶＳＩ６０２
は、Ｉ／Ｏがすでに実ＶＳＩ上で実施されていない限
り、実ＶＳＩ５１２とみなされる。新たにインターコネ
クト・ファブリックに導入されたＩＯＮ２１２は、変異
番号０から開始する必要がある。C) Conflict Resolution at Startup At startup, the ION 212 attempts to export the VSI 602 to the interconnect fabric 106. In this case, the data integrity of the system must be protected from any interruption by the new ION 212.
To accomplish this, the new ION 212 is checked before the storage can be exported.
This is performed as follows. First, the ION 212 checks the local storage and exports the VSI
A list 602 is created. VSI 602 metadata includes VSI generation or mutation number. The VSI mutation number increases when there is a major change in the state of the VSI 602 (such as when the VSI is successfully exported to the network). All nodes involved in the VSI conflict detection, including the compute node 200 and the ION 212, maintain the VSI history and mutation number exported to memory. All nodes on the interconnect fabric have the VSI of the exported VSI 602
SI contention must be monitored constantly. Initially (when initially created for storage expansion), the VSI variant number is set to zero. VSI60 exported
2 has a smaller mutation number than when it was last exported, even if the IO associated with the real VSI 602
Even if N212 is out of service, the variant number provides a criterion for conflict resolution in that its VSI can be presumed to be spoofed VSI. False VSI 602 associated with ION 212 having a mutation number larger than the mutation number of actual VSI 602
Is considered a real VSI 512 unless the I / O has already been implemented on the real VSI. IONs 212 that are newly introduced into the interconnect fabric need to start with mutation number 0.

【０１１７】システムへの加入希望を通知した後、ＩＯ
Ｎ２１２はＶＳＩ６０２のリストと関連する変異番号を
送信する。他のすべてのＩＯＮ２１２と計算ノード２０
０はこのリストを入手し、ＩＯＮ２１２がＶＳＩ６０２
のリストをエキスポートする妥当性をチェックする。After notifying the request to join the system, the IO
N212 sends the VSI 602 list and the associated mutation number. All other IONs 212 and compute nodes 20
0 obtains this list, and the ION 212
Check the validity of exporting a list of.

【０１１８】現在同じＶＳＩ６０２をエキスポートして
いる他のＩＯＮは有効であると仮定され、新しいＩＯＮ
に対して、競合している特定のＶＳＩのエキスポートを
認めないメッセージを送付する。新しいＩＯＮ５１２が
現在システムで使用されているものよりも大きな生成又
は変異番号を持っている場合は（ＶＳＩは大域的にユニ
ークであるため、通常の運用では発生しないイベン
ト）、必要な行動を行うシステム管理者に対してこれを
通知及び報告する。競合がない場合、各ＩＯＮ２１２と
計算ノードは続行投票により反応する。すべてのＩＯＮ
２１２と計算ノード２００からの反応を受領すると、競
合していない新しいＩＯＮ２１２のすべてのＶＳＩ６０
２では生成番号が増加し、システムへのエキスポートが
できるようになる。The other IONs currently exporting the same VSI 602 are assumed to be valid, and
, A message that does not allow the export of the competing specific VSI is sent. If the new ION 512 has a larger generation or mutation number than what is currently used in the system (an event that does not occur in normal operation because VSI is globally unique), the system that takes the necessary action Notify and report this to the administrator. If there is no conflict, each ION 212 and compute node respond with a continue vote. All ION
Upon receiving a response from the compute node 200 and the VSI 60 of the new ION 212 that is not competing,
In 2, the generation number increases, and it becomes possible to export to the system.

【０１１９】計算ノード２００がＶＳＩ６０２のアプリ
ケーション基準とアクセスを持つと、その計算ノード２
００は生成番号をローカルで追跡する。ＩＯＮ２１２が
ＶＳＩ６０２を公表した時（エキスポートを試みた時）
は常に、計算ノード２００はＶＳＩ６０２が公表した生
成番号をＶＳＩ６０２に関してローカルでストアしてい
る生成番号と比較する。生成番号が符号した場合、計算
ノード２００は続行の投票を行う。生成番号が競合する
場合（古いバージョンのＶＳＩがオンラインになった場
合など）、計算ノード２００は拒否のメッセージを送
る。新しいＩＯＮがＶＳＩに関して公表した生成番号よ
り古い生成番号を持っている計算ノード２００は続行を
投票し、そのＶＳＩ６０２のローカルバージョンの生成
番号を更新する。計算ノード２００はリブートまでの間
に生成番号を保存しない。これは基本設計において、イ
ンターコネクト・ファブリック１０６全体のシステムが
安定し、計算ノード２００及びＩＯＮ２１２を含むすべ
ての新規参入者を絶えずチェックするからである。When the computing node 200 has the application reference and access of the VSI 602, the computing node 2
00 tracks the generation number locally. When ION212 publishes VSI602 (when trying to export)
Always, the compute node 200 compares the generation number published by the VSI 602 with the generation number stored locally for the VSI 602. If the generation number is coded, the computation node 200 votes to continue. If the generation numbers conflict (such as when an older version of the VSI goes online), the compute node 200 sends a reject message. Compute nodes 200 that have a generation number older than the new ION has published for the VSI vote for continuation and update the generation number of the local version of that VSI 602. The computing node 200 does not save the generation number until the reboot. This is because, in the basic design, the system throughout the interconnect fabric 106 is stable and constantly checks all new entrants, including the compute nodes 200 and IONs 212.

【０１２０】第一の電源投入時、ＶＳＩ６０２のネーム
スペースの安定性が問題になる状況が起こり得る。この
問題には、先にＩＯＮ２１２の電源を入れ、名前の競合
を継続的に解消できるようにしてから、計算ノード２０
０の加入を認める形で対処する。これにより、期限切れ
バージョンのＶＳＩ６０２（ディスクドライブの古いデ
ータその他の変質条件による）を生成番号によって解消
できる。計算ノード２００がそのＶＳＩ６０２を使用し
ていない限り、より大きな生成番号を持つ新規参入者
は、特定のＶＳＩ６０２を現在エキスポートしているも
のを無効化できる。At the first power-on, a situation may arise in which the stability of the namespace of the VSI 602 becomes a problem. To solve this problem, first turn on the ION 212 so that name conflicts can be continuously resolved,
We will deal with it by accepting the participation of 0. As a result, the expired version of the VSI 602 (due to old data in the disk drive or other deterioration conditions) can be eliminated by the generation number. As long as the compute node 200 is not using its VSI 602, a new entrant with a higher generation number can override what is currently exporting a particular VSI 602.

【０１２１】（１）ネーム・サービス（ａ）ＩＯＮの名前のエキスポートＩＯＮ２１２は、独占的に所有するＶＳＩ６０２の作業
セットをエキスポートして、関連するストレージへのア
クセスを可能にする。ＩＯＮ２１２がエキスポートする
ＶＳＩの作業セットは、バディＩＯＮ（ダイポール２１
６内の他のＩＯＮ２１２、２１４で表される）とのＶＳ
Ｉ所有権の交渉を通じて動的に決定され、インターコネ
クト・ファブリック１０６で通信するすべてのノード内
で大域的にユニークになる。このセットは普通は、ＩＯ
Ｎ２１２に割り当てられたＶＳＩ６０２のデフォルト又
はプライマリセットである。ダイナミック・ロード・バ
ランシングのためのＶＳＩマイグレーションや、バディ
ＩＯＮ２１４の異常及びＩ／Ｏパスの異常を含む例外的
状況によって、エキスポートされるＶＳＩ６０２セット
がプライマリセット以外に変更される場合もある。(1) Name Service (a) Export of ION Name The ION 212 exports a working set of the exclusively owned VSI 602 to allow access to associated storage. The working set of the VSI exported by the ION 212 is a buddy ION (Dipole 21
6 (represented by other IONs 212, 214)
Determined dynamically through I ownership negotiations, it becomes globally unique among all nodes communicating with the interconnect fabric 106. This set is usually IO
The default or primary set of the VSI 602 assigned to N212. The VSI 602 set to be exported may be changed to a primary set other than the primary set due to VSI migration for dynamic load balancing, or exceptional situations including abnormalities of the buddy ION 214 and abnormalities of the I / O path.

【０１２２】ＶＳＩの作業セットは、作業セットが変更
になると常に、同報メッセージを利用してＩＯＮ２１２
によってエキスポートされ、最新のＶＳＩ構成を計算ノ
ード２００に提供する。計算ノード２００も、ＩＯＮ２
１２に対して、ＶＳＩ６０２の作業セットについて質問
を行うことがある。エキスポートしたＶＳＩ６０２のた
めにＩＯＮ２１２がオンラインに加入又は再加入する
と、計算ノード２００によってＶＳＩ６０２へのＩ／Ｏ
アクセスが開始される。上述のように、エキスポートし
たＶＳＩ６０２に何らかの競合があれば、ＩＯＮ２１２
はオンラインへの加入を許可されない。ストレージの集
合に付随するＶＳＩ６０２はすべてユニークであるはず
だが、ストレージの複数の集合が同じＶＳＩを持つ競合
状態が発生する可能性もある（例えば、ＶＳＩがＩＯＮ
２１２ハードウェアに付随するユニークなＩＤとＩＯＮ
２１２が管理するシーケンス番号から形成され、そのＩ
ＯＮ２１２ハードウェアが物理的に移動された場合）。Whenever the working set of the VSI is changed, the ION 212 using the broadcast message is used.
And provides the latest VSI configuration to the compute node 200. Compute node 200 is also ION2
12 may be asked about the working set of the VSI 602. When the ION 212 subscribes or re-subscribes online for the exported VSI 602, I / O to the VSI 602 by the compute node 200
Access is started. As described above, if there is any conflict in the exported VSI 602, the ION 212
Is not allowed to join online. Although all VSIs 602 associated with a storage set should be unique, a race condition may occur where multiple sets of storage have the same VSI (e.g., VSI is an ION).
Unique ID and ION attached to 212 hardware
212 is formed from the sequence number managed by
ON 212 hardware is physically moved).

【０１２３】作業セットをエキスポートすると、エキス
ポートしたＩＯＮ２１２は、オンライン状態に入って、
エキスポートされたＶＳＩ６０２へのＩ／Ｏアクセスを
可能にする前に、競合チェックタイマ（２秒）を設定す
る。競合チェックタイマは、インポートするものが競合
チェック処理を行い、エキスポートしたものに競合を知
らせるのに十分な時間を与えようとするが、タイマーを
非常に大きな値に設定しない限り、これは保証できな
い。そのため、ＩＯＮ２１２はすべてのノード（計算ノ
ード及びＩＯＮ２１２）から、オンライン加入を正式に
認める明確な承認が必要である。すべてのノードは、オ
ンライン同報メッセージに同時に反応し、結果はまとめ
られ、返送される。まとめた結果がＡＣＫであればＩＯ
Ｎ２１２は正式にオンラインに加入する。ＩＯＮ２１２
がオンライン加入を認められなかった場合は、新しくエ
キスポートされたＶＳＩ６０２のセットにアクセスはで
きない。ＮＡＫを送ったノードは、その後ＶＳＩ競合メ
ッセージをエキスポートしたものに送り、競合を解消さ
せる。競合が解消すると、ＩＯＮ２１２は調整した作業
セットをエキスポートし、再度オンラインに加入を試み
る。When the working set is exported, the exported ION 212 enters an online state,
Before enabling I / O access to the exported VSI 602, a contention check timer (2 seconds) is set. The conflict check timer attempts to give the importer enough time to perform the conflict check process and notify the exporter of the conflict, but this cannot be guaranteed unless the timer is set to a very large value. . Therefore, the ION 212 requires explicit approval from all nodes (computing node and ION 212) to formally authorize online subscription. All nodes respond simultaneously to the online broadcast message, and the results are compiled and returned. If the summarized result is ACK, IO
N212 officially subscribes online. ION212
If he is not authorized to join online, he cannot access the newly exported set of VSIs 602. The node that sent the NAK then sends a VSI conflict message to the exported one to resolve the conflict. When the conflict is resolved, the ION 212 exports the adjusted working set and attempts to join online again.

【０１２４】（ｂ）ＣＮの名前のインポート計算ノード２００には、ＩＯＮ２１２がエキスポートし
たすべてのＶＳＩ５０４をインポートするために行動す
る責任がある。その日の第一の処理中、計算ノード２０
０はオンライン上のすべてのＩＯＮ２１２に、前回エキ
スポートしたＶＳＩ６０２を要求し、ネームスペースの
新しい状態を入手する。これ以降、計算ノード２００は
ＶＳＩ６０２のエキスポートに注意を払う。(B) Importing CN Name The compute node 200 is responsible for acting to import all VSIs 504 exported by the ION 212. During the first processing of the day, compute node 20
0 requests all IONs 212 on-line for the previously exported VSI 602 to get the new state of the namespace. Thereafter, the compute node 200 pays attention to the export of the VSI 602.

【０１２５】ＶＳＩ６０２に関する制御情報は、ＩＯＮ
２１２が管理するｖｓノードに含まれる。ｖｓノードの
計算ノード２００部分には、アプリケーション２０４に
例示するネームの構築及び管理に使用される情報が含ま
れる。ｖｓノード情報にはユーザアクセス権及びネーム
・エイリアスが含まれる。The control information on the VSI 602 is ION
212 is included in the vs node managed. The computation node 200 portion of the vs node includes information used for construction and management of names exemplified in the application 204. The vs node information includes a user access right and a name alias.

【０１２６】（ｉ）ネーム・ドメイン及びエイリアスＶＳＩ６０２はアプリケーションが定義するネーム・エ
イリアスを持つ構成になる場合があり、これは関連する
ストレージにアクセスする代替名を提供する。ネーム・
エイリアスは、ネームのセットを論理的にまとめるため
に仮想ストレージドメインに属すことができる。ネーム
・エイリアスは仮想ストレージドメイン内でユニークで
なければいけない。(I) Name Domain and Alias VSI 602 may be configured with an application-defined name alias, which provides an alternative name for accessing the associated storage. name·
Aliases can belong to a virtual storage domain to logically organize a set of names. Name aliases must be unique within the virtual storage domain.

【０１２７】（ｉｉ）ｖｓノード計算ノード２００によるｖｓノードの修正は、所有する
ＩＯＮに送られ、すぐに更新と処理が行われる。次に、
ｖｓノードの変更は、ＩＯＮ２１２が変更のエキスポー
トとオンライン状態への再加入を行い、すべてのノード
に伝達する。(Ii) vs node The modification of the vs node by the computing node 200 is sent to the owning ION, and is immediately updated and processed. next,
The change of the node is transmitted to all the nodes by the ION 212 exporting the change and rejoining the online state.

【０１２８】ｄ）ストレージディスク管理ＪＢＯＤエンクロージャ２２２は、ディスクデバイスの
ための物理環境の提供と、ディスクデバイス及びエンク
ロージャ管理アプリケーションへのいくつかのサービス
の提供に責任を有する。こうしたサービスの中に含まれ
るのは（１）コンポーネント異常の通知（電源、ファン
など）、（２）しきい値の通知（温度や電圧）、（３）
故障及びステータスライトの可能化と不能化、（４）音
声警告の可能化と不能化、（５）ディスクデバイスのデ
バイスＩＤの設定である。D) Storage Disk Management The JBOD enclosure 222 is responsible for providing a physical environment for disk devices and for providing some services to disk device and enclosure management applications. These services include (1) component error notification (power supply, fan, etc.), (2) threshold notification (temperature and voltage), (3)
(4) enabling and disabling a sound warning; and (5) setting a device ID of a disk device.

【０１２９】以前は、管理アプリケーションは一般に、
帯域外の接続によってエンクロージャとのインターフェ
ースを行った。単一ネットワーク管理プロトコル（ＳＮ
ＭＰ）のようなプロトコルを使用したリモート・エンク
ロージャへのシリアル又はイーサネット接続では、エン
クロージャの状態に関するステータス情報を受領でき
た。本発明では、ディスクエンクロージャは物理的にホ
ストシステムと離れている可能性があり、エンクロージ
ャの構成やステータスを、分離したシリアル・パスのよ
うな直接的な接続によってモニターするのは実用的では
ない。余分なケーブルを避けるために、本発明では、既
存のノーマルなファイバ・チャンネル・ループによって
エンクロージャのステータスをモニターし、エンクロー
ジャの構成をコントロールできる帯域内接続を使用す
る。Previously, management applications were generally
An out-of-band connection interfaced with the enclosure. Single Network Management Protocol (SN
A serial or Ethernet connection to a remote enclosure using a protocol such as MP) could receive status information about the state of the enclosure. In the present invention, the disk enclosure may be physically separated from the host system, and it is not practical to monitor the configuration or status of the enclosure by a direct connection such as a separate serial path. To avoid extra cabling, the present invention uses an in-band connection that can monitor the status of the enclosure and control the configuration of the enclosure over existing normal Fiber Channel loops.

【０１３０】この帯域内接続では、構成情報の問い合わ
せや制御のためにホストからＳＣＳＩデバイスへ送信さ
れるＳＣＳＩコマンドセットと、デバイスがこの情報を
エンクロージャ自身に通信するメカニズムを使用する。
ホストとディスクデバイスのプロトコルの部分は、ＳＣ
ＳＩ−３エンクロージャサービス（ＳＥＳ）の仕様に詳
述されており、これは本文書の参照事項に含まれる。The in-band connection uses a SCSI command set transmitted from the host to the SCSI device for inquiring and controlling configuration information, and a mechanism by which the device communicates this information to the enclosure itself.
The protocol part of the host and disk device is SC
It is detailed in the SI-3 Enclosure Services (SES) specification, which is included in the references of this document.

【０１３１】ＳＥＳインターフェースの導入には、問い
合わせ、診断送信、診断結果受領の３つのＳＣＳＩコマ
ンドが使用される。問い合わせコマンドは、特定のデバ
イスがエンクロージャデバイスか、ＳＥＳコマンドをエ
ンクロージャ・サービス・プロセスに転送できるデバイ
スかを指定する。診断送信、診断結果受領はそれぞれ、
エンクロージャ要素からのステータス情報を管理、受領
するために使用される。To introduce the SES interface, three SCSI commands of inquiry, transmission of a diagnosis, and reception of a diagnosis result are used. The query command specifies whether the particular device is an enclosure device or a device that can forward SES commands to the enclosure service process. Sending the diagnosis and receiving the diagnosis result respectively
Used to manage and receive status information from enclosure elements.

【０１３２】診断送信又は診断結果受領コマンドを使用
する時は、ページコードを指定しなくてはいけない。ペ
ージコードは要求されているステータスや情報のタイプ
を指定する。診断送信及び診断結果受領コマンドによっ
て要求できる既定ＳＥＳページのすべてのセットは下の
表ＶＩＩに記載されている。太字の項目はＳＥＳイベン
トモニタが必要とするものである。When using the diagnosis transmission or diagnosis result reception command, a page code must be specified. The page code specifies the status or type of information being requested. All sets of default SES pages that can be requested by the Send Diagnostic and Receive Diagnostic Results commands are listed in Table VII below. Items in bold are those required by the SES event monitor.

【０１３３】[0133]

【表７】 [Table 7]

【０１３４】アプリケーション・クライアントは、診断
結果読み出しコマンドを実行して断続的にエンクロージ
ャをポーリングし、１より大きな最少割り当て長さのエ
ンクロージャ・ステータス・ページを要求することがあ
る。情報は、エンクロージャのステータスを要約した５
ビットを含む１バイトによって戻される。これらのビッ
トの一つが設定されている場合、アプリケーション・ク
ライアントは、完全なステータスを入手するために、こ
のコマンドを長い割り当て長さで再発行できる。The application client may execute the read diagnostics command intermittently to poll the enclosure and request a minimum allocated length greater than one enclosure status page. Information summarizes the status of the enclosure5
Returned by one byte containing the bits. If one of these bits is set, the application client can reissue this command with a long assigned length to get the complete status.

【０１３５】ｅ）ＩＯＮエンクロージャ管理図７は、ＩＯＮエンクロージャ管理モジュールとＩＯＮ
物理ディスクドライバ・アーキテクチャ５００の関係を
示している。このサブシステムは、ＳＥＳイベントモニ
タ７０２とＳＣＣ２＋ｔｏＳＥＳガスケット７０４の
２つのコンポーネントで構成されている。ＳＥＳイベン
トモニタ７０２は、接続するすべてのエンクロージャの
サービス・プロセスのモニタと、ステータスが変化した
場合のイベント記録サブシステムによる報告に責任を有
している。この報告は、必要があれば管理サービスレイ
ヤ７０６に転送される。ＳＣＣ２＋ｔｏＳＥＳガスケ
ット７０４は、構成及び維持アプリケーションからのＳ
ＣＣ２＋コマンドのトランスレートと、エンクロージャ
・サービスプロセスに対応する１つ以上のＳＥＳコマン
ドへのこれのトランスレートに責任を有する。これによ
り、アプリケーション・クライアントはＪＢＯＤ構成の
仕様を知る必要がなくなる。E) ION Enclosure Management FIG. 7 shows the ION enclosure management module and the ION
The relationship of the physical disk driver architecture 500 is shown. This subsystem is composed of two components: SES event monitor 702 and SCC2 + to SES gasket 704. The SES event monitor 702 is responsible for monitoring the service processes of all connected enclosures and for reporting by the event recording subsystem when status changes. This report is forwarded to the management service layer 706 if necessary. The SCC2 + to SES gasket 704 provides S
Responsible for translating CC2 + commands and translating them to one or more SES commands corresponding to the enclosure service process. This eliminates the need for the application client to know the specifications of the JBOD configuration.

【０１３６】（１）ＳＥＳイベントモニタＳＥＳイベントモニタ７０２は、エンクロージャ２２２
のサービスプロセス・ステータスの変化に関する報告
を、管理サービスレイヤ７０６に戻す。ステータス情報
はイベント記録サブシステムによって報告される。ＳＥ
Ｓイベントモニタ７０２は、エンクロージャ・ステータ
ス・ページを要求する診断結果読み出しコマンドを実行
して、各エンクロージャプロセスを断続的にポーリング
する。診断結果読み出しコマンドは、ＩＯＮ物理デバイ
スディスクドライバ５００が提供するＳＣＳＬｉｂイン
ターフェース５１４を介して送信される。報告されるス
テータスには、下の表ＶＩＩＩのステータス項目が含ま
れる。(1) SES Event Monitor The SES event monitor 702
To the management service layer 706. Status information is reported by the event recording subsystem. SE
The S event monitor 702 executes a diagnostic result read command requesting an enclosure status page to intermittently poll each enclosure process. The diagnosis result read command is transmitted via the SCSLib interface 514 provided by the ION physical device disk driver 500. The reported status includes the status items in Table VIII below.

【０１３７】[0137]

【表８】 [Table 8]

【０１３８】ＳＥＳイベントモニタ７０２は、起動する
と、エンクロージャに含まれる各要素４０２−４２４の
ステータスを読み出す。このステータスは現在のステー
タスである。ステータスの変化を検知すると、現在のス
テータスから変化した各ステータスの報告が管理サービ
スレイヤ７０６に戻される。そして今度は、この新しい
ステータスが現在のステータスになる。例えば、ファン
要素の現在のステータスがＯＫで、ステータス変化によ
りエレメントの状態がファン障害になった場合、このイ
ベントはファンの故障を示すものとして報告される。別
のステータス変化がエレメントの未インストールを示す
場合、このイベントはファンがエンクロージャから取り
外されたことを示すものとして報告される。さらに別の
ステータス変化がファン要素のＯＫであることを示す場
合、このイベントは、ファンがホットプラグされ、正常
に動いていることを示すものとして生成される。When the SES event monitor 702 starts, it reads the status of each element 402-424 included in the enclosure. This status is the current status. Upon detecting a change in status, a report of each status that has changed from the current status is returned to the management service layer 706. This time, this new status becomes the current status. For example, if the current status of the fan element is OK and the status change causes the element to be in a fan failure state, this event is reported as indicating a fan failure. If another status change indicates that the element has not been installed, this event is reported as an indication that the fan has been removed from the enclosure. If yet another status change indicates that the fan element is OK, this event is generated as an indication that the fan has been hot-plugged and is operating normally.

【０１３９】（ａ）その日の第一の処理ＳＥＳイベントモニタ７０２は、ＩＯＮ物理ディスクド
ライバ５００の初期化成功の後で起動する。起動後、Ｓ
ＥＳイベントモニタ６０２は、ＪＢＯＤ及びＳＣＳＩ構
成モジュール５１６を読み出して、ディスクデバイスと
エンクロージャサービスデバイスの相互関係、デバイス
のアドレスを確認する。次に、各エンクロージャ・ステ
ータス・デバイスのステータスを読む。そして、すべて
のエラー状態及び失われている要素に関してイベントを
生成する。これらのステップが終了すると、そのステー
タスが現在のステータスになり、ポーリングが開始され
る。(A) First Processing of the Day The SES event monitor 702 starts after the ION physical disk driver 500 has been successfully initialized. After startup, S
The ES event monitor 602 reads the JBOD and SCSI configuration module 516 to check the mutual relationship between the disk device and the enclosure service device and the address of the device. Next, the status of each enclosure status device is read. It then generates events for all error conditions and missing elements. When these steps are completed, the status becomes the current status and polling starts.

【０１４０】（２）ＳＣＣ２＋ｔｏＳＥＳガスケッ
トＳＣＣ２＋は、ＩＯＮ２１２が仮想及び物理デバイスの
構成と管理に使用するプロトコルである。ＳＣＣ２＋の
プラス（＋）は、ＳＣＣ２によるＩＯＮ２１２デバイス
及びコンポーネントの完全な管理と、ＳＣＣ２が定義す
るコマンドのＳＥＳに対する一貫したマッピングを可能
にするための追加を意味する。(2) SCC2 + to SES Gasket SCC2 + is a protocol used by the ION 212 for configuration and management of virtual and physical devices. The plus (+) of SCC2 + means the addition of SCC2 to allow complete management of ION 212 devices and components and the consistent mapping of SCC2 defined commands to SES.

【０１４１】サービスレイヤ７０６は、ＳＣＣ２メンテ
ナンス・イン及びメンテナンス・アウトのコマンドを通
じて、ＪＢＯＤエンクロージャ２２２要素をアドレスす
る。以下のセクションでは、コンポーネントのステータ
スの構成、制御、報告に関するメカニズムを提供するサ
ービス・アクションについて説明する。これらのコマン
ドはそれぞれ、一連の診断送信及び診断結果受領のＳＣ
ＳＩコマンドとしてＩＯＮ２１２に導入される。The service layer 706 addresses the JBOD enclosure 222 elements through SCC2 maintenance in and maintenance out commands. The following sections describe service actions that provide mechanisms for configuring, controlling, and reporting the status of components. These commands are a series of SCs for transmitting a series of diagnostics and receiving diagnostic results, respectively.
It is introduced into the ION 212 as an SI command.

【０１４２】コンポーネントの構成は以下のサービス・
アクションを使用して実行される。The components are composed of the following services.
Performed using actions.

【０１４３】コンポーネント・デバイス追加−コンポー
ネント・デバイス追加コマンドはシステム内でのコンポ
ーネント・デバイスの構成と、ＬＵＮアドレスの定義に
使用する。ＬＵＮアドレスは、ＳＥＳ構成ページのコン
ポーネントの位置に基づいて、ＩＯＮ２１２が割り当て
る。このコマンドに続いて、ＬＵＮ割り当ての結果を入
手するために、コンポーネント・デバイス報告サービス
・アクションが実行される。Component / Device Addition--The component / device addition command is used for configuring a component / device in the system and defining a LUN address. The LUN address is assigned by the ION 212 based on the location of the component on the SES configuration page. Following this command, a component device reporting service action is performed to obtain the result of the LUN assignment.

【０１４４】コンポーネント・デバイス報告−コンポー
ネント・デバイス・ステータス報告サービス・アクショ
ンは、コンポーネント・デバイスに関する完全なステー
タスをレトリーブするためのベンダー固有コマンドであ
る。ＳＥＳは、各要素タイプのステータスを４バイトで
提供する。この新しいコマンドが必要なのは、ステータ
ス報告及びコンポーネント・デバイス報告サービス・ア
クションがステータス情報に１バイトしか割り当てず、
既定のステータスコードがＳＥＳ標準で定義されるもの
と競合するからである。Component Device Report-A component device status report service action is a vendor specific command to retrieve the complete status for a component device. SES provides the status of each element type in 4 bytes. This new command is needed because the status report and component device report service actions only allocate one byte for status information,
This is because the default status codes conflict with those defined in the SES standard.

【０１４５】コンポーネント・デバイス接続−コンポー
ネント・デバイス接続は、一つ以上のユニットを論理的
に特定のコンポーネント・デバイスに接続することを要
求する。このコマンドは、ボリュームセットと、ファ
ン、電源など、それに従属するコンポーネント・デバイ
スの論理関係を形成するのに使われることがある。Component Device Connection-A component device connection requires that one or more units be logically connected to a particular component device. This command may be used to form a logical relationship between a volume set and its dependent component devices such as fans and power supplies.

【０１４６】コンポーネント・デバイス交換−コンポー
ネント・デバイス交換サービス・アクションは、あるコ
ンポーネント・デバイスの他のものとの交換を要求す
る。Component Device Replacement-A component device replacement service action requires the replacement of one component device with another.

【０１４７】コンポーネント・デバイス除去−周辺機器
デバイス／コンポーネント・デバイス除去サービス・ア
クションは、システム構成からの周辺機器又はコンポー
ネント・デバイスの除去を要求する。論理ユニットを接
続しているコンポーネント・デバイスを除去する場合、
このコマンドは状況チェックで終了する。感知キーは不
当な要求で、追加感知修飾子の障害論理ユニットの除去
がついている。Remove Component Device-Peripheral Device / Remove Component Device service action requests the removal of a peripheral or component device from the system configuration. When removing a component device connecting a logical unit,
This command ends with a status check. The sensing key is an unreasonable request, with the removal of the faulty logic unit of an additional sensing qualifier.

【０１４８】コンポーネントのステータスその他の情報
は以下のサービス・アクションによって入手する。The status of a component and other information are obtained by the following service actions.

【０１４９】コンポーネント・ステータス報告−コンポ
ーネント・デバイス・ステータス報告サービス・アクシ
ョンは、コンポーネント・デバイスに関する完全なステ
ータス情報をレトリーブするためのベンダー固有コマン
ドである。ＳＥＳは各要素タイプのステータスを４バイ
トで提供する。ステータス報告及びコンポーネント・デ
バイス報告サービス・アクションはステータス情報に１
バイトしか割り当てず、既定のステータスコードはＳＥ
Ｓ標準で定義されるものと競合する。そのため、この新
しいコマンドが必要になる。Component Status Report-The component device status report service action is a vendor specific command to retrieve complete status information about the component device. SES provides the status of each element type in 4 bytes. The status report and component device report service action include 1 in the status information.
Allocate only bytes, default status code is SE
Conflicts with those defined in the S standard. So this new command is needed.

【０１５０】ステータス報告−ステータス報告サービス
・アクションは、選択した論理ユニットのステータス情
報を要求する。各論理ユニットの一つ以上のステータス
が戻される。Status Report-The status report service action requests status information for the selected logical unit. One or more statuses for each logical unit are returned.

【０１５１】コンポーネント・デバイス報告−コンポー
ネント・デバイス報告サービス・アクションは、ＪＢＯ
Ｄ内のコンポーネント・デバイスに関する情報を要求す
る。ＬＵＮ記述子の順序リストが戻され、ＬＵＮアドレ
ス、コンポーネントタイプ、全体のステータスの報告が
行われる。このコマンドは、コンポーネント・デバイス
追加サービス・アクションが割り当てるＬＵＮアドレス
を判断する初期構成プロセスの一部として利用される。Component Device Report-The component device report service action is a JBO
Request information about the component devices in D. An ordered list of LUN descriptors is returned, reporting the LUN address, component type, and overall status. This command is used as part of the initial configuration process to determine the LUN address assigned by the add component device service action.

【０１５２】コンポーネント・デバイス接続報告−コン
ポーネント・デバイス接続報告サービス・アクション
は、特定のコンポーネント・デバイスに接続している論
理ユニットに関する情報を要求する。コンポーネント・
デバイス記述子のリストが戻され、それぞれにＬＵＮ記
述子のリストが含まれる。ＬＵＮ記述子は、対応するコ
ンポーネントに接続する各論理ユニットのタイプ及びＬ
ＵＮアドレスを特定する。Component Device Connection Report-The Component Device Connection Report service action requests information about the logical unit that is connected to a particular component device. component·
A list of device descriptors is returned, each containing a list of LUN descriptors. The LUN descriptor contains the type and L of each logical unit that connects to the corresponding component.
Specify the UN address.

【０１５３】コンポーネント・デバイス識別子報告−コ
ンポーネント・デバイス識別子報告サービス・アクショ
ンは、特定のコンポーネント・デバイスの位置を要求す
る。コンポーネントの位置を示すＡＳＣＩＩ値が戻され
る。この値は、コンポーネント・デバイス識別子設定サ
ービス・アクションによって事前に設定されている必要
がある。Component Device Identifier Report-The Component Device Identifier Report service action requests the location of a specific component device. An ASCII value indicating the location of the component is returned. This value must have been previously set by the component device identifier setting service action.

【０１５４】コンポーネントの管理は以下によって実行
される。The management of components is performed as follows.

【０１５５】コンポーネント・デバイス命令−コンポー
ネント・デバイス命令は、電源のオンオフといった制御
命令をコンポーネント・デバイスに送るのに使用する。
特定のデバイスに適用されるアクションは、コンポーネ
ントタイプによって異なり、ベンダーに固有である。Component Device Command-The component device command is used to send a control command such as power on / off to the component device.
The actions that apply to a particular device depend on the component type and are vendor specific.

【０１５６】コンポーネント・デバイス中断−コンポー
ネント・デバイス中断サービス・アクションは、特定の
コンポーネントを中断（無作動）状態にする。Suspend Component Device-The Suspend Component Device service action places a particular component in a suspended (inactive) state.

【０１５７】Ｃ．インターコネクト・ファブリック１．概要大量のデータ移動を可能にするために、本発明のファブ
リックに接続するストレージモデルは、データのコピー
及び割り込み処理コストに関するＩ／Ｏ性能問題に取り
組まなくてはならない。本発明では、データのコピー、
割り込み、フロー制御問題について、手法を独自に組み
合わせて取り組んでいる。ほとんどのネットワークで使
用されている目的地ベースのアドレス・モデルとは異な
り、本発明では送信者ベースのアドレスモデルを使用し
ており、ここでは送信者が、データをファブリックに送
信する前に、目的地のターゲット・バッファを選択す
る。送信者ベース・モデルでは、目的地は送信者に対し
て、メッセージの送信前に、メッセージを送信できる目
的地アドレスのリストを発信する。メッセージを送るた
めに、送信者はまずリストから目的地バッファを選択す
る。これが可能なのは、ターゲット側アプリケーション
がこれらのバッファのアドレスを先にＯＳに与え、ター
ゲットのネットワーク・ハードウェアが使用できるよう
にしており、そのため、ネットワーク・ハードウェア
は、ＤＮＡ作業によって、コピーせずにデータを正しい
ターゲットバッファに転送するための十分な情報を持っ
ているからである。C. Interconnect Fabric 1. Overview To enable mass data movement, the storage model attached to the fabric of the present invention must address I / O performance issues related to data copy and interrupt handling costs. In the present invention, data copy,
He works on interrupts and flow control issues by combining methods. Unlike the destination-based address model used in most networks, the present invention uses a sender-based address model, in which a sender sends a destination-based address model before sending data to the fabric. Select a local target buffer. In the sender-based model, the destination sends the sender a list of destination addresses to which the message can be sent before sending the message. To send a message, the sender first selects a destination buffer from a list. This is possible because the target application first gives the address of these buffers to the OS, making it available to the target network hardware, so that the network hardware can use DNA work without copying. This is because it has enough information to transfer the data to the correct target buffer.

【０１５８】有益な点はあるものの、送信者ベースのア
ドレスにはいくつかの問題も存在する。まず、送信者ベ
ースのアドレスでは、ファブリック全体の保護領域が目
的地から送信者を含むまでに拡大し、全般的な分離が失
われ、データのセキュリティと完全性の問題が持ち上が
る。純粋な送信者ベースのアドレスでは、メモリアドレ
スを送信者に解放し、目的地は送信者を信頼する必要が
あり、これは高可用性システムにおける大きな問題であ
る。例えば、目的地ノードが目的地アドレスのリストを
送信者に与えたとする。送信者がこのすべてのアドレス
を使用する前に、目的地ノードがクラッシュし、リブー
トを行う。すると送信側は、すでに有効ではなくなった
アドレスバッファのセットを持つことになる。目的地で
は、このアドレスが別の目的で使用されているかもしれ
ない。これのいずれかにメッセージを送信すると、目的
地で重要なデータが破壊されるという深刻な結果につな
がる可能性がある。[0158] Despite the benefits, there are also some problems with sender-based addresses. First, with sender-based addresses, the protected area of the entire fabric extends from the destination to include the sender, losing overall separation and raising data security and integrity issues. With pure sender-based addresses, memory addresses are released to the sender and the destination must trust the sender, which is a major problem in high availability systems. For example, assume that a destination node has provided a list of destination addresses to a sender. Before the sender uses all these addresses, the destination node crashes and performs a reboot. The sender then has a set of address buffers that are no longer valid. At the destination, this address may be used for another purpose. Sending a message to any of these can have serious consequences, as important data will be destroyed at the destination.

【０１５９】二番目に、送信者ベースのアドレスの導入
には、ネットワークが協力して、データのＤＭＡを開始
する前にメッセージから目的地アドレスの抽出を行う必
要があるが、ほとんどのネットワーク・インターフェー
スでは、こういった作業を行うように設計されていな
い。Second, the introduction of sender-based addresses requires that the network cooperate to extract the destination address from the message before initiating the DMA of the data, but most network interfaces So it's not designed to do this.

【０１６０】必要なアドレスモデルは、送信者ベースモ
デルの利点を含み、問題点を回避するものである。本発
明では、ＢＹＮＥＴに基づくインターコネクト・ファブ
リックを使用する独自の「プット・イット・ゼア」（Ｐ
ＩＴ）プロトコルを利用したハイブリッド・アドレスモ
デルによって、この問題を解決している。The required address model includes the advantages of the sender-based model and avoids problems. The present invention uses a proprietary “Put It There” (P) that uses an interconnect fabric based on BYNET.
This problem is solved by a hybrid address model using the (IT) protocol.

【０１６１】２．ＢＹＮＥＴとＢＹＮＥＴインターフ
ェースＢＹＮＥＴには、本発明の導入に役立つ三つの重要な属
性がある。[0161] 2. BYNET and the BYNET Interface BYNET has three important attributes that help to implement the present invention.

【０１６２】まず、ＢＹＮＥＴは本質的にスケーラブル
である−追加接続や帯域幅の追加を容易に実施でき、シ
ステム内のすべての存在をすぐに利用できる。これは、
追加接続を行っても帯域幅を追加できない他のバス指向
のインターコネクト技術とは対照的である。他のインタ
ーコネクトと比較すると、ＢＹＮＥＴはファン・アウト
（単一のファブリックで利用できるポート数）を増加で
きるだけでなく、ファン・アウトによって増加する二分
帯域幅も有している。First, BYNET is inherently scalable-it can easily implement additional connections and additional bandwidth, and can immediately take advantage of everything in the system. this is,
In contrast to other bus-oriented interconnect technologies, where additional connections do not add bandwidth. Compared to other interconnects, BYNET not only can increase fan out (the number of ports available in a single fabric), but also has a dichotomous bandwidth that increases with fan out.

【０１６３】二番目に、ＢＹＮＥＴはソフトウェアによ
って、アクティブ・メッセージ・インターコネクトに拡
張できる−ユーザ（計算リソース１０２及びストレージ
リソース１０４）の指示の下で、ＢＹＮＥＴは作業の中
断を最低限に抑えながら、ノード間でデータを移動でき
る。ＤＭＡを使用して、データを既定のメモリアドレス
に直接移動し、必要のない割り込みや内部でのデータコ
ピーを回避する。この基本テクニックは、より小さなデ
ータブロックを大きなインターコネクト・メッセージに
多重化することで、こうしたデータブロックの移動に最
適な形に拡張できる。それぞれのデータブロックは、修
正したＤＭＡベースのテクニックを使って処理すること
が可能で、ノードの運用上の有効性を失わずにインター
コネクトの利用法を最適化できる。Second, BYNET can be extended by software to an active message interconnect-under the direction of the user (computation resource 102 and storage resource 104), BYNET allows the You can move data between them. DMA is used to move data directly to predefined memory addresses, avoiding unnecessary interrupts and internal data copying. This basic technique can be extended to optimally move such data blocks by multiplexing smaller data blocks into larger interconnect messages. Each data block can be processed using a modified DMA-based technique to optimize the use of the interconnect without losing the operational effectiveness of the node.

【０１６４】三番目に、ＢＹＮＥＴは複数のファブリッ
クを提供する構成にできるため、トラフィック整形を利
用して、さらにインターコネクトを最適化できる。これ
は基本的にはＢＹＮＥＴソフトウェアが提供するメカニ
ズムで、特定のインターコネクトチャンネル（ファブリ
ック）を特定の種類のトラフィックに割り当てて、例え
ば、使用者の非常に多い共有チャンネルにおいて、長い
メッセージと短いメッセージのランダムな組み合わせに
よって生じる干渉の減少などを行う。トラフィック整形
はＢＹＮＥＴによって可能になるもので、予測されるト
ラフィックパターンのためにユーザが選択することもで
きる。Third, since BYNET can be configured to provide a plurality of fabrics, the traffic shaping can be used to further optimize the interconnect. This is basically a mechanism provided by BYNET software that assigns a specific interconnect channel (fabric) to a specific type of traffic, for example, the randomization of long and short messages on shared channels with a large number of users. To reduce interference caused by various combinations. Traffic shaping is enabled by BYNET and can also be selected by the user for expected traffic patterns.

【０１６５】図８は、ＢＹＮＥＴとホスト側インターフ
ェース８０２の図である。ＢＹＮＥＴホスト側インター
フェース８０２には、回線が作成されたときに常にチャ
ンネルプログラムを実行するプロセッサ８０４が含まれ
る。チャンネルプログラムは、各ノードの送信側インタ
ーフェース８０６と目的地インターフェース８０８の両
方で、このプロセッサ８０４が実行する。送信側インタ
ーフェース８０６ハードウェアは、回線の作成を制御す
るダウン・コールに従って作成されたチャンネルプログ
ラムを実行し、データを送信し、その後この回線をシャ
ットダウンする。目的地側インターフェース８０８ハー
ドウェアはは、データを目的地のメモリに伝達するため
にチャンネルプログラムを実行し、回線を完成させる。FIG. 8 is a diagram of the BYNET and the host-side interface 802. The BYNET host-side interface 802 includes a processor 804 that executes a channel program whenever a line is created. The channel program is executed by the processor 804 at both the transmitting interface 806 and the destination interface 808 of each node. The sending interface 806 hardware executes the channel program created according to the down call that controls the creation of the line, sends the data, and then shuts down the line. The destination interface 808 hardware executes the channel program to transfer the data to the destination memory and completes the line.

【０１６６】ＢＹＮＥＴは、ネットワーク内のプロセッ
サとして稼働する計算ノード２００とＩＯＮ２１２をイ
ンターコネクトするネットワークで構成される。またＢ
ＹＮＥＴは、入出力ポート８１４を持つ複数のスイッチ
ノード８１０で構成される。スイッチノード８１０は、
ｇ（ｌｏｇｂＮ）以上のスイッチノードステージ８１２
に配列される。ｂはスイッチノードの入出力ポートの合
計数、Ｎはネットワーク入出力ポート８１６の合計数
で、ｇ（ｘ）は引数ｘより大きな最少の整数を得るため
の切り上げ関数である。したがって、スイッチノード８
１０は、任意のネットワーク入力ポート８１６とネット
ワーク出力ポート８１６の間に複数のパスを提供して、
誤り許容を拡張し、競合を減少する。さらにＢＹＮＥＴ
は、ネットワーク全体のメッセージ伝送管理のために、
ネットワークの最高スイッチノードステージに沿った跳
ね返り面８１８にある複数の跳ね返り点で構成される。
跳ね返り点は、ネットワークを通じてバランスメッセー
ジをロードするスイッチノード８１０と、受領するプロ
セッサへメッセージを方向付けするスイッチノード８１
０を、論理的に区別する。[0166] BYNET is composed of a network that interconnects the ION 212 with the computing node 200 operating as a processor in the network. Also B
The YNET includes a plurality of switch nodes 810 having input / output ports 814. The switch node 810
g (logbN) or more switch node stage 812
It is arranged in. b is the total number of input / output ports of the switch node, N is the total number of network input / output ports 816, and g (x) is a round-up function for obtaining a minimum integer larger than the argument x. Therefore, switch node 8
10 provides multiple paths between any network input port 816 and network output port 816,
Extend error tolerance and reduce contention. In addition, BYNET
Is used to manage message transmission throughout the network.
It consists of a plurality of bounce points on a bounce surface 818 along the highest switch node stage of the network.
The bounce points are switch node 810, which loads the balance message through the network, and switch node 81, which directs the message to the receiving processor.
0 is logically distinguished.

【０１６７】計算ノード２００やＩＯＮ２１２などのプ
ロセッサは、論理的に独立した既定のプロセッサのサブ
セットで構成される１つ以上のスーパークラスタに区分
できる。プロセッサ間の通信は２地点間方式又はマルチ
キャストで行われる。マルチキャストモードの通信で
は、単一のプロセッサは他のすべてのプロセッサ又はス
ーパークラスタにメッセージを同報できる。マルチキャ
スト・コマンドは異なるスーパークラスタ内で同時に発
生できる。送信プロセッサは、転送チャンネルを通じて
すべてのプロセッサ又はプロセッサの集団に伝達される
マルチキャスト・コマンドを発信する。マルチキャスト
・メッセージは、跳ね返り面８１８の特定の跳ね返り点
に向けられ、スーパークラスタ内のプロセッサへの経路
が定まる。特定の跳ね返り点を通過するマルチキャスト
・メッセージを一度に一つだけにするため、これはネッ
トワークのデッドロックを防ぎ、異なるスーパークラス
タへのマルチキャスト・メッセージがお互いに干渉する
のを防止する。マルチキャスト・メッセージを受領した
プロセッサは、例えば、返信チャンネルを通じた現在の
ステータスの伝送などによって返答する。ＢＹＮＥＴは
返答を様々な形で組み合わせる役割を果たす。Processors, such as compute node 200 and ION 212, can be partitioned into one or more superclusters that comprise a subset of predefined processors that are logically independent. Communication between the processors is performed by a point-to-point method or a multicast. In multicast mode communication, a single processor can broadcast a message to all other processors or superclusters. Multicast commands can occur simultaneously in different superclusters. The sending processor issues a multicast command that is communicated to all processors or groups of processors over the transport channel. The multicast message is directed to a specific bounce point on bounce surface 818 and is routed to the processors in the supercluster. This prevents network deadlocks and prevents multicast messages to different superclusters from interfering with each other, since only one multicast message passes through a particular bounce point at a time. The processor receiving the multicast message replies, for example, by transmitting the current status over a return channel. BYNET plays a role in combining responses in various ways.

【０１６８】ＢＹＮＥＴは現在、バンド内メッセージと
バンド外メッセージという二つの基本タイプのメッセー
ジをサポートしている。ＢＹＮＥＴバンド内メッセージ
は、目的地ホストのメモリのカーネルバッファ（又はバ
ッファ）にメッセージを伝達し、回線を完成させ、アッ
プ・コール割り込みを通知する。ＢＹＮＥＴバンド外メ
ッセージでは、回線メッセージのヘッダ・データが、Ｂ
ＹＮＥＴドライバの割り込み操作子に、受領中の残りの
回線データ処理に使用するチャンネルプログラムを作成
させる。どちらのタイプのメッセージにおいても、チャ
ンネルプログラムの成功又は失敗は、ＢＹＮＥＴ返信チ
ャンネル上の小さなメッセージによって送信者に戻され
る。この返信チャンネル・メッセージは、送信者のチャ
ンネルプログラムによる回線シャットダウン作業の一部
として処理される（返信チャンネルは、ＢＹＮＥＴ回線
の定帯域幅の返信パスである）。回線がシャットダウン
された後、目的地でアップ・コール割り込みが（選択的
に）通知され、新しいメッセージの到着を知らせる。BYNET currently supports two basic types of messages: in-band messages and out-of-band messages. The BYNET in-band message conveys the message to a kernel buffer (or buffer) in the memory of the destination host, completes the line, and signals an up-call interrupt. In a BYNET out-of-band message, the header data of the line message is
Causes the interrupt operator of the YNET driver to create a channel program to be used for processing the remaining line data being received. For either type of message, the success or failure of the channel program is returned to the sender by a small message on the BYNET reply channel. This return channel message is processed as part of the line shut down operation by the sender's channel program (the return channel is the constant bandwidth return path of the BYNET line). After the line is shut down, an up call interrupt is (optionally) signaled at the destination, signaling the arrival of a new message.

【０１６９】ＢＹＮＥＴバンド外メッセージの利用は、
チャンネルプログラムがまず作成され、次に実行される
のを送信側が待つため、最も望ましい構成ではない。Ｂ
ＹＮＥＴバンド内メッセージでは、送信者がアプリケー
ション・バッファを直接ターゲットにすることができな
いため、データのコピーが必要になる。この問題を解決
するために、本発明ではＢＹＮＥＴハードウェアを独自
の方法で利用している。目的地側インターフェース８０
８にデータ処理に必要なチャンネルプログラムを作成さ
せる代わりに、送信インターフェース８０６側が送信側
と目的地側の両方のチャンネルプログラムを作成する。
送信側チャンネルプログラムは、メッセージの一部とし
て、目的地側が実行できる非常に小さなチャンネルプロ
グラムを転送する。このチャンネルプログラムには、目
的地側がターゲット・アプリケーション・スレッドの特
定の目的地バッファにデータをどのように移動するかが
記述されている。このメッセージを伝達すべき目的地の
スレッドを送信者が知っているため、このテクニックで
は、送信側がメッセージを伝達する方法と場所の両方を
制御可能で、目的地側の従来のアップ・コール処理の損
傷をほとんど避けることができる。このＢＹＮＥＴメッ
セージの形式は有向バンドメッセージと呼ばれる。アク
ティブメッセージ・プロセス間通信モデルで使用される
アクティブメッセージ（データと、目的地でのメッセー
ジ処理に使用する小さなメッセージ処理ルーチンを含
む）とは異なり、本発明では、ＢＹＮＥＴＩ／Ｏプロ
セッサが単純なチャンネルプログラムを実行するＢＹＮ
ＥＴ有向バンドメッセージを使用する。アクティブメッ
セージでは普通はホストＣＰＵがアクティブメッセージ
操作子を実行する。The use of the BYNET out-of-band message is as follows.
This is not the most desirable configuration because the sender waits for the channel program to be created first and then executed. B
For YNET in-band messages, a copy of the data is required because the sender cannot directly target the application buffer. In order to solve this problem, the present invention utilizes BYNET hardware in a unique way. Destination side interface 80
8, the transmission interface 806 creates channel programs on both the transmission side and the destination side instead of causing the transmission interface 806 to create channel programs necessary for data processing.
The transmitting channel program transmits, as part of the message, a very small channel program that can be executed by the destination. The channel program describes how the destination moves data to a specific destination buffer of the target application thread. Because the sender knows the destination thread to which this message should be delivered, this technique allows the sender to control both how and where the message is delivered, and the traditional up-call handling of the destination. Damage can be almost avoided. This format of the BYNET message is called a directed band message. Unlike the active message used in the active message-interprocess communication model (which includes data and small message processing routines used to process messages at the destination), the present invention uses a BYNET I / O processor to provide a simple channel. BYN to run the program
Use the ET directed band message. In an active message, the host CPU normally executes an active message operator.

【０１７０】返信チャンネルの利用により、送信側イン
ターフェースは、メッセージ伝達完了を知らせるための
従来の割り込み方法を抑止できる。バンド外及び有向バ
ンドメッセージの両方において、送信側の成功の表示は
メッセージが目的地のメモリに確かに伝達されたことの
みを示す。By using the reply channel, the transmitting interface can suppress the conventional interrupt method for notifying the completion of message transmission. In both out-of-band and directed-band messages, the sender's indication of success only indicates that the message was successfully delivered to the destination memory.

【０１７１】これは目的地ノードのメモリスペースにメ
ッセージが確かに移動したことを保証するが、目的地ア
プリケーションによるメッセージの処理は保証しない。
例えば、目的地のノードが機能のあるメモリシステムを
持っていても、目的地のアプリケーション・スレッドが
メッセージの処理を恒久的に妨げる可能性のある異常を
持っているかもしれない。本発明では、メッセージの確
実な処理を行うために、メッセージ処理の異常を検知し
修正する、独立したいくつかの方法を採用している。本
発明の通信プロトコルに関して、送信側でのメッセージ
消失の検知には時間切れが利用されている。必要に応じ
て再送信が行われ、ソフトウェアやハードウェアの異常
が検知された場合は復旧作業が開始される。This guarantees that the message has certainly moved to the memory space of the destination node, but does not guarantee the processing of the message by the destination application.
For example, even though the destination node has a functional memory system, the destination application thread may have an anomaly that can permanently prevent the processing of the message. The present invention employs several independent methods for detecting and correcting message processing anomalies to ensure message processing. Regarding the communication protocol of the present invention, time-out is used for detecting message loss on the transmission side. Retransmission is performed as necessary, and when an abnormality in software or hardware is detected, recovery work is started.

【０１７２】本発明では、有向バンドメッセージを利用
しながら、目的地の特定のターゲットへのメッセージ伝
達を可能にし、正しいターゲット・アプリケーション・
スレッド・バッファにメッセージを送信するために十分
なデータを送信側に与えるメカニズムを可能にしなくて
はいけない。本発明では、チケット・ベース認証スキム
によって、これを実現している。チケットは偽造できな
いデータ構造で、所有者に権利を与える。本質的には、
チケットは特定のリソースを使用する一度限りの許可又
は権利である。本発明では、ＩＯＮ２１２は、チケット
の分配によって、計算ノード２００へのサービスの分配
を制御できる。加えて、チケットは、送信者ベースのフ
ロー制御モデルを導入するための必要条件である特定の
ターゲットの指定を行う。According to the present invention, it is possible to transmit a message to a specific target at a destination while using a directed band message, so that a correct target application
A mechanism must be provided to give the sender enough data to send the message to the thread buffer. In the present invention, this is achieved by a ticket-based authentication scheme. A ticket is a data structure that cannot be forged and grants rights to the owner. In essence,
A ticket is a one-time permission or right to use a particular resource. In the present invention, the ION 212 can control the distribution of services to the computing nodes 200 by the distribution of tickets. In addition, tickets provide for the specification of specific targets that are a prerequisite for implementing a sender-based flow control model.

【０１７３】Ｄ．「プット・イット・ゼア」（ＰＩ
Ｔ）プロトコル１．概要ＰＩＴプロトコルは、ＢＹＮＥＴ有向バンドメッセージ
・プロトコルを使用したアクティブメッセージによって
チケットとデータ・ペイロードを送信するチケット・ベ
ース認証スキムである。ＰＩＴプロトコルは、チケット
・ベース認証、送信者ベース・アドレス、デビット／ク
レジット・フロー制御、ゼロメモリコピー、アクティブ
メッセージを独自に組み合わせたものである。D. "Put It There" (PI
T) Protocol Overview The PIT protocol is a ticket-based authentication scheme that sends tickets and data payloads through active messages using the BYNET directed band message protocol. The PIT protocol is a unique combination of ticket-based authentication, sender-based address, debit / credit flow control, zero memory copy, and active messages.

【０１７４】２．ＰＩＴメッセージ図９は、ＰＩＴヘッダ９０２とそれに続くペイロード・
データ９０４を含む、ＰＩＴメッセージ又はパケット９
０１の基本的な特徴を示す図である。ＰＩＴヘッダ９０
２は、ＰＩＴＩＤで構成されており、これは抽象化し
たターゲット・データバッファを表し、指定された特定
のサイズのバッファにアクセスする権利を意味する期限
付きチケットである。ＰＩＴＩＤを所有する要素は、
特定のバッファを使用する権利を持つ要素であり、ＰＩ
ＴＩＤ９０６はＰＩＴバッファを使用したときに放棄
されなくてはいけない。目的地がＰＩＴメッセージを受
領すると、ＰＩＴヘッダのＰＩＴＩＤ９０６は、ＤＭ
Ａ作業によってペイロードが運ばれるＢＹＮＥＴハード
ウェアに対して、ターゲットバッファの指定を行う。[0174] 2. PIT Message FIG. 9 shows a PIT header 902 followed by a payload
PIT message or packet 9 containing data 904
It is a figure showing the basic feature of No. 01. PIT header 90
2 is a PIT ID, which represents an abstracted target data buffer, and is a time-limited ticket indicating a right to access a buffer of a specified specific size. The element that owns the PIT ID is
An element that has the right to use a particular buffer, PI
T ID 906 must be discarded when using the PIT buffer. When the destination receives the PIT message, the PIT ID 906 of the PIT header indicates the DM
The target buffer is specified for the BYNET hardware to which the payload is carried by the operation A.

【０１７５】ＰＩＴプロトコルでのフロー制御は、送信
者ベース・アドレスを使用したデビット／クレジット・
モデルである。ＰＩＴメッセージの送信は、送信者のフ
ロー制御デビットと目的地のフロー制御クレジットを意
味する。別の言い方をすれば、デバイスがＰＩＴＩＤ
９０６をスレッドに送信した場合、スレッドにはそのア
ドレススペースのＰＩＴバッファがクレジットされる。
デバイスがＰＩＴＩＤ９０６を送信者に返信した場
合、デバイスは権利の放棄又はＰＩＴＩＤ９０６が指
定するバッファの解放を行っている。デバイスがＰＩＴ
ＩＤ９０６によって抽象化された目的地のバッファに
メッセージを送ると、デバイスはＰＩＴバッファに対す
る権利も放棄する。デバイスがＰＩＴＩＤ９０６を受
領すると、（そのＰＩＴＩＤ９０６が、デバイスの返
信しているＰＩＴＩＤ９０６ではない限り）それは送
信者のアドレススペースにあるＰＩＴバッファのクレジ
ットになる。[0175] Flow control in the PIT protocol is based on debit / credit credit using the sender base address.
Model. The transmission of the PIT message implies the flow control debit of the sender and the flow control credit of the destination. In other words, the device has a PIT ID
If 906 is sent to the thread, the thread will be credited with the PIT buffer for that address space.
If the device returns a PIT ID 906 to the sender, the device has either relinquished the right or released the buffer specified by the PIT ID 906. Device is PIT
Upon sending a message to the destination buffer abstracted by ID 906, the device also relinquishes its rights to the PIT buffer. When the device receives the PIT ID 906, it becomes a credit in the PIT buffer in the sender's address space (unless that PIT ID 906 is the device's returning PIT ID 906).

【０１７６】ヘッダ９０２の最上部は、ＰＩＴパケット
９０１を処理するＢＹＮＥＴチャンネルプログラム９０
８（送信者側と目的地側）である。その下は、ＰＩＴ
ＩＤチケットを伝送するための二つのフィールド、つま
りクレジット・フィールド９１０とデビット・フィール
ド９１２である。デビット・フィールド９１２は、チャ
ンネルプログラムを介して目的地ネットワークインター
フェースによりペイロード・データが転送されるＰＩＴ
ＩＤ９０６を含んでいる。これがデビット・フィール
ドと呼ばれるのは、ＰＩＴＩＤ９０６は送信者のアプ
リケーション・スレッドにとって借りになるからである
（目的地のスレッドでは貸し）。クレジット・フィール
ド９１０は、送信者スレッドが目的地スレッドに対して
ＰＩＴバッファを転送する、又は貸す場所である。クレ
ジット・フィールド９１０は通常、送信者スレッドが返
信メッセージの送信を期待するＰＩＴＩＤ９０６を有
している。このクレジットＰＩＴの利用法は、ＳＡＳＥ
（自己アドレス・スタンプ・エンベロープ）ＰＩＴと呼
ばれる。コマンド・フィールド９１４は、ターゲットが
ペイロード・データに実行する作業を記述している（例
えば、ディスク読み出し又は書き込みコマンド）。引数
フィールド９１６は、コマンドに関するデータである
（例えば、読み出し又は書き込み作業を行うディスクの
ディスク及びブロック番号）。シーケンス番号９１８
は、各ソース及び目的地ノードのペアに固有の単調増加
整数である（ノードの各ペアは、方向ごとに一つのシー
ケンス番号を持っている。）。長さフィールド９２０は
ＰＩＴペイロード・データの長さをバイトで特定する。
フラグ・フィールド９２２は、ＰＩＴメッセージの処理
を修正する様々なフラグを含んでいる。一例として、メ
ッセージ複製フラグがある。これは、失われた可能性の
あるメッセージの再送信を行い、イベントの二回以上の
処理を防ぐときに使用される。At the top of the header 902 is a BYNET channel program 90 for processing the PIT packet 901.
8 (sender side and destination side). Below that is PIT
There are two fields for transmitting an ID ticket, a credit field 910 and a debit field 912. Debit field 912 contains the PIT where the payload data is transferred by the destination network interface via the channel program.
ID 906 is included. This is called a debit field because the PIT ID 906 is borrowed by the sender's application thread (lent to the destination thread). Credit field 910 is where the sender thread transfers or lends the PIT buffer to the destination thread. The credit field 910 typically has a PIT ID 906 that the sender thread expects to send a reply message. The usage of this credit PIT is SASE
(Self address stamp envelope) called PIT. Command field 914 describes the work that the target performs on the payload data (eg, a disk read or write command). The argument field 916 is data related to the command (for example, the disk and block number of the disk that performs the read or write operation). Sequence number 918
Is a monotonically increasing integer that is unique to each source and destination node pair (each pair of nodes has one sequence number for each direction). Length field 920 specifies the length of the PIT payload data in bytes.
Flags field 922 contains various flags that modify the processing of the PIT message. One example is a message duplication flag. This is used to retransmit potentially lost messages and prevent more than one processing of the event.

【０１７７】システムが最初に起動されたとき、どのノ
ードも他のノードのためのＰＩＴＩＤ９０６を持ってい
ない。ＢＹＮＥＴソフトウェアドライバは、ＰＩＴの第
一のオープン・プロトコルが完了するまで、すべての有
向バンドメッセージの伝達を防止する。ＰＩＴＩＤ９
０６の分配は、計算ノード２００のアプリケーション・
スレッドが、ＩＯＮ２１２上の任意の仮想ディスクデバ
イスを最初にオープンにした際に開始される。第一のオ
ープン中、ＩＯＮ２１２と計算ノード２００は交渉の段
階に入り、ここで作業パラメータの交換が行われる。第
一のオープン・プロトコルの一部はＰＩＴＩＤ９０６
の交換である。ＰＩＴＩＤ９０６は、二つ以上のバッ
ファを、送信者の収集ＤＭＡ及び目的地の分散ＤＭＡの
両方をサポートするインターフェースとして指定でき
る。このアプリケーションは、他のすべてのノードにあ
る任意のアプリケーションに対して、ＰＩＴＩＤ９０
６の配布を自由に行える。When the system is first started, no node has a PITID 906 for another node. The BYNET software driver prevents the transmission of all directed band messages until the PIT's first open protocol is completed. PIT ID9
06 is distributed to the application
The thread is started when a virtual disk device on the ION 212 is first opened. During the first opening, the ION 212 and the compute node 200 enter a negotiation phase, where the exchange of work parameters takes place. Part of the first open protocol is PIT ID906
Is an exchange. The PIT ID 906 can specify two or more buffers as an interface that supports both the sender's collection DMA and the destination's distributed DMA. This application provides a PIT ID 90 for any application on all other nodes.
6 can be freely distributed.

【０１７８】この計算ノード２００とＩＯＮ２１２が交
換するＰＩＴバッファのサイズと数は、調整可能な値で
ある。デビット及びクレジットＰＩＴＩＤ９０６（デ
ビット・フィールド９１２とクレジット・フィールド９
１０にあるもの）は、システムのフロー制御モデルの基
盤を形成する。送信者は目的地に対して、クレジットさ
れたＰＩＴＩＤ９０６と同じ数しかメッセージを送信
できない。これによって、特定のホストが送信できるメ
ッセージの数は制限される。また、各ノードは自分のＰ
ＩＴＩＤ９０６プールを持っており、各送信者が消費
できるのは多くとも割り当てられたＰＩＴＩＤ９０６
のみであるという点で公平性を確保できる。The size and number of the PIT buffers exchanged between the computation node 200 and the ION 212 are adjustable values. Debit and Credit PIT ID 906 (Debit field 912 and Credit field 9
10) form the basis of a flow control model of the system. The sender can send only as many messages to the destination as the number of credited PIT IDs 906. This limits the number of messages that a particular host can send. Each node has its own P
It has an IT ID 906 pool and each sender can consume at most the assigned PIT ID 906
Only in that it is fair.

【０１７９】ＩＯＮ２１２は計算ノード２０２に対して
発行したＰＩＴチケットのプールを管理する。計算ノー
ド２０２に対するＰＩＴＩＤ９０６の初期割り当て
は、第一のオープン・プロトコル中に行われる。分配さ
れるＰＩＴＩＤ９０６の数は、ＩＯＮ２１２を同じ時
期に利用する同時稼働の計算ノード２００の数の予測と
ＩＯＮ２１２内のメモリリソースに基づいている。これ
は予測に過ぎないため、ＰＩＴプールのサイズも稼働中
にＩＯＮ２１２によって動的に調整される。このＰＩＴ
リソースの再配布は、複数の計算ノード２００からの要
求に対する公平な対応を確保するのに必要である。The ION 212 manages a pool of PIT tickets issued to the computing node 202. The initial assignment of PIT ID 906 to compute node 202 occurs during a first open protocol. The number of distributed PIT IDs 906 is based on a prediction of the number of concurrently running compute nodes 200 that will use the ION 212 at the same time and the memory resources in the ION 212. Since this is only a prediction, the size of the PIT pool is also dynamically adjusted by the ION 212 during operation. This PIT
The redistribution of resources is necessary to ensure a fair response to requests from a plurality of computing nodes 200.

【０１８０】稼働計算ノード２００へのＰＩＴ再割り当
ては以下のように行われる。計算ノード２００は絶えず
Ｉ／Ｏ要求を行っているため、完了したＩ／Ｏメッセー
ジのＰＩＴクレジットのフローを制御することで、ＰＩ
Ｔリソースが再配布される。適切なレベルになるまで、
ＩＯＮ２１２の完了によるＰＩＴクレジットは送られな
い（計算ノード２００のＰＩＴプールが減る）。すでに
ＰＩＴ割り当てを持っているが計算ノード２００が稼働
していない（そしてリソースが動かない）場合、状況は
より複雑になる。この場合、ＩＯＮ２１２は稼働してい
ない計算ノード２００それぞれにＰＩＴ（又はＰＩＴ
ＩＤのリスト）を無効にするメッセージを送ることがで
きる。稼働していない計算ノード２００が反応しない場
合、ＩＯＮ２１２はそのノードのＰＩＴＩＤをすべて
無効にすることができ、その後で他の計算ノード２００
にＰＩＴＩＤを再分配できる。稼働していない計算ノ
ード２００が再割り当てされたＰＩＴを使おうとする
と、その計算ノード２００は第一のオープン・プロトコ
ルに強制的に戻される。The PIT reassignment to the operation calculation node 200 is performed as follows. Since the computing node 200 is constantly making I / O requests, by controlling the flow of PIT credits for completed I / O messages,
T resources are redistributed. Until you reach the right level
No PIT credit is sent due to the completion of the ION 212 (the PIT pool of the compute node 200 is reduced). The situation is more complicated if you already have a PIT assignment but the compute node 200 is not running (and the resources are not moving). In this case, the ION 212 assigns a PIT (or PIT) to each of the inactive compute nodes 200.
ID list) can be sent. If an inactive compute node 200 does not respond, the ION 212 can invalidate all of its PIT IDs before the other compute nodes 200
PIT ID can be redistributed. If a non-working compute node 200 attempts to use the reassigned PIT, that compute node 200 is forced back to the first open protocol.

【０１８１】計算ノード２００へのＰＩＴ割り当ての増
加は以下のように行われる。ＰＩＴ割り当てメッセージ
を使用して、新しく割り当てられたＰＩＴＩＤを任意
の計算ノードに送ることができる。代わりのテクニック
では、Ｉ／Ｏ完了メッセージごとに二つ以上のＰＩＴク
レジットを送信する。The increase in PIT allocation to the computing node 200 is performed as follows. The PIT Assignment message can be used to send the newly assigned PIT ID to any compute node. An alternative technique sends more than one PIT credit per I / O completion message.

【０１８２】３．ＰＩＴプロトコルの働き−ディスク
の読み出しと書き込みＰＩＴプロトコルを説明するため
に、計算ノード２００がストレージディスク２２４のＩ
ＯＮ２１２からの読み出し作業を要求した場合ついて論
じる。ここでは、第一のオープンはすでに行われ、計算
ノード２００及びＩＯＮ２１２の両方に十分な数の自由
なＰＩＴバッファが存在すると仮定する。アプリケーシ
ョン・スレッドは読み出しシステムコールを行い、ディ
スクデータを転送するバッファのアドレスを、計算ノー
ドの高レベルＳＣＳＩドライバ（ＣＮシステムドライ
バ）に渡す。ＣＮシステムドライバは、この要求を含む
ＰＩＴパケットを作成する（仮想ディスク名、ブロック
番号、データの長さが含まれる）。ＣＮシステムドライ
バの上半分が、デビット及びクレジットＰＩＴＩＤフ
ィールド９１０、９１２に情報を与える。このデビット
ＰＩＴフィールド９１２は、この読み出し要求が送られ
る目的地ＩＯＮ２１２のＰＩＴＩＤ９０６である。こ
れは読み出し要求であるため、ＩＯＮ２１２は、Ｉ／Ｏ
完了パケットを作成する時に、アプリケーションのバッ
ファ（読み出しシステムコールの一部として提供された
もの）を指定する方法が必要になる。ＰＩＴパケットは
送信者ベース・アドレスを利用しているため、ＰＩＴ
ＩＤ９０６を持っているならば、ＩＯＮ２１２だけがこ
のアプリケーション・バッファをアドレスできる。アプ
リケーション・バッファは通常のＰＩＴプールの一部で
はないため、バッファはメモリに固定され、バッファの
ためにＰＩＴＩＤ９０６が作成される。読み出し要求
には、ディスク作業からの返信ステータスも必要なた
め、返信ステータスを含むためにＰＩＴの分散バッファ
が作成される。読み出しＰＩＴパケットの一部として、
ＳＡＳＥＰＩＴがクレジット・フィールドに送られ
る。ＰＩＴパケットは出力待ち行列に置かれる。ＰＩＴ
パケットを送るときに、ＢＹＮＥＴインターフェース８
０２は、これをDMA作業によって送信者側から移動し、
インターコネクト・ファブリック１０６を通じて転送す
る。目的地側のＢＹＮＥＴインターフェース８０８で
は、ＰＩＴパケットが到着すると、ＢＹＮＥＴチャンネ
ルプロセッサ８０４によるＰＩＴチャンネルプログラム
の実行が開始される。ホスト側インターフェース８０２
のＢＹＮＥＴチャンネルプロセッサ８０４は、デビット
ＰＩＴＩＤ９０６を抽出し、ＩＯＮ２１２上に端点を
置く。チャンネルプログラムはバッファ・アドレスを抽
出し、インターフェースＤＭＡエンジンがペイロード・
データをＰＩＴバッファに直接移動するのをプログラム
する−こうして、ＰＩＴプロトコルがゼロデータ・コピ
ーの意味を提供できるようにする。ＢＹＮＥＴインター
フェース８０２はＩＯＮ２１２上の受領アプリケーショ
ンに割り込みを通知する。計算ノード２００では割り込
みは起こらない。返信チャンネルメッセージが転送の失
敗を示すと、失敗の理由に応じて、Ｉ／Ｏが再試行され
る。数回の試行後、ＩＯＮエラー状況が入力され（本文
書中のＩＯＮ２１２復旧及びフェールオーバ作業の詳細
を参照）、計算ノード２００はダイポールのバディＩＯ
Ｎ２１４に要求を処理させようと試みる。メッセージが
確実に目的地ノードのメモリに伝達されると、ホスト側
は再送信期限を（Ｉ／Ｏサービス時間の最悪のケースよ
り長く）設定し、ＩＯＮ２１２のメッセージ処理成功を
確実にする。このタイマが時間切れになると、計算ノー
ド２００はＩＯＮ２１２にＰＩＴメッセージを再送信す
る。Ｉ／Ｏがまだ進行中の場合は、複製した要求は単純
に却下され、そうでない場合は再送信した要求は通常通
り処理される。選択的に、プロトコルは、期限切れタイ
マをリセットし、Ｉ／Ｏの失敗によるアプリケーション
の被害を避けるために、再送信した要求の明確な通知を
求めることもできる。[0182] 3. How the PIT Protocol Works-Read and Write Disks To illustrate the PIT protocol, the compute node 200
The case where a read operation from ON 212 is requested will be discussed. Here, it is assumed that the first open has already been performed and that there is a sufficient number of free PIT buffers in both compute node 200 and ION 212. The application thread makes a read system call and passes the address of the buffer for transferring disk data to the high-level SCSI driver (CN system driver) of the compute node. The CN system driver creates a PIT packet containing this request (including the virtual disk name, block number, and data length). The upper half of the CN system driver provides information in the debit and credit PIT ID fields 910, 912. This debit PIT field 912 is the PIT ID 906 of the destination ION 212 to which this read request is sent. Since this is a read request, the ION 212
When creating the completion packet, you need a way to specify the application's buffer (provided as part of the read system call). Because PIT packets use the sender base address, PIT
If it has the ID 906, only the ION 212 can address this application buffer. Since the application buffer is not part of the normal PIT pool, the buffer is fixed in memory and a PIT ID 906 is created for the buffer. Since a read request also requires a reply status from the disk operation, a PIT distributed buffer is created to include the reply status. As part of the read PIT packet,
SASE PIT is sent to the credit field. PIT packets are placed in an output queue. PIT
When sending a packet, the BYNET interface 8
02 moves this from the sender side by DMA work,
Transfer through interconnect fabric 106. When the PIT packet arrives at the BYNET interface 808 on the destination side, the execution of the PIT channel program by the BYNET channel processor 804 is started. Host side interface 802
The BYNET channel processor 804 extracts the debit PIT ID 906 and places the endpoint on the ION 212. The channel program extracts the buffer address and the interface DMA engine
Program moving data directly to the PIT buffer-thus enabling the PIT protocol to provide zero data copy semantics. The BYNET interface 802 notifies the receiving application on the ION 212 of the interruption. No interruption occurs in the computing node 200. If the reply channel message indicates a transfer failure, the I / O will be retried depending on the reason for the failure. After several attempts, an ION error status is entered (see ION 212 Recovery and Failover Work Details in this document) and compute node 200 enters the dipole buddy IO.
Attempt to have N214 handle the request. When the message is reliably delivered to the destination node's memory, the host sets a retransmission deadline (longer than the worst case I / O service time) to ensure that the ION 212 has successfully processed the message. When this timer expires, compute node 200 retransmits the PIT message to ION 212. If the I / O is still in progress, the duplicated request is simply rejected; otherwise, the retransmitted request is processed normally. Optionally, the protocol may reset the expiration timer and require explicit notification of the retransmitted request to avoid application damage due to I / O failure.

【０１８３】図１０は、ＩＯＮ２１２機能モジュールの
ブロック図である。ＩＯＮ２１２及び２１４への入力
は、データライン１００２及び１００４と制御ライン１
００６である。ＩＯＮ２１２の各モジュールは、制御ラ
イン１００６と連絡する制御モジュール１００８で構成
される。制御モジュール１００８はデータライン１００
２からコマンドを受け取り、モジュール制御機能を提供
する。システム機能モジュール１０１０は本文書で説明
するＩＯＮ機能を実施する。ＩＯＮ２１２及び２１４
は、ファブリックモジュール１０２０、キャッシュモジ
ュール１０１４、データ復元モジュール１０１６、スト
レージモジュール１０１８で構成される。これら各モジ
ュールは、制御モジュール、データライン１００２及び
１００４でのデータの挿入とレトリーブを行う仕事イン
ジェクタ１０２０、データの通過を禁止するデータフェ
ンス１０２２で構成される。FIG. 10 is a block diagram of the ION 212 function module. The inputs to the IONs 212 and 214 are data lines 1002 and 1004 and control line 1
006. Each module of the ION 212 comprises a control module 1008 that communicates with a control line 1006. The control module 1008 includes the data line 100
2 and provides a module control function. System function module 1010 implements the ION functions described in this document. ION 212 and 214
Is composed of a fabric module 1020, a cache module 1014, a data restoration module 1016, and a storage module 1018. Each of these modules comprises a control module, a work injector 1020 for inserting and retrieving data on the data lines 1002 and 1004, and a data fence 1022 for inhibiting the passage of data.

【０１８４】ＰＩＴ読み出し要求は、ＩＯＮ２１２に送
られた後、ＩＯＮキャッシュモジュール１０１４の仕事
インジェクタに転送される。仕事インジェクタは要求を
ＩＯＮキャッシュモジュール１０１４に挿入し、ＩＯＮ
キャッシュモジュールはデータがキャッシュされるかデ
ータのバッファが割り当てられた場合、データを直接戻
し、要求をＩＯＮストレージモジュール１０１８に渡
す。ＩＯＮストレージモジュール１０１８は、この要求
を一つ（又は複数）の物理ディスク要求にトランスレー
トし、要求を該当するディスクドライバ２２４に送る。
ディスク読み出し作業が完了すると、ディスクコントロ
ーラは割り込みを通知し、ディスク読み出しの完了を知
らせる。ＩＯＮ仕事インジェクタはＩ／Ｏ完了ＰＩＴパ
ケットを作成する。デビットＰＩＴＩＤ（デビット・
フィールド９１２に格納）は、読み出し要求のＳＡＳＥ
ＰＩＴ（アプリケーションがディスクデータを置きた
い場所）からのクレジットＰＩＴＩＤ（クレジット・
フィールド９１０に格納）である。クレジットＰＩＴ
ＩＤは、計算ノード２００がこの要求を送ったＰＩＴＩ
Ｄと同じか、バッファがフリーでない場合は置き換えた
ＰＩＴＩＤである。このクレジットＰＩＴは計算ノー
ドに、未来の要求を送るためのクレジットを渡す（この
現在のＰＩＴ要求が完了したばかりなので、計算ノード
２００のＩＯＮ２１２に対する待ち行列は深さが１つ増
える）。ＰＩＴを処理した後、ＩＯＮ２１２がＰＩＴク
レジットを戻さない理由は三つある。一つは、ＩＯＮ２
１２が計算ノード２００からの未処理待ち行列の数を減
らしたい。二つ目の理由は、ＩＯＮ２１２がＰＩＴクレ
ジットを別の計算ノード２００に再分配したい。三つ目
の理由は、単一のＰＩＴパケットに複数の要求が含まれ
ている（本文書のスーパーＰＩＴパケットの解説を参
照）。コマンドフィールド９１４は読み出し完了メッセ
ージで、引数はディスクドライブ読み出し作業からの返
信コードである。このＰＩＴパケットはＢＹＮＥＴイン
ターフェース７０２への待ち行列に入れられ、計算ノー
ド２００に送り返される。このときＢＹＮＥＴハードウ
ェアは、このＰＩＴパケットをＤＭＡによって計算ノー
ド２００に移動する。これにより、計算ノード２００Ｂ
ＹＮＥＴチャンネルプログラムによるデビットＰＩＴ
ＩＤ９１２の抽出が始まり、ターゲットＰＩＴバッファ
（ここではアプリケーションの固定されたバッファ）へ
のDMAが開始する前にこれを確認する。DMAが完了する
と、計算ノードＢＹＮＥＴハードウェアは割り込みを開
始し、アプリケーションにディスク読み出しが完了した
ことを知らせる。ＩＯＮ２１２では、ＢＹＮＥＴドライ
バがバッファをキャッシュシステムに戻す。After the PIT read request is sent to the ION 212, it is transferred to the work injector of the ION cache module 1014. The work injector inserts the request into the ION cache module 1014 and
If the data is cached or a buffer of data is allocated, the cache module returns the data directly and passes the request to the ION storage module 1018. The ION storage module 1018 translates this request into one (or more) physical disk requests and sends the requests to the appropriate disk driver 224.
When the disk reading operation is completed, the disk controller notifies the interruption and notifies the completion of the disk reading. The ION work injector creates an I / O completion PIT packet. Debit PIT ID (Debit
Field 912) is the read request SASE
Credit from PIT (where the application wants to put the disk data) PIT ID (credit
(Stored in field 910). Credit PIT
ID is the PITI from which the compute node 200 sent this request.
Same as D, or replaced PIT ID if the buffer is not free. The credit PIT passes the credit to the compute node to send future requests (the queue for compute node 200 to ION 212 is increased by one since the current PIT request has just been completed). After processing a PIT, the ION 212 does not return PIT credits for three reasons. One is ION2
12 wants to reduce the number of outstanding queues from compute node 200. The second reason is that the ION 212 wants to redistribute the PIT credit to another compute node 200. The third reason is that a single PIT packet contains multiple requests (see the description of the super PIT packet in this document). The command field 914 is a read completion message, and the argument is a return code from the disk drive read operation. This PIT packet is queued to the BYNET interface 702 and sent back to the compute node 200. At this time, the BYNET hardware moves the PIT packet to the computing node 200 by DMA. Thereby, the calculation node 200B
Debit PIT by YNET channel program
Extraction of ID 912 begins and confirms this before DMA to the target PIT buffer (here, the application's fixed buffer) starts. When the DMA is complete, the compute node BYNET hardware initiates an interrupt to notify the application that the disk read is complete. In the ION 212, the BYNET driver returns the buffer to the cache system.

【０１８５】書き込み要求の作業も、読み出し作業と同
じように行われる。アプリケーションが計算ノードの高
レベルドライバを呼び出し、データ、仮想ディスク名、
ディスクブロック番号、データの長さを含むアドレスを
渡す。計算ノードの高レベルドライバは、目的地ＩＯＮ
２１２のＰＩＴＩＤ９０６を選択し、このデータを使
ってＰＩＴ書き込み要求を作成する。ＳＡＳＥＰＩＴ
には、ＩＯＮ２１２からの書き込み作業の返信ステータ
スのみが含まれる。ＩＯＮ２１２では、ＰＩＴパケット
が到着すると割り込みが通知される。この要求は、ＰＩ
Ｔ読み出し作業と同じ形で処理される。書き込み要求は
キャッシュルーチンに渡され、これがその後データをデ
ィスクに書き込む。ディスク書き込みが完了すると（又
はデータがＩＯＮノード２１２及び２１４の書き込みキ
ャッシュに安全にストアされると）、Ｉ／Ｏ完了メッセ
ージが計算ノード２００に送り返される。ＩＯＮ２１２
が作動中の書き込みキャッシュを動かしているときは、
要求の送られたＩＯＮ２１２ではなく、ダイポールのも
う一方のＩＯＮ２１４が、Ｉ／Ｏ完了メッセージを戻
す。これについては、本文書のバミューダトライアング
ルプロトコルに関する場所でさらに説明する。The operation of the write request is performed in the same manner as the read operation. The application calls the compute node's high-level driver and retrieves data, virtual disk names,
Pass the address including the disk block number and data length. The high-level driver of the compute node is the destination ION
The PIT ID 906 of 212 is selected, and a PIT write request is created using this data. SASE PIT
Contains only the return status of the write operation from the ION 212. The ION 212 is notified of the interruption when the PIT packet arrives. This request is
It is processed in the same manner as the T read operation. The write request is passed to a cache routine, which then writes the data to disk. Upon completion of the disk write (or when the data is securely stored in the write caches of ION nodes 212 and 214), an I / O completion message is sent back to compute node 200. ION212
Is running a working write cache,
The other ION 214 of the dipole, not the ION 212 to which the request was sent, returns an I / O completion message. This is further explained in this document at the location of the Bermuda Triangle Protocol.

【０１８６】４．失効ＰＩＴＩＤと復旧問題第一のオープン中のＰＩＴＩＤの交換は、ハードウェ
ア又はソフトウェアの以上によって生じた失効ＰＩＴ
ＩＤ９０６を無効にするメカニズムである。ＩＯＮ２１
２と計算ノード２００がＰＩＴＩＤを交換し、突然Ｉ
ＯＮ２１２がクラッシュした状況を考える。ＰＩＴＩ
Ｄ９０６はメモリに固定されたターゲットバッファを意
味しており、無効化しない限り、リブートしたＩＯＮ２
１２又は計算ノード２００の未処理ＰＩＴＩＤ９０６
は、有効でなくなった又は失効したＰＩＴＩＤの影響
で、ソフトウェアの完全性に大きな問題を引き起こす可
能性がある。ＢＹＮＥＴハードウェアと有向バンドメッ
セージは、失効ＰＩＴＩＤ９０６を無効化する基本メ
カニズムの提供をサポートする。[0186] 4. Revoked PIT ID and Recovery Issues The first open PIT ID exchange is the revoked PIT caused by hardware or software above.
This is a mechanism for invalidating the ID 906. ION21
2 and compute node 200 exchange PIT IDs,
Consider a situation in which ON 212 has crashed. PIT I
D906 means a target buffer fixed in the memory, and unless invalidated, ION2 that has been rebooted
12 or the unprocessed PIT ID 906 of the compute node 200
Can cause major problems with software integrity due to invalid or expired PIT IDs. BYNET hardware and directed band messages support providing a basic mechanism for revoking the expired PIT ID 906.

【０１８７】第一のオープンプロトコルの終わりに、両
者は計算ノード高レベルＳＣＳＩドライバに、ＰＩＴ
ＩＤ９０６が分配されるホストのリストを渡さなくては
いけない。別の言い方をすれば、ホストは計算ノード高
レベルＳＣＳＩドライバに、ＰＩＴパケットを受け取る
ホストのリストを渡している。計算ノード高レベルドラ
イバは、このリストを利用して、有向バンドメッセージ
の配布を管理する表を作成する。この表では、有向バン
ドメッセージを相互に送ることのできるＩＯＮ２１２ペ
アを指定する（この表では片道ＰＩＴメッセージフロー
も指定する）。計算ノード高レベルドライバは、この表
をＢＹＮＥＴ構成プロセスの一部として、ホスト内部で
（ドライバ専用データとして）保存する。このリストか
らのホストの追加や減少は、計算ノード高レベルドライ
バへの簡単な通知メッセージにより、ＰＩＴプロトコル
が常に行える。ノードの故障、シャットダウン、無応答
が生じた場合、ＢＹＮＥＴハードウェアはこれを検知
し、ファブリック上の他のすべてのノードに通知する。
キャッシュノード上のＢＹＮＥＴホストドライバはこの
通知に反応し、有向バンドホスト表から、このホストへ
の参照をすべて削除する。このアクションによって、そ
のホストが他のホストに分配したＰＩＴＩＤ９０６を
すべて無効化する。これは、以前に配布されたＰＩＴパ
ケットからノードを保護する鍵である。そのホスト上の
計算ノード高レベルドライバが再構成されるまで、ＢＹ
ＮＥＴはそのホストにすべてのメッセージを届けない。
第一の再構成が行われた後も、ローカルＰＩＴプロトコ
ルが知らせるまで、ＢＹＮＥＴはこの新たに再起動又は
再構成されたホストへの有向バンドメッセージの送信を
まったく認めない。これにより、第一のオープン・プロ
トコルによって適切なＰＩＴプロトコルの初期化が行わ
れるまで、失効ＰＩＴパケットの配布から保護すること
ができる。At the end of the first open protocol, both have added the PIT to the compute node high-level SCSI driver.
A list of hosts to which ID 906 is to be distributed must be passed. Stated another way, the host has passed to the compute node high level SCSI driver a list of hosts that will receive the PIT packet. The compute node high-level driver uses this list to create a table that manages the distribution of directed band messages. This table specifies ION 212 pairs that can send directed band messages to each other (this table also specifies one-way PIT message flows). The compute node high level driver stores this table inside the host (as driver specific data) as part of the BYNET configuration process. Adding or removing hosts from this list is always done by the PIT protocol with a simple notification message to the compute node high level driver. If a node fails, shuts down, or becomes unresponsive, BYNET hardware detects this and notifies all other nodes on the fabric.
The BYNET host driver on the cache node responds to this notification and removes all references to this host from the directed band host table. This action invalidates all PIT IDs 906 that the host has distributed to other hosts. This is the key that protects the node from previously distributed PIT packets. BY until the compute node high-level driver on that host is reconfigured.
NET does not deliver all messages to the host.
After the first reconfiguration has taken place, BYNET will not allow any directed band messages to be sent to this newly restarted or reconfigured host until the local PIT protocol signals. As a result, the distribution of the invalid PIT packet can be protected until the appropriate PIT protocol is initialized by the first open protocol.

【０１８８】ホストが有向バンドメッセージを（無効化
されたＰＩＴＩＤ９０６を使用する）無効ホストに送
ろうとすると、送信側の計算ノード高レベルドライバ
は、送信者へのエラー状態と共にこのメッセージを拒否
する。この拒否により、二つのノード間で第一のオープ
ン・ハンドシェークが開始される。第一のオープン・ハ
ンドシェークが完了した後で、（計算ノード２００から
見て）未処理になっているＩＯＮ２１２のＩ／Ｏ作業が
再送信されることになる。しかし、これがウォーム・リ
スタートでない限り、ＩＯＮ２１２は長い間ダウンして
いる可能性が高く、未処理のＩ／Ｏ作業はフェールオー
バ処理の一部として再起動され、ダイポール内の他のＩ
ＯＮ２１２に送られると思われる（詳細はＩＯＮ失敗処
理のセクションを参照）。クラッシュしたノードが計算
ノード２００の場合、第一のオープンをすでに行った計
算ノード２００の第一のオープン要求がＩＯＮ２１２に
予期せず到着すると、これにより、ＰＩＴＩＤ復旧作
業が開始される。ＩＯＮ２１２は計算ノード２００にク
レジットされたＰＩＴＩＤ９０６をすべて無効化する
（又は実際は古いものの再発行のみを行う）。計算ノー
ド２００の未処理のＩ／Ｏ作業はすべて完了できる（た
だしノードを再起動する時間が極度に短くない限り、こ
のイベントが起こる可能性は低い）。使用しているＳＡ
ＳＥＰＩＴは失効し、完了メッセージは拒否されるこ
とになる（そして、Ｉ／Ｏ要求を発行したアプリケーシ
ョンスレッドは存在しなくなる）。If the host attempts to send a directed band message to an invalid host (using invalidated PIT ID 906), the sending compute node high level driver will reject this message with an error condition to the sender. . This rejection initiates a first open handshake between the two nodes. After the first open handshake is completed, any outstanding ION 212 I / O work (as viewed from compute node 200) will be retransmitted. However, unless this is a warm restart, the ION 212 is likely to be down for a long time, and outstanding I / O work will be restarted as part of the failover process and other I / Os in the dipole will be restarted.
It will be sent to ON 212 (see ION failure handling section for details). When the crashed node is the computation node 200, when the first open request of the computation node 200 that has already performed the first open arrives at the ION 212 unexpectedly, the PIT ID recovery operation is started. The ION 212 invalidates all PIT IDs 906 credited to the computing node 200 (or actually only reissues old ones). All outstanding I / O work for compute node 200 can be completed (unless the time to restart the node is extremely short, so this event is unlikely to occur). SA used
The SE PIT expires and the completion message will be rejected (and no application thread will have issued the I / O request).

【０１８９】５．スーパーＰＩＴ（ＳＰＩＴ）−小Ｉ
／Ｏ性能の改善ＰＩＴプロトコルは通常のＳＣＳＩコマンドを利用する
ことができる。本発明の中核は通信ネットワークであ
り、ストレージ・ネットワークではないため、システム
はネットワーク・プロトコルを使用して、ストレージ・
モデルが可能にする性能を改善する。アップ・コールを
扱う際のオーバーヘッド処理は、小Ｉ／Ｏ要求に支配さ
れる仕事量にとって性能の壁を意味する。小Ｉ／Ｏ性能
を改善する方法はいくつか存在している。一つは、割り
込み処理コードのパスの長さを改善することである。二
つ目は、デバイスドライバで採用されているものと同様
のテクニックを利用して、複数の割り込みのベクトルを
単一の割り込み操作子の呼び出しに縮小することであ
る。三つ目は、個々のＩ／Ｏ作業の数を減らし、単一の
要求にクラスタ化する（コンボイする）ことである。ソ
ース及び目的地の物理リンクのＭＴＵサイズの違いの影
響で、出入りするデータフローを再パッケージ化する必
要のあるノードには、データが集まる傾向がある。この
問題は、送信側と目的地側ネットワークの速度の不一致
（特に目的地ネットワークが遅くなる）により、さらに
悪化する。こうしたノードは、絶えず目的地からのフロ
ー制御の対象になる。その結果、バーストによりルータ
から流出するデータが生じる。これはデータ・コンボイ
と呼ばれる。[0189] 5. Super PIT (SPIT)-Small I
Improving I / O performance The PIT protocol can use normal SCSI commands. Since the core of the present invention is the communication network, not the storage network, the system uses a network protocol to
Improve the performance that the model allows. The overhead processing when handling up calls represents a performance barrier for workloads dominated by small I / O requests. There are several ways to improve small I / O performance. One is to improve the path length of the interrupt handling code. The second is to use a technique similar to that employed in device drivers to reduce the vector of multiple interrupts into a single call to an interrupt manipulator. Third, reduce the number of individual I / O operations and cluster (convoy) into a single request. Due to the difference in MTU size of the source and destination physical links, nodes that need to repackage incoming and outgoing data flows tend to collect data. This problem is further exacerbated by a mismatch between the speeds of the transmitting side and the destination side network (particularly the destination network becomes slower). These nodes are constantly subject to flow control from the destination. As a result, data flowing out of the router due to the burst is generated. This is called a data convoy.

【０１９０】本発明では、データ・コンボイを、ＩＯＮ
２１２及び計算ノード２００においてアップ・コールが
生成した割り込みの数を減らすテクニックとして利用す
る。説明のために、ＩＯＮ２１２から計算ノード２００
へのデータフローを考える。本発明で使用するフロー制
御のデビット／クレジット・モデルでは、Ｉ／Ｏ要求は
計算ノード２００とＩＯＮ２１２の両方で待ち行列に加
わる。待ち行列はＩＯＮ２１２にストアされるＰＩＴパ
ケットで始まり、これをすべて使うと、計算ノード２０
０に戻って継続する。これはオーバーフロー状態と呼ば
れる。通常、オーバーフローは、ノードが自分のＰＩＴ
バッファ・クレジットより多くの要求を持つときに生じ
る。Ｉ／Ｏが完了する度に、ＩＯＮ２１２は完了メッセ
ージを計算ノード２００に戻す。通常、この完了メッセ
ージには、解放されたＰＩＴバッファ・リソースのクレ
ジットが含まれる。これはデビット／クレジット・フロ
ー制御の基盤である。システムにＩ／Ｏ要求が殺到する
と、いおん２１２ではＩ／Ｏが完了する度に新しいＩ／
Ｏ要求と置き換えられる。そのため、負荷が大きい時期
には、Ｉ／Ｏ要求は一度に一つずつＩＯＮ２１２に流れ
込み、不特定の期間、ＩＯＮ２１２の待ち行列に加えら
れる。こうしたそれぞれの要求によってアップ・コール
割り込みが生じ、ＩＯＮ２１２の負荷が増加する。In the present invention, the data convoy is stored in the ION
This is used as a technique for reducing the number of interrupts generated by the up call in the 212 and the compute node 200. For the sake of explanation, the calculation node 200
Consider the data flow to In the flow control debit / credit model used in the present invention, I / O requests are queued at both the compute node 200 and the ION 212. The queue starts with a PIT packet stored in the ION 212 and when all of this is used, the compute node 20
Return to 0 and continue. This is called an overflow condition. Normally, an overflow indicates that a node has its own PIT.
Occurs when you have more requests than buffer credits. Each time the I / O is completed, the ION 212 returns a completion message to the compute node 200. Typically, this completion message includes credits for the released PIT buffer resources. This is the basis for debit / credit flow control. When the system is flooded with I / O requests, the Ion 212 receives a new I / O each time the I / O is completed.
Replaced by O request. Therefore, during periods of heavy load, I / O requests flow into the ION 212 one at a time and are added to the ION 212 queue for an unspecified period. Each of these requests causes an up-call interrupt and increases the load on the ION 212.

【０１９１】この二重待ち行列モデルは数多くの利点を
持つ。計算ノード２００に割り当てるＰＩＴバッファの
かずに関しては綿密な相殺取引が行われる。十分な仕事
量をＩＯＮ２１２のローカルの待ち行列に入れ、要求が
完了したときに、即座に新しい仕事をディスパッチでき
るようにするべきである。しかし、キャッシュシステム
に割り当てれば、ＩＯＮ２１２上の待ち行列にある要求
が消費するメモリリソースをさらに効率的に利用でき
る。ＩＯＮ２１２のＰＩＴ待ち行列がメモリを維持する
のに足りない状態にあるとき、ＩＯＮ２１２の仕事が無
くなり、計算ノード２００から仕事が送られるのを待た
ざるを得なくなれば性能は低下する。This double queuing model has a number of advantages. A close offset transaction is performed on the PIT buffer allocated to the computing node 200. Sufficient work should be queued locally at the ION 212 so that new work can be dispatched as soon as the request is completed. However, by allocating to the cache system, the memory resources consumed by requests in the queue on the ION 212 can be used more efficiently. When the ION 212's PIT queue is in a state of insufficient memory to maintain memory, performance degrades if the ION 212 runs out of work and has to wait for work to be sent from the compute node 200.

【０１９２】スーパーＰＩＴは、高負荷時にデビット／
クレジット・システムのフロー制御を利用して、アップ
・コール割り込みの数を減らすために設計された、ＰＩ
Ｔプロトコルの特徴である。スーパーＰＩＴは、ＯＬＴ
Ｐや大量の比較的小さなＩ／Ｏに支配される同様の仕事
の性能を改善する。一度に一つの要求を送るのではな
く、スーパーＰＩＴパケットは、単一の大きなスーパー
ＰＩＴ要求によってすべてが運ばれるＩ／Ｏ要求の集合
である。それぞれのスーパーＰＩＴパケットは、通常の
ＰＩＴバッファと同じ方法で転送される。スーパーＰＩ
Ｔパケットに含まれる個々のＩ／Ｏ要求は、ＩＯＮ２１
２リソースが利用可能になったときに、ＰＩＴ仕事イン
ジェクタによって抽出され、通常のいおん２１２待ち行
列メカニズムに挿入される。この個々のＩ／Ｏ要求は、
読み出し、書き込み、どちらの要求でもかまわない。The super PIT has a debit / high
A PI designed to reduce the number of up-call interrupts using the flow control of the credit system
This is a feature of the T protocol. Super PIT is OLT
Improves the performance of P and similar tasks dominated by large amounts of relatively small I / O. Rather than sending one request at a time, a super PIT packet is a collection of I / O requests all carried by a single large super PIT request. Each super PIT packet is transferred in the same way as a normal PIT buffer. Super PI
Each I / O request included in the T packet is transmitted to the ION 21
When two resources become available, they are extracted by the PIT work injector and inserted into the regular Onion 212 queuing mechanism. This individual I / O request is
Either read or write request may be used.

【０１９３】ＰＩＴ仕事インジェクタは、ＩＯＮ２１２
に転送されたアプリケーション要求のローカルでの（Ｉ
ＯＮ２１２上の）代理の役割を果たす。ＰＩＴ仕事イン
ジェクタは、後のセクションで述べるＲＴ−ＰＩＴ及び
ＦＲＡＧ−ＰＩＴプロトコルとしても使われる。スーパ
ーＰＩＴの個々の要求が空になると、このリソースは計
算ノードに開放され、別のスーパーＰＩＴパケットを送
って置き換えることが可能になる。一つのホストに認め
られるスーパーＰＩＴパケットの数は、第一のオープン
交渉で決定する。当然、ＩＯＮ２１２の待ち行列にある
仕事量は、別のスーパーＰＩＴパケットが運ばれるまで
ＩＯＮ２１２が仕事を続けるのに十分な量でなければい
けない。The PIT work injector is an ION212
(I) of the application request forwarded to
Acts as a proxy (on ON 212). PIT work injectors are also used as the RT-PIT and FRAG-PIT protocols described in later sections. As each super PIT request becomes empty, this resource is released to the compute node so that another super PIT packet can be sent and replaced. The number of super PIT packets allowed for one host is determined in the first open negotiation. Of course, the amount of work in the ION 212 queue must be sufficient to keep the ION 212 working until another super PIT packet is carried.

【０１９４】計算ノード２００によって、ＩＯＮ２１２
が自分のＰＩＴクレジットを使い果たすのに十分な仕事
が待ち行列に加えられ、要求の待ち行列への追加がロー
カルで始まった状況を考える。スーパーＰＩＴ要求の待
ち行列に加えられる要求の数は、スーパーＰＩＴが転送
されるバッファのサイズによってのみ制限される。スー
パーＰＩＴパケットは通常のＰＩＴパケットとは異なる
働きをする。本発明の制御モデルでは、目的地へのクレ
ジットを持っている場合に限り、デバイスは要求（デビ
ット）を送ることができる。デバイスがＩＯＮ２１２内
の特定のアプリケーション・スレッドをターゲットにし
ているわけではないので、デバイスがどのＰＩＴパケッ
トを使用するかは、特に関係ない。ＩＯＮ２１２へのＰ
ＩＴパケットはバッファの利用を制限するだけである
（そして副作用としてフロー制御をする）。対照的に、
ＰＩＴ要求内のＳＡＳＥＰＩＴは異なっている。ＳＡ
ＳＥＰＩＴＩＤは計算ノード２１２内の特定のスレッ
ドのアドレススペースを意味している。スーパーＰＩＴ
の各要求はＳＡＳＥＰＩＴを含んでいるが、この要求
のＩ／Ｏが完了したとき、作成されるＩ／Ｏ完了メッセ
ージにはクレジットＰＩＴは含まれない。スーパーＰＩ
Ｔ内のすべての要求が無くなったときだけ、そのアドレ
ススペースにクレジットＰＩＴが発行される。The ION 212 is calculated by the calculation node 200.
Consider the situation where enough work has been added to the queue to deplete its PIT credits and the addition of requests to the queue has started locally. The number of requests that can be enqueued in the super PIT request queue is limited only by the size of the buffer to which the super PIT is transferred. A super PIT packet works differently from a normal PIT packet. With the control model of the present invention, a device can send a request (debit) only if it has credit to the destination. It does not matter what PIT packet the device uses, as the device is not targeting a particular application thread within the ION 212. P to ION212
IT packets only limit the use of buffers (and have flow control as a side effect). In contrast,
The SASE PIT in the PIT request is different. SA
The SEPIT ID means an address space of a specific thread in the computing node 212. Super PIT
Includes a SASE PIT, but when the I / O for this request is completed, the created I / O completion message does not include the credit PIT. Super PI
Only when all requests in T are exhausted will a credit PIT be issued to that address space.

【０１９５】計算ノード２００でのスーパーＰＩＴの作
成は以下のように行われる。スーパーＰＩＴは、単一の
ＩＯＮ２１２へのＩ／Ｏ要求が、計算ノード２００の待
ち行列に最低２つあれば、常に作成できる。このＩＯＮ
２１２に関して、計算ノード２００のスーパーＰＩＴパ
ケットがすでに限界に達している場合、計算ノード２０
０はスーパーＰＩＴが戻るまで、要求を待ち行列に加え
続ける。この計算ノードはその後、別のスーパーＰＩＴ
メッセージを発行する。このシステムドライバ内では、
待ち行列が発生した場合、スーパーＰＩＴパケットを作
成するためには、ＩＯＮごとの待ち行列が必要になる。The creation of the super PIT in the computing node 200 is performed as follows. A super PIT can always be created if there are at least two I / O requests to a single ION 212 in the queue of the compute node 200. This ION
Regarding 212, if the super PIT packet of the compute node 200 has already reached the limit,
0 keeps adding requests to the queue until the super PIT returns. This compute node is then replaced by another super PIT
Issue a message. Within this system driver,
When a queue occurs, a queue for each ION is required to create a super PIT packet.

【０１９６】上述の通り、スーパーＰＩＴメッセージ
は、大量の小Ｉ／Ｏ要求に支配された仕事によるＩＯＮ
２１２の処理負荷を減らすことができる。スーパーＰＩ
Ｔメッセージは、メッセージの平均サイズを増やすこと
で、目的地ノードの性能とインターコネクト・ファブリ
ックの利用性を改善する。しかし、スーパーＰＩＴメッ
セージの概念は、ＩＯＮ２１２が、小Ｉ／Ｏ作業によっ
て生じる計算ノード２００の負荷を減少させる場合にも
適用できる。ＩＯＮ２１２でのスーパーＰＩＴメッセー
ジの作成では、計算ノード２００での作成とはまったく
異なる問題が生じる。計算ノード２００では、Ｉ／Ｏ要
求を作成するアプリケーション・スレッドは、ＩＯＮ２
１２の過負荷を防止するためのフロー制御の対象とな
る。ディスクサブシステムのサービス速度は、残りのＩ
ＯＮ２１２よりはるかに低いため、ＩＯＮ２１２の性能
にとって常に最終的な制限となる。要求は、ＩＯＮ２１
２がそれを待ち行列に加え、サービスを行うために十分
なリソースを持つまで、システムに入ることをブロック
される。重要なのは、ＩＯＮ２１２上でリソースが利用
可能になるまで、要求が計算ノードで待ち行列に加えら
れる（又はアプリケーションがブロックされる）という
点である。リソース不足は計算ノード２００での問題で
はない。計算ノード２００のアプリケーションがシステ
ムに対してＩ／Ｏの要求を発信するとき、要求の一部に
はＩ／Ｏを完了するのに必要な計算ノード２００のメモ
リリソース（アプリケーション・スレッド・バッファ）
が含まれる。ＩＯＮ２１２が計算ノード２００に送る必
要のあるＩ／Ｏ完了メッセージには、すでにＰＩＴＩ
Ｄ（ＳＡＳＥＰＩＴＩＤ）が割り当てられている。
ＩＯＮ２１２から見ると、Ｉ／Ｏ完了メッセージはすで
にターゲットバッファが割り当てられており、データの
準備が終わり次第、実行することができる。Ｉ／Ｏ完了
メッセージは、伝達すれば成功する（ＩＯＮ２１２は計
算ノードのディスクストレージシステムのサービス時間
を持つ必要はない）。そのため、ＩＯＮ２１２は、計算
ノードからのフロー制御の圧力によるブロックができな
い。スーパーＰＩＴメッセージを作成するために、計算
ノードはフロー制御待ち行列を利用するが、ＩＯＮ２１
２はこのオプションを持っていない。ＩＯＮ２１２は、
ＢＹＮＥＴへのアクセスを除き、待つ必要のあるリソー
スをまったく持っていないため、スーパーＰＩＴメッセ
ージを作成する機会ははるかに少なくなる。As described above, the super PIT message is an ION by a task dominated by a large number of small I / O requests.
212 can be reduced. Super PI
T messages improve the performance of the destination node and the availability of the interconnect fabric by increasing the average size of the messages. However, the concept of a super PIT message is also applicable when the ION 212 reduces the load on the compute node 200 caused by small I / O work. Creation of a super PIT message in the ION 212 has a completely different problem from creation in the compute node 200. In the computing node 200, the application thread that creates the I / O request is ION2
12 is subject to flow control for preventing overload. The service speed of the disk subsystem is
Because it is much lower than the ON 212, it is always the final limit on the performance of the ION 212. Request is for ION21
2 is blocked from entering the system until it has added it to the queue and has enough resources to perform the service. Importantly, requests are queued at the compute nodes (or applications are blocked) until resources are available on the ION 212. Insufficient resources are not a problem at the compute node 200. When an application of the computing node 200 issues an I / O request to the system, a part of the request includes a memory resource (application thread buffer) of the computing node 200 required to complete the I / O.
Is included. The I / O completion message that the ION 212 needs to send to the compute node 200 includes the PIT I
D (SASE PIT ID) is assigned.
From the perspective of the ION 212, the I / O completion message has already been allocated a target buffer and can be executed as soon as the data is ready. The I / O completion message is successful if delivered (ION 212 need not have the service time of the disk storage system of the compute node). Therefore, the ION 212 cannot block by the pressure of the flow control from the calculation node. To create a super PIT message, the compute node uses a flow control queue, but the ION 21
2 does not have this option. ION212,
Except for access to BYNET, there is much less opportunity to create a super PIT message since it has no resources to wait for.

【０１９７】ＩＯＮ２１２でスーパーＰＩＴメッセージ
を作成するために、いくつかの方法が利用できる。一つ
は、Ｉ／Ｏ完了要求をわずかに遅らせて、スーパーＰＩ
Ｔパケット作成の機会を増やすことである。わずかな遅
れの後、同じノードのために新しい完了メッセージが準
備されていなければ、そのメッセージは通常のＰＩＴメ
ッセージとして送られる。このテクニックの問題は、
（計算ノードのアップ・コール・オーバーヘッドを減ら
すために）スーパーＰＩＴ作成を期待して要求を遅らせ
る時間を取れば、それに対応して要求サービス時間の合
計が増加することである。最終的には計算ノード２００
の負荷を減らすことになるが、アプリケーションの速度
も遅くなってしまう。適応性のある遅延時間が有効であ
る（計算ノード２００に対する平均サービス時間と特定
の要求にかかった合計サービス時間によって決定）。二
つ目の方法は、一つ目をわずかに変化させたものであ
る。この場合、それぞれの計算ノード２００は、計算ノ
ードでの小Ｉ／Ｏの割合が上昇するにしたがって増加す
る遅延時間を、それぞれのＩＯＮ２１２に提供する必要
がある。これにより、必要な時に特定のＩＯＮ２１２が
スーパーＰＩＴメッセージを作成する機会を増やすこと
になる。三つ目の方法は、特定のトラフィックのタイ
プ、例えばキャッシュが直接サービスを行い、ストレー
ジ２２４のディスク作業を待たなくていい小さな読み出
し又は書き込みなどを遅らせることである。キャッシュ
は要求の一部に関して、ディスク・トラフィックを回避
して、平均Ｉ／Ｏ待ち時間を減らすが、待ち時間の配分
はキャッシュのヒットによって変化する。キャッシュの
ヒット要求における短い待ち行列遅延時間は、ディスク
作業を含むものに比べ、サービス時間の大きな増加には
つながらない。サービス時間の配分に敏感なアプリケー
ション（均一な反応時間が性能にとって重要なもの）で
は、ＩＯＮ２１２でのスーパーＰＩＴパケット作成のた
めのわずかな遅延によって、全体的なシステム性能が改
善される可能性がある。６．大型ブロックのサポートと断片ＰＩＴパケットデータベース・アプリケーションの性能要件は、データ
ベースのサイズに関係しない場合が多い。データベース
のサイズが大きくすれば、アプリケーション性能の浸食
を防ぐために、ディスクストレージを調べる速度も比例
して大きくしなくてはいけない。別の言い方をすれば、
サイズが大きくなるカスタマ・データベースにおいて、
一定の問い合わせに対する反応時間は一定に保たなくて
はいけない。この要件を満たす上で問題は、これが現在
のディスクドライブ・テクノロジの傾向と直接対立する
点にある。ディスクドライブの容量が増加する一方で、
ランダムＩ／Ｏ性能は変化していない。この傾向を緩和
する方法の一つは、ディスクデバイスの容量増加にした
がって、ディスクＩ／Ｏ作業の平均サイズを増加させる
ことである。ストレージ容量と性能要件の現在の傾向か
ら言って、平均Ｉ／Ｏサイズである２４ＫＢは、非常に
近い将来に１２８ＫＢに増加する可能性がある。より積
極的なキャッシュ使用と遅延書き込みテクニックが、多
くの仕事にとって有効になるかもしれない。ディスクド
ライブにおける不均一なテクノロジの成長だけが、Ｉ／
Ｏ要求サイズの増加を生み出しているわけではない。Ｂ
ＬＯＢＳ（バイナリ大型オブジェクト）を利用するデー
タベースが普及するにつれ、１ＭＢ以上のサイズに達す
るオブジェクトが一般的になりつつある。特定の原因と
は関係なく、システムには大きなＩ／Ｏオブジェクトを
サポートする必要が生じ、そのサイズはディスクストレ
ージの経済学に従っていくことになる。Several methods are available for creating a super PIT message at the ION 212. One is to slightly delay the I / O completion request and
This is to increase the chances of creating a T packet. After a slight delay, if no new completion message has been prepared for the same node, it is sent as a normal PIT message. The problem with this technique is
Taking the time to delay the request in anticipation of creating a super PIT (to reduce the up-call overhead of the compute node) is to increase the total required service time correspondingly. Eventually compute node 200
But it also slows down the application. An adaptive delay time is valid (determined by the average service time for the compute node 200 and the total service time for a particular request). The second method is a slight modification of the first. In this case, each computing node 200 needs to provide each ION 212 with a delay time that increases as the ratio of small I / O at the computing node increases. This will increase the opportunity for a particular ION 212 to create a super PIT message when needed. A third approach is to delay certain types of traffic, such as small reads or writes, where the cache services directly and does not have to wait for storage 224 disk activity. The cache avoids disk traffic and reduces average I / O latency for some requests, but the distribution of latency varies with cache hits. Short queuing delays in cache hit requests do not result in a significant increase in service time compared to those involving disk work. In applications that are sensitive to service time allocation (uniform response time is critical for performance), small delays for super-PIT packet creation in ION 212 may improve overall system performance. . 6. Large Block Support and Fragmented PIT Packets Database application performance requirements are often independent of database size. As the size of the database grows, the speed at which disk storage is probed must be increased proportionally to prevent erosion of application performance. In other words,
In a growing customer database,
The response time for certain inquiries must be kept constant. The challenge in meeting this requirement is that it directly conflicts with current trends in disk drive technology. As disk drive capacity increases,
The random I / O performance has not changed. One way to alleviate this tendency is to increase the average size of disk I / O work as the capacity of disk devices increases. Given the current trends in storage capacity and performance requirements, the average I / O size of 24 KB may increase to 128 KB in the very near future. More aggressive cache usage and lazy write techniques may be useful for many jobs. The only non-uniform technology growth in disk drives is
It does not create an increase in O request size. B
With the spread of databases using LOBS (binary large objects), objects reaching the size of 1 MB or more are becoming common. Regardless of the particular cause, the system will need to support large I / O objects, the size of which will follow the economics of disk storage.

【０１９８】ＰＩＴプロトコルを使用したＩＯＮ２１２
と計算ノード２００による大型データオブジェクトの送
信に関しては、いくつかの問題がある。本文書で述べた
通り、ＰＩＴプロトコルの利点は、目的地バッファを先
に割り当て、フロー制御と端点決定の問題に対処するこ
とにある。しかし、アップ・コール意味もメッセージを
保管する十分なバッファスペースを確認（又は割り当
て）する必要がある。ＰＩＴプロトコルでは、送信側で
各メッセージを保管するターゲットＰＩＴＩＤ９０６
を送信側に選択させることで、この問題に対処する。大
きなＩ／Ｏ書き込みは、メッセージのサイズが利用でき
るプールから特定のＰＩＴＩＤ９０６を選択する際の
基準になるため、確実にプロトコルが複雑になる。負荷
が大きい時期には、送信者が利用できるＰＩＴＩＤ９
０６クレジットを持っているにも関わらず、そのすべて
が大型Ｉ／Ｏ要求のバッファサイズ条件に合わない状況
が生じる可能性がある。ＰＩＴプロトコルでは、送信す
るデータサイズの幅が広い場合、送信側は受信側と協力
してＰＩＴバッファの数とサイズの両方を管理しなくて
はいけない。これによりＰＩＴバッファ割り当てサイズ
の問題が生じる。つまり、ＰＩＴバッファのプールを作
成する時に、特定の仕事のＰＩＴバッファのプールにと
って、バッファサイズの適切な配分とは何か、である。
ＢＹＮＥＴソフトウェアは、書き込みだけでなく大型Ｉ
／Ｏ読み出しも複雑にする追加最大転送単位（ＭＴＵ）
を強制する。ＢＹＮＥＴＭＴＵを超えるＩ／Ｏ要求
（読み出しと書き込みの両方）は、ソフトエアのプロト
コル（この場合はＰＩＴプロトコル）によって、送信側
で断片化し、受信側で組み立て直す必要がある。これに
よってメモリ断片化の問題が生じる。簡単に言えば、内
部断片化は割り当てられたバッファ内部の無駄なスペー
スであり、外部断片化は割り当てられたバッファ外部
の、小さすぎてまったく要求を満たすことができない無
駄なスペースである。解決策の一つは、大型ＰＩＴバッ
ファの一部のみを使用することだが、大きなＰＩＴバッ
ファを使用した場合に不必要な内部断片化が生じること
になる。大型ＰＩＴバッファはメモリを無駄にし、コス
ト／性能を悪化させる。ION 212 Using PIT Protocol
And the transmission of large data objects by the compute node 200 have several problems. As mentioned in this document, the advantage of the PIT protocol is that it allocates the destination buffer first and addresses the issues of flow control and endpoint determination. However, up call semantics also need to ensure (or allocate) enough buffer space to store the message. In the PIT protocol, a target PIT ID 906 for storing each message on the transmission side.
This problem is addressed by having the sender select. Large I / O writes certainly complicate the protocol because the size of the message is a criterion in selecting a particular PIT ID 906 from the available pool. During periods of heavy load, PIT ID9 available to the sender
There may be situations where despite having 06 credits, not all of them meet the buffer size requirements for large I / O requests. In the PIT protocol, when the data size to be transmitted is wide, the transmitting side must cooperate with the receiving side to manage both the number and the size of the PIT buffer. This raises the problem of PIT buffer allocation size. That is, when creating a pool of PIT buffers, what is the proper allocation of buffer size for a pool of PIT buffers for a particular job?
BYNET software is not only for writing
Additional maximum transfer unit (MTU) that also complicates / O read
To force. I / O requests (both read and write) that exceed the BYNET MTU need to be fragmented on the transmitting side and reassembled on the receiving side by a software protocol (in this case, the PIT protocol). This creates a problem of memory fragmentation. Simply put, internal fragmentation is wasted space inside the allocated buffer, and external fragmentation is wasted space outside the allocated buffer that is too small to satisfy any demand. One solution is to use only a portion of the large PIT buffer, but using a large PIT buffer will result in unnecessary internal fragmentation. Large PIT buffers waste memory and degrade cost / performance.

【０１９９】本発明では、ＢＹＮＥＴＭＴＵ及びＰＩ
Ｔバッファ割り当ての問題は、ＰＩＴメッセージのタイ
プを２つ追加することで解決している。これはＲＴ−Ｐ
ＩＴ（ラウンド・トリップＰＩＴ）とＦＲＡＧ−ＰＩＴ
（断片ＰＩＴ）である。ＦＲＡＧ−ＰＩＴとＲＴ−ＰＩ
Ｔはどちらも、ＰＩＴデータ・プッシュ・モデルではな
く、データ・プル・モデルを利用している（プッシュ・
データでは、送信側がデータを目的地にプッシュする。
プル・データでは、目的地がデータをソースからプルす
る）。ＦＲＡＧ−ＰＩＴメッセージは大型データ読み出
しをサポートする設計になっており、ＲＴ−ＰＩＴメッ
セージは大型データ書き込みをサポートしている。ＦＲ
ＡＧ−ＰＩＴとＲＴ−ＰＩＴはどちらもスーパーＰＩＴ
と同じように、ＩＯＮＰＩＴ仕事インジェクタを使用
してデータ・フローを管理する。In the present invention, BYNET MTU and PI
The problem of T-buffer allocation has been solved by adding two PIT message types. This is RT-P
IT (Round Trip PIT) and FRAG-PIT
(Fragment PIT). FRAG-PIT and RT-PI
T both use a data pull model instead of a PIT data push model (push
For data, the sender pushes the data to the destination.
In pull data, the destination pulls the data from the source). The FRAG-PIT message is designed to support reading large data, and the RT-PIT message supports writing large data. FR
AG-PIT and RT-PIT are both super PIT
Like, manage data flow using ION PIT work injectors.

【０２００】ａ）ＲＴ−ＰＩＴメッセージ計算ノード２００がＩＯＮ２１２に対して大きなディス
ク書き込み作業の実行を求め、そのＩ／Ｏ書き込みがＢ
ＹＮＥＴＭＴＵ又は利用できる任意のＩＯＮ２１２Ｐ
ＩＴバッファのどちらかより大きいとき、計算ノード２
００はＲＴ−ＰＩＴ作成メッセージを作成する。ＲＴ−
ＰＩＴメッセージは二つの段階で働く。ブースト段階と
ラウンド・トリップ段階である。ブースト段階では、書
き込まれるデータのためのソース・バッファのリストが
計算ノード２００一連のＰＩＴＩＤに割り当てられる。
ソース・バッファの断片化サイズがＢＹＮＥＴＭＴＵ
とＩＯＮ初期オープンプロトコルで指定されたサイズ制
約によって決定する。このＰＩＴＩＤ（及び地合おう
するバッファサイズ）のリストは、単一のＲＴ−ＰＩＴ
要求メッセージのペイロードに置かれ、目的地ＩＯＮ２
１２に対するＰＩＴクレジットになる。ＲＴ−ＰＩＴプ
ロトコルが直接使用するために、追加ＰＩＴバッファが
計算ノードプールから割り当てられる。この追加バッフ
ァのＰＩＴＩＤは、ＰＩＴヘッダのクレジット・フィー
ルドに置かれる。残りのＲＴ−ＰＩＴ要求は、通常のＰ
ＩＴ書き込みメッセージと同じである。次に計算ノード
２００は、このＲＴ−ＰＩＴ要求メッセージをＩＯＮ２
１２に送る（ブーストする）。A) RT-PIT message The computing node 200 requests the ION 212 to execute a large disk write operation, and the I / O write
YNET MTU or any available ION212P
When either of the IT buffers is larger, computation node 2
00 creates an RT-PIT creation message. RT-
PIT messages work in two stages. The boost phase and the round trip phase. In the boost phase, a list of source buffers for the data to be written is assigned to the compute node 200 series of PITIDs.
The fragmentation size of the source buffer is BYNET MTU
And the size constraint specified in the ION initial open protocol. This list of PIT IDs (and buffer sizes to be formed) is a single RT-PIT
Placed in the payload of the request message, the destination ION2
12 PIT credits. Additional PIT buffers are allocated from the compute node pool for direct use by the RT-PIT protocol. The PITID of this additional buffer is placed in the credit field of the PIT header. The remaining RT-PIT requests are
Same as the IT write message. Next, the computing node 200 transmits this RT-PIT request message to ION2.
Send to 12 (boost).

【０２０１】ＩＯＮ２１２では、ＰＩＴ仕事インジェク
タがＲＴ−ＰＩＴ要求メッセージを二つのステップで処
理する。それそれのソース側ＰＩＴＩＤ９０６に関し
て、仕事インジェクタはサイズの一致するＰＩＴバッフ
ァをＩＯＮキャッシュから要求しなくてはいけない（Ｉ
ＯＮバッファキャッシュで利用できるスペースに応じ
て、これがすべて一緒に又は一度に一つずつ行われ
る）。ＰＩＴバッファを一致させることで、ＩＯＮ２１
２は動的にリソースを割り当て、書き込み要求に組み合
わせる。これで、通常のＰＩＴ転送を修正したシーケン
スを利用して、Ｉ／Ｏを進めることができるようにな
る。ここでＲＴ−ＰＩＴメッセージの処理はラウンド・
トリップ段階に入り、仕事インジェクタはソースと目的
地ＰＩＴＩＤの一致した一つ（以上）のペアのため
に、ＲＴ−ＰＩＴ開始メッセージを作成する（ＩＯＮ２
１２には、一致したＰＩＴＩＤの一つ又は一部を送信
するオプションも残されている）。単一のＲＴ−ＰＩＴ
開始メッセージに含まれるＰＩＴＩＤ９０６の数は、
ＩＯＮ２１２内のデータ転送の細かさをコントロールす
る（以下で解説）。In the ION 212, the PIT work injector processes the RT-PIT request message in two steps. For each source-side PIT ID 906, the work injector must request a matching PIT buffer from the ION cache (I
This is all done together or one at a time, depending on the space available in the ON buffer cache). By matching PIT buffers, ION21
2 dynamically allocates resources and combines them with write requests. As a result, I / O can be advanced using the sequence obtained by modifying the normal PIT transfer. Here, the processing of the RT-PIT message is a round
Entering the trip phase, the work injector creates an RT-PIT start message for one (or more) matched pairs of source and destination PIT IDs (ION2).
12 also has the option to send one or a portion of the matching PIT ID). Single RT-PIT
The number of PIT IDs 906 included in the start message is
It controls the fineness of data transfer within the ION 212 (described below).

【０２０２】このＲＴ−ＰＩＴ開始メッセージは計算ノ
ード２００に送り返され、ＲＴ−ＰＩＴメッセージのブ
ースト段階が終了する。ＲＴ−ＰＩＴ開始メッセージを
受領すると、計算ノード２００は、通常ＰＩＴ書き込み
メッセージを使用して、一度に一つのＰＩＴペアずつ、
ＩＯＮ２１２にデータを転送する。計算ノード２００と
ＩＯＮ２１２の両方が失われた断片を処理するのに十分
なデータを持っているので、計算ノード２００は断片を
順番に送る必要はない（一致したＰＩＴペアが再組立の
順番を指定する）。ＩＯＮ２１２がＰＩＴ書き込みメッ
セージを受領すると、仕事インジェクタは通知を受け、
この書き込み要求がもっと大きなＲＴ−ＰＩＴＩ／Ｏ
作業の一部であることを認識する。仕事インジェクタに
はＰＩＴ書き込み処理に関して二つの選択肢があり、断
片をキャッシュ・ルーチンに渡して書き込み作業を開始
させる、又は書き込みを開始する前に最後の断片の送信
を待つ。さきにＩ／Ｏを開始すると、キャッシュがパイ
プライン式にディスクドライブへデータフローを送るこ
とができるが（書き込みキャッシュの方針による）、小
さなＩ／Ｏサイズによる性能低下のリスクがある。しか
し、すべての断片が到着するまでＩ／Ｏを保持すると、
キャッシュシステムに過度の負荷がかかる可能性があ
る。断片の合計サイズと数量は最初から分かっているの
で、現在の稼働状況で大型Ｉ／Ｏ要求を最適化するのに
必要なすべてのデータは、キャッシュシステムによって
作られる。計算ノード２００側では、ＰＩＴ書き込み作
業の送信が成功する度に、単一のＲＴ−ＰＩＴ開始メッ
セージに複数の断片が含まれるときには、次の断片書き
出しの開始が起こる。単一のＲＴ−ＰＩＴ開始コマンド
内の最後の断片が受領されると、要求インジェクタは、
通常の書き込み要求と同様の処理のために、データをキ
ャッシュシステムに渡す。データが安全であれば、キャ
ッシュシステムはＩ／Ｏ完了メッセージを作成し、計算
ノード２００に送り返して、（ＲＴ−ＰＩＴ開始作業
の）処理のこの段階が完了したことを知らせる。断片が
まだ残っている場合は、ＲＴ−ＰＩＴ開始コマンドが作
成され、計算ノードに送られ、すべての断片が処理され
るまで上述のサイクルが繰り返される。作業インジェク
タとキャッシュが最後の断片の処理を完了すると、最終
Ｉ／Ｏ完了メッセージがステータスと共に計算ノードに
戻され、ＲＴ−ＰＩＴ要求の処理の終了を同期させる。This RT-PIT start message is sent back to the compute node 200, and the boost phase of the RT-PIT message ends. Upon receiving the RT-PIT start message, the compute node 200 uses the normal PIT write message, one PIT pair at a time,
The data is transferred to the ION 212. Since both compute node 200 and ION 212 have enough data to handle the lost fragment, compute node 200 need not send the fragments in order (the matched PIT pair specifies the reassembly order). Do). When the ION 212 receives the PIT write message, the work injector is notified and
This write request is a larger RT-PIT I / O
Recognize that it is part of the task. The work injector has two options for the PIT write process: pass the fragment to the cache routine to start the write operation, or wait for the transmission of the last fragment before starting the write. Starting I / O earlier allows the cache to send data flow to the disk drive in a pipelined fashion (depending on write cache policy), but there is a risk of performance degradation due to small I / O size. However, if I / O is held until all fragments arrive,
The cache system may be overloaded. Since the total size and quantity of the fragments are known from the start, all data needed to optimize large I / O requests in current operating conditions is created by the cache system. On the computing node 200 side, whenever a single RT-PIT start message includes a plurality of fragments each time the transmission of a PIT write operation is successful, the start of writing the next fragment occurs. When the last fragment in a single RT-PIT start command is received, the request injector will:
The data is passed to the cache system for the same processing as a normal write request. If the data is secure, the cache system creates an I / O completion message and sends it back to compute node 200 to indicate that this stage of the process (of the RT-PIT start operation) has been completed. If fragments remain, an RT-PIT start command is created and sent to the compute node, and the above cycle is repeated until all fragments have been processed. When the work injector and cache have completed processing the last fragment, a final I / O completion message is returned to the compute node with status, synchronizing the end of processing of the RT-PIT request.

【０２０３】ＲＴ−ＰＩＴメッセージはいくつかの変更
でＢＹＮＥＴに最適化できる。ＩＯＮ２１２がＲＴ−Ｐ
ＩＴ要求を受領した直後の状況を考える。ＩＯＮ２１２
の仕事インジェクタは、計算ノードのバッファとＩＯＮ
２１２を一致させ、大型Ｉ／Ｏ要求をたくさんの小さな
通常書き込み要求にトランスレートしようとしている。
同期は中間ＲＴ−ＰＩＴ開始コマンドによって行われ
る。しかし、ＢＹＮＥＴによって、受領したチャンネル
プログラムのデータ引き出しが可能になれば、ＲＴ−Ｐ
ＩＴ開始コマンド送信の中間のステップが排除できる。
説明のために、このＢＹＮＥＴ作業のモードをループ・
バンド・メッセージと呼ぶことにする。ループ・バンド
・メッセージは実際は二つの有向バンド・メッセージ
で、片方が他方に組み込まれている。例えば、ＲＴ−Ｐ
ＩＴ要求を受領した仕事インジェクタは、計算ノードで
二番目のＰＩＴ書き込みメッセージを作成するのに必要
なデータを含んだＲＴ−ＰＩＴ開始メッセージを作成す
ることで、それぞれの断片を処理することになる。ＲＴ
−ＰＩＴ開始メッセージはＰＩＴ書き込み作業の断片化
のテンプレートを計算ノード２００に転送する。計算ノ
ード２００で実行されるチャンネルプログラム（ＲＴ−
ＰＩＴ開始メッセージと一緒に送られる）は、計算ノー
ドＢＹＮＥＴドライバの送信待ち行列上にペイロードを
保管する。このペイロードは、第一のＲＴ−ＰＩＴ要求
をするアプリケーションスレッドからの要求待ち行列に
似ている。このペイロードは、ソースと目的地のＰＩＴ
ＩＤのペアを使って、仕事インジェクタが送信するこ
の断片のために、ＰＩＴ書き込み要求を作成する。ＰＩ
Ｔ書き込みは断片をＩＯＮ２１２で保管し、仕事インジ
ェクタに到着を知らせる。仕事インジェクタは、それぞ
れの断片についてこのサイクルを繰り返して、すべてを
処理する。ループ・バンド・メッセージによる性能改善
は、それぞれのＲＴ−ＰＩＴ開始メッセージに必要な割
り込みと計算ノードの処理の排除に由来する。The RT-PIT message can be optimized for BYNET with some changes. ION212 is RT-P
Consider the situation immediately after receiving an IT request. ION212
Work injectors have compute node buffers and IONs
212, trying to translate a large I / O request into many small regular write requests.
Synchronization is performed by an intermediate RT-PIT start command. However, if BYNET makes it possible to extract the data of the received channel program, RT-P
Intermediate steps of sending the IT start command can be eliminated.
For the purpose of explanation, this mode of BYNET operation is looped.
Let's call it a band message. A loop band message is actually two directed band messages, one embedded in the other. For example, RT-P
The work injector that has received the IT request will process each fragment by creating an RT-PIT start message that contains the data needed to create the second PIT write message at the compute node. RT
The PIT start message transfers the fragmentation template of the PIT write operation to the compute node 200. A channel program (RT-
Sent together with the PIT start message) stores the payload on the transmit queue of the compute node BYNET driver. This payload is similar to a request queue from the application thread making the first RT-PIT request. This payload contains the source and destination PIT
Using the ID pair, create a PIT write request for this fragment sent by the work injector. PI
The T write stores the fragment in the ION 212 and signals the work injector of arrival. The work injector repeats this cycle for each fragment and processes everything. The performance improvement due to the loop band message comes from the elimination of the interrupt and compute node processing required for each RT-PIT start message.

【０２０４】ＦＲＡＧ−ＰＩＴメッセージは、計算ノー
ドからの大型Ｉ／Ｏ読み出し要求の作業をサポートする
ように設計されている。アプリケーションが大型Ｉ／Ｏ
読み出しを要求すると、計算ノードはターゲットバッフ
ァを固定し、各断片のターゲットバッファを意味するＰ
ＩＴＩＤのリストを作成する。それぞれのＰＩＴＩＤ
は、断片のターゲットバッファと関連するステータスバ
ッファで構成される分散リストを記述する。ステータス
バッファはデータ送信時に更新され、これにより計算ノ
ードは各断片がいつ処理されたかを判断できる。各断片
のサイズは、ＲＴ−ＰＩＴメッセージと同じアルゴリズ
ムで決定する（上のＲＴ−ＰＩＴのセクションを参
照）。これらのフィールドを組み合わせてＦＲＡＧ−Ｐ
ＩＴを作成する。[0204] The FRAG-PIT message is designed to support the work of large I / O read requests from compute nodes. Application is large I / O
When a read is requested, the compute node fixes the target buffer and sets P to mean the target buffer for each fragment.
Create a list of IT IDs. Each PITID
Describes a scatter list consisting of a fragment target buffer and an associated status buffer. The status buffer is updated upon data transmission so that the compute nodes can determine when each fragment has been processed. The size of each fragment is determined by the same algorithm as the RT-PIT message (see RT-PIT section above). By combining these fields, FRAG-P
Create IT.

【０２０５】ＦＲＡＧ−ＰＩＴ要求は計算ノード２００
によってＩＯＮ２１２に送信され、ここで仕事インジェ
クタによって処理される。この要求には、仮想ディスク
名、開始ブロック番号、ＩＯＮ２１２上のデータソース
のデータの長さが含まれる。仕事インジェクタはＲＴ−
ＰＩＴ要求と同様の形でＦＲＡＧ−ＰＩＴ要求について
の作業を行う。ＦＲＡＧ−ＰＩＴ要求内の各断片は、別
々のＰＩＴ読み出し要求として、キャッシュシステムと
の協力によって処理される。キャッシュシステムは、各
断片を個別に扱うか、又は単一の読み出し要求として扱
うかを選択でき、可能になったときにディスクデータを
仕事インジェクタに供給する。キャッシュがデータ断片
を（個別に又は単一のＩ／Ｏ作業の一部として）供給す
ると、大型読み出し要求のデータが計算ノードにフロー
バックし始める。キャッシュがデータを利用可能にした
各断片ごとに、仕事インジェクタはデータ断片をＦＲＡ
Ｇ−ＰＩＴ部分完了メッセージに入れて計算ノードに送
り返す。それぞれのＦＲＡＧ−ＰＩＴ部分完了メッセー
ジは、通常のＰＩＴ読み出し要求完了と同様にデータを
伝送するが、ＦＲＡＧ−ＰＩＴ部分完了メッセージは伝
達されたときに計算ノードで割り込みを生成しない。最
後の完了断片は、ＦＲＡＧ−ＰＩＴ完全完了メッセージ
と一緒に計算ノードへ戻される。ＦＲＡＧ−ＰＩＴ完全
完了メッセージと部分完了メッセージの違いは、割り込
みによってＦＲＡＧ−ＰＩＴ読み出し要求全体の完了を
知らせる点にある（フル・アップ・コール）。The FRAG-PIT request is sent to the computation node 200
To the ION 212 where it is processed by the work injector. This request includes the virtual disk name, the starting block number, and the data length of the data source on the ION 212. Work injector is RT-
Work on FRAG-PIT requests in the same way as PIT requests. Each fragment in the FRAG-PIT request is processed as a separate PIT read request in cooperation with the cache system. The cache system can choose to treat each fragment individually or as a single read request and supply disk data to the work injector when available. As the cache supplies the data fragments (individually or as part of a single I / O operation), the data of the large read request begins to flow back to the compute nodes. For each fragment for which the cache made the data available, the work injector fetched the data fragment
Send it back to the compute node in a G-PIT partial completion message. Each FRAG-PIT partial completion message carries data in the same manner as a normal PIT read request completion, but the FRAG-PIT partial completion message does not generate an interrupt at the compute node when delivered. The last completed fragment is returned to the compute node along with the FRAG-PIT complete completion message. The difference between the FRAG-PIT complete completion message and the partial completion message is that an interrupt signals the completion of the entire FRAG-PIT read request (full up call).

【０２０６】７．他のネットワークデバイスへのＰＩ
Ｔプロトコルの導入ネットワーク接続ストレージに対する上述のアプローチ
による性能の大部分は、インターコネクト・ファブリッ
ク１０６がＰＩＴプロトコルをサポートする能力に依存
している。ＢＹＮＥＴの場合、低レベルインターフェー
スが作成され、これはＰＩＴプロトコルと密接に調和し
ている。ファイバ・チャンネル等の他のネットワーク・
インターフェースにも、同様にＰＩＴプロトコルをサポ
ートする能力がある。7. PI to other network devices
Introduction of the T Protocol Most of the performance of the above described approach to network attached storage depends on the interconnect fabric 106's ability to support the PIT protocol. In the case of BYNET, a low-level interface is created, which is closely aligned with the PIT protocol. Other networks such as Fiber Channel
The interface has the ability to support the PIT protocol as well.

【０２０７】Ｅ．バミューダ・トライアングル・プロ
トコル本発明では、ＩＯＮクリーク２２６と書き込みキャッシ
ュを使用して、データとＩ／Ｏ冗長を提供している。Ｉ
ＯＮクリーク２２６を構成するのは複数のＩＯＮ（通常
はペア又はダイポールとして設置）、例えばプライマリ
ＩＯＮ２１２及びバディＩＯＮ２１４から成るＩＯＮ２
１２及び２１４である。E. Bermuda Triangle Protocol The present invention uses ION cliques 226 and write caches to provide data and I / O redundancy. I
The ON clique 226 comprises a plurality of IONs (usually installed as pairs or dipoles), for example, an ION 2 comprising a primary ION 212 and a buddy ION 214.
12 and 214.

【０２０８】バディＩＯＮ２１４は、プライマリＩＯＮ
２１２が修正したキャッシュページのコピーの一時格納
場所として働き、データとＩ／Ｏ冗長を提供する。ＩＯ
Ｎクリーク２２６（図のＩＯＮのペア又はダイポール）
内のそれぞれのＩＯＮ２１２は、ボリューム・セットの
一つのグループにとってのプライマリＩＯＮ２１２とし
て機能し、他にとってのバディＩＯＮ２１４として機能
する。The buddy ION 214 is a primary ION
212 serves as a temporary storage location for a copy of the modified cache page, providing data and I / O redundancy. IO
N Creek 226 (pair or dipole of ION in the figure)
Each ION 212 functions as a primary ION 212 for one group of volume sets and a buddy ION 214 for the other.

【０２０９】高い可用性と書き込みキャッシュを提供す
るためには、アプリケーションが書き込みを確認できる
までの間、最低２カ所にデータを安全に格納する必要が
ある。これは、キャッシュメモリのバックアップコピー
や高速度シーケンシャルディスク用ログを使用して行う
こともある。書き込みが確認された後、データが恒久ス
トレージに記録される前に、ストレージ・コントローラ
が働かなくなった場合、この冗長コピー供給の失敗はデ
ータの損失につながる可能性がある。In order to provide high availability and a write cache, it is necessary to securely store data in at least two places until an application can confirm a write. This may be performed using a backup copy of a cache memory or a log for a high-speed sequential disk. If the storage controller fails after the write is confirmed and before the data is recorded to permanent storage, this failure of the redundant copy supply can lead to data loss.

【０２１０】しかし、ＩＯＮ２１２とＩＯＮ２１４は物
理的に分離したコンピュータで構成されるため、インタ
ーコネクト・ファブリック１０６上の通信には、こうし
たバックアップコピーを維持する必要がある。最高のシ
ステム性能のためには、書き込みキャッシュを利用しな
がら、ＢＹＮＥＴ送信と書き込みプロトコルに伴う割り
込みを最小化する必要がある。However, since the ION 212 and the ION 214 are constituted by physically separated computers, it is necessary to maintain such a backup copy for communication on the interconnect fabric 106. For best system performance, it is necessary to minimize interrupts associated with BYNET transmission and write protocols while utilizing the write cache.

【０２１１】図１１は、ダイポール２２６におけるディ
スク２２４へのデータ書き込みのために可能なプロトコ
ルの一つを示している。ステップ１及び３では、計算ノ
ード２００は要求をプライマリＩＯＮ２１２とバディＩ
ＯＮ２１４に送る。ステップ２及び４では、ＩＯＮが書
き込み要求に応答する。計算ノード２００がＩＯＮ２１
２と２１４の両方から応答を受領すると、書き込みが完
了したとみなされる。データがその後ディスクに書き込
まれるとき、プライマリＩＯＮ２１２はバディＩＯＮ２
１４に排除要求を送り、書き込みデータのページのコピ
ーを保存しておく必要がないことを知らせる。「送信完
了」割り込みを送信側で抑制すると仮定すると、それぞ
れの送信メッセージが計算ノード又はＩＯＮ２１２及び
ＩＯＮ２１４で割り込みを生成するため、このプロトコ
ルでは最低５回の割り込みが必要になる。また、このプ
ロトコルのもう一つの欠点は、書き込みが開始になった
時に片方のＩＯＮがダウンした場合に二つ目の応答を永
遠に待つのを避けるために、計算ノード２００がプライ
マリＩＯＮ２１２及びバディＩＯＮ２１４の状況を知っ
ておく必要があることである。FIG. 11 shows one possible protocol for writing data to the disk 224 in the dipole 226. In steps 1 and 3, the compute node 200 sends the request to the primary ION 212 and the buddy I
Send to ON214. In steps 2 and 4, the ION responds to the write request. Compute node 200 is ION 21
Upon receipt of a response from both 2 and 214, the write is considered complete. When data is subsequently written to the disk, the primary ION 212 receives the buddy ION2
Send a rejection request to 14 indicating that it is not necessary to keep a copy of the page of write data. Assuming that the "transmission complete" interrupt is suppressed on the transmitting side, this protocol requires a minimum of five interrupts because each transmitted message generates an interrupt at the compute node or ION 212 and ION 214. Another drawback of this protocol is that when the write starts, one of the IONs goes down forever to avoid waiting forever for the second response if one computes the primary ION 212 and the buddy ION 214 It is necessary to know the situation.

【０２１２】図１２は、考えられる別のプロトコルを表
している。このプロトコルでは、プライマリＩＯＮ２１
２に対して、書き込み要求をバディＩＯＮ２１４へ送
り、応答を待ち、計算ノード２００へ確認を送り返すよ
うに指示する。このプロトコルでも最低５回の割り込み
が必要になる。第一の割り込みは、ステップ１で示され
るように、計算ノード２００がプライマリＩＯＮ２１２
に書き込み要求を送るときに発生する。第二の割り込み
はステップ２で、プライマリＩＯＮ２１２がデータをバ
ディＩＯＮ２１４に送るときに発生する。３回目の割り
込みはステップ３で、バディＩＯＮ２１４がデータの受
領を知らせるときに発生する。４回目の割り込みはステ
ップ４で、プライマリＩＯＮ２１２が計算ノード２００
に応答するときに発生し、最後の割り込みはステップ５
で、データが安全にディスクに転送された後、プライマ
リＩＯＮ２１２が排除要求をバディＩＯＮ２１４に送る
ときに発生する。FIG. 12 illustrates another possible protocol. In this protocol, the primary ION 21
2 sends a write request to the buddy ION 214, waits for a response, and instructs the compute node 200 to send back a confirmation. This protocol also requires at least five interrupts. The first interrupt is as shown in step 1 where the compute node 200 is the primary ION 212
Occurs when sending a write request to The second interrupt occurs in step 2 when primary ION 212 sends data to buddy ION 214. The third interrupt occurs at step 3 when buddy ION 214 signals receipt of data. The fourth interruption is step 4, in which the primary ION 212
Occurs when responding to, and the last interrupt is
Occurs when the primary ION 212 sends an exclusion request to the buddy ION 214 after the data has been safely transferred to the disk.

【０２１３】図１３は、書き込み要求処理に必要な割り
込み回数を最小化する、本発明で使用されているプロト
コルを表している。このプロトコルをバミューダ・トラ
イアングル・プロトコルと呼ぶ。まず、計算ノード２０
０が書き込みデータと共に書き込み要求をプライマリＩ
ＯＮ２１２に対して発行する。この書き込み要求は、イ
ンターコネクト・ファブリック１０６を介して、プライ
マリＩＯＮ２１２に伝送される。これはステップ１で表
される。プライマリＩＯＮ２１２はメモリ３０４に位置
する書き込みキャッシュに書き込みデータを格納し、こ
の書き込みデータをバディＩＯＮ２１４に送る。これは
ステップ２で表される。次に、バディＩＯＮ２１４は確
認メッセージを計算ノード２００に送り、書き込み要求
を確認する。最後に、データが安全にディスクに保存さ
れると、プライマリＩＯＮ２１２は排除要求をＩＯＮ２
１４に送る。この排除ステップは図１３のステップ３で
示している。図１１及び図１２で示した方法では５つの
ステップが必要なのに対して、上述のプロトコルで必要
なのは４つのプロセスであるため、これはデータ処理ア
ーキテクチャ１００の通信要件を減少し、処理能力を増
加させる。FIG. 13 shows a protocol used in the present invention for minimizing the number of interrupts required for processing a write request. This protocol is called the Bermuda Triangle Protocol. First, the calculation node 20
0 is a write request along with write data to the primary I
Issued to ON212. This write request is transmitted to the primary ION 212 via the interconnect fabric 106. This is represented in step 1. The primary ION 212 stores the write data in a write cache located in the memory 304 and sends the write data to the buddy ION 214. This is represented in step 2. Next, the buddy ION 214 sends a confirmation message to the compute node 200 to confirm the write request. Finally, when the data is securely stored on disk, the primary ION 212 issues an eviction request to ION2.
Send to 14. This exclusion step is shown as step 3 in FIG. This reduces the communication requirements of the data processing architecture 100 and increases the processing power, since the methods shown in FIGS. 11 and 12 require five steps, whereas the above protocol requires four processes. .

【０２１４】図１４は、上述の作業をフローチャートの
形で示した図である。まず、プライマリＩＯＮ２１２が
計算ノード２００から書き込み要求を受領する１４０
２。次に、プライマリＩＯＮ２１２からバディＩＯＮ２
１４に書き込み要求の書き込みデータが転送される１４
０４。バディＩＯＮ２１４から計算ノード２００に確認
メッセージが伝送され１４０６、排除論理を実行して１
４０８、バディＩＯＮ２１４に格納された書き込みデー
タを排除する。FIG. 14 is a diagram showing the above operation in the form of a flowchart. First, the primary ION 212 receives a write request from the compute node 200 140
2. Next, from the primary ION 212 to the buddy ION2
The write data of the write request is transferred to 14
04. A confirmation message is transmitted from the buddy ION 214 to the computing node 200 1406, executing exclusion logic and
408, the write data stored in the buddy ION 214 is excluded.

【０２１５】図１５は、排除論理の実施形態を示してい
る。この実施形態では、プライマリＩＯＮ２１２の不揮
発性ストレージに書き込みデータが格納されたときに、
プライマリＩＯＮ２１２からバディＩＯＮに排除コマン
ドが送られる１５０２。通常、これはデータがメディア
に書き込まれたときに発生する。図１６は、排除論理の
もう一つの実施形態を示している。この実施形態は、不
揮発性メモリに書き込みデータが格納されるまで１６０
２、排除コマンドが送信されない点では、図１５の実施
形態と同じである。しかし、この実施形態では、排除コ
マンドを送信する前に、プライマリＩＯＮ２１２が二番
目の書き込み要求を受領するのも待つ１６０４。したが
って、排除要求を遅らせ、３回の割り込みプロトコルを
発生させる次の書き込みデータ送信と組み合わせること
で、さらに割り込みが減少する。このプロトコルのもう
一つの利点は、書き込み要求を受領したときにバディＩ
ＯＮ２１４がダウンした場合でも、プライマリＩＯＮ２
１２がライト・バック・モードで要求を処理し、データ
が安全にディスクに保存された後、書き込みを知らせる
ことができる点にある。計算ノード２００はバディＩＯ
Ｎ２１４のステータスを知る必要はない。実施形態で
は、プライマリＩＯＮ２１２の受領する書き込み要求が
なくなった場合でも、最後の排除命令の送信を確実に行
うために、ソフトウェア・タイマ又はその他のデバイス
を導入することもできる。FIG. 15 shows an embodiment of the exclusion logic. In this embodiment, when the write data is stored in the non-volatile storage of the primary ION 212,
An exclusion command is sent 1502 from the primary ION 212 to the buddy ION. Typically, this occurs when data is written to media. FIG. 16 shows another embodiment of the exclusion logic. In this embodiment, the write data is stored in the nonvolatile memory until the write data is stored.
2. The point that the exclusion command is not transmitted is the same as the embodiment of FIG. However, in this embodiment, it also waits 1604 for primary ION 212 to receive the second write request before sending the exclusion command. Therefore, interrupts are further reduced by delaying the exclusion request and combining it with the next write data transmission that generates three interrupt protocols. Another advantage of this protocol is that the buddy I
Even if ON 214 goes down, primary ION2
12 can process the request in write-back mode and signal the write after the data has been securely stored on disk. Compute node 200 is a buddy IO
There is no need to know the status of N214. In embodiments, a software timer or other device may be introduced to ensure that the last eviction command is sent even if there are no more write requests to be received by the primary ION 212.

【０２１６】バミューダ・トライアングル・プロトコル
では、従来のプロトコルよりも少ない割り込みによって
書き込みキャッシュを利用でき、同時にデータの可用性
を維持できる。これが可能なのは、バディＩＯＮ２１４
がプライマリＩＯＮ２１２に送られた書き込み要求の確
認を行うからである。現代のパイプライン式プロセッサ
において割り込み処理のコストが高いことを考えれば、
このプロトコルは、幅広い分散ストレージシステム・ア
ーキテクチャでの使用が可能であり、システム全体のオ
ーバーヘッド低下と性能の向上を生み出すことになる。In the Bermuda Triangle protocol, the write cache can be used with fewer interrupts than the conventional protocol, and at the same time, data availability can be maintained. This is possible because Buddy ION 214
Is to confirm the write request sent to the primary ION 212. Given the high cost of interrupt handling in modern pipelined processors,
The protocol can be used in a wide variety of distributed storage system architectures, resulting in lower overall system overhead and increased performance.

【０２１７】Ｆ．計算ノード１．概要計算ノード２００はユーザアプリケーション２０４を動
かす。従来のシステムでは、使用される専用共通ＳＣＳ
Ｉバスの数と、クラスタ又はクリーク内のノードにアク
セスできるストレージの数は同じだった。本発明では、
ストレージは一つ以上の通信ファブリック１０６を通じ
て計算ノード２００に接続している。このネットワーク
接続ストレージは、計算ノード２００間に分散するユー
ザアプリケーション２０４内のプロセス間通信（ＩＰ
Ｃ）トラフィックと、通信ファブリック１０６を共有し
ている。ユーザアプリケーション２０４からのストレー
ジ要求は、ファブリック／ストレージ・インターフェー
スによって、ＩＯＮ２１２に位置するストレージ管理ア
プリケーションに対するＩＰＣメッセージに入れられ
る。こうしたストレージ・ノードの専用アプリケーショ
ンは、このＩＰＣメッセージをローカル・キャッシュ又
はＩ／Ｏ作業に変換し、必要に応じて結果を計算ノード
２００に送り返す。ユーザアプリケーション２０４にと
って、ネットワーク接続ストレージとローカル接続スト
レージは区別が付かない。F. Compute node Overview The computing node 200 runs a user application 204. In conventional systems, the dedicated common SCS used
The number of I-buses was the same as the number of storage accessible to nodes in the cluster or clique. In the present invention,
The storage is connected to the compute nodes 200 through one or more communication fabrics 106. This network-attached storage is used for inter-process communication (IP) in the user application 204 distributed between the computing nodes 200.
C) The communication fabric 106 is shared with traffic. Storage requests from the user application 204 are put by the fabric / storage interface into IPC messages for the storage management application located at the ION 212. The dedicated application on such a storage node translates this IPC message into a local cache or I / O work and sends the result back to the compute node 200 as needed. For the user application 204, network-connected storage and local-connected storage are indistinguishable.

【０２１８】仮想ディスクブロックの読み出し及び書き
込み要求は、インターコネクト・ファブリック１０６を
介してＩＯＮ２１２に到着する。特定のＩＯＮ２１２へ
の要求の経路は、計算ノード２００においてソースが行
った選択によって決定できる。すべての計算ノード２０
０は、システムのそれぞれのファブリック仮想ディスク
への要求を、どのＩＯＮ２１２が受け取るのかを知って
いる。ファブリック仮想ディスクは独自のストレージ拡
張子が記述される仮想ディスクモデルを反映している
が、このストレージ拡張子は、その名前によって物理デ
ィスクの物理的な位置を意味したり記号化するものでは
ない。The read and write requests for the virtual disk block arrive at the ION 212 via the interconnect fabric 106. The route of the request to a particular ION 212 can be determined by the choices made by the source at the compute node 200. All compute nodes 20
0 knows which ION 212 will receive requests for each fabric virtual disk in the system. Fabric virtual disks reflect a virtual disk model in which a unique storage extension is described, but the storage extension does not imply or symbolize the physical location of the physical disk by its name.

【０２１９】それぞれの計算ノード２００は、ファブリ
ック仮想ディスク名をＩＯＮダイポール２２６にマップ
するリストを持っている。このリストは、計算ノード２
００とＩＯＮ２１２との調整により、動的に作成され
る。電源投入及び失敗復旧作業時、ダイポール２２６内
のＩＯＮ２１２は、仮想（及び物理）ディスク同士を区
分し、どのＩＯＮ２１２がどの仮想ディスクを所有して
いるかについてのリストを作成する。ダイポール２２６
の他のＩＯＮ２１４（仮想ディスク又はストレージリソ
ースを所有していない）は、異常に備えて仮想ディスク
に代替経路を提供する。Each compute node 200 has a list that maps fabric virtual disk names to ION dipoles 226. This list is for compute node 2
It is created dynamically by adjusting 00 and the ION 212. At power-on and failure recovery operations, the IONs 212 in the dipole 226 partition virtual (and physical) disks and create a list of which IONs 212 own which virtual disks. Dipole 226
The other ION 214 (which does not own the virtual disk or storage resources) provides an alternate path to the virtual disk in case of anomalies.

【０２２０】このリストは、インターコネクト・ファブ
リック１０６を通して、他のすべてのダイポール２２６
と計算ノード２００に、定期的にエキスポート又は同報
される。計算ノード２００は、このデータを使って、シ
ステム内の各仮想ディスクへの第一の及び第二のパスの
マスタ・テーブルを作成する。その後、計算ノード２０
０内のインターコネクト・ファブリック・ドライバは、
ダイポール２２６と調整し、Ｉ／Ｏ要求の経路を決め
る。ダイポール２２６は、この「自己発見」テクニック
を使用して、稼働システムにおいてダイポール２２６が
追加・除去されたときに発生する可能性のある仮想ディ
スク名の不一致を検知し、訂正する。This list is passed through the interconnect fabric 106 to all other dipoles 226.
Is periodically exported or broadcast to the computing node 200. Compute node 200 uses this data to create a master table of the first and second paths to each virtual disk in the system. Then, the calculation node 20
The interconnect fabric driver in 0 is
Coordinate with the dipole 226 to determine the path of the I / O request. The dipole 226 uses this "self-discovery" technique to detect and correct virtual disk name mismatches that can occur when dipoles 226 are added or removed in a running system.

【０２２１】計算ノード２００上で動くアプリケーショ
ンは、ブロック・インターフェース・モデルを、計算ノ
ード２００にエキスポートされた各ファブリック仮想デ
ィスクのローカルディスクのように見る。本文書で前に
述べたように、計算ノード２００はブート時に各ファブ
リック仮想ディスクの入口点を作成し、計算ノード２０
０とＩＯＮ２１２の間で確立したネーミング・プロトコ
ルを使って、こうした入口点を動的に更新する。An application running on a compute node 200 views the block interface model as a local disk for each fabric virtual disk exported to the compute node 200. As mentioned earlier in this document, compute node 200 creates an entry point for each fabric virtual disk at boot time and compute node 20
These entry points are dynamically updated using the naming protocol established between ION 212 and ION 212.

【０２２２】この文書では、データ・ストレージとデー
タ処理システムにおける書き込みキャッシュデータの転
送方法及び装置について説明した。この方法は、計算ノ
ードからの書き込みデータを含む書き込み要求を第一の
Ｉ／Ｏノードで受領し、この書き込みデータを第一のＩ
／Ｏノードから第二のＩ／Ｏノードに転送し、第二のＩ
／Ｏノードが書き込みデータを受領した後で、確認メッ
セージを第二のＩ／Ｏノードから計算ノードへ送るステ
ップで構成されている。第一のＩ／Ｏノードの不揮発性
ストレージにデータを書き込んだ後、排除要求又はコマ
ンドを第二のＩ／Ｏノードに送り、第二のＩ／Ｏノード
の揮発性メモリから書き込みデータを排除する。実施形
態によっては、排除要求は第一のＩ／Ｏノードが第二の
書き込み要求を受領するまで送信されず、この場合、排
除要求は第二の書き込み要求の書き込みデータと同じ割
り込みにおいて送信される。このデータ処理システム
は、第一の及び第二のＩ／Ｏノードで構成され、各ノー
ドは計算ノードからの書き込み要求を受領し、別のＩ／
Ｏノードへ書き込みノードを転送するための手段を有し
ている。また、各Ｉ／Ｏノードには、書き込みデータを
送ったＩ／Ｏノードを介して確認メッセージを送信する
のではなく、計算ノードに直接確認メッセージを送る手
段も有している。この成果が、データ保存に要する割込
み回数を減らしながら、書き込みキャッシュを導入して
ストレージ速度及びターンアラウンドを改善するＩ／Ｏ
プロトコルである。本発明は、本発明を実施すべき命令
を遂行するために保存された命令を確実に具現するハー
ドディスク、フロッピーディスク、ＣＤといったプログ
ラム・ストレージ・デバイスの見地から説明することも
可能である。This document has described a method and apparatus for transferring write cache data in a data storage and data processing system. In this method, a write request including write data from a computing node is received by a first I / O node, and the write data is received by a first I / O node.
/ O node to the second I / O node,
Sending a confirmation message from the second I / O node to the compute node after the / O node receives the write data. After writing data to the non-volatile storage of the first I / O node, an exclusion request or command is sent to the second I / O node to eliminate the write data from the volatile memory of the second I / O node. . In some embodiments, the eviction request is not sent until the first I / O node receives the second write request, in which case the eviction request is sent in the same interrupt as the write data of the second write request. . The data processing system comprises a first and a second I / O node, each node receiving a write request from a compute node and providing another I / O node.
There is means for transferring the write node to the O node. Also, each I / O node has means for sending a confirmation message directly to the computation node instead of sending a confirmation message via the I / O node that sent the write data. The result is an I / O that introduces a write cache to improve storage speed and turnaround while reducing the number of interrupts required for data storage.
Protocol. The present invention may be described in terms of a program storage device such as a hard disk, a floppy disk, and a CD that reliably implements instructions stored in order to perform an instruction to implement the present invention.

[Brief description of the drawings]

【図１】本発明の主要なアーキテクチャ要素を示す最上
部のブロック図。FIG. 1 is a top block diagram showing the main architectural elements of the present invention.

【図２】本発明のシステム・ブロック図。FIG. 2 is a system block diagram of the present invention.

【図３】ＩＯＮ及びシステムのインターコネクト構造を
示すブロック図。FIG. 3 is a block diagram showing an ION and an interconnect structure of the system.

【図４】ＪＢＯＤエンクロージャの要素のブロック図。FIG. 4 is a block diagram of elements of a JBOD enclosure.

【図５】ＩＯＮ物理ディスクドライバの機能ブロック
図。FIG. 5 is a functional block diagram of an ION physical disk driver.

【図６】ファブリック固有ＩＤの構造を示す図。FIG. 6 is a diagram showing a structure of a fabric unique ID.

【図７】ＩＯＮエンクロージャ管理モジュールとＩＯＮ
物理ディスクドライバの関係を示す機能ブロック図。FIG. 7: ION enclosure management module and ION
FIG. 3 is a functional block diagram showing a relationship between physical disk drivers.

【図８】ＢＹＮＥＴホスト側のインターフェースの図。FIG. 8 is a diagram of an interface on the BYNET host side.

【図９】ＰＩＴヘッダの図。FIG. 9 is a diagram of a PIT header.

【図１０】ＩＯＮ２１２機能モジュールのブロック図。FIG. 10 is a block diagram of an ION212 function module.

【図１１】ダイポールのディスクにデータを書き込むた
めのプロトコルを示す図。FIG. 11 is a diagram showing a protocol for writing data to a dipole disk.

【図１２】ダイポールのディスクにデータを書き込むた
めの第二のプロトコルを示す図。FIG. 12 is a diagram showing a second protocol for writing data to a dipole disk.

【図１３】ＩＯＮダイポールのディスクにデータを書き
込むための効率的なプロトコルを示す図。FIG. 13 is a diagram showing an efficient protocol for writing data to an ION dipole disk.

【図１４】本発明の書き込みキャッシュ・プロトコルの
実施形態を実行する際に利用する動作を示すフローチャ
ート。FIG. 14 is a flowchart illustrating operations utilized in executing an embodiment of the write cache protocol of the present invention.

【図１５】第一のＩＯＮの不揮発性ストレージにデータ
が書き込まれた後、他方のＩＯＮのメモリを排除する際
に利用する動作を示すフローチャート。FIG. 15 is a flowchart showing an operation used when data is written to the nonvolatile storage of the first ION and then the memory of the other ION is excluded.

【図１６】第一のＩＯＮの不揮発性ストレージにデータ
が書き込まれた後、他方のＩＯＮのメモリを排除する際
に利用する代替動作を示すフローチャート。FIG. 16 is a flowchart showing an alternative operation used when data is written to the nonvolatile storage of the first ION and then the memory of the other ION is excluded.

───────────────────────────────────────────────────── フロントページの続き (72)発明者キットエムチョウアメリカ合衆国カリフォルニア州 92009 カールスバッドコルビダエストリート 1336 (72)発明者ピーキースミュラーアメリカ合衆国カリフォルニア州 92103 サンディエゴマリルイスウェイ 2440 (72)発明者マイケルダブリューメイヤーアメリカ合衆国カリフォルニア州 92024 エンシニタスサマーヒルドライブ 2323 (72)発明者ギャリーエルボッグスアメリカ合衆国カリフォルニア州 92064 ポウェイシカモアツリーレーン 13743 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Kit Em Chou United States of America 92009 Carlsbad Corvidae's Treat 1336 (72) Inventor Peeks Muller United States of America 92103 San Diego Marilouis Way 2440 (72) Inventor Michael W. Meyer United States 92024 California Encinitas Summerhill Driving 2323 (72) Inventor Gary El Boggs United States 92064 Poway Sycamore Tree Lane 13743

Claims

[Claims]

1. A method for transferring write cache data in a data storage and data processing system, comprising: at a first I / O node, receiving a write request including write data from a computing node; From the first I / O node to the second I / O
Forwarding to the node; and transmitting a confirmation message from the second I / O node to the computing node.

2. The method of claim 1, further comprising the step of transmitting an eviction request from the first I / O node to the second I / O node.

3. The method according to claim 2, wherein the exclusion request is transmitted when the write data is stored in the nonvolatile storage of the first I / O node.

4. The method of claim 2, wherein the eviction request is sent after a first I / O node receives a second write request following a first write request. .

5. The method of claim 2, wherein the second write request includes second write data, and the method further includes transmitting the second write data and an eviction request to the second I / O node with a single data interrupt. 4. The method according to claim 3, comprising the step of:

6. The method of claim 1, wherein the computing node, the first I / O node, and the second I / O node are communicatively coupled via an interconnect fabric. Method.

7. The computing node is coupled to a first I / O node and a second I / O node via an interconnect fabric, wherein the interconnect fabric comprises the computing node and an I / O node. Is a network connected via a plurality of network input ports and a plurality of network output ports, where b is the total number of input / output ports of the switch node, N is the total number of network input / output ports, and g (x) is an argument Given a round-up function to obtain the smallest integer greater than x, in a network composed of a plurality of switch nodes arranged in switch node stages equal to or more than g (log _b N), the switch node stage can be any arbitrary Provide multiple paths between network input port and network output port The switch node stage provides a plurality of bounce points to the highest switch node stage of the network, and the bounce points logically distinguish switch nodes that load balanced messages through the network from those that direct messages in the network. And a network.
The described method.

8. An apparatus for transferring write cache data in a data storage and data processing system, comprising: a first I / O node receiving a write request including write data from a calculation node; From the first I / O node to the second I / O
An apparatus comprising: means for forwarding to a node; and means for transmitting a confirmation message from a second I / O node to a computing node.

9. The apparatus according to claim 8, further comprising means for transmitting an exclusion request from the first I / O node to the second I / O node.

10. A means for judging when to store write data in the nonvolatile storage of the first I / O node, and a means for transmitting an exclusion request when the write data is stored in the nonvolatile storage. 9. The apparatus according to claim 8, comprising:

11. A device for determining when to receive a second write request, and a unit for transmitting an exclusion request to a second I / O node together with the second write request. The device according to claim 9, characterized in that:

12. The method of claim 8, wherein the computing node, the first I / O node, and the second I / O node are communicatively coupled via an interconnect fabric. apparatus.

13. The computing node, a first I / O node, and a second I / O node are connected via an interconnect fabric, wherein the interconnect fabric comprises an I / O and the computing node. A network connecting nodes via a plurality of network input ports and a plurality of network output ports, where b is the total number of input / output ports of the switch node, N is the total number of network input / output ports, and g (x) is When a round-up function for obtaining a minimum integer larger than the argument x is used, in a network including a plurality of switch nodes arranged in switch node stages equal to or more than g (log _b N), the switch node stage can be arbitrarily set. Multiple paths between network input ports and network output ports And the switch node stage provides a plurality of bounce points to the highest switch node stage of the network, wherein the bounce point logically connects the switch node loading the balanced messages through the network and the switch node directing the messages in the network. 9. The network according to claim 8, further comprising a network for distinguishing.
The described device.

14. One or more computer-readable instruction programs readable by a computer and executable by a computer to perform the steps of transferring write cache data in a data storage system according to claims 1 to 7. A program storage device that reliably implements