JPS6246026B2

JPS6246026B2 -

Info

Publication number: JPS6246026B2
Application number: JP4220382A
Authority: JP
Inventors: Hiroshi Hatsuda
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1982-03-17
Filing date: 1982-03-17
Publication date: 1987-09-30
Also published as: JPS58159168A

Description

[Detailed description of the invention]

〔発明の属する技術分野〕本発明は並列処理方式、特に、データ処理装置
における並列処理方式に関する。一般に、演算処理を高速化する方法の１つとし
て並列処理方式がある。この並列処理方式は、処理すべきプログラムの
中で並列に実行できる部分を各々異なるプロセツ
サで実行し、Ｎ台のプロセツサで理想的にはＮ倍
の性能を得ようとするものである（実際には並列
に実行できない部分や並列動作を制御するための
余分な時間……オーバヘツド……のため、Ｎ倍以
下の性能しか得られない。）〔従来技術〕従来の並列処理方式は、制御プロセツサと、そ
れぞれがデータを記憶する複数のデータメモリ
と、前記制御プロセツサに並列に接続された複数
のプロセツサと、前記複数のプロセツサと前記複
数のデータメモリとを並行して相互に接続するた
めのメモリスイツチとを含み、前記複数のプロ
セツサのそれぞれはプロセツサエレメントと、前
記プロセツサエレメントを前記制御プロセツサと
接続するための制御プロセツサインターフエース
と、前記プロセツサエレメントを前記メモリスイ
ツチと接続するためのメモリスイツチインターフ
エースとを含んで構成される。次に、従来の並列処理方式について、図面を参
照に詳細に説明する。第１図は従来の並列処理システムの一例を示す
システム構成図であり、第２図は第１図に示すプ
ロセツサの一例を示す詳細ブロツク図である。第１図に示す並列処理方式は、制御プロセツサ
CPと、この制御プロセツサCPに専用の制御専用
メモリCPM１，CPM２と、制御プロセツサCPに
並列接続されたプロセツサPP１〜PP１６と、プ
ログラムおよびデータを記憶したメモリMM１〜
MM３２と16台のプロセツサと32台のメモリとを
相互に並列して接続するために16×32＝512個の
接続点をもつメモリスイツチとを含んでいる。プロセツサPP１〜PP１６はいずれも同一の構
成をなし、第２図に示すように、プロセツサエレ
メントPEと、メモリスイツチインターフエース
MS１と、制御プロセツサインターフエースCP１
を含んでいる。メモリインターフエースMS１
は、プロセツサエレメントPEからデータあるい
はプログラムの読出を行なうためのアクセス要求
をメモリスイツチMSを介してメモリMM１〜MM
３２に供給するとともにメモリMM１〜MM３２
から読み出したデータをプロセツサエレメント
PEに供給するとともにプロセツサPEでの演算結
果などをメモリMM１〜MM３２に記憶させるた
めに供給する。制御プロセツサインターフエース
CP１はインターフエースａを介して制御プロセ
ツサCPと接続され、プログラム実行開始指示
STARTやプログラム実行停止指示STOPを制御
プロセツサCPから供給されて、プロセツサエレ
メントPEに供給したり、プロセツサエレメント
PEからの処理終了通知ENDを制御プロセツサCP
に供給する。すなわちメモリスイツチMSを介して16台のプ
ロセツサPP１〜PP１６が32台のメモリMM１〜
MM３２にアクセスできるようになつており、各
プロセツサPP１〜PP１６は各々独立にプログラ
ムを実行することが可能である。制御プロセツサ
CPはプロセツサPP１〜PP１６とのインターフエ
ースａを通してプログラム実行開始指示START
を供給したり、プロセツサが実行を完了したとき
の処理終了通知ENDを受理する。この制御プロセツサCPの制御の元でプロセツ
サPP１〜PP１６は解くべきプログラム中の並列
処理部分について分担して実行する。たとえばa₁
＋b₁，a₂＋b₂，……，ａ_o＋ｂ_o，という計算であ
ればｉ番目のプロセツサPP_iがａ_i＋ｂ_iを計算す
る。このような従来の並列処理システムの性能を高
めるには各プロセツサの性能を高くするかプロセ
ツサの台数を増やす必要がある。しかしながら、プロセツサの性能を高めるとそ
の装置寸法が大きくなり多数並べることが困難に
なる。さらに、プロセツサの台数を増やすとメモ
リを並行して使用できるようにするためにはメモ
リも拡大する必要がありメモリスイツチはプロセ
ツサの台数とメモリの台数との積で増大して複雑
大規模になりやはり実現困難になる（たとえばク
ロスバスイツチで考えるとプロセツサ台数とメ
モリ台数を各々２倍にするとスイツチの規模は２
×２＝４倍になる）。こうした欠点のため大規模
超高性能の並列処理システムはほとんど実用化さ
れていない。すなわち、従来の並列処理方式は並列度を増大
させることが困難であるという欠点があつた。〔発明の目的〕本発明の目的は並列度を増大できる並列処理方
式を提供することにある。すなわち、本発明の目的は並列処理を分担する
各プロセツサをさらに複数のプロセツサエレメン
トからなる並列処理プロセツサとすることにより
メモリスイツチの規模を大きくすることなく並列
度を高めて上記欠点を解決し大規模、超高性能を
有する並列処理システムを提供することにある。〔発明の構成〕本発明の並列処理方式は、制御プロセツサと、
それぞれがデータを記憶する複数のデータメモリ
と、前記制御プロセツサに並列に接続された複数
のプロセツサと、前記複数のプロセツサと前記複
数のデータメモリとを並行して相互に接続するた
めのメモリスイツチとを含み、前記複数のプロ
セツサのそれぞれは、並列に設けられた複数のプ
ロセツサエレメントと、各プロセツサエレメント
に対応して設けられプログラムを記憶する複数の
プログラムメモリと、前記複数のプロセツサエレ
メントを前記制御プロセツサと接続するための制
御プロセツサインターフエースと、前記複数のプ
ロセツサエレメントを前記メモリスイツチと接続
するためのメモリスイツチインターフエースと、
前記メモリスイツチインターフエースに接続され
前記データメモリに記憶したデータの一部の写し
を記憶するデータ用キヤツシユメモリとを含んで
構成される。すなわち、本発明の並列処理方式は、各々がプ
ログラムメモリを有する複数のプロセツサエレ
メントと該複数のプロセツサエレメントで共用
されるデータ用キヤツシユメモリと該複数のプ
ロセツサエレメントから発生するデータメモリへ
のアクセス要求の中から各データメモリアクセス
タイミング毎に一つを選択して処理する回路とか
ら構成される演算処理装置複数台と複数のデータ
メモリと任意の上記演算処理装置から任意の上記
データメモリへのアクセスを可能にするメモリス
イツチとを備えて構成される。さらに、本発明の並列処理方式は、上述の構成
に加えて制御プロセツサと該制御プロセツサから
上記全プロセツサエレメントにプログラム実行開
始を指示する通信手段と上記各プロセツサエレメ
ントからプログラム実行終了を上記制御プロセツ
サに通知する手段とを備え、上記制御プロセツサ
の制御下で一つのプログラム中の並列処理部分を
上記全プロセツサエレメントにより並列に実行す
るように構成される。すなわち、本発明の並列処理方式を分担する各
プロセツサを並列に動作する複数のプロセツサエ
レメントで構成することになり、メモリスイツチ
の規模を大きくすることなく実質的な並列処理プ
ロセツサ台数を増やしている。すなわち、本発明の並列処理システムは、ｎ台
のプロセツサと、ｍ台すなわち、ｎ台あるいは
2n台などｎ台以上のデータメモリと、このｎ台
のプロセツサとｍ台のデータメモリとを接続する
ためのｎ×ｍ個の接続点を有するメモリスイツチ
とを含みこのｎ台のプロセツサのそれぞれの１台
のプロセツサの内部構造をｌ台のプロセツサエレ
メントと、このｌ台のプロセツサエレメントのそ
れぞれに専用的に使用されるメモリで対応するプ
ロセツサエレメントの実行すべきプログラムを格
納したｌ台のプログラムメモリと、ｌ台のプロセ
ツサエレメントのそれぞれから前記ｍ台のデータ
メモリへのアクセス要求を受けて処理するメモリ
スイツチインターフエースとを含んでいる。すな
わち、このメモリスイツチインターフエースはメ
モリのアクセスタイミング毎にｌ台のプロセツサ
エレメントのうちの任意の１台以上のプロセツサ
エレメントからのアクセス要求の中から１つを選
択して選択されたアクセス要求をメモリスイツチ
を介してデータメモリへ送出する。このアクセス
要求が読出要求であればデータメモリから送られ
てくるデータを要求元のプロセツサエレメントに
渡す。このようにメモリスイツチインターフエー
スで、データメモリへのアクセスインタフエー
スを１本に絞つているのでメモリスイツチの規模
（プロセツサを接続するためのインタフエース
数）を１／ｌにすることができる。この場合デー
タメモリへのアクセスがｌ台のプロセツサエレメ
ント間で競合するのでこれが性能上のボトルネツ
クになる可能性がある。しかし、この問題は第１に各プロセツサエレメ
ントにプログラム専用のプログラムメモリを持た
せることで軽減している。すなわち、通常のコン
ピユータではプログラムもデータも同じメモリに
格納しているが本発明に使用するプロセツサでは
プログラムは各プロセツサエレメントに専用のプ
ログラムメモリに格納されているのでメモリスイ
ツチインターフエースを介してのメモリへのアク
セスはデータに対するものに限られ、通常のコン
ピユータに比しアクセス頻度は最大１／２位に低
減される。第２にメモリスイツチインターフエースに接続
されたデータ用キヤツシユメモリによりデータ
メモリへのアクセス頻度をさらに軽減している。
すなわちデータ用キヤツシユメモリにはｌ台の
プロセツサエレメントで共通に利用できるデータ
（たとえば定数など）や計算の途中結果などかな
らずしもデータメモリに格納しておかなくてもよ
いデータを格納して、データメモリヘアクセスす
る回数をへらす。このためメモリスイツチインターフエースはプ
ロセツサエレメントからデータメモリへのアクセ
ス要求があつた場合そのデータがすでにデータ用
キヤツシユメモリに格納されていないかを調べ、
そこに格納されていればそこから読み出し、ない
ときのみデータメモリへ要求を出す。〔実施例の説明〕次に本発明の実施例について、図面を参照して
詳細に説明する。第３図は本発明の一実施例を示すシステム構成
図、第４図は第３図に示すプロセツサの詳細ブロ
ツク図である。プロセツサPP１′〜PP１６′は内部に８台のプ
ロセツサエレメントPE１〜PE８を含む並列処理
方式のプロセツサで各々８個のプログラムを並列
に実行する能力を有しているがプロセツサの台数
やその中のプロセツサエレメントの台数はこの例
に限定されるものではない。各プロセツサPP１′〜PP１６′はメモリスイツ
チMSを介して任意のデータメモリDM１〜DM３
２に対してデータの読出、書込ができる。データ
メモリの台数は第３図では32台としているるが、
これはプロセツサの台数やデータメモリの性能、
データメモリの使用頻度によつて定められこの例
に限定されるものではない。また、メモリスイツチMSの構成については完
全なクロスバー方式をはじめとして多数の構成法
があるがそのいずれかに限定されるものではな
い。ここでは一例として完全クロスバー方式を仮
定しており複数のプロセツサから同時にデータメ
モリへのアクセス要求が発生しても同一のデータ
メモリヘアクセスしないかぎり競合は起らないと
している。他の構成のメモリスイツチMSを用い
たとしても本発明の効果には関係しない。制御プロセツサは制御専用メモリCPM１，
CPM２を有しさらにメモリスイツチMSを介して
データメモリDM１〜DM３２へもアクセスでき
る。制御専用メモリの台数も本例では２台として
いるがこれに限定される訳ではない。制御プロセ
ツサCPはインターフエースａを介して各プロセ
ツサPP１′〜PP１６′のそれぞれの制御プロセツ
サインターフエースCPI′を介して各プロセツサ
と通信することができる。第４図は第３図に示すプロセツサの一例を示す
ブロツク図である。プロセツサエレメントPE１〜PE８は各々プロ
グラムを実行する能力を有するプロセツサエレメ
ントでそのプログラムはプロセツサエレメント
PE１〜PE８のそれぞれに対応して接続された専
用のプログラムメモリPM１〜PM8に格納されて
いる。メモリスイツチインターフエースMSI′は各プ
ロセツサエレメントPE１〜PE８が第３図に示す
データメモリDM１〜DM３２にアクセスするた
めの制御回路で複数のプロセツサエレメントPE
１〜PE８から同時にアクセス要求があつたとき
にはそれらの中から１つを一定のアルゴリズムに
従つて選択し、選択されたアクセス要求をメモリ
スイツチMSを経てデータメモリDM１〜DM３２
のいずれかへ送出する。読出動作であれば送つた
アドレスに従つて該当するデータメモリから送ら
れてくるデータを要求元のプロセツサエレメント
に引き渡す制御も行う。データ用キヤツシユメモリDCの動作は一般の
コンピユータ用キヤツシユメモリと同様である。すなわち、プロセツサエレメントPE１〜PE８
からデータメモリDM１〜DM３２へのアクセス
要求があると、メモリスイツチインターフエース
MSI′はデータ用キヤツシユメモリDCの内容を調
べて求めるデータがすでにそこに格納されている
ときはそこから読み出してプロセツサエレメント
PE１〜PE８へ渡す。ない場合にはデータメモリ
DM１〜DM３２へアクセス要求を出しデータメ
モリDM１〜DM３２から送られてきたデータ要
求元のプロセツサエレメントPE１〜PE８へ引渡
すと共にメモリスイツチインターフエース
MSI′にも格納しておき、同じデータが再び要求
されたときに備える（この要求は他のプロセツサ
エレメントからでもよい）。また、データメモリDM１〜DM３２への書込
みに際しては同じデータをデータ用キヤツシユメ
モリDCにも格納しておき後で再びこれを読み出
すときに備える。キヤツシユからの追出しアルゴ
リズムなども汎用コンピユータのキヤツシユにお
ける一般的手法が適用できるが、本コンピユータ
システムが専用機的であることからプロセツサエ
レメントPE１〜PE８のプログラムによりそれを
制御させるようにしてもよいであろう。すなわ
ち、キヤツシユに格納しておきたい、データと格
納する必要のないデータをプログラムに指定させ
ることや、キヤツシユではなくアドレス指定可能
なメモリとしてしまう方法（この時はプロセツサ
エレメントPE１〜PE８からはデータメモリDM
１〜DM３２と別のメモリとして見え、そこへ何
を格納するかはすべてプロセツサエレメントのプ
ログラムで指定されることになる）などが考えら
れる。制御プロセツサインターフエースCPI′は制御
プロセツサCPと通信するための回路で各プロセ
ツサエレメントPE１〜PE８と制御プロセツサ
CP間の通信およびそのプロセツサPP１′〜PP１
６′自身と制御プロセツサCP間の通信を制御する
（本方式ではソフトウエアから見えるのは各プロ
セツサエレメントPE１〜PE８でありプロセツサ
PP１′〜PP１６′は物理的なかたまり（装置単
位）としてしか意味がないので制御プロセツサ
CPとの通信も論理的にはプロセツサエレメント
と制御プロセツサCP間が主である）。この通信の例としては各プロセツサエレメント
PE１〜PE８にプログラム実行の開始を指示する
プログラム実行開始指示STARTか、プログラム
実行停止指示STOPなどがある。プロセツサエレ
メントPE１〜PE８はプログラム実行開始指示
STARTを受けてプログラムの実行を開始し所定
の条件を満した時あるいはプログラム実行停止指
示STOPを受けたときに動作を中止する。また、
制御プロセツサインターフエースCPIはプロセツ
サエレメントPE１〜PE８から制御プロセツサ
CPへインターフエースａを介して情報を伝える
ための制御も行い、たとえば、プログラム実行開
始指示STARTを受けて実行開始後、特定のプロ
セツサエレメントPE１〜PE８が実行を終了した
などある条件か満したらそれを制御プロセツサ
CPに伝えるのも制御プロセツサインターフエー
スCPI′である。各プロセツサエレメントPE１〜PE８の構成は
一般的なコンピユータと基本的には変らないが命
令語を対応するプログラムメモリPM１〜PM８か
ら読み出す点が異なる。一般のコンピユータでは
命令語とデータは同一のメモリに格納されるが本
発明を用いた並列処理システムではデータメモリ
DM１〜DM３２へのアクセスパスの負荷を軽
減するため命令語はプログラムメモリPM１〜PM
８に格納している。これはデータについては各プ
ロセツサエレメントPE１〜PE８の相互間で受渡
しする必要があるとともに各プロセツサPP１′〜
PP１６′の相互間でも受渡しの必要があるので共
通のデータメモリに格納せざるを得ないけれど、
プログラムはその必要性がなく、各プロセツサエ
レメントPE１〜PE８が専用のメモリ中に格納し
ておけるという性質を利用している。各プロセツサエレメントPE１〜PE８はプログ
ラムメモリPM１〜PM８に格納されたプログラム
に従つてデータ用キヤツシユメモリDCあるいは
データメモリDM１〜DM３２からデータを読み
出して処理し結果をデータメモリDM１〜DM３
２ならびにデータ用キヤツシユメモリDCへ戻す
という動作を繰り返すことになる。第３図に示す並列処理システムにおいてプログ
ラムを実行する時の動作は次のようになる。例として、各々128個のデータＡ_i，Ｂ_i（ｉ＝１
〜128）に対して [Technical Field to Which the Invention Pertains] The present invention relates to a parallel processing method, and particularly to a parallel processing method in a data processing device. Generally, one of the methods for speeding up arithmetic processing is a parallel processing method. This parallel processing method uses different processors to execute parts of the program that can be executed in parallel, and ideally aims to obtain N times the performance using N processors (in reality, (Because of the parts that cannot be executed in parallel and the extra time required to control the parallel operations...overhead...the performance is only N times lower.) [Prior Art] Conventional parallel processing methods have , a plurality of data memories each storing data, a plurality of processors connected in parallel to the control processor, and a memory switch for interconnecting the plurality of processors and the plurality of data memories in parallel. each of the plurality of processors includes a processor element, a control processor interface for connecting the processor element to the control processor, and a memory for connecting the processor element to the memory switch. It is configured to include a switch interface. Next, a conventional parallel processing method will be explained in detail with reference to the drawings. FIG. 1 is a system configuration diagram showing an example of a conventional parallel processing system, and FIG. 2 is a detailed block diagram showing an example of the processor shown in FIG. 1. The parallel processing method shown in Figure 1 is based on the control processor
CP, control-only memories CPM1 and CPM2 dedicated to this control processor CP, processors PP1 to PP16 connected in parallel to the control processor CP, and memories MM1 to MM1 to store programs and data.
It includes a memory switch having 16×32=512 connection points for connecting the MM 32, 16 processors, and 32 memories in parallel. Processors PP1 to PP16 all have the same configuration, and as shown in Figure 2, they have a processor element PE and a memory switch interface.
MS1 and control processor interface CP1
Contains. Memory interface MS1
sends an access request to read data or a program from the processor element PE to the memories MM1 to MM via the memory switch MS.
32 and the memories MM1 to MM32.
The data read from the processor element
In addition to supplying the data to the processor PE, the data is also supplied to the memories MM1 to MM32 to store the calculation results and the like in the processor PE. control processor interface
CP1 is connected to the control processor CP via interface a, and receives instructions to start program execution.
START and program execution stop instruction STOP are supplied from the control processor CP and supplied to the processor element PE.
Processor CP that controls processing end notification END from PE
supply to. In other words, 16 processors PP1 to PP16 are connected to 32 memories MM1 to PP16 via memory switch MS.
The MM32 can be accessed, and each of the processors PP1 to PP16 can independently execute programs. control processor
CP issues a program execution start instruction START through interface a with processors PP1 to PP16.
or receive a processing completion notification END when the processor completes execution. Under the control of the control processor CP, the processors PP1 to PP16 share and execute the parallel processing portion of the program to be solved. For example a ₁
+b ₁ , a ₂ +b ₂ , . . . , a _o +b _o , the i-th processor PP _i calculates a _i +b _i . To improve the performance of such conventional parallel processing systems, it is necessary to increase the performance of each processor or increase the number of processors. However, increasing the performance of the processor increases the size of the device, making it difficult to line up a large number of processors. Furthermore, if you increase the number of processors, you will need to expand the memory so that they can be used in parallel, and the memory switch will increase in size by the product of the number of processors and the number of memory, making it large and complex. It will still be difficult to realize (for example, considering a crossbar switch, if you double the number of processors and the number of memories, the scale of the switch will double.
×2 = 4 times). Due to these drawbacks, large-scale, ultra-high-performance parallel processing systems have hardly been put into practical use. That is, the conventional parallel processing method has a drawback in that it is difficult to increase the degree of parallelism. [Object of the Invention] An object of the present invention is to provide a parallel processing method that can increase the degree of parallelism. That is, an object of the present invention is to solve the above-mentioned drawbacks by increasing the degree of parallelism without increasing the scale of the memory switch by making each processor that performs parallel processing into a parallel processing processor consisting of a plurality of processor elements. The objective is to provide a parallel processing system with high scale and ultra-high performance. [Configuration of the Invention] The parallel processing method of the present invention includes a control processor,
a plurality of data memories each storing data; a plurality of processors connected in parallel to the control processor; a memory switch for interconnecting the plurality of processors and the plurality of data memories in parallel; Each of the plurality of processors includes a plurality of processor elements provided in parallel, a plurality of program memories provided corresponding to each processor element and storing programs, and a plurality of processor elements arranged in parallel. a control processor interface for connecting to the control processor; a memory switch interface for connecting the plurality of processor elements to the memory switch;
The data cache memory is connected to the memory switch interface and stores a copy of a portion of the data stored in the data memory. That is, the parallel processing method of the present invention consists of a plurality of processor elements each having a program memory, a data cache memory shared by the plurality of processor elements, and an access request to the data memory generated from the plurality of processor elements. A plurality of arithmetic processing units, a plurality of data memories, and an access to any of the above data memories from any of the above arithmetic processing units, and a circuit that selects and processes one of the above at each data memory access timing. It is configured with a memory switch that enables Furthermore, in addition to the above configuration, the parallel processing system of the present invention includes a control processor, a communication means for instructing all of the processor elements to start program execution from the control processor, and a control processor for controlling the end of program execution from each of the processor elements. and means for notifying the processor, and is configured to execute parallel processing portions of one program in parallel by all of the processor elements under the control of the control processor. In other words, each processor that shares the parallel processing method of the present invention is composed of a plurality of processor elements that operate in parallel, and the actual number of parallel processing processors is increased without increasing the scale of the memory switch. . That is, the parallel processing system of the present invention includes n processors and m processors, i.e., n or
Each of the n processors includes n or more data memories such as 2n processors, and a memory switch having n×m connection points for connecting the n processors and the m data memories. The internal structure of one processor consists of l processor elements and l processor elements that store programs to be executed by the corresponding processor elements in memory used exclusively for each processor element. It includes a program memory and a memory switch interface that receives and processes access requests to the m data memories from each of the l processor elements. In other words, this memory switch interface selects one of the access requests from any one or more of the l processor elements at each memory access timing, and responds to the selected access request. is sent to the data memory via the memory switch. If this access request is a read request, the data sent from the data memory is passed to the requesting processor element. In this way, since the memory switch interface is limited to one interface for accessing the data memory, the scale of the memory switch (the number of interfaces for connecting the processor) can be reduced to 1/1. In this case, access to the data memory is competed among the l processor elements, which may become a performance bottleneck. However, this problem is alleviated by first providing each processor element with its own program memory. In other words, in a normal computer, programs and data are stored in the same memory, but in the processor used in the present invention, programs are stored in a dedicated program memory for each processor element, so they cannot be transferred via the memory switch interface. Access to memory is limited to data, and the access frequency is reduced to 1/2 at most compared to a normal computer. Second, a data cache memory connected to the memory switch interface further reduces the frequency of data memory access.
In other words, data that does not necessarily need to be stored in the data memory, such as data that can be commonly used by l processor elements (such as constants) and intermediate results of calculations, is stored in the data cache memory. Reduce the number of accesses. Therefore, when a processor element requests access to data memory, the memory switch interface checks whether the data is already stored in the data cache memory.
If it is stored there, it is read from there, and only when it is not, a request is made to the data memory. [Description of Embodiments] Next, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 3 is a system configuration diagram showing one embodiment of the present invention, and FIG. 4 is a detailed block diagram of the processor shown in FIG. 3. Processors PP1' to PP16' are parallel processing type processors that internally contain eight processor elements PE1 to PE8, and each has the ability to execute eight programs in parallel. The number of processor elements is not limited to this example. Each processor PP1' to PP16' can access any data memory DM1 to DM3 via a memory switch MS.
Data can be read and written to 2. The number of data memories is 32 in Figure 3, but
This depends on the number of processors, data memory performance,
It is determined by the frequency of use of the data memory and is not limited to this example. Furthermore, there are many configuration methods for the configuration of the memory switch MS, including a complete crossbar type, but the configuration is not limited to any one of them. Here, as an example, a complete crossbar system is assumed, and even if multiple processors issue requests to access the data memory at the same time, no contention will occur as long as they do not access the same data memory. Even if a memory switch MS having a different configuration is used, the effects of the present invention will not be affected. The control processor has control-only memory CPM1,
It has CPM2 and can also access data memories DM1 to DM32 via memory switch MS. Although the number of control-only memories is two in this example, it is not limited to this. The control processor CP can communicate with each of the processors PP1' to PP16' via their respective control processor interface CPI' via an interface a. FIG. 4 is a block diagram showing an example of the processor shown in FIG. 3. Processor elements PE1 to PE8 each have the ability to execute a program, and the program is a processor element that has the ability to execute a program.
The programs are stored in dedicated program memories PM1 to PM8 connected to PE1 to PE8, respectively. The memory switch interface MSI' is a control circuit for each processor element PE1 to PE8 to access data memories DM1 to DM32 shown in FIG.
When access requests are received simultaneously from PEs 1 to 8, one of them is selected according to a certain algorithm, and the selected access request is sent to the data memories DM1 to DM32 via the memory switch MS.
Send to either. In the case of a read operation, control is also performed to transfer data sent from the corresponding data memory to the requesting processor element according to the sent address. The data cache memory DC operates in the same way as a general computer cache memory. That is, processor elements PE1 to PE8
When there is an access request to data memory DM1 to DM32 from
MSI' checks the contents of the data cache memory DC, and if the desired data is already stored there, it is read from there and sent to the processor element.
Pass to PE1 to PE8. If not, data memory
An access request is made to DM1 to DM32, and the data sent from the data memories DM1 to DM32 is transferred to the requesting processor elements PE1 to PE8, and the memory switch interface is
MSI' is also stored in case the same data is requested again (this request may be made from another processor element). Furthermore, when writing to the data memories DM1 to DM32, the same data is also stored in the data cache memory DC in preparation for reading it again later. Although general techniques for caches on general-purpose computers can be applied to the algorithm for eviction from the cache, since this computer system is a dedicated machine, it may be possible to control it by the programs of the processor elements PE1 to PE8. Dew. In other words, you can have the program specify the data that you want to store in the cache and the data that does not need to be stored, or you can use addressable memory instead of the cache (in this case, data is not stored in the cache from processor elements PE1 to PE8). memory DM
1 to DM32, and what is stored there is all specified by the processor element program). The control processor interface CPI' is a circuit for communicating with the control processor CP and connects each processor element PE1 to PE8 and the control processor.
Communication between CPs and their processors PP1' to PP1
6' Controls communication between itself and the control processor CP (in this method, what is visible from the software is each processor element PE1 to PE8;
PP1' to PP16' have meaning only as a physical unit (device unit), so they cannot be used as control processors.
Logically, communication with the CP is mainly between the processor element and the control processor CP). An example of this communication is each processor element.
There is a program execution start instruction START, which instructs PE1 to PE8 to start program execution, and a program execution stop instruction STOP. Processor elements PE1 to PE8 are instructed to start program execution
Upon receiving START, the program starts executing, and stops when a predetermined condition is met or when a program execution stop instruction STOP is received. Also,
The control processor interface CPI connects processor elements PE1 to PE8 to the control processor.
It also controls the transmission of information to the CP via interface a. For example, after receiving the program execution start instruction START and starting execution, if certain conditions are met, such as when specific processor elements PE1 to PE8 have finished execution, the processor that controls it
It is also the control processor interface CPI' that informs the CP. The configuration of each processor element PE1-PE8 is basically the same as that of a general computer, but the difference is that instruction words are read from the corresponding program memories PM1-PM8. In a general computer, instructions and data are stored in the same memory, but in the parallel processing system using this invention, the data memory
Access to DM1 to DM32 To reduce the load on the path, the instruction words are from program memory PM1 to PM.
It is stored in 8. This means that data must be transferred between each processor element PE1 to PE8, and each processor element PP1' to
Since it is necessary to transfer data between PP16's, it is necessary to store it in a common data memory,
This is not necessary for the program, and the program utilizes the property that each processor element PE1 to PE8 can be stored in a dedicated memory. Each processor element PE1 to PE8 reads and processes data from the data cache memory DC or data memory DM1 to DM32 according to the program stored in the program memory PM1 to PM8, and stores the results in the data memory DM1 to DM3.
2 and returning to the data cache memory DC are repeated. The operation when executing a program in the parallel processing system shown in FIG. 3 is as follows. As an example, each of 128 pieces of data A _i and B _i (i=1
~128)

【式】を計算する場合をとりあげる。演算開始前にデータＡ_i，Ｂ_iを制御プロセツサ
CPがデータメモリDM１〜DM３２に入れる。た
とえば、データA₁〜A₈はデータメモリDM１にデ
ータA₉〜A₁₆はデータメモリDM２に格納し、以
下同様にしてデータA₁₂₀〜A₁₂₈はデータメモリ
DM１６に格納する。同様に、データB₁〜B₈はデ
ータメモリDM１７に、データB₉〜B₁₆はデータ
メモリDM１８に、……データB₁₂₀〜B₁₂₈はデー
タメモリDM３２に格納する。この例では、システム中には１６（プロセツサ
の数）×８（各プロセツサ中のプロセツサエレメ
ントの数）＝128台のプロセツサエレメントがあり
ｉ番目のプロセツサエレメントPE_iはＡ_i×Ｂ_iの
計算をして演算結果Ｃ_iをデータメモリに格納す
る。この計算をやるためのプログラムは各プロセ
ツサエレメントPE１〜PE８に付属したプログラ
ムメモリPM１〜PM８の中にそれぞれ格納されて
おり、各プロセツサエレメントPE１〜PE８の中
の命令アドレスレジスタにはそのプロセツサエ
レメントPE１〜PE８が実行すべき最初の命令語
のプログラムメモリPM１〜PM８のアドレスが設
定される。これは制御プロセツサCPの制御下で
データメモリDM１〜DM３２からメモリスイツ
チMSおよびメモリスイツチインターフエース
MSI′を通して行なわれるか、あるいはインター
フエースａおよび制御プロセツサインターフエー
スCPI′を通して行なわれる。以上の準備は制御プロセツサCPが行い完了す
るとインターフエースａを通して128台のすべて
のプロセツサエレメント宛のプログラム実行開始
指示STARTをプロセツサPP１′〜PP１６′に送
出する。これによつて、すべてのプロセツサエレ
メントPE１〜PE８は各々の命令アドレスレジス
タの値に従つて、プログラムメモリPM１〜PM８
から命令語を読み出し、解読して実行する。いま、プロセツサPP１′中のプロセツサエレメ
ントPE１を例にとれば、データメモリDM１から
読み出したデータA₁とデータメモリDM１７から
読み出したデータＢ１に対しA₁×B₁の計算をし
て演算結果C₁をデータメモリに格納する。同様にプロセツサエレメントPE２はA₂×B₂の
計算をして、演算結果C₂を格納し、以下同様
に、プロセツサエレメントPE８はA₈×B₈→C₈の
処理をする。これらの処理は各プロセツサエレメ
ントPE１〜PE８が並行に同時に実行する。なお、本例ではすべてのプロセツサエレメント
が同一のプログラムを実行するとしているがそれ
は異なるプログラムであつてもよいし、たとえ同
一プログラムであつても条件分岐が入る場合には
各プロセツサエレメント毎に途中から異なる命令
シーケンスを実行することになる可能性がある。また、各プロセツサエレメントPE１〜PE８か
ら各データメモリDM１〜DM３２へのアクセス
要求（Ａ_i，Ｂ_iを読み出したり、Ｃ_iを格納するた
めの要求）はメモリスイツチインターフエース
MSI′で交通整理され、競合した場合は１つだけ
選択されて他は待たされるので、各プロセツサエ
レメントPE１〜PE８の命令実行のタイミングは
ずれてくる可能性があり、すべてのプロセツサエ
レメントPE１〜PE８がまつたく同期して同時刻
に同じ動作・処理をしている訳ではない。演算処理Ａ_i×Ｂ_i→Ｃ_iの処理が完了すると制御
プロセツサインターフエースCPI′およびインタ
ーフエースａを通つて制御プロセツサCPにこの
旨通知される。制御プロセツサCPは128台すべて
のプロセツサエレメントPE１〜PE８からの完了
通知を待つてLet us consider the case of calculating [Formula]. Before starting the calculation, the data A _i and B _i are controlled by the processor.
The CP stores the data in the data memories DM1 to DM32. For example, data A ₁ to A ₈ are stored in data memory DM1, data A ₉ to A ₁₆ are stored in data memory DM2, and data A ₁₂₀ to A ₁₂₈ are stored in data memory DM2 in the same manner.
Store in DM16. Similarly, data B ₁ to B ₈ are stored in the data memory DM17, data B ₉ to B ₁₆ are stored in the data memory DM18, and data B ₁₂₀ to B ₁₂₈ are stored in the data memory DM32. In this example, there are 16 (number of processors) x 8 (number of processor elements in each processor) = 128 processor elements in the system, and the i-th processor element PE _i is A _i x B _i The calculation result C _i is stored in the data memory. The program for performing this calculation is stored in the program memories PM1 to PM8 attached to each processor element PE1 to PE8, and the instruction address register in each processor element PE1 to PE8 contains the processor The address of the program memory PM1-PM8 of the first instruction word to be executed by the elements PE1-PE8 is set. It connects data memories DM1 to DM32 to memory switch MS and memory switch interface under the control of control processor CP.
MSI' or through interface a and control processor interface CPI'. The control processor CP performs the above preparations, and when completed, sends a program execution start instruction START addressed to all 128 processor elements to the processors PP1' to PP16' through the interface a. As a result, all processor elements PE1 to PE8 are assigned to program memories PM1 to PM8 according to the value of each instruction address register.
Reads the instruction word from, decodes and executes it. Now, taking processor element PE1 in processor PP1' as an example, A ₁ ×B ₁ is calculated for data A ₁ read from data memory DM1 and data B1 read from data memory DM17, and the calculation result C is obtained. Store ₁ in data memory. Similarly, processor element PE2 calculates A ₂ ×B ₂ and stores the calculation result C ₂ , and similarly, processor element PE8 processes A ₈ ×B ₈ →C ₈ . These processes are simultaneously executed in parallel by each of the processor elements PE1 to PE8. In this example, it is assumed that all processor elements execute the same program, but it may be a different program, and even if the program is the same, if a conditional branch is included, it is assumed that each processor element executes the same program. There is a possibility that a different instruction sequence will be executed from the middle. In addition, access requests from each processor element PE1 to PE8 to each data memory DM1 to DM32 (requests for reading A _i and B _i or storing C _i ) are handled by the memory switch interface.
Traffic is controlled by MSI', and if there is a conflict, only one is selected and the others are made to wait, so the timing of instruction execution of each processor element PE1 to PE8 may be different, and all processor elements PE1 to PE8 PE8 does not synchronize perfectly and perform the same operations and processes at the same time. When the processing A _i ×B _i →C _i is completed, the control processor CP is notified of this through the control processor interface CPI' and the interface a. The control processor CP waits for completion notifications from all 128 processor elements PE1 to PE8.

【式】の処理をする。演算線果Ｃ_i はデータメモリDM１〜DM３２の中に格納され
ているから制御プロセツサCPはメモリスイツチ
MSを介してデータメモリDM１〜DM３２にアク
セスして演算結果Ｃ_iを読出順に加算する。この
動作は一般的コンピユータにおける加算と同じで
制御プロセツサCP内のプログラムにより、演算
結果C₁，C₂，…，C₁₂₈を逐一読み出して加算す
る。この加算が終了すれば求める答となる。各プロセツサエレメントPE１〜PE８から制御
プロセツサCPへの通知は上記のように各プロセ
ツサエレメントPE１〜PE８が終る毎に制御プロ
セツサCPに通知してもよいがプロセツサPP１′
〜PP１６′の内でまとめて通知することで制御プ
ロセツサCPとの間の通信量を減らすことも考え
られよう。また、上記のようにProcess [expression]. Since the calculation line product C _i is stored in the data memories DM1 to DM32, the control processor CP switches the memory switch.
The data memories DM1 to DM32 are accessed via the MS and the calculation results C _i are added in the order of reading. This operation is the same as addition in a general computer, and the calculation results C ₁ , C ₂ , . . . , C ₁₂₈ are read out one by one and added by a program in the control processor CP. When this addition is completed, the desired answer is obtained. The notification from each processor element PE1 to PE8 to the control processor CP may be notified to the control processor CP every time each processor element PE1 to PE8 finishes, as described above, but
It may also be possible to reduce the amount of communication with the control processor CP by notifying all of them within PP16'. Also, as above

〔Effect of the invention〕

本発明の並列処理方式は、制御プロセツサに並
列接続され複数のデータメモリとメモリスイツチ
を介して相互に並行して接続されるプロセツサの
それぞれが、単一のプロセツサエレメントからな
る代りに、並列に動作する複数のプロセツサエレ
メントを並列に設けることにより、メモリスイツ
チ側から見た場合には単一のプロセツサエレメン
トしか有していないように見えながら時分割で複
数のプロセツサエレメントをメモリスイツチに接
続することができるため、並列度を増大できると
いう効果がある。すなわち、本発明の並列処理方式は、複数のプ
ロセツサエレメントを内蔵するプロセツサを並列
におき、制御プロセツサの制御下に並列動作され
るように構成することで大きな並列度の並列演算
を実現容易にし、かつ並列演算できない部分は制
御プロセツサで処理することで融通性が増し応用
分野が拡大するという効果を有する。 The parallel processing method of the present invention is such that each of the processors connected in parallel to the control processor and connected to each other in parallel via a plurality of data memories and memory switches is configured in parallel, instead of consisting of a single processor element. By providing multiple operating processor elements in parallel, when viewed from the memory switch side, it appears to have only a single processor element, but multiple processor elements can be connected to the memory switch in a time-sharing manner. Since it can be connected, it has the effect of increasing the degree of parallelism. In other words, the parallel processing method of the present invention facilitates the realization of parallel operations with a large degree of parallelism by arranging processors containing a plurality of processor elements in parallel and configuring them to operate in parallel under the control of a control processor. , and the parts that cannot be computed in parallel can be processed by a control processor, which has the effect of increasing flexibility and expanding the field of application.

[Brief explanation of the drawing]

第１図は従来の一例を示すシステム構成図、第
２図は第１図に示すプロセツサの詳細ブロツク
図、第３図は本発明の一実施例を示すシステム構
成図、第４図は第３図に示すプロセツサの詳細ブ
ロツク図である。 CP……制御プロセツサ、PP１〜PP１６，PP
１′〜PP１６′……プロセツサ、CPM１，CPM２
……制御専用メモリ、MS……メモリスイツチ、
MM１〜MM３２……メモリ、MSI，MSI′……メ
モリスイツチインターフエース、CPI，CPI′……
制御プロセツサインターフエース、PE，PE１〜
PE８……プロセツサエレメント、DM１〜DM３
２……データメモリ、DC……データ用キヤツシ
ユメモリ、PM１〜PM８……プログラムメモリ、
ａ……インターフエース。 FIG. 1 is a system configuration diagram showing a conventional example, FIG. 2 is a detailed block diagram of the processor shown in FIG. 1, FIG. 3 is a system configuration diagram showing an embodiment of the present invention, and FIG. FIG. 2 is a detailed block diagram of the processor shown in the figure. CP...Control processor, PP1 to PP16, PP
1' to PP16'...Processor, CPM1, CPM2
...control-only memory, MS...memory switch,
MM1 to MM32...Memory, MSI, MSI'...Memory switch interface, CPI, CPI'...
Control processor interface, PE, PE1~
PE8...Processor element, DM1 to DM3
2...Data memory, DC...Data cache memory, PM1 to PM8...Program memory,
a...interface.

Claims

[Claims]

1. A control processor, a plurality of data memories each storing data, a plurality of processors connected in parallel to the control processor, and the plurality of processors and the plurality of data memories are interconnected in parallel. each of the plurality of processors includes a plurality of processor elements provided in parallel;
a plurality of program memories for storing programs provided corresponding to each processor element; a control processor interface for connecting the plurality of processor elements to the control processor; and a control processor interface for connecting the plurality of processor elements to the control processor. a memory switch interface for connecting to a memory switch; and a data cache memory connected to the memory switch interface and storing a copy of a portion of the data stored in the data memory. Processing method.