JP2004334297A

JP2004334297A - Parallel operation processor and parallel operation processing method

Info

Publication number: JP2004334297A
Application number: JP2003125408A
Authority: JP
Inventors: Isamu Kozuka; 勇小塚; Shiro Kobayashi; 士朗小林
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2003-04-30
Filing date: 2003-04-30
Publication date: 2004-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a parallel operation processor suitable for the reduction of a production cost, including the reduction of the number of processing steps and the simplification of a program. <P>SOLUTION: This SIMD type processor 100 comprises four PE0-3, and a controller 110 decoding an instruction code to control operation of each the PE0-3. Each the PE0-3 has a flag bit storage part 14 for storing a flag bit, and a calculation part 12 selecting data that are an arithmetic target of the PE from operation target operands on the basis of the flag bit of the flag bit storage part 14. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、単一の命令コードに基づいて複数のデータを並列に処理する装置および方法に係り、特に、処理ステップ数の低減およびプログラムの簡素化を図るとともに製造コストを低減するのに好適な並列演算処理装置および並列演算処理方法に関する。
【０００２】
【従来の技術】
ＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎｓｔｒｅａｍＭｕｌｔｉｐｌｅＤａｔａｓｔｒｅａｍ）型プロセッサは、複数のプロセッシング・エレメント（ＰｒｏｃｅｓｓｉｎｇＥｌｅｍｅｎｔ）（以下、ＰＥと略記する。）を備え、演算対象となる複数のデータを同一の命令で並列に処理するプロセッサである。そのため、条件分岐命令のように各ＰＥごとに異なる処理を単一の命令では実行することができない。例えば、条件が真なら第１の値を用いて演算を行い、条件が偽なら第２の値を用いて演算を行うということを各ＰＥごとに独立に行うためには、マスク演算を利用して４つの処理を経なければならない。
【０００３】
従来のＳＩＭＤ型プロセッサの構成を図６を参照しながら説明する。なお、説明の理解を容易にするため、以下、ＰＥが４つの場合の構成を例にとって説明する。実際には、さらに複数のＰＥを備えているものもある。
図６は、従来のＳＩＭＤ型プロセッサ２００の構成を示すブロック図である。
ＳＩＭＤ型プロセッサ２００は、図６に示すように、４つのＰＥ０〜３と、命令コードをデコードして各ＰＥ０〜３の動作を制御する制御装置２１０とで構成されており、データを転送するための信号線であるバス３９でＤ−ＲＯＭ３２ａ、Ｐ−ＲＯＭ３２ｂ、Ｄ−ＲＡＭ３４ａおよびＰ−ＲＡＭ３４ｂと相互にかつデータ授受可能に接続されている。そして、単一の命令コードに基づいて、演算対象となる複数のデータ（以下、演算対象オペランドという。）を読み込み、読み込んだ演算対象オペランドの各データをＰＥ０〜３で並列に処理するようになっている。
【０００４】
各ＰＥ０〜３は、複数のレジスタと、マスク演算に必要なマスク演算回路とを有し、制御装置２１０の制御により、Ｄ−ＲＡＭ３４ａのデータをレジスタに読み込み、レジスタのデータに対してマスク演算その他の演算を行い、演算結果をレジスタに格納するようになっている。
マスク演算回路は、マスクベクトル（後述）を生成するマスクベクトル生成回路と、論理積（ＡＮＤ）演算を行うＡＮＤ回路と、論理和（ＯＲ）演算を行うＯＲ回路とで構成されている。
【０００５】
各ＰＥ０〜３ごとに異なる処理を実行する場合を図７ないし図１３を参照しながら説明する。
図７は、プロセッサを用いて処理させるための演算処理プログラムの例である。
図７の演算処理プログラムの目的は、各々Ｎ（＝８０）個の要素をもつデータＡ，Ｂ，Ｘ，Ｙをもとに、要素を特定する添字「ｉ」ごとに、所定条件の成立の有無に応じて演算対象となるデータを選択し、データＺに格納することである。
【０００６】
図７の演算処理プログラムのデータと処理は、すべて添字「ｉ」に対して相関をもたない。すなわち、異なる「ｉ」ごとに独立して演算処理を行うことが可能であり、図７の演算処理プログラムは、Ｎ回の繰り返しを伴う処理をＭ個の並列プロセッサを用いることで、ＮがＭの倍数の場合、繰り返しを伴う処理をＮ／Ｍ回に低減させることが可能である（それ以外ではＮ／Ｍ＋１回となる。以下、単純化のためＮはプロセッサ並列度の倍数と仮定する。）。
【０００７】
図８は、図７の演算処理プログラムを４並列、すなわちＰＥを４個もつＳＩＭＤプロセッサ上で処理する際に、その過程を実際のＳＩＭＤプロセッサ上での処理に即したかたちで書き直したものである。
まず、図８においては、ループカウンタ「ｉ」による繰り返しは、４個のＰＥで並列処理されることから、ループカウンタ「ｉ」の増分ステップは、「１」から「４」となり、図７中で「ｉ＋＋」と記述されている部分が「ｉ＋＝４」と変更される。これは、並列数４のプロセッサを用いることで繰り返しを伴う処理がＮ／４回に減少することに相当する。
【０００８】
ここで、図８中で宣言されている変数「ｊ」に関連した処理はＳＩＭＤプロセッサ上では、各々の演算が「ｊ」番目のＰＥで処理されることを表しており、繰り返しのためのループカウンタではない。また、配列変数「ａ」、「ｂ］、「ｘ」、［ｙ」、「ｚ」は、ＰＥが取り扱うレジスタに相当する。ａ［０］、ｂ［０］、ｘ［０］、ｙ［０］、ｚ［０］は、ＰＥ０に属するレジスタを表し、ａ［１］、ｂ［１］、ｘ［１］、ｙ［１］、ｚ［１］は、ＰＥ１に属するレジスタを表し、ａ［２］、ｂ［２］、ｘ［２］、ｙ［２］、ｚ［２］は、ＰＥ２に属するレジスタを表し、ａ［３］、ｂ［３］、ｘ［３］、ｙ［３］、ｚ［３］は、ＰＥ３に属するレジスタを表す。
【０００９】
ステップＳ１００は、Ｄ−ＲＡＭ３４ａのデータを各ＰＥ０〜３のレジスタに格納するロード処理相当部であり、ステップＳ１０２は、所定条件の成立の有無に応じて演算対象となるデータを選択する演算処理相当部であり、ステップＳ１０４は、各ＰＥ０〜３のレジスタの内容をＤ−ＲＡＭ３４ａに格納するストア処理相当部である。
【００１０】
次に、ステップＳ１００のロード処理相当部について詳細に説明する。
ステップＳ１００のロード処理相当部は、ループカウンタ変数「ｊ」によって繰り返し処理が行われるのではなく、実際には「ｊ」番目のＰＥに処理が割り振られ並列に動作する。すなわち、ステップＳ２００の処理は、「ａ［ｊ］＝Ａ［ｉ＋ｊ］」という処理がｊ＝０〜３の４回繰り返して処理が行われるのではなく、「ａ［０］＝Ａ［ｉ＋０］」、「ａ［１］＝Ａ［ｉ＋１］」、「ａ［２］＝Ａ［ｉ＋２］」および「ａ［３］＝Ａ［ｉ＋３］」という処理が各ＰＥ０〜３で同時並行して実行される。
【００１１】
同様に、ステップＳ２０２，Ｓ２０４，Ｓ２０６についても各ＰＥ０〜３で同時並列処理される。このように、Ｓ２００，Ｓ２０２，Ｓ２０４，Ｓ２０６のような単一の命令で各ＰＥ０〜３が複数のデータを取り扱うのがＳＩＭＤプロセッサの基本原理である。
次に、ステップＳ１０２の処理の前にステップＳ４００のストア処理相当部について説明する。
【００１２】
ステップＳ４００の処理もステップＳ１００のロード処理相当部と同じように、ループカウンタ変数「ｊ」によって繰り返し処理が行われるのではなく、実際には「ｊ」番目のＰＥに処理が割り振られ並列に動作する。すなわち、「Ｚ［ｉ＋ｊ］＝ｚ［ｊ］」という処理がｊ＝０〜３の４回繰り返して処理が行われるのではなく、「Ｚ［ｉ＋０］＝ｚ［０］」、「Ｚ［ｉ＋１］＝ｚ［１］」、「Ｚ［ｉ＋２］＝ｚ［２］」および「Ｚ［ｉ＋３］＝ｚ［３］」という処理が各ＰＥ０〜３で同時並行して実行される。
【００１３】
戻って、ステップＳ１０２の処理について詳細に説明する。
ステップＳ１０２の処理内容も各々のデータ間に相関がないため、ステップＳ１００やステップＳ１０４と同じように一見するとＳＩＭＤプロセッサによる処理が可能なように見受けられるが、ステップＳ３００の処理結果に応じてステップＳ３０２を実行しなければならないＰＥと、ステップＳ３０４を実行しなければならないＰＥが現れるため、そのままでは単一の命令を旨とするＳＩＭＤプロセッサでの処理は行えない。
【００１４】
この問題を解決するために従来のＳＩＭＤプロセッサでは、マスク演算処理と呼ばれる技法を用いて処理をしていた。
ステップＳ１０２の処理をマスク演算を用いて記述し直したものを図１３に示し、その詳細について説明する。
図９は、ＰＥ０〜３でマスクベクトルを生成する場合を示す図である。
【００１５】
マスク演算のうち第１の処理では、図９に示すように、各ＰＥ０〜３ごとに、条件の成立の有無に応じたマスクベクトルを生成する。マスクベクトルとは、変数ｘ［］，ｙ［］，ｚ［］のデータ長に応じたビット数を有し、それらすべてのビットが「１」または「０」となるデータである。各ＰＥ０〜３では、変数ａ［ｊ］の値が変数ｂ［ｊ］の値よりも小さいときは、条件が真であるとして、すべてのビットが「１」となるマスクベクトルを生成する。また、変数ａ［ｊ］の値が変数ｂ［ｊ］の値以上であるときは、条件が偽であるとして、すべてのビットが「０」となるマスクベクトルを生成する。図９の例では、ＰＥ０、ＰＥ１およびＰＥ３については条件が真となっているので、すべてのビットが「１」となるマスクベクトルが生成されるのに対して、ＰＥ２については条件が偽となっているので、すべてのビットが「０」となるマスクベクトルが生成される。
【００１６】
図１０は、ＰＥ０〜３で変数ｘ［ｊ］の値とマスクベクトルとを論理積演算する場合を示す図である。
マスク演算のうち第２の処理では、図１０に示すように、各ＰＥ０〜３ごとに、条件が真の場合にＰＥ０〜３の各レジスタに代入する値である変数ｘ［ｊ］の値と、第１の処理で生成したマスクベクトルとの論理積をとる。図１０の例では、ＰＥ０、ＰＥ１およびＰＥ３についてはマスクベクトルが「１」となっているので、論理積演算の結果として変数ｘ［ｊ］の値が出力されるのに対して、ＰＥ２についてはマスクベクトルが「０」となっているので、論理積演算の結果としてマスクベクトルの値「０」が出力される。
【００１７】
図１１は、ＰＥ０〜３で変数ｙ［ｊ］の値とマスクベクトルの反転値とを論理積演算する場合を示す図である。
マスク演算のうち第３の処理では、図１１に示すように、各ＰＥ０〜３ごとに、条件が偽の場合にＰＥ０〜３の各レジスタに代入する値である変数ｙ［ｊ］の値と、第１の処理で生成したマスクベクトルの反転値との論理積をとる。図１１の例では、ＰＥ０、ＰＥ１およびＰＥ３についてはマスクベクトルの反転値が「０」となるので、論理積演算の結果としてマスクベクトルの反転値「０」が出力されるのに対して、ＰＥ２についてはマスクベクトルの反転値が「１」となるので、論理積演算の結果として変数ｙ［ｊ］の値が出力される。
【００１８】
図１２は、ＰＥ０〜３で第２の処理の演算結果と第３の処理の演算結果とを論理和演算する場合を示す図である。
マスク演算のうち第４の処理では、図１２に示すように、各ＰＥ０〜３ごとに、第２の処理の演算結果と第３の処理の演算結果との論理和をとる。図１２の例では、ＰＥ０、ＰＥ１およびＰＥ３については、第２の処理の演算結果が変数ｘ［ｊ］の値であり、第３の処理の演算結果がマスクベクトルの反転値「０」であるので、論理和演算の結果として変数ｘ［ｊ］の値が出力されるのに対して、ＰＥ２については、第２の処理の演算結果がマスクベクトル「０」であり、第３の処理の演算結果が変数ｙ［ｊ］の値であるので、論理和演算の結果として変数ｙ［ｊ］の値が出力される。
【００１９】
したがって、ステップＳ１０２のデータ演算処理をマスク演算を用いた処理として書き換えると、図１３に示すようになる。
図１３は、マスク演算を用いたデータ演算処理を示すプログラムである。
データ演算処理は、ステップＳ１０２において実行されると、図１３に示すように、まず、配列型の変数ｔ［４］を確保し、ループカウンタ変数ｊに「０」を設定し、ステップＳ３１０に移行するようになっている。
【００２０】
ステップＳ３１０では、変数ａ［ｊ］の値が変数ｂ［ｊ］の値よりも小さいときは、すべてのビットが「１」となるマスクベクトルを変数ｔ［ｊ］に設定し、変数ａ［ｊ］の値が変数ｂ［ｊ］の値以上であるときは、すべてのビットが「０」となるマスクベクトルを変数ｔ［ｊ］に設定し、ステップＳ３１２に移行する。なお、図１３においては、すべてのビットが「１」となるマスクベクトルを補数表現しているため「−１」と表記している。
【００２１】
ステップＳ３１２では、変数ｘ［ｊ］の値と変数ｔ［ｊ］の値との論理積をとったものを変数ｘ［ｊ］の新たな値として設定し、ステップＳ３１４に移行して、変数ｙ［ｊ］の値と変数ｔ［ｊ］の反転値との論理積をとったものを変数ｙ［ｊ］の新たな値として設定し、ステップＳ３１６に移行して、変数ｘ［ｊ］の値と変数ｙ［ｊ］の値との論理和をとったものを変数ｚ［ｊ］の値として設定する。
【００２２】
なお、従来のＳＩＭＤ型プロセッサについては、例えば、非特許文献１に開示されている技術が広く知られている。
【００２３】
【非特許文献１】
「インテル（Ｒ）アーキテクチャ最適化リファレンスマニュアル」５−２２ページ「条件付移動」、ｈｔｔｐ：／／ｗｗｗ．ｉｎｔｅｌ．ｃｏ．ｊｐ／ｊｐ／ｄｅｖｅｌｏｐｅｒ／ｄｏｗｎｌｏａｄ／ｉｎｄｅｘ．ｈｔｍ、資料番号：７３０７９５Ｊ−００１
【００２４】
【発明が解決しようとする課題】
このように、従来のＳＩＭＤ型プロセッサ２００では、各ＰＥ０〜３ごとに条件の成立の有無に応じて演算対象となるデータを選択する処理を実行する場合に、マスク演算として４つの処理を経なければならないため、プログラムとしてマスクベクトル生成命令語および論理演算命令語を用意し、さらに、ＳＩＭＤ型プロセッサ２００にマスク演算回路を実装する必要がある。したがって、処理ステップ数が大きくなるとともにプログラムが煩雑となり、マスク演算回路の実装にコストを要するという問題があった。
【００２５】
そこで、本発明は、このような従来の技術の有する未解決の課題に着目してなされたものであって、処理ステップ数の低減およびプログラムの簡素化を図るとともに製造コストを低減するのに好適な並列演算処理装置および並列演算処理方法を提供することを目的としている。
【００２６】
【課題を解決するための手段】
上記目的を達成するために、本発明に係る請求項１記載の並列演算処理装置は、複数の演算器を備え、単一の命令コードに基づいて、前記複数の演算器を並列に動作させて複数のデータを並列に処理する装置であって、前記複数のデータは、前記各演算器に対応したサブセットを含み、前記各演算器は、フラグ情報を記憶するためのフラグ情報記憶手段と、前記フラグ情報記憶手段のフラグ情報に基づいて当該演算器の演算対象となるデータを当該演算器に対応するサブセットのなかから選択するデータ選択手段とを有する。
【００２７】
このような構成であれば、各演算器についてのフラグ情報が各フラグ情報記憶手段に記憶され、単一の命令コードが与えられると、各演算器では、データ選択手段により、フラグ情報記憶手段のフラグ情報に基づいてその演算器の演算対象となるデータがその演算器に対応するサブセットのなかから選択される。
これにより、条件の成立の有無を示すフラグ情報を各フラグ情報記憶手段に設定するだけで、各演算器ごとに、条件の成立の有無に応じて演算対象となるデータを単一の命令コードで選択することができる。
【００２８】
ここで、フラグ情報は、演算器の演算対象となるデータをサブセットのなかから選択するのに必要な情報であって、例えば、「０」または「１」の２値の状態を取り得る情報であってもよいし、より多値の状態を取り得る情報であってもよい。以下、請求項３記載の並列演算処理方法において同じである。
さらに、本発明に係る請求項２記載の並列演算処理装置は、請求項１記載の並列演算処理装置において、前記データ選択手段は、前記フラグ情報が第１の状態であるときは、前記複数のデータのうち当該演算器に対応する第１の位置のデータを取得し、前記フラグ情報が第２の状態であるときは、前記複数のデータのうち当該演算器に対応する第２の位置のデータを取得するようになっている。
【００２９】
このような構成であれば、各演算器では、設定されたフラグ情報が第１の状態であると、データ選択手段により、複数のデータのうちその演算器に対応する第１の位置のデータが取得される。また、設定されたフラグ情報が第２の状態であると、データ選択手段により、複数のデータのうちその演算器に対応する第２の位置のデータが取得される。
【００３０】
一方、上記目的を達成するために、本発明に係る請求項３記載の並列演算処理方法は、単一の命令コードに基づいて、複数の演算器を並列に動作させて複数のデータを並列に処理する方法であって、前記複数のデータは、前記各演算器に対応したサブセットを含み、前記各演算器ごとにフラグ情報を記憶するフラグ情報記憶ステップと、前記各演算器ごとに、前記フラグ情報記憶ステップで記憶したフラグ情報のうち当該演算器に対応するものに基づいて、当該演算器の演算対象となるデータを当該演算器に対応するサブセットのなかから選択するデータ選択ステップとを含む。
【００３１】
さらに、本発明に係る請求項４記載の並列演算処理方法は、請求項３記載の並列演算処理方法において、前記データ選択ステップは、前記各演算器ごとに、当該演算器に対応するフラグ情報が第１の状態であるときは、前記複数のデータのうち当該演算器に対応する第１の位置のデータを取得し、当該演算器に対応するフラグ情報が第２の状態であるときは、前記複数のデータのうち当該演算器に対応する第２の位置のデータを取得する。
【００３２】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照しながら説明する。図１ないし図５は、本発明に係る並列演算処理装置および並列演算処理方法の実施の形態を示す図である。
本実施の形態は、本発明に係る並列演算処理装置および並列演算処理方法を、図１に示すように、条件の成立の有無に応じて演算対象となるデータを選択する処理をＳＩＭＤ型プロセッサ１００に実行させる場合について適用したものである。
【００３３】
まず、本発明に係るＳＩＭＤ型プロセッサ１００の構成を図１を参照しながら説明する。なお、発明の理解を容易にするため、以下、ＰＥが４つの場合の構成を例にとって説明する。実際には、さらに複数のＰＥを備えていてもよい。
図１は、本発明に係るＳＩＭＤ型プロセッサ１００の構成を示すブロック図である。
【００３４】
ＳＩＭＤ型プロセッサ１００は、図１に示すように、４つのＰＥ０〜３と、命令コードをデコードして各ＰＥ０〜３の動作を制御する制御装置１１０とで構成されており、データを転送するための信号線であるバス３９でＤ−ＲＯＭ３２ａ、Ｐ−ＲＯＭ３２ｂ、Ｄ−ＲＡＭ３４ａおよびＰ−ＲＡＭ３４ｂと相互にかつデータ授受可能に接続されている。そして、単一の命令コードに基づいて、演算対象オペランドを読み込み、読み込んだ演算対象オペランドの各データをＰＥ０〜３で並列に処理するようになっている。
【００３５】
次に、ＰＥ０の構成を図２を参照しながら詳細に説明する。なお、各ＰＥ０〜３は、いずれも同一機能を有して構成されているため、以下、ＰＥ０の構成について説明し、ＰＥ１〜３の構成については説明を省略する。
図２は、ＰＥ０の構成を示すブロック図である。
ＰＥ０は、図２に示すように、複数のレジスタ１０と、レジスタ１０のデータに対して演算を行う演算部１２と、フラグビットを記憶するためのフラグビット記憶部１４とで構成されている。
【００３６】
演算部１２は、制御装置１１０の制御により、Ｄ−ＲＡＭ３４ａのデータをレジスタ１０に読み込み、レジスタ１０のデータに対して演算を行い、演算結果をレジスタ１０に格納するようになっている。また、フラグビットが「０」であるときは、演算対象オペランドのうちＰＥ０に対応する第１の位置のデータをレジスタ１０に読み込み、フラグビットが「１」であるときは、演算対象オペランドのうちＰＥ０に対応する第２の位置のデータをレジスタ１０に読み込むようになっている。
【００３７】
次に、ＳＩＭＤ型プロセッサ１００に実行させるための処理を図３を参照しながら説明する。
図３は、図８の演算処理のうち、ステップＳ１０２の部分を本発明によるＳＩＭＤプロセッサに実装する際の動作を表したものである。
図８の演算処理のうち、ステップＳ１００，Ｓ１０４については従来のＳＩＭＤプロセッサと同じように処理が行われる。
【００３８】
図３に表したデータ処理の特徴は、各ＰＥ内にプログラム上では、＿Ｂｏｏｌ型（Ｃ言語では、ＩＳＯ／ＩＥＣ９８９９：１９９９から導入された整数型であり、真偽値のみを表す）に相当するフラグを設け、条件判断の結果を格納することにある。
ここで、図８中で宣言されている変数「ｊ」に関連した処理は従来のＳＩＭＤプロセッサに関する説明で用いた意味と同じであり、各々の演算が「ｊ］番目のＰＥで処理されることを表しており、繰り返しのためのループカウンタではない。プログラム上の変数についても同様で、配列変数「ａ」、「ｂ」、「ｘ」、「ｙ」、「ｚ」は、ＰＥ取り扱うレジスタに相当する。ａ［０］、ｂ［０］、ｘ［０］、ｙ［０］、ｚ［０］は、ＰＥ０に属するレジスタを表し、ａ［１］、ｂ［１］、ｘ［１］、ｙ［１］、ｚ［１］は、ＰＥ１に属するレジスタを表し、ａ［２］、ｂ［２］、ｘ［２］、ｙ［２］、ｚ［２］は、ＰＥ２に属するレジスタを表し、ａ［３］、ｂ［３］、ｘ［３］、ｙ［３］、ｚ［３］は、ＰＥ３に属するレジスタを表す。加えて、「ｔ」は各ＰＥに含まれるフラグを表し、ｔ［０］はＰＥ０に属するフラグ、ｔ［１］はＰＥ１に属するフラグ、ｔ［２］はＰＥ２に属するフラグ、ｔ［３］はＰＥ３に属するフラグを意味する。
【００３９】
図３中ステップＳ３０２では、各ＰＥ０〜３がａ［ｊ］とｂ［ｊ］に相当するレジスタの値を比較し、ａ［ｊ］の値がｂ［ｊ］の値より小さい場合にはフラグｔ［ｊ］に真値「ｔｒｕｅ」を設定し、ａ［ｊ］の値がｂ［ｊ］の値以上である場合にはフラグｔ［ｊ］に偽値「ｆａｌｓｅ」を設定するという動作を同時並列に行う。
図４は、ステップＳ３２０に対応する本発明のＳＩＭＤプロセッサの動作とデータの流れを表したものである。
【００４０】
図４中では、演算結果のフラグ値として、ｔ［０］＝１、ｔ［１］＝１、ｔ［２］＝０、ｔ［３］＝１となる場合を表している。
次に、図３のステップＳ３２２では、各ＰＥ０〜３がステップＳ３２０で設定されたフラグ値ｔ［ｊ］に応じてｔ［ｊ］が真値であるときは、ｘ［ｊ］の値を演算結果レジスタｚ［ｊ］に代入し、ｔ［ｊ］が偽値であるときは、ｙ［ｊ］の値を演算結果レジスタｚ［ｊ］に代入する動作を同時並列に行う。
【００４１】
図５は、ステップＳ３２２に対応する本発明のＳＩＭＤプロセッサの動作とデータの流れを表したものである。
各ＰＥ０〜３では、フラグビットが設定され、単一の命令コードが与えられると、図５に示すように、演算部１２により、フラグビットが「１」であるときは、演算対象オペランドのうち変数ｘ［ｊ］の値がレジスタ１０に設定される。これに対して、フラグビットが「０」であるときは、演算対象オペランドのうち変数ｙ［ｊ］の値がレジスタ１０に設定される。図５の例では、ＰＥ０、ＰＥ１およびＰＥ３についてはフラグビットが「１」なので、それらＰＥの各レジスタ１０に変数ｘ［ｊ］の値が設定されるのに対して、ＰＥ２についてはフラグビットが「０」なので、ＰＥ２のレジスタ１０に変数ｙ［ｊ］の値が設定される。
【００４２】
したがって、従来のＳＩＭＤ型プロセッサ２００と比較した場合、従来のＳＩＭＤ型プロセッサ２００は、図１３に示すように、１つのＰＥにつきステップＳ３１０〜Ｓ３１６の４つの処理を要していたのに対して、本発明に係るＳＩＭＤ型プロセッサ１００は、図３に示すように、１つのＰＥにつきステップＳ３２０，Ｓ３２２の２つの処理で足りる。
【００４３】
このようにして、本実施の形態では、各ＰＥ０〜３は、フラグビットを記憶するためのフラグビット記憶部１４と、フラグビット記憶部１４のフラグビットに基づいてそのＰＥの演算対象となるデータを演算対象オペランドのなかから選択する演算部１２とを備える。
これにより、条件の成立の有無を示すフラグビットをフラグビット記憶部１４に設定するだけで、条件の成立の有無に応じて演算対象となるデータを単一の命令コードで選択することができる。したがって、従来に比して、処理ステップ数を低減することができるとともにプログラムを簡素化することができる。また、ＳＩＭＤ型プロセッサ１００にマスク演算回路を実装する必要がないので、従来に比して、製造コストを比較的低減することができる。
【００４４】
さらに、本実施の形態では、演算部１２は、フラグビットが「０」であるときは、演算対象オペランドのうちＰＥ０に対応する第１の位置のデータをレジスタ１０に読み込み、フラグビットが「１」であるときは、演算対象オペランドのうちＰＥ０に対応する第２の位置のデータをレジスタ１０に読み込むようになっている。
【００４５】
これにより、条件が真なら第１の値を用いて演算を行い、条件が偽なら第２の値を用いて演算を行うという条件分岐処理を単一の命令コードで実現することができる。
上記実施の形態において、ＰＥ０〜３は、請求項１ないし４記載の演算器に対応し、フラグビット記憶部１４は、請求項１記載のフラグ情報記憶手段に対応し、演算部１２は、請求項１または２記載のデータ選択手段に対応している。
【００４６】
なお、上記実施の形態においては、４つのＰＥを設けて構成したが、これに限らず、２つまたは３つのＰＥを設けて構成することもできるし、５つ以上のＰＥを設けて構成することもできる。
また、上記実施の形態においては、本発明に係る並列演算処理装置および並列演算処理方法を、図１に示すように、条件の成立の有無に応じて演算対象となるデータを選択する処理をＳＩＭＤ型プロセッサ１００に実行させる場合について適用したが、これに限らず、本発明の主旨を逸脱しない範囲で他の場合にも適用可能である。
【００４７】
【発明の効果】
以上説明したように、本発明に係る請求項１または２記載の並列演算処理装置によれば、条件の成立の有無を示すフラグ情報をフラグ情報記憶手段に設定するだけで、条件の成立の有無に応じて演算対象となるデータを単一の命令コードで選択することができる。したがって、従来に比して、処理ステップ数を低減することができるとともにプログラムを簡素化することができるという効果が得られる。また、マスク演算回路を実装する必要がないので、従来に比して、製造コストを比較的低減することができるという効果も得られる。
【００４８】
さらに、本発明に係る請求項２記載の並列演算処理装置によれば、条件が真なら第１の値を用いて演算を行い、条件が偽なら第２の値を用いて演算を行うという条件分岐処理を単一の命令コードで実現することができるという効果も得られる。
一方、本発明に係る請求項３または４記載の並列演算処理方法によれば、請求項１の並列演算処理装置と同等の効果が得られる。
【００４９】
さらに、本発明に係る請求項４記載の並列演算処理方法によれば、請求項２の並列演算処理装置と同等の効果も得られる。
【図面の簡単な説明】
【図１】本発明に係るＳＩＭＤ型プロセッサ１００の構成を示すブロック図である。
【図２】ＰＥ０の構成を示すブロック図である。
【図３】図８の演算処理のうち、ステップＳ１０２の部分を本発明によるＳＩＭＤプロセッサに実装する際の動作を表したものである。
【図４】ステップＳ３２０に対応する本発明のＳＩＭＤプロセッサの動作とデータの流れを表したものである。
【図５】ステップＳ３２２に対応する本発明のＳＩＭＤプロセッサの動作とデータの流れを表したものである。
【図６】従来のＳＩＭＤ型プロセッサ２００の構成を示すブロック図である。
【図７】プロセッサを用いて処理させるための演算処理プログラムの例である。
【図８】図７の演算処理プログラムを４並列、すなわちＰＥを４個もつＳＩＭＤプロセッサ上で処理する際に、その過程を実際のＳＩＭＤプロセッサ上での処理に即したかたちで書き直したものである。
【図９】ＰＥ０〜３でマスクベクトルを生成する場合を示す図である。
【図１０】ＰＥ０〜３で変数ｘ［ｊ］の値とマスクベクトルとを論理積演算する場合を示す図である。
【図１１】ＰＥ０〜３で変数ｙ［ｊ］の値とマスクベクトルの反転値とを論理積演算する場合を示す図である。
【図１２】ＰＥ０〜３で第２の処理の演算結果と第３の処理の演算結果とを論理和演算する場合を示す図である。
【図１３】マスク演算を用いたデータ演算処理を示すプログラムである。
【符号の説明】
１０レジスタ
１２演算部
１４フラグビット記憶部
３２ａＤ−ＲＯＭ
３２ｂＰ−ＲＯＭ
３４ａＤ−ＲＡＭ
３４ｂＰ−ＲＡＭ
１００，２００ＳＩＭＤ型プロセッサ
１１０，２１０制御装置
ＰＥ０〜３プロセッシング・エレメント[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an apparatus and a method for processing a plurality of data in parallel based on a single instruction code, and is particularly suitable for reducing the number of processing steps, simplifying a program, and reducing manufacturing costs. The present invention relates to a parallel processing device and a parallel processing method.
[0002]
[Prior art]
A SIMD (Single Instruction Stream Multiple Data stream) type processor includes a plurality of processing elements (Processing Elements) (hereinafter abbreviated as PEs) and processes a plurality of data to be operated in parallel with the same instruction. Processor. Therefore, different processing for each PE, such as a conditional branch instruction, cannot be executed by a single instruction. For example, in order to perform the operation using the first value if the condition is true and to perform the operation using the second value if the condition is false independently for each PE, a mask operation is used. Have to go through four processes.
[0003]
The configuration of a conventional SIMD processor will be described with reference to FIG. Note that, in order to facilitate understanding of the description, a configuration in the case of four PEs will be described below as an example. In fact, some have a plurality of PEs.
FIG. 6 is a block diagram showing a configuration of a conventional SIMD type processor 200.
As shown in FIG. 6, the SIMD type processor 200 includes four PEs 0 to 3 and a control device 210 that decodes an instruction code and controls the operation of each of the PEs 0 to 3. The D-ROM 32a, the P-ROM 32b, the D-RAM 34a, and the P-RAM 34b are connected to each other via a bus 39, which is a signal line for data exchange. Then, based on a single instruction code, a plurality of data to be operated (hereinafter referred to as an operand to be operated) is read, and each data of the read operand to be processed is processed in parallel by PE0 to PE3. ing.
[0004]
Each of the PEs 0 to 3 has a plurality of registers and a mask operation circuit necessary for the mask operation. Under the control of the control device 210, the data of the D-RAM 34a is read into the register, and the mask operation and the like are performed on the data of the register. And the result of the operation is stored in a register.
The mask operation circuit includes a mask vector generation circuit that generates a mask vector (described later), an AND circuit that performs a logical product (AND) operation, and an OR circuit that performs a logical sum (OR) operation.
[0005]
The case where different processing is executed for each of PEs 0 to 3 will be described with reference to FIGS. 7 to 13.
FIG. 7 is an example of an arithmetic processing program for processing using a processor.
The purpose of the arithmetic processing program in FIG. 7 is to determine whether a predetermined condition is satisfied for each subscript “i” specifying an element based on data A, B, X, and Y each having N (= 80) elements. That is, data to be operated is selected according to the presence or absence, and stored in the data Z.
[0006]
All of the data and processing of the arithmetic processing program in FIG. 7 have no correlation with the subscript “i”. That is, the arithmetic processing can be independently performed for each different “i”, and the arithmetic processing program in FIG. 7 uses M parallel processors to perform the processing that involves N repetitions, so that N becomes M , It is possible to reduce the number of processes involving repetition to N / M times (otherwise, N / M + 1 times. Hereinafter, for simplicity, it is assumed that N is a multiple of the processor parallelism. ).
[0007]
FIG. 8 is a diagram in which, when the arithmetic processing program of FIG. 7 is processed on a four-parallel SIMD processor, that is, on a SIMD processor having four PEs, the process is rewritten according to the processing on the actual SIMD processor. .
First, in FIG. 8, the repetition by the loop counter “i” is performed in parallel by four PEs, so the increment step of the loop counter “i” changes from “1” to “4”. Is changed to "i + = 4" in the portion described as "i ++". This corresponds to the fact that the number of processes involving repetition is reduced to N / 4 times by using a processor having four parallels.
[0008]
Here, the processing related to the variable “j” declared in FIG. 8 indicates that each operation is processed by the “j” -th PE on the SIMD processor. Not a counter. The array variables “a”, “b”, “x”, “y”, and “z” correspond to registers handled by the PE. a [0], b [0], x [0], y [0], z [0] represent registers belonging to PE0, and a [1], b [1], x [1], y [ 1] and z [1] represent registers belonging to PE1, a [2], b [2], x [2], y [2] and z [2] represent registers belonging to PE2, [3], b [3], x [3], y [3] and z [3] represent registers belonging to PE3.
[0009]
Step S100 is a load processing equivalent section that stores the data of the D-RAM 34a in the registers of the PEs 0 to 3, and step S102 corresponds to an arithmetic processing of selecting data to be operated according to whether a predetermined condition is satisfied. Step S104 is a store processing equivalent section that stores the contents of the registers of the PEs 0 to 3 in the D-RAM 34a.
[0010]
Next, a portion corresponding to the loading process in step S100 will be described in detail.
The portion corresponding to the load process in step S100 is not repeatedly executed by the loop counter variable “j”, but is actually assigned to the “j” -th PE and operates in parallel. That is, in the process of step S200, the process of “a [j] = A [i + j]” is not performed by repeating the process j = 0 to 3 four times, but “a [0] = A [i + 0]”. , "A [1] = A [i + 1]", "a [2] = A [i + 2]", and "a [3] = A [i + 3]" are simultaneously executed in parallel by the PEs 0 to 3. Is done.
[0011]
Similarly, steps S202, S204, and S206 are also simultaneously processed in parallel by the PEs 0 to 3. As described above, the basic principle of the SIMD processor is that each of the PEs 0 to 3 handles a plurality of data with a single instruction such as S200, S202, S204, and S206.
Next, before the processing in step S102, a part corresponding to the store processing in step S400 will be described.
[0012]
In the process of step S400, similarly to the portion corresponding to the load process of step S100, the process is not repeatedly performed by the loop counter variable "j", but the process is actually allocated to the "j" th PE and operates in parallel. I do. That is, the processing of “Z [i + j] = z [j]” is not repeated four times from j = 0 to 3, and “Z [i + 0] = z [0]” and “Z [i + 1] ] = Z [1], “Z [i + 2] = z [2]”, and “Z [i + 3] = z [3]” are simultaneously executed in each of the PEs 0 to 3.
[0013]
Referring back, the processing in step S102 will be described in detail.
Since there is no correlation between the respective data in the processing contents of step S102, it seems that the processing by the SIMD processor can be performed at a glance like steps S100 and S104. However, according to the processing result of step S300, step S302 is performed. And a PE that needs to execute step S304 appearing, so that the SIMD processor that performs a single instruction cannot perform the processing as it is.
[0014]
In order to solve this problem, a conventional SIMD processor performs processing using a technique called mask operation processing.
FIG. 13 shows the processing of step S102 rewritten using a mask operation, and the details thereof will be described.
FIG. 9 is a diagram illustrating a case where a mask vector is generated by PEs 0 to 3.
[0015]
In the first processing of the mask operation, as shown in FIG. 9, a mask vector is generated for each of PEs 0 to 3 according to whether or not the condition is satisfied. The mask vector is data having a number of bits corresponding to the data length of the variables x [], y [], z [], and all the bits being “1” or “0”. In each of the PEs 0 to 3, when the value of the variable a [j] is smaller than the value of the variable b [j], it is determined that the condition is true, and a mask vector in which all bits are “1” is generated. When the value of the variable a [j] is equal to or greater than the value of the variable b [j], it is determined that the condition is false, and a mask vector in which all bits are “0” is generated. In the example of FIG. 9, since the condition is true for PE0, PE1, and PE3, a mask vector in which all bits are “1” is generated, whereas the condition is false for PE2. Therefore, a mask vector in which all bits are “0” is generated.
[0016]
FIG. 10 is a diagram illustrating a case where a logical product operation is performed between the value of the variable x [j] and the mask vector in PEs 0 to 3.
In the second processing of the mask operation, as shown in FIG. 10, for each of PE0 to PE3, the value of a variable x [j] which is a value to be assigned to each of the registers of PE0 to PE3 when the condition is true is calculated. , And the logical product with the mask vector generated in the first processing. In the example of FIG. 10, since the mask vector is “1” for PE0, PE1, and PE3, the value of the variable x [j] is output as a result of the AND operation, whereas for PE2, Since the mask vector is “0”, the value “0” of the mask vector is output as a result of the AND operation.
[0017]
FIG. 11 is a diagram illustrating a case where the values of the variable y [j] and the inversion value of the mask vector are subjected to the logical product operation in PE0 to PE3.
In the third processing of the mask operation, as shown in FIG. 11, for each of PE0 to PE3, the value of a variable y [j], which is a value to be assigned to each register of PE0 to PE3 when the condition is false, , And the logical product of the mask vector and the inverted value of the mask vector generated in the first processing. In the example of FIG. 11, the inverted value of the mask vector is “0” for PE0, PE1, and PE3. Therefore, the inverted value “0” of the mask vector is output as a result of the AND operation, whereas the inverted value of PE2 is output. Since the inverted value of the mask vector becomes “1”, the value of the variable y [j] is output as a result of the logical product operation.
[0018]
FIG. 12 is a diagram illustrating a case where the operation result of the second process and the operation result of the third process are ORed by the PEs 0 to 3.
In the fourth process of the mask operation, as shown in FIG. 12, the OR of the operation result of the second process and the operation result of the third process is calculated for each of PEs 0 to 3. In the example of FIG. 12, for PE0, PE1, and PE3, the operation result of the second processing is the value of the variable x [j], and the operation result of the third processing is the inverted value “0” of the mask vector. Therefore, while the value of the variable x [j] is output as the result of the OR operation, for PE2, the operation result of the second process is the mask vector “0”, and the operation of the third process is performed. Since the result is the value of the variable y [j], the value of the variable y [j] is output as the result of the OR operation.
[0019]
Therefore, if the data operation process of step S102 is rewritten as a process using a mask operation, the result is as shown in FIG.
FIG. 13 is a program showing a data operation process using a mask operation.
When the data calculation process is executed in step S102, as shown in FIG. 13, first, an array type variable t [4] is secured, a loop counter variable j is set to “0”, and the process proceeds to step S310. It is supposed to.
[0020]
In step S310, when the value of the variable a [j] is smaller than the value of the variable b [j], a mask vector in which all bits are “1” is set in the variable t [j], and the variable a [j Is greater than or equal to the value of the variable b [j], a mask vector in which all bits are “0” is set as the variable t [j], and the flow shifts to step S312. In FIG. 13, since the mask vector in which all the bits are "1" is represented by a complement, it is described as "-1".
[0021]
In step S312, a logical product of the value of the variable x [j] and the value of the variable t [j] is set as a new value of the variable x [j]. The logical product of the value of [j] and the inverted value of variable t [j] is set as a new value of variable y [j], and the process proceeds to step S316 to change the value of variable x [j]. And the value of the variable y [j] is set as the value of the variable z [j].
[0022]
As for the conventional SIMD type processor, for example, the technology disclosed in Non-Patent Document 1 is widely known.
[0023]
[Non-patent document 1]
"Intel (R) Architecture Optimization Reference Manual", page 5-22, "Conditional Movement", http: // www. intel. co. jp / jp / developer / download / index. htm, document number: 730795J-001
[0024]
[Problems to be solved by the invention]
As described above, in the conventional SIMD type processor 200, when performing the process of selecting the data to be operated according to whether or not the condition is satisfied for each of the PEs 0 to 3, four processes must be performed as the mask operation. Therefore, it is necessary to prepare a mask vector generation instruction and a logical operation instruction as a program, and to implement a mask operation circuit in the SIMD type processor 200. Therefore, there is a problem that the number of processing steps increases and the program becomes complicated, and the cost for mounting the mask operation circuit is high.
[0025]
Therefore, the present invention has been made by focusing on such unresolved problems of the conventional technology, and is suitable for reducing the number of processing steps, simplifying a program, and reducing manufacturing costs. It is an object of the present invention to provide a simple parallel processing device and a parallel processing method.
[0026]
[Means for Solving the Problems]
In order to achieve the above object, a parallel operation processing device according to claim 1 of the present invention includes a plurality of operation units, and operates the plurality of operation units in parallel based on a single instruction code. An apparatus for processing a plurality of data in parallel, wherein the plurality of data includes a subset corresponding to each of the arithmetic units, wherein each of the arithmetic units includes flag information storage means for storing flag information, Data selecting means for selecting, from the subset corresponding to the arithmetic unit, data to be operated by the arithmetic unit based on the flag information in the flag information storage means.
[0027]
With such a configuration, the flag information of each arithmetic unit is stored in each flag information storage unit, and when a single instruction code is given, in each arithmetic unit, the data selection unit stores the flag information in the flag information storage unit. Based on the flag information, data to be operated on by the operation unit is selected from a subset corresponding to the operation unit.
Thus, only by setting the flag information indicating whether the condition is satisfied in each flag information storage means, the data to be operated can be expressed by a single instruction code for each arithmetic unit according to whether the condition is satisfied. You can choose.
[0028]
Here, the flag information is information necessary for selecting data to be operated by the operation unit from among the subsets, for example, information that can take a binary state of “0” or “1”. It may be present, or may be information that can assume a multi-value state. Hereinafter, the same applies to the parallel operation processing method according to the third aspect.
Further, in the parallel processing device according to claim 2 according to the present invention, in the parallel processing device according to claim 1, the data selection unit is configured to output the plurality of data when the flag information is in the first state. Acquiring data at a first position corresponding to the arithmetic unit among the data, and when the flag information is in the second state, data at a second position corresponding to the arithmetic unit among the plurality of data; Is supposed to get.
[0029]
With such a configuration, in each of the arithmetic units, when the set flag information is in the first state, the data at the first position corresponding to the arithmetic unit among the plurality of data is determined by the data selection unit. Is obtained. Further, when the set flag information is in the second state, the data at the second position corresponding to the arithmetic unit out of the plurality of data is acquired by the data selecting means.
[0030]
On the other hand, in order to achieve the above object, a parallel operation processing method according to claim 3 of the present invention operates a plurality of arithmetic units in parallel based on a single instruction code to execute a plurality of data in parallel. A method of processing, wherein the plurality of data includes a subset corresponding to each of the arithmetic units, and a flag information storing step of storing flag information for each of the arithmetic units; A data selecting step of selecting, from the subset corresponding to the computing unit, data to be computed by the computing unit based on the flag information corresponding to the computing unit among the flag information stored in the information storing step.
[0031]
Further, in the parallel operation processing method according to a fourth aspect of the present invention, in the parallel operation processing method according to the third aspect, the data selecting step includes, for each of the operation units, flag information corresponding to the operation unit. When in the first state, the data at the first position corresponding to the arithmetic unit is obtained from the plurality of data, and when the flag information corresponding to the arithmetic unit is in the second state, The data at the second position corresponding to the computing unit is obtained from the plurality of data.
[0032]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIGS. 1 to 5 are diagrams showing an embodiment of a parallel operation processing device and a parallel operation processing method according to the present invention.
In the present embodiment, a parallel operation processing device and a parallel operation processing method according to the present invention, as shown in FIG. 1, perform processing for selecting data to be operated in accordance with whether or not a condition is satisfied. This is applied to the case where the program is executed.
[0033]
First, the configuration of a SIMD type processor 100 according to the present invention will be described with reference to FIG. Note that, in order to facilitate understanding of the invention, a configuration in the case of four PEs will be described below as an example. In practice, a plurality of PEs may be further provided.
FIG. 1 is a block diagram showing a configuration of a SIMD type processor 100 according to the present invention.
[0034]
As shown in FIG. 1, the SIMD type processor 100 includes four PEs 0 to 3 and a control device 110 that decodes an instruction code and controls the operation of each of the PEs 0 to 3. The D-ROM 32a, the P-ROM 32b, the D-RAM 34a, and the P-RAM 34b are connected to each other via a bus 39, which is a signal line for data exchange. Then, based on a single instruction code, the operand to be operated is read, and each data of the read operand to be operated is processed in parallel by PE0 to PE3.
[0035]
Next, the configuration of PE0 will be described in detail with reference to FIG. Since each of the PEs 0 to 3 has the same function, the configuration of the PE 0 will be described below, and the description of the configuration of the PEs 1 to 3 will be omitted.
FIG. 2 is a block diagram showing the configuration of PE0.
As shown in FIG. 2, PE0 includes a plurality of registers 10, an operation unit 12 that performs an operation on the data in the register 10, and a flag bit storage unit 14 that stores flag bits.
[0036]
The arithmetic unit 12 reads the data of the D-RAM 34a into the register 10 under the control of the control device 110, performs an arithmetic operation on the data of the register 10, and stores the arithmetic result in the register 10. When the flag bit is "0", the data at the first position corresponding to PE0 among the operands to be operated is read into the register 10, and when the flag bit is "1", the data among the operands to be operated is The data at the second position corresponding to PE0 is read into the register 10.
[0037]
Next, processing to be executed by the SIMD type processor 100 will be described with reference to FIG.
FIG. 3 shows an operation when the step S102 in the arithmetic processing of FIG. 8 is implemented in the SIMD processor according to the present invention.
8, steps S100 and S104 are performed in the same manner as the conventional SIMD processor.
[0038]
The feature of the data processing shown in FIG. 3 is equivalent to _Bool type (in C language, it is an integer type introduced from ISO / IEC 9899: 1999 and represents only a Boolean value) in a program in each PE. Is provided and a result of the condition determination is stored.
Here, the processing related to the variable “j” declared in FIG. 8 is the same as the meaning used in the description of the conventional SIMD processor, and each operation is processed by the “j” -th PE. Is not a loop counter for repetition. The same applies to variables on the program, and the array variables “a”, “b”, “x”, “y”, and “z” are stored in the registers handled by the PE. Equivalent to. a [0], b [0], x [0], y [0], z [0] represent registers belonging to PE0, and a [1], b [1], x [1], y [ 1] and z [1] represent registers belonging to PE1, a [2], b [2], x [2], y [2] and z [2] represent registers belonging to PE2, [3], b [3], x [3], y [3] and z [3] represent registers belonging to PE3. In addition, "t" represents a flag included in each PE, t [0] is a flag belonging to PE0, t [1] is a flag belonging to PE1, t [2] is a flag belonging to PE2, and t [3]. Means a flag belonging to PE3.
[0039]
In step S302 in FIG. 3, each of the PEs 0 to 3 compares the values of the registers corresponding to a [j] and b [j], and if the value of a [j] is smaller than the value of b [j], a flag is set. An operation of setting a true value “true” to t [j] and setting a false value “false” to the flag t [j] when the value of a [j] is equal to or larger than the value of b [j]. Perform in parallel at the same time.
FIG. 4 shows the operation and data flow of the SIMD processor of the present invention corresponding to step S320.
[0040]
FIG. 4 illustrates a case where t [0] = 1, t [1] = 1, t [2] = 0, and t [3] = 1 as the flag values of the operation result.
Next, in step S322 in FIG. 3, when each of PEs 0 to 3 is a true value according to the flag value t [j] set in step S320, the value of x [j] is calculated. When the result is substituted into the result register z [j] and t [j] is a false value, the operation of substituting the value of y [j] into the operation result register z [j] is performed in parallel.
[0041]
FIG. 5 shows the operation and data flow of the SIMD processor of the present invention corresponding to step S322.
In each of PEs 0 to 3, when the flag bit is set and a single instruction code is given, as shown in FIG. 5, when the flag bit is "1", The value of the variable x [j] is set in the register 10. On the other hand, when the flag bit is “0”, the value of the variable y [j] among the operands to be operated is set in the register 10. In the example of FIG. 5, since the flag bits are "1" for PE0, PE1, and PE3, the value of the variable x [j] is set in each register 10 of those PEs, whereas the flag bit is set for PE2. Since it is “0”, the value of the variable y [j] is set in the register 10 of PE2.
[0042]
Therefore, when compared with the conventional SIMD processor 200, the conventional SIMD processor 200 required four processes of steps S310 to S316 for one PE as shown in FIG. In the SIMD type processor 100 according to the present invention, as shown in FIG. 3, two processes of steps S320 and S322 are sufficient for one PE.
[0043]
As described above, in the present embodiment, each of PEs 0 to 3 stores the flag bit storage unit 14 for storing the flag bit, and the data to be processed by the PE based on the flag bit in the flag bit storage unit 14. From the operands to be operated.
Thus, data to be operated can be selected by a single instruction code in accordance with whether the condition is satisfied only by setting the flag bit indicating whether the condition is satisfied in the flag bit storage unit 14. Therefore, the number of processing steps can be reduced and the program can be simplified as compared with the related art. Further, since it is not necessary to mount the mask operation circuit on the SIMD type processor 100, the manufacturing cost can be relatively reduced as compared with the related art.
[0044]
Further, in the present embodiment, when the flag bit is “0”, the arithmetic unit 12 reads the data at the first position corresponding to PE0 among the operands to be operated into the register 10 and sets the flag bit to “1”. , The data at the second position corresponding to PE0 among the operands to be operated is read into the register 10.
[0045]
This makes it possible to realize, with a single instruction code, a conditional branch process in which the operation is performed using the first value if the condition is true, and the operation is performed using the second value if the condition is false.
In the above embodiment, the PEs 0 to 3 correspond to the arithmetic units according to claims 1 to 4, the flag bit storage unit 14 corresponds to the flag information storage unit according to claim 1, and the arithmetic unit 12 includes It corresponds to the data selection means described in item 1 or 2.
[0046]
In the above embodiment, four PEs are provided. However, the present invention is not limited to this, and two or three PEs may be provided. Alternatively, five or more PEs may be provided. You can also.
Further, in the above-described embodiment, as shown in FIG. 1, the parallel operation processing device and the parallel operation processing method according to the present invention employ a SIMD process for selecting data to be operated in accordance with the presence or absence of a condition. Although the present invention has been applied to the case where the type processor 100 executes the present invention, the present invention is not limited to this, and may be applied to other cases without departing from the gist of the present invention.
[0047]
【The invention's effect】
As described above, according to the parallel arithmetic processing device according to claim 1 or 2 of the present invention, it is only necessary to set flag information indicating whether a condition is satisfied in the flag information storage means, and to determine whether the condition is satisfied. , Data to be operated can be selected with a single instruction code. Therefore, the effect that the number of processing steps can be reduced and the program can be simplified as compared with the related art can be obtained. Further, since there is no need to mount a mask operation circuit, an effect that the manufacturing cost can be relatively reduced as compared with the related art can be obtained.
[0048]
Further, according to the parallel processing device of the second aspect of the present invention, if the condition is true, the operation is performed using the first value, and if the condition is false, the operation is performed using the second value. An effect is also obtained that the branch processing can be realized by a single instruction code.
On the other hand, according to the parallel operation processing method of the third or fourth aspect of the present invention, the same effect as that of the parallel operation processing device of the first aspect can be obtained.
[0049]
Further, according to the parallel operation processing method of the fourth aspect of the present invention, the same effect as that of the parallel operation processing device of the second aspect can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a SIMD type processor 100 according to the present invention.
FIG. 2 is a block diagram showing a configuration of PE0.
FIG. 3 illustrates an operation when the part of step S102 in the arithmetic processing of FIG. 8 is implemented in the SIMD processor according to the present invention.
FIG. 4 shows the operation and data flow of the SIMD processor of the present invention corresponding to step S320.
FIG. 5 illustrates an operation and a data flow of the SIMD processor of the present invention corresponding to step S322.
FIG. 6 is a block diagram showing a configuration of a conventional SIMD type processor 200.
FIG. 7 is an example of an arithmetic processing program for processing using a processor.
FIG. 8 is a diagram in which, when the arithmetic processing program of FIG. 7 is processed on a SIMD processor having four parallels, that is, four PEs, the process is rewritten in a manner suitable for processing on an actual SIMD processor. .
FIG. 9 is a diagram showing a case where a mask vector is generated by PE0 to PE3.
FIG. 10 is a diagram illustrating a case where a logical product operation is performed between a value of a variable x [j] and a mask vector in PE0 to PE3.
FIG. 11 is a diagram illustrating a case where a logical product operation is performed on the values of a variable y [j] and the inverted value of a mask vector in PE0 to PE3.
FIG. 12 is a diagram illustrating a case where the operation result of the second process and the operation result of the third process are ORed in PE0 to PE3.
FIG. 13 is a program showing a data operation process using a mask operation.
[Explanation of symbols]
10 registers
12 Operation part
14 Flag bit storage
32a D-ROM
32b P-ROM
34a D-RAM
34b P-RAM
100,200 SIMD type processor
110, 210 control device
PE0-3 Processing element

Claims

An apparatus for processing a plurality of data in parallel, comprising a plurality of arithmetic units, based on a single instruction code, operating the plurality of arithmetic units in parallel,
The plurality of data includes a subset corresponding to each of the computing units,
Each of the arithmetic units includes a flag information storage unit for storing flag information, and data to be operated by the arithmetic unit based on the flag information of the flag information storage unit, among subsets corresponding to the arithmetic unit. And a data selecting means for selecting.

In claim 1,
When the flag information is in the first state, the data selection unit acquires data at a first position corresponding to the arithmetic unit out of the plurality of data, and when the flag information is in the second state. When there is at least one of the plurality of data, a data at a second position corresponding to the arithmetic unit is obtained.

A method of processing a plurality of data in parallel by operating a plurality of arithmetic units in parallel based on a single instruction code,
The plurality of data includes a subset corresponding to each of the computing units,
A flag information storing step of storing flag information for each of the arithmetic units;
For each of the arithmetic units, based on the flag information stored in the flag information storing step corresponding to the arithmetic unit, data to be operated by the arithmetic unit is selected from among subsets corresponding to the arithmetic unit. A data selection step of selecting.

In claim 3,
The data selecting step includes, for each of the computing elements, when flag information corresponding to the computing element is in the first state, data of a first position corresponding to the computing element among the plurality of data. Acquiring the flag information corresponding to the arithmetic unit in the second state, acquiring data at a second position corresponding to the arithmetic unit among the plurality of data. Method.