JP3991590B2

JP3991590B2 - Computer system and fault processing method in computer system

Info

Publication number: JP3991590B2
Application number: JP2000601532A
Authority: JP
Inventors: 知紀関口; 利明新井; 博古川; 和美池田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-02-24
Filing date: 1999-02-24
Publication date: 2007-10-17
Anticipated expiration: 2019-02-24
Also published as: EP1172732A1; US6948100B1; US20050172169A1; EP1172732A4; US7426662B2; WO2000051000A1; TW449687B

Description

技術分野
本発明は、計算機システムに関し、特に、障害処理を効率よく行なう計算機システムに関する。
背景技術
遠隔管理用の入出力装置であるリモート管理装置をＰＣＩバス等のＩ／Ｏバスを介して計算機に接続して、リモート管理装置により計算機を管理する方法がある。リモート管理装置は、ネットワークアダプタやモデムといった通信用の入出力装置を有し、ＬＡＮや電話回線等により他の計算機と接続して、遠隔地にある他の計算機から計算機を管理している。
リモート管理装置は、Ｉ／Ｏバス、あるいは、管理対象の計算機の管理情報を転送する専用のバスを経由して、計算機の稼動情報を取得する。リモート管理装置は、管理対象の計算機のＣＰＵがＩ／Ｏバス経由でアクセス可能なレジスタやメモリを保持している。
また、特開平９−５０３８６や特開平５−２５７９１４、および、特開平５−２５０２８４のように、リモート管理装置は、ＣＰＵ、メモリ、および、ネットワークアダプタやモデムといった通信装置を含むＩ／Ｏ装置を持つ計算機（管理装置計算機）として構成される場合もある。この場合、管理装置計算機上のＣＰＵは、管理対象の計算機とは独立して管理用のプログラムを実行でき、管理対象の計算機の実行状態に関わらず管理プログラムを実行することができる。つまり、計算機のオペレーティングシステム（ＯＳ）の起動前、障害停止時、外部からの操作を受け付けない状態（ハングアップ）時でも、管理装置計算機は実行可能になっている。
Ｉ／Ｏバスに接続される従来の管理装置は、管理対象の計算機がハングアップする障害が発生した場合、ＣＰＵのリセット、あるいは、管理対象の計算機の電源の遮断等の方法により計算機を再起動している。この再起動は、管理装置と管理対象の計算機を専用の信号線で接続して、その信号線を経由して管理対象の計算機のＣＰＵにリセット信号を送ったり、あるいは、管理対象の計算機上のファームウェアに制御を移す割り込みを送ることにより実現している。専用線が必要なのは、Ｉ／ＯバスにはＯＳの実行を強制的に停止させるような割り込みを送る信号線がないためである。
この再起動方法を実施するには、管理装置と管理対象の計算機との間にＩ／Ｏバス以外の信号線を設置しなければならない。このため、管理装置を接続可能な管理対象の計算機が限定されてしまう問題がある。つまり、管理装置と管理対象の計算機を専用線で接続できる組み合わせでなければ、障害発生時に管理装置から管理対象の計算機を再起動できない。
また、従来の管理装置の再起動方法は、ＣＰＵのリセットによるためＯＳが介在する機会がなく、加えて、ＯＳの再起動により管理対象の計算機の主記憶の内容が失われてしまう。このため、障害原因の解析を困難している。さらに再現性のない障害の場合、障害解析をすることができず問題である。
一方、ＰＣＩバスのような汎用のＩ／Ｏバスについてみると、前に述べたように、ＯＳの実行を強制的に障害処理へ移行させる割り込みを管理装置から管理対象の計算機に送ることができない。しかし、Ｉ／Ｏバスが、Ｉ／Ｏバス経由で転送されるアドレス、コマンド、および、データ等の正確性を保証するための付加情報（例えばパリティビット）を転送する信号線を持っている場合もある（ＰＣＩＨａｒｄｗａｒｅａｎｄＳｏｆｔｗａｒｅＡｒｃｈｉｔｅｃｔｕｒｅＤｅｓｉｇｎ，ｐｐ１７２〜１７４，Ａｎｎａｂｏｏｋｓ，１９９４）。このような付加情報を転送できるＩ／Ｏバスであれば、管理対象の計算機や入出力装置は、Ｉ／Ｏバス経由のデータ転送においてＩ／Ｏバス上のデータの正確性を検証することは可能である。
更に、前記の機能を持つＩ／Ｏバスを使用している場合、Ｉ／Ｏバスの付加情報により不正な信号を検出した時に、障害をＣＰＵに通知するための信号線を持つＩ／Ｏバス制御装置もある（ＭｉｃｒｏｐｒｏｃｅｓｓｏｒＲｅｐｏｒｔ，ｐｐ１１〜１２，Ｖｏｌ．１２，Ｎｕｍｂｅｒ９，Ｊｕｌｙ，１９９８）。
管理対象の計算機のＣＰＵについてみると、バスに障害が発生すると、メモリアクセスができなくなって、ＣＰＵが動作できない状況が発生し得る。このようにバスがロックしている場合、ＣＰＵに割り込み信号を送っただけでは、ＣＰＵの実行を再開することはできない。これは、バス障害のためにメモリアクセスができないため、割り込みハンドラを起動できないためである。
このような障害に対して、バスに関する障害信号を検出した場合に、ＣＰＵをリセットするのではなくバスだけを再初期化して、その後に内部的に割り込みを生成して割り込みハンドラに制御を渡すＣＰＵがある（ＭｉｃｒｏｐｒｏｃｅｓｓｏｒＲｅｐｏｒｔ，ｐｐ１，６〜１０，Ｖｏｌ．１２，Ｎｕｍｂｅｒ９，Ｊｕｌｙ，１９９８）。このＣＰＵに依れば、バスがロックしてしまってもＣＰＵの実行を再開させることができ、ＯＳの障害処理を開始させることも可能となる。
従来のＩ／Ｏバスに接続する計算機の管理装置では、ＯＳの障害処理が実行できなくなる障害が計算機に発生した時、Ｉ／Ｏバス以外の信号線により計算機のＣＰＵをリセットする、あるいは、計算機上のファームウェアによりＣＰＵをリセットして、計算機全体を再起動している。これら方法では、ＣＰＵがリセットされてしまうため、ＯＳは障害処理を実行することができず、障害情報が取得できなくなるという問題があった。
また、従来の管理装置では、Ｉ／Ｏバスとは別の信号線、あるいは、計算機上にＣＰＵのリセット処理を実行する回路やファームウェアが必要であった。この方式には、管理装置の接続可能な計算機が限定されるという問題があった。
本発明の目的は、ＯＳの障害処理が実行できなくなる障害が計算機に発生した場合でも、障害情報を取得可能な計算機システムを提供することにある。
また、本発明の別の目的は、Ｉ／Ｏバスを介して管理対象の計算機のバスを初期化可能な計算機システムを提供することにある。
発明の開示
上記目的を達成するために、本発明では、計算機と管理装置がＩ／Ｏバスにより接続された計算機システムにおいて、ＯＳの障害処理が実行できなくなる障害が計算機に発生した場合、障害管理装置から計算機内のＩ／Ｏバス管理装置にＩ／Ｏバス障害の発生を通知するＩ／Ｏバス信号を送る。そして、Ｉ／Ｏバス管理装置は、Ｉ／Ｏバスを初期化した後、Ｉ／Ｏバス障害を計算機のＣＰＵにＯＳが処理する割り込みとして通知する。
このようにして、従来、ＯＳの障害処理が実行できなくなる障害が計算機に発生した場合でも、ＯＳへの割り込みを契機として障害情報を取得可能な計算機システムを提供することができる。また、Ｉ／Ｏバスを介して管理対象の計算機のバスを初期化可能な計算機システムを提供できる。
発明を実施するための最良の形態
以下、図面を用いて本発明の実施例を説明する。
（１）第１の実施形態
第１図は、本発明の実施形態のシステム構成を示す図である。計算機１００は、管理装置１２０の管理対象となる計算機である。
計算機１００の構成について説明する。ＣＰＵ１０１と主記憶１０２は、バス１０３により接続している。バス１０３には、Ｉ／Ｏバス１０７を制御するＩ／Ｏバス制御装置１０４が接続している。バス１０３には、ＣＰＵ１０１やＩ／Ｏバス制御装置１０４に、バス１０３に関する内部状態のリセットを指示する信号線が含まれる。Ｉ／Ｏバス制御装置１０４からはＩ／Ｏバス１０７が伸びている。Ｉ／Ｏバス１０７には、管理装置１２０、外部記憶装置１０５、キーボード、ディスプレイ等の対話型デバイスから構成されるコンソール１０６等が接続される。
Ｉ／Ｏバス制御装置１０４は、ＣＰＵ１０１が実行する入出力操作のＩ／Ｏバス１０７への転送や、Ｉ／Ｏバス１０７に接続する入出力機器からのデータの、主記憶１０２やＣＰＵ１０１内のレジスタへの転送、割り込みのＣＰＵ１０１への転送等を実施する。
Ｉ／Ｏバス制御装置１０４とＣＰＵ１０１は、バスエラー通知線１０８により接続している。バスエラー通知線１０８は、Ｉ／Ｏバス制御装置１０４が、Ｉ／Ｏバス１０７上でエラーを検出した時に、ＣＰＵ１０１にバスエラーを通知するためのバス信号線である。
次に、管理装置１２０について説明する。管理装置１２０は、計算機１００のＩ／Ｏバス１０７に接続する外部入出力装置の一種で、遠隔から計算機１００の実行状況の監視や起動・停止等の運用操作を実現する。管理装置１２０は、それ自体で計算機を構成しており、そこで実行するプログラムは、計算機１００のＯＳが停止している時でも独立して実行可能である。管理装置１２０で実行するプログラムは、モデム１２７やネットワークアダプタ１２８を制御して、計算機１５１、および、１７０のような遠隔にある計算機と連携して、遠隔にある計算機からの計算機１００の運用のための操作を実現する。
管理装置１２０上のＣＰＵ１２１と主記憶１２２は、バス１２３で接続している。バス１２３には、Ｉ／Ｏバス制御装置１２４が接続し、Ｉ／Ｏバス制御装置１２４からはＩ／Ｏバス１２５が伸びている。Ｉ／Ｏバス１２５には、モデム１２７やネットワークアダプタ１２８があり、遠隔の計算機と通信可能となっている。
管理装置１２０は、デバイス制御装置１２６を介して、計算機１００のＩ／Ｏバス１０７と接続する。デバイス制御装置１２６は、ＣＰＵ１０１が実行する管理装置１２０に対する入出力操作要求を受信して、要求に応じた制御を実施する。例えば、主記憶１２２の内容を変更する、ＣＰＵ１２１に割り込みを送信する等の操作である。
デバイス制御装置１２６は、ＣＰＵ１２１からも入出力装置として見えるように構成する。デバイス制御装置１２６は、ＣＰＵ１２１の実行する入出力操作を受けてＩ／Ｏバス１０７にデータを書き出す等の操作を実施する。
デバイス制御装置１２６の中に、障害生成装置１３０がある。障害生成装置１３０は、ＣＰＵ１２１の指示を受けてＩ／Ｏバス１０７に不正な信号を送出する装置である。計算機１００のＩ／Ｏバス制御装置１０４は、Ｉ／Ｏバス１０７上で不正な信号を検出した場合、バスエラー通知線１０８によりＣＰＵ１０１に障害を通知する。
第２図は、本発明の実施形態のソフトウェア構成図である。ここでは、計算機１００のＩ／Ｏバス１０７に管理装置１２０が接続されており、管理装置１２０のネットワークアダプタ１２８がネットワークを介して管理計算機１５１に接続されている。
計算機１００と１５１、および、管理装置１２０のそれぞれには、ＯＳ２０１、ＯＳ２２１、および、ＯＳ２１３がローデイングされ、動作している。計算機１００では、通常のアプリケーションプログラム群２０２が実行している。加えて、計算機１００では、管理装置１２０と連携して実行する管理エージェントプログラム２０３が動作している。管理エージェント２０３は、計算機１００で実行するプログラム２０２、およびＯＳ２０１の実行状況の収集、管理装置１２０への実行状況送信、管理装置１２０への動作指示、管理装置１２０が収集した計算機１００の実行状況情報の取得、運用管理処理を実施する。運用管理処理とは、計算機１００の自動起動・停止時刻の設定、計算機１００のシャットダウン、リブート、電源断、管理情報の表示やネットワークへの管理情報送信等である。
管理装置１２０では、遠隔の計算機１５１との通信を行う通信制御プログラム２１２と、計算機１００の運用管理処理をする管理プログラム２１１が実行している。管理プログラム２１１は、計算機１００の動作状況の取得、時刻指定による計算機１００の電源制御、ＯＳ２０１の自動起動・停止処理、管理エージェント２０３収集情報の遠隔管理計算機１５１への転送、遠隔計算機１５１からの運用操作要求の処理等を実行する。
管理装置１２０上のプログラム２１１ないし２１３は、計算機１００のＯＳ２０１が停止していても実行可能である。計算機１００がＯＳ２０１の障害のため停止している時、管理プログラム２１１は、Ｉ／Ｏバス１０７経由で主記憶１０２の内容を取得、遠隔計算機１５１へ障害情報の送信等の障害処理を実施する。加えて、本実施形態では、障害生成装置１３０を駆動してＩ／Ｏバス１０７に障害信号を送出し、ＯＳ２０１の障害処理を起動させる処理を実施する。
遠隔の計算機１５１や１７０は、ＬＡＮのようなネットワーク１５０、あるいは、電話回線といった通信回線１４０で管理装置１２０と接続している。遠隔計算機１５１では、遠隔計算機管理プログラム２２０が実行している。このプログラム２２０は、管理装置１２０上の管理プログラム２１１と通信により管理情報を交換して、計算機１００の運用管理操作を実行する。例えば、計算機１００の運用管理情報の表示、遠隔からの停止・リブート、ＯＳ２０１の障害処理開始指示などを実行する。
バス１０３やＩ／Ｏバス１０７で障害が発生すると、ＣＰＵ１０１はバスエラー割り込みを生成して障害処理を実行する。ＯＳ２０１内には、バスエラー割り込みを処理する割り込みハンドラ２０４がある。割り込みハンドラ２０４は、ＣＰＵ１０１の割り込みベクタに登録されて、バスエラー割り込み発生時に実行されるように設定される。
第３図は、本実施形態におけるデバイス制御装置１２６の構成を示した図である。デバイス制御装置１２６は、Ｉ／Ｏバスインターフェイス回路３０１を介して管理装置１２０のＩ／Ｏバス１２５、および、計算機１００のＩ／Ｏバス１０７と接続している。回路３０１は、各Ｉ／Ｏバスからのデバイス制御装置１２６宛てデータの取出し、あるいは、ＣＰＵからのＩ／Ｏバスへのデータの送出を実施する回路である。回路３０１は、Ｉ／Ｏバス１０７より取得したデータ内容に従って、デバイス制御装置１２６内の他の回路を駆動する。
制御装置１２６には、Ｉ／Ｏバス１０７用のパリティ生成回路３０２と、障害生成装置１３０が組み込まれている。本実施形態では、パリティ生成回路３０２は、Ｉ／Ｏバス１０７に送出するアドレス信号１０７ｂに関するパリティ信号１０７ａを、排他的論理和回路の組み合わせにより生成している。通常実行時は、パリティ生成回路３０２で生成したパリティ信号をそのままＩ／Ｏバス１０７に送出する。
障害生成装置１３０は、パリティ生成回路３０２が生成したパリティ信号を反転して、Ｉ／Ｏバス１０７で障害と定義される信号を生成する。障害信号の生成は、障害生成レジスタ３０３で制御する。通常動作時は、レジスタ３０３は０に設定する。レジスタ３０３を１に設定すると、障害生成装置１３０はパリティ生成回路３０２で生成された信号を反転して、Ｉ／Ｏバス１０７に障害となる信号を送出する。
レジスタ３０３は、管理装置１２０のＣＰＵ１２１の入出力命令によりアクセス可能なように構成する。管理プログラム２１１は、レジスタ３０３を１にセットしてＩ／Ｏバス１０７にアクセスする操作を実行することで計算機１００のＯＳ２０１を強制停止できる。
障害生成装置１３０は、パリティ信号１０７ａに不正な信号を送出した時点で障害生成状態レジスタ３０４を１にセットする。また、Ｉ／Ｏバス１０７への障害注入が連続して発生しないように、レジスタ３０３を０にリセットする。
本実施形態では、アドレス信号のパリティを不正な値にすることでＩ／Ｏバスに障害を送出したが、不正なバス信号の生成の仕方はこの限りではない。
Ｉ／Ｏバス制御装置１０４について説明する。第４図は、本実施形態におけるＩ／Ｏバス制御装置１０４の構成の一部を示す図である。
Ｉ／Ｏバス制御装置１０４は、Ｉ／Ｏバス１０７へのデータの送出、および、Ｉ／Ｏバス１０７からのデータの取り込みを実施する。データ取り込みの際、Ｉ／Ｏバス１０７上のデータが不正になっていないかを検査するため、アドレス信号１０７ｂに関するパリティ信号１０７ａを参照する。Ｉ／Ｏバス制御装置１０４内のパリティ計算回路４０１は、アドレス信号１０７ｂよりパリティ値を求める。このパリティ値とＩ／Ｏバス１０７のパリティ信号１０７ａを比較する。一致しない場合、バスエラー通知線１０８により、ＣＰＵ１０１にバス障害を通知する。
障害生成装置１３０によりＩ／Ｏバス１０７に障害となる信号が送出された場合、パリティ値が不正になるため、ＣＰＵ１０１にバス障害が通知される。
第５図にＣＰＵ１０１側のバス障害処理に関する構成を示す。ＣＰＵ１０１は、バスエラー信号線１０８よりバス障害を通知されると、バス初期化回路５０１によりバス１０３の初期化を実施する。ここでバス１０３の初期化とは、ＣＰＵ１０１内部にあるバスに関する状態を初期状態に設定することを示し、ＣＰＵ１０１のリセットではない。このバス初期化処理は、バス１０３に接続している他の装置でも必要であり、バス初期化信号１０３ｂとして他の装置にもバス初期化を指示する。
また、ＣＰＵ１０１は、遅延回路５０２でエラー通知信号１０８を遅延させて、バス１０３の初期化が終了した時点で、割込み制御回路５０４を駆動して内部的にバスエラー割り込みを生成する。
通常の外部割り込みは、外部割り込み信号１０３ａでプロセッサに通知される。外部割り込みは、割り込み禁止レジスタ５０３の値によりマスクされる。バスエラー通知による割り込みが、割り込み禁止レジスタ５０３によるマスク制御を迂回して割り込み制御回路５０４を駆動するように構成すれば、ＣＰＵ１０１が外部割り込み禁止の状態でも、バス障害による割り込みを生成できる。
ＣＰＵ１０１のバス初期化処理について説明する。第６図は、ＣＰＵ１０１のバス初期化回路５０１の構成例を示した図である。
ＣＰＵ１０１のバスに関係する回路は、クロック信号６０４に同期して駆動する。
ＣＰＵ１０１内には、バス１０３を制御する回路がある。その中には、過去にバス１０３を流れたデータに関連する状態を保持している部分がある。この例では、フリップフロップにより構成されたレジスタ６０３がバス状態を保存しているとする。レジスタ６０３は、クロック信号６０４と同期して、バス状態を取り込む。
通常動作時のレジスタ６０３の値は、バス制御回路６０１により決定される。バス初期化信号１０３ｂがアクティブでない、つまり０の場合は、バス制御回路６０１の出力値がレジスタ６０３に到達するようにスイッチ回路６０５を構成する。
バス初期化信号１０３ｂがアクティブの場合は、初期状態レジスタ６０２に設定されている値がレジスタ６０３に到達するようにスイッチ回路６０５を構成する。初期状態レジスタ６０２の値は、ＣＰＵ１０１に予め設定されている、あるいは、計算機１０１の電源投入時の初期化により設定される。これにより、ＣＰＵ１０１は、バス初期化信号１０３ｂを受けてレジスタ６０３を初期状態に設定できる。
本実施形態では、ＣＰＵ１０１がバス初期化信号１０３ｂをバス１０３に送出したが、バスエラー通知信号１０８をバス１０３に接続する各々の装置が検出して、各装置で初期化を実施しても良い。
本実施形態では、以上のハードウェア構成により、計算機１００のＩ／Ｏバス１０７に接続する管理装置１２０が、計算機１００の実行状態とは独立した任意の時点に、Ｉ／Ｏバス１０７で障害と定義される信号をＩ／Ｏバス１０７へ送出することで、バス１０３に接続する各装置が保持するバス１０３に関連する内部状態を初期化して、ＣＰＵ１０１でバスエラー割り込みを生成することが可能となる。
次に、本実施形態のソフトウェアの処理について説明する。第７図は、計算機１００で実行するＯＳ２０１内の、バスエラー用の割り込みハンドラ２０４の処理を示すフローチャートである。
ＣＰＵ１０１は、バスエラー割り込みを捕獲すると、ステップ７０１から始まる割り込みハンドラ２０４に制御を渡す。バスエラー割り込みは、管理装置１２０が意図的に発生する場合と、そうでない場合がある。割り込みハンドラ２０４では、まず、管理装置１２０の障害生成状態レジスタ３０４の値を取得する（ステップ７０１）。レジスタ３０４は、ＣＰＵ１０１からＩ／Ｏバス１０７経由でアクセス可能なように構成されている。
続いて取得したレジスタ３０４の値を検査し（ステップ７０２）、レジスタ３０４の値が０である場合、つまり、管理装置１２０がバス障害を送出したのではに場合は、通常のバスエラー処理（ステップ７０５）を実行する。例えば、障害情報のコンソール１０６への表示、主記憶１０２の外部記憶装置１０５へのダンプ、計算機１００の再起動等である。
レジスタ３０４が１の場合、すなわち、管理装置１２０がＩ／Ｏバス１０７に障害を注入したことによるバスエラーの場合は、障害状態生成レジスタをリセットし（ステップ７０３）、その旨をコンソール１０６に表示する（ステップ７０４）。７２０は、コンソール画面表示の例である。
管理装置１２０内の管理プログラム２１１の処理について説明する。第８図は、管理プログラム２１１の処理例を示すフローチャートである。
まず、ステップ８０１で、計算機１００への停止要求があるかどうか検査する。停止要求は、遠隔の計算機１５１や１７０から通信回線経由でモデム１２７やネットワークアダプタ１２８に送られたり、および、緊急停止ボタン１２９の押下等により生じる。
停止要求がない場合は、計算機１００の動作状況を収集して管理データ２１０に格納する（ステップ８０２）。取得したデータ２１０より、計算機１００が正常に実行しているか判定する（ステップ８０３）。実行している場合は、動作状況を遠隔の計算機に送信する（ステップ８０４）。停止している場合は、ステップ８０７へ進み、障害情報を取得して遠隔の計算機に送信する。
停止要求がある場合は、ステップ８０５を実行する。ここでは、障害生成レジスタ３０３を１に設定し、Ｉ／Ｏバス１０７へアクセスする命令を実行する（ステップ８０６）。これにより、ＣＰＵ１０１でバスエラー割り込みが生成されて、バスエラー割り込みハンドラ２０４に制御が渡る。
その後、ステップ８０７へ進み、障害情報を遠隔の計算機に送信する。
以上のハードウェア構成、および、ソフトウェア手順により、Ｉ／Ｏバス１０７に接続した管理装置１２０より、計算機１００で実行するＯＳ２０１の実行を強制停止して、ＯＳの障害処理であるバスエラー割り込みハンドラ２０４を実行することが可能となる。
本実施形態は、管理装置１２０の障害生成装置１３０が、計算機１００の実行状態とは無関係の任意の時点に、Ｉ／Ｏバス１０７に障害となる信号を送出することにより、計算機１００で実行するＯＳ２０１の強制停止を実現している。この実施形態では、計算機１００と管理装置１２０をＩ／Ｏバス１０７だけで接続する。従来の専用信号線で管理装置と計算機を接続する方式と比べて、管理装置１２０が接続できる計算機１００の制限が緩和される。
また、従来の管理装置が、障害によるＯＳ実行停止時にＣＰＵリセットにより計算機の再起動を実行していたため、障害原因の解析を困難にしていた。それに対し、本実施形態では、Ｉ／Ｏバス制御装置１０４がバスエラーをＣＰＵ１０１に通知し、ＣＰＵ１０１はそれを受けて割り込みを生成して割り込みハンドラ２０４を実行する。この割り込みハンドラ２０４の延長で、主記憶１０２の内容の外部記憶装置１０５への格納、障害要因解析、障害要因除去などの障害処理や、ＯＳ２０１の停止処理を実行できるため、後の障害解析と回復が容易になる。
また、ＣＰＵ１０１、および、バス１０３に接続する各々の装置がバス１０３に関する内部状態を初期化してからＣＰＵ１０１が割り込みを生成するため、割り込みハンドラ２０４が実行できる可能性が高まる。
本実施形態では、バスエラー割り込みハンドラ２０４で主記憶１０２の内容を外部記憶装置１０５に格納するとしたが、主記憶１０２の内容の全て、あるいは、一部や、割り込みハンドラ２０４による障害解析情報を、管理装置１２０の主記憶装置１２２に格納しても良い。
この実施形態では、管理装置１２０がＩ／Ｏバス１０７に障害信号を送出するとしたが、ネットワークアダプタやモデムといった装置に、特定のパケットあるいはデータを受信した時に、Ｉ／Ｏバス１０７に障害信号を送出するように障害信号生成装置１３０を組み込んでも良い。
（２）第２の実施形態
次に、本発明の第２の実施形態について説明する。
第１の実施形態では、Ｉ／Ｏバス１０７に接続している管理装置１０２からＩ／Ｏバス１０７に、障害と認識される信号を送出する必要があった。このためには、管理装置１２０がＩ／Ｏバス１０７へアクセスする権利を取得しなければならない。つまり、バス１０７の調停でバスの使用権を獲得しなければならない。
ところが、管理装置１２０が、Ｉ／Ｏバス１０７の使用権が取得できない場合がある。ＣＰＵ１０１が、Ｉ／Ｏバス１０７に接続しているデバイスに対してある連続した非分割の処理を実行する場合、Ｉ／Ｏバス１０７を排他的に使用するとしてバス使用権を獲得する。これを、バスをロックすると呼ぶ。この時に、対象デバイスが故障している等の理由でデバイスが応答できなければ、バス１０７の使用権が解放されないままになる。
このような場合、第１の実施形態ではＩ／Ｏバス１０７に障害信号を注入できないため、管理装置１２０から計算機１００のＯＳ２０１の障害処理を起動できない。
本発明の第２の実施形態では、バスがロックしている状態を解除してから、障害信号を送出する手段と手順について説明する。本実施形態では、管理装置１２０がＩ／Ｏバス１０７のロック状態を検査できるようにする。更に、管理装置１２０が、バスをロックしたまま完了しないＩ／Ｏバス要求に対して、任意のデータを送出することで要求操作が完了したと見せかけ、要求発行元にバスロックを解除させる。
Ｉ／Ｏバス上のデータの流れについて説明する。第９図は、本実施形態におけるＩ／Ｏバス１０７上でのデータの流れを示すタイミング図である。
第９図は、Ｉ／Ｏバス１０７のアクセス権調停が済んで、実際にデータの受け渡しをする時のバス信号の状態を示している。Ｉ／Ｏバス１０７にアクセスするデバイスは、アクセス権を獲得した後、アクセス対象デバイスを指定するアドレス信号１０７ｂを出力する。
このアクセスを排他的に実行したい場合は、Ｉ／Ｏバスロック信号１０７ｃを同時にアクティブにする。Ｉ／Ｏバス１０７に接続するデバイスは、バスロック信号１０７ｃがアクティブになっている間、Ｉ／Ｏバス１０７に次の要求を出すことができないよう構成される。要求元デバイスは、操作が終了するまでバスロック信号１０７ｃをアクティブにしておく。
アドレス信号１０７ｂにより指定されたデバイスは、操作を完了すると応答信号１０７ｄをアクティブにして、データが有ればデータ信号線１０７ｅにデータを出力する。
要求元デバイスは、応答信号１０７ｄがアクティブになったのを検出して、データ信号線１０７ｅよりデータを取り込み、バスロック信号１０７ｃのアクティブを解除する。
第１０図は、第２の実施形態での制御装置１２０の構成を示した図である。ＣＰＵ１０１がデバイス１０２０に対して非分割の連続Ｉ／Ｏ要求を発行したが、デバイス１０２０が応答できないとして説明する。
ＣＰＵ１０１が非分割のＩ／Ｏ要求を発行すると、Ｉ／Ｏバス制御装置１０４は、Ｉ／Ｏバス１０７のバスロック信号１０７ｃをアクティブにする。
制御装置１２０には、各時点のバスロック信号１０７ｃを保持するバスロック状態レジスタ１００６を設ける。バスロック状態レジスタ１００６は、管理装置１２０上のＣＰＵ２０１から参照可能なように構成され、管理プログラム２１１はその値を知ることができる。
管理装置１２０は、通常動作時は、Ｉ／Ｏバス１０７のアドレス信号１０７ｂが制御装置１２０を指定した時だけ応答信号１０７ｄを出力するように構成されている。これに加えて、管理プログラム２１１の指示により、任意の時点にＩ／Ｏバス１０７へ応答信号１０７ｄを送出する手段を持っている。
応答信号１０７ｄは、代理応答制御レジスタ１００１で制御する。代理応答制御レジスタ１００１が０の場合は、デバイス制御回路１００２が出力する応答信号１００３が、Ｉ／Ｏバスの応答信号１０７ｄとして出力される。
Ｉ／Ｏバスデータ信号１０７ｅも、代理応答制御レジスタ１００１により制御する。スイッチ回路１００５が、レジスタ１００１の値に応じて、デバイス制御回路１００２の出力値か、代理応答値レジスタ１００４の出力値を、データ信号１０７ｅに出力する。
つまり、代理応答制御レジスタ１００１を１にセットすると、応答信号１０７ｄがアクティブになり、代理応答値レジスタ１００４に格納されている値がバスデータ信号１０７ｅに送出される。
次に、本実施形態の制御プログラム２１１の処理について説明する。第１１図は、制御プログラム２１１の、ＯＳ２０１の強制停止処理を示すフローチャートである。
まず、制御プログラム２１１は、バスロック状態レジスタ１００６参照して、Ｉ／Ｏバス１０７がロックされているかどうか検査する（ステップ１１０１）。ロックされていない場合は、ステップ１１０３へ進み、第１の実施形態と同じ手順で、障害生成レジスタ３０３を１にセットして、Ｉ／Ｏバス１０７に障害信号を注入する。
ロックされている場合は、ステップ１１０２へ進む。ステップ１１０２では、代理応答制御レジスタを１にセットする。これにより、Ｉ／Ｏバス１０７のロック解除を試み、ステップ１１０１へ戻って、再度バスロック状態を検査する。これで、バスロックが解除されれば、ステップ１１０３へ進み、障害信号を注入する。
以上の手段と手順により、管理装置１２０は、Ｉ／Ｏバス１０７が他のデバイスにロックされていても、障害信号をＩ／Ｏバス１０７に注入することが可能になる。これにより、Ｉ／Ｏバス１０７だけで計算機１００に接続している管理装置１２０からＯＳ２０１を強制停止できる障害範囲が拡大する。
（３）第３の実施形態
次に、本発明の第３の実施形態について説明する。第２の実施形態では、Ｉ／Ｏバス１０７のロックの解除と、Ｉ／Ｏバス１０７への障害注入の制御を個別に実行した。本実施形態では、これらを１つの回路としてまとめて制御装置１２０に実現する手段について説明する。
第１２図は、本実施形態の障害生成装置１２０１の構成を示す図である。障害生成装置１２０１には、障害生成回路１２０２とバスロック解除回路１２０３が含まれている。障害生成回路１２０２は、第１の実施形態の第３図に示した障害生成装置１３０と同様の構成である。バスロック解除回路１２０３も、第２の実施形態の第１０図に示した構成と同様の構成である。
障害生成装置１２０１は、クロック６０４と同期してＩ／Ｏバス１０７のバスロック信号１０７ｃを採取して、バスロック状態レジスタ１２０４に格納している。
障害生成装置１２０１は、障害信号注入の制御を、障害生成レジスタ１２０５により実施する。障害生成レジスタ１２０５が０の時、障害生成回路１２０２とバスロック解除回路１２０３は、作動しない。制御プログラム２１１は、ＯＳ２０１の実行を停止する時、障害生成レジスタ１２０５を１に設定する。
障害生成レジスタ１２０５を１に設定した時にバスロック信号１０７ｃがアクティブでなければ、障害生成回路１２０３が作動する。回路１２０３は、Ｉ／Ｏバス１０７に障害となる信号を送出する。
レジスタ１２０５を１に設定した時にバスロック信号１０７ｃがアクティブである場合は、バスロック解除回路１２０４が作動する。回路１２０４は、Ｉ／Ｏバス１０７にバス応答信号１０７ｄとバスデータ信号１０７ｅを送出して、バスロックの解除を試みる。
バスロックが解除されると、つまり、バスロック信号１０７ｃがアクティブでなくなると、障害生成回路１２０３が作動し、障害信号をＩ／Ｏバス１０７に送出する。
本実施形態に依れば、第２の実施形態でのようにソフトウェアによりロック信号を監視して障害信号を注入するよりも、確実に計算機１００の実行を停止できる。また、第２の実施形態でのソフトウェアによる制御部を除去できる。
第２と第３の実施形態では、管理装置１２０が疑似の応答信号をＩ／Ｏバス１０７に送出してバスロックを解除した。Ｉ／Ｏバス１０７の構成によっては、応答に応答先を指定しなければならないバスもある。この場合は、管理装置１２０がバスロックを要するバストランザクションを送出した装置のバス上の識別子を記録しておけば良い。
（４）第４の実施形態
次に、本発明の第４の実施形態について説明する。これまで説明した実施形態では、Ｉ／Ｏバス１０７だけの接続により計算機１００の実行を停止する方式について説明したが、管理装置１２０が従来の専用信号線も備えていても良い。例えば、計算機１００の実行を停止する場合、まず、本発明の手段によりＯＳ２０１の停止を試み、本発明の手段により停止できなければ、従来の手段により計算機１００をリセットする。これを実現する計算機１００と管理装置１２０の構成について説明する。
第１３図は、第４の実施形態の計算機１００と管理装置１２０の構成を示す図である。計算機１００には、ＣＰＵ１０１をリセットするリセット回路１３０２がある。リセット回路１３０２は、リセット制御線１３０３により管理装置１２０と接続している。リセット制御線１３０３がアクティブになった時に、リセット回路１３０２が作動し、ＣＰＵ１０１をリセットする。これにより計算機全体がリセットされる。
管理装置１２０には、リセット制御レジスタ１３０１がある。リセット制御レジスタ１３０１は、ＣＰＵ１２１から設定可能なように構成する。リセット制御レジスタ１３０１が１に設定されたときに、リセット制御線がアクティブになるよう構成する。
次に、管理プログラム２１１の計算機１００停止の処理フローについて説明する。第１４図は、そのフローチャートを示している。まず、障害生成装置１３０を駆動して、Ｉ／Ｏバス１０７に障害信号を送出してみる（ステップ１４０１）。あらかじめ定めた時間を待ってから（ステップ１４０２）、ＯＳ２０１が障害処理を実行したかを検査する（ステップ１４０３）。処理が実行されていなければ、ステップ１４０４でリセット制御レジスタ１３０２を１にして、計算機１００をリセットする。
（５）第５の実施形態
これまで説明した実施形態では、遠隔の計算機や操作者がＩ／Ｏバス１０７への障害送出の契機を与えるとしているが、管理装置１２０や管理プログラム２１１が障害送出の実施するかを決定しても良い。本発明の第５の実施形態では、管理エージェントプログラム２０３と管理プログラム２１１が連携により、障害送出を実施する方式について述べる。管理装置１２０には、管理エージェント２０３が実行していることを示す、エージェント起動レジスタがある。エージェント起動レジスタは、計算機１００のＣＰＵ１０１と管理装置１２０のＣＰＵ２０１の両方からアクセス可能なように構成される（図省略）。
管理エージェント２０３は、一定時間間隔で実行して、実行時にエージェント起動レジスタをセットするように構成する（フローチャート省略）。管理装置１２０の側では、エージェント起動レジスタを参照することにより、計算機１００が正常実行しているか判定する。
第１５図は、管理装置１２０で実行する管理プログラム２１１の処理を示すフローチャートである。第１５図に示した処理は、一定時間間隔で実行されるように構成する。
管理プログラム２１１は、エージェント起動レジスタを検査した時に、レジスタがセットされていない回数を記録する変数（未起動回数）を保持している。
管理プログラム２１１の処理について説明する。まず、管理装置１２０のエージェント起動レジスタを検査する（ステップ１５０１）。本レジスタがセットされている場合は、本レジスタをクリアし（ステップ１５０４）、未起動回数を０に設定して（ステップ１５０５）、終了する。
レジスタがセットされていない場合、未起動回数を検査する（ステップ１５０２）。未起動回数が予め定めた正整数Ｘである場合、Ｉ／Ｏバス１０７に障害信号を送出する（ステップ１５０３）。Ｘでない場合は、未起動回数に１を加算して（ステップ１５０６）、終了する。
以上により、管理プログラム２１１が計算機１００の実行状態を検査して、自発的にＩ／Ｏバス１０７に障害を送出することが可能となる。障害を送出する時に、遠隔の計算機１５１や１７０に、計算機１００を強制停止したことを示すメッセージを送信しても良い。
また、第５の実施形態では、ソフトウェアによりＩ／Ｏバス１０７への障害送出を実施するようにしたが、管理装置１２０に一定時間再設定されなければ障害生成装置１３０を駆動するように構成したウォッチドッグタイマを設けてもよい。
この場合、管理エージェント２０３は、一定時間間隔で実行して、実行時にウォッチドッグタイマを再設定するよう構成する。管理プログラム２１１の側では、特別な処理は不要になる。
また、管理プログラム１２０が、計算機１００の主記憶１０２の内容を参照して、ＯＳ２０１の実行状況を検査して、それに応じてＩ／Ｏバス１０７に障害信号を送出しても良い。
産業上の利用可能性
以上のように、本発明にかかる計算機の障害処理方法及び装置は、管理装置からＩ／Ｏバス経由で管理対象の計算機に障害発生の信号を送り、管理対象の計算機ではこの信号の受信を契機としてバスの初期化を行なうとともに、割り込みを生成する計算機システムを構築するのに適している。
【図面の簡単な説明】
第１図は、本発明の実施形態のシステム構成図である。
第２図は、本発明の実施形態のプログラムの構成図である。
第３図は、デバイス制御装置の構成図である。
第４図は、Ｉ／Ｏバス制御装置の構成図である。
第５図は、ＣＰＵ内の障害処理部分の構成図である。
第６図は、ＣＰＵ内のバス初期化部分の構成図である。
第７図は、ＯＳのバスエラー割り込みハンドラの処理のフローチャートである。
第８図は、管理装置で実行する管理プログラムの処理のフローチャートである。
第９図は、Ｉ／Ｏバス上の信号のタイミングを示す図である。
第１０図は、本発明の第２の実施形態における、管理装置内のバスロック解除装置の構成図である。
第１１図は、本発明の第２の実施形態における、管理装置で実行する管理プログラムの処理のフローチャートである。
第１２図は、本発明の第３の実施形態における、管理装置内の障害生成装置の構成図である。
第１３図は、本発明の第４の実施形態における、計算機と管理装置の構成図である。
第１４図は、本発明の第４の実施形態における、管理装置で実行する計算機停止処理のフローチャートである。
第１５図は、本発明の第５の実施形態における、管理装置で実行する計算機停止処理のフローチャートである。Technical field
The present invention relates to a computer system, and more particularly to a computer system that efficiently performs fault processing.
Background art
There is a method of connecting a remote management device, which is an input / output device for remote management, to a computer via an I / O bus such as a PCI bus and managing the computer by the remote management device. The remote management device has a communication input / output device such as a network adapter and a modem, and is connected to another computer via a LAN, a telephone line, or the like, and manages the computer from another remote computer.
The remote management apparatus obtains computer operation information via an I / O bus or a dedicated bus for transferring management information of a computer to be managed. The remote management apparatus holds a register and a memory that can be accessed via the I / O bus by the CPU of the computer to be managed.
Further, as disclosed in JP-A-9-50386, JP-A-5-257914, and JP-A-5-250284, the remote management device includes an I / O device including a CPU, a memory, and a communication device such as a network adapter and a modem. In some cases, it may be configured as a computer (management device computer). In this case, the CPU on the management apparatus computer can execute the management program independently of the management target computer, and can execute the management program regardless of the execution state of the management target computer. That is, the management apparatus computer can be executed even before the operating system (OS) of the computer is started, when a failure is stopped, or when an external operation is not accepted (hangup).
The conventional management device connected to the I / O bus restarts the computer by a method such as resetting the CPU or turning off the power of the managed computer when a failure occurs that causes the managed computer to hang up is doing. This restart can be done by connecting the management device and the managed computer with a dedicated signal line and sending a reset signal to the CPU of the managed computer via the signal line, or on the managed computer. This is achieved by sending an interrupt to transfer control to the firmware. The dedicated line is necessary because the I / O bus does not have a signal line for sending an interrupt that forcibly stops the execution of the OS.
In order to implement this restart method, a signal line other than the I / O bus must be installed between the management apparatus and the computer to be managed. For this reason, there is a problem that the management target computers to which the management apparatus can be connected are limited. In other words, the management target computer cannot be restarted from the management device when a failure occurs unless the combination is such that the management device and the management target computer can be connected by a dedicated line.
In addition, since the conventional restart method of the management apparatus is based on resetting the CPU, there is no opportunity for the OS to intervene. In addition, the contents of the main memory of the managed computer are lost due to the restart of the OS. For this reason, it is difficult to analyze the cause of the failure. Furthermore, in the case of a failure with no reproducibility, it is a problem that failure analysis cannot be performed.
On the other hand, regarding a general-purpose I / O bus such as a PCI bus, as described above, an interrupt that forcibly shifts execution of the OS to failure processing cannot be sent from the management apparatus to the managed computer. . However, when the I / O bus has a signal line for transferring additional information (for example, parity bits) for guaranteeing the accuracy of the address, command, and data transferred via the I / O bus. (PCI Hardware and Software Architecture Design, pp 172-174, Annabooks, 1994). If the I / O bus can transfer such additional information, the managed computer or input / output device cannot verify the accuracy of data on the I / O bus in the data transfer via the I / O bus. Is possible.
Further, when an I / O bus having the above function is used, an I / O bus having a signal line for notifying the CPU of a fault when an illegal signal is detected by additional information of the I / O bus. There is also a control device (Microprocessor Report, pp 11-12, Vol. 12, Number 9, July, 1998).
With regard to the CPU of the computer to be managed, if a failure occurs on the bus, memory access becomes impossible and the CPU cannot operate. When the bus is locked as described above, the CPU execution cannot be resumed only by sending an interrupt signal to the CPU. This is because the interrupt handler cannot be activated because the memory cannot be accessed due to a bus failure.
For such a fault, when a fault signal related to the bus is detected, the CPU only resets the bus instead of resetting the CPU, then generates an interrupt internally and passes control to the interrupt handler. (Microprocessor Report, pp1, 6-10, Vol. 12, Number 9, July, 1998). According to this CPU, even if the bus is locked, the execution of the CPU can be resumed, and the failure processing of the OS can be started.
In a computer management apparatus connected to a conventional I / O bus, when a failure occurs in the computer that makes it impossible to execute OS failure processing, the CPU of the computer is reset by a signal line other than the I / O bus. The CPU is reset by the above firmware, and the entire computer is restarted. In these methods, since the CPU is reset, there is a problem that the OS cannot execute the failure process, and failure information cannot be acquired.
In addition, the conventional management apparatus requires a signal line different from the I / O bus, or a circuit or firmware for executing a CPU reset process on the computer. This method has a problem that the computers to which the management apparatus can be connected are limited.
An object of the present invention is to provide a computer system that can acquire failure information even when a failure occurs in a computer that makes it impossible to execute OS failure processing.
Another object of the present invention is to provide a computer system capable of initializing the bus of a computer to be managed via an I / O bus.
Disclosure of the invention
In order to achieve the above object, according to the present invention, in a computer system in which a computer and a management device are connected by an I / O bus, when a failure occurs in the computer that makes it impossible to perform OS failure processing, the failure management device executes the computer. An I / O bus signal for notifying the occurrence of an I / O bus failure is sent to the internal I / O bus management device. Then, after initializing the I / O bus, the I / O bus management device notifies the CPU of the computer of an I / O bus failure as an interrupt processed by the OS.
In this way, it is possible to provide a computer system that can acquire failure information in response to an interrupt to the OS even when a failure has occurred in the computer that makes it impossible to execute OS failure processing. In addition, it is possible to provide a computer system that can initialize a bus of a computer to be managed via an I / O bus.
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
(1) First embodiment
FIG. 1 is a diagram showing a system configuration of an embodiment of the present invention. The computer 100 is a computer to be managed by the management device 120.
The configuration of the computer 100 will be described. The CPU 101 and the main memory 102 are connected by a bus 103. An I / O bus control device 104 that controls the I / O bus 107 is connected to the bus 103. The bus 103 includes a signal line that instructs the CPU 101 and the I / O bus control device 104 to reset the internal state of the bus 103. An I / O bus 107 extends from the I / O bus control device 104. Connected to the I / O bus 107 are a management device 120, an external storage device 105, a console 106 including interactive devices such as a keyboard and a display, and the like.
The I / O bus control device 104 transfers input / output operations executed by the CPU 101 to the I / O bus 107, and transmits data from input / output devices connected to the I / O bus 107 in the main memory 102 and the CPU 101. Transfer to a register, transfer of an interrupt to the CPU 101, and the like are performed.
The I / O bus control device 104 and the CPU 101 are connected by a bus error notification line 108. The bus error notification line 108 is a bus signal line for notifying the CPU 101 of a bus error when the I / O bus control device 104 detects an error on the I / O bus 107.
Next, the management apparatus 120 will be described. The management device 120 is a type of external input / output device connected to the I / O bus 107 of the computer 100, and realizes operation operations such as monitoring the execution status of the computer 100 and starting / stopping it from a remote location. The management device 120 constitutes a computer by itself, and a program to be executed there can be executed independently even when the OS of the computer 100 is stopped. A program executed by the management apparatus 120 controls the modem 127 and the network adapter 128 to operate the computer 100 from a remote computer in cooperation with a remote computer such as the computers 151 and 170. Realize the operation.
The CPU 121 on the management device 120 and the main memory 122 are connected by a bus 123. An I / O bus control device 124 is connected to the bus 123, and an I / O bus 125 extends from the I / O bus control device 124. The I / O bus 125 includes a modem 127 and a network adapter 128, and can communicate with a remote computer.
The management device 120 is connected to the I / O bus 107 of the computer 100 via the device control device 126. The device control device 126 receives an input / output operation request to the management device 120 executed by the CPU 101, and performs control according to the request. For example, an operation such as changing the contents of the main memory 122 or transmitting an interrupt to the CPU 121.
The device control device 126 is configured to be visible to the CPU 121 as an input / output device. The device control device 126 receives an input / output operation executed by the CPU 121 and executes an operation such as writing data to the I / O bus 107.
The device controller 126 includes a failure generator 130. The fault generation device 130 is a device that sends an illegal signal to the I / O bus 107 in response to an instruction from the CPU 121. When the I / O bus control device 104 of the computer 100 detects an illegal signal on the I / O bus 107, the I / O bus control device 104 notifies the CPU 101 of a failure through the bus error notification line 108.
FIG. 2 is a software configuration diagram of the embodiment of the present invention. Here, the management device 120 is connected to the I / O bus 107 of the computer 100, and the network adapter 128 of the management device 120 is connected to the management computer 151 via the network.
In each of the computers 100 and 151 and the management device 120, the OS 201, the OS 221 and the OS 213 are loaded and operating. In the computer 100, a normal application program group 202 is executed. In addition, in the computer 100, a management agent program 203 that is executed in cooperation with the management apparatus 120 operates. The management agent 203 collects the execution status of the program 202 executed by the computer 100 and the execution status of the OS 201, transmits the execution status to the management device 120, instructs the management device 120 to operate, and executes status information of the computer 100 collected by the management device 120. Acquisition and operation management processing. The operation management processing includes setting automatic start / stop times of the computer 100, shutting down, rebooting, powering off the computer 100, displaying management information, sending management information to the network, and the like.
In the management device 120, a communication control program 212 that performs communication with a remote computer 151 and a management program 211 that performs operation management processing of the computer 100 are executed. The management program 211 obtains the operating status of the computer 100, power control of the computer 100 by specifying a time, automatic start / stop processing of the OS 201, transfer of management agent 203 collected information to the remote management computer 151, operation from the remote computer 151 Executes operation request processing and the like.
The programs 211 to 213 on the management apparatus 120 can be executed even when the OS 201 of the computer 100 is stopped. When the computer 100 is stopped due to a failure of the OS 201, the management program 211 acquires the contents of the main memory 102 via the I / O bus 107 and performs failure processing such as transmission of failure information to the remote computer 151. In addition, in the present embodiment, the failure generation device 130 is driven, a failure signal is transmitted to the I / O bus 107, and processing for starting the failure processing of the OS 201 is performed.
Remote computers 151 and 170 are connected to the management apparatus 120 via a network 150 such as a LAN or a communication line 140 such as a telephone line. In the remote computer 151, a remote computer management program 220 is executed. The program 220 exchanges management information by communication with the management program 211 on the management apparatus 120 and executes an operation management operation of the computer 100. For example, the operation management information of the computer 100 is displayed, the remote stop / reboot, the OS 201 failure processing start instruction, and the like are executed.
When a failure occurs in the bus 103 or the I / O bus 107, the CPU 101 generates a bus error interrupt and executes failure processing. In the OS 201, there is an interrupt handler 204 that processes a bus error interrupt. The interrupt handler 204 is registered in the interrupt vector of the CPU 101 and is set to be executed when a bus error interrupt occurs.
FIG. 3 is a diagram showing the configuration of the device control device 126 in the present embodiment. The device control device 126 is connected to the I / O bus 125 of the management device 120 and the I / O bus 107 of the computer 100 via the I / O bus interface circuit 301. The circuit 301 is a circuit that extracts data addressed to the device control device 126 from each I / O bus or transmits data from the CPU to the I / O bus. The circuit 301 drives other circuits in the device control device 126 according to the data content acquired from the I / O bus 107.
In the control device 126, a parity generation circuit 302 for the I / O bus 107 and a failure generation device 130 are incorporated. In this embodiment, the parity generation circuit 302 generates a parity signal 107a related to the address signal 107b to be sent to the I / O bus 107 by a combination of exclusive OR circuits. During normal execution, the parity signal generated by the parity generation circuit 302 is sent to the I / O bus 107 as it is.
The failure generation device 130 inverts the parity signal generated by the parity generation circuit 302 and generates a signal defined as a failure on the I / O bus 107. The generation of the fault signal is controlled by the fault generation register 303. During normal operation, register 303 is set to zero. When the register 303 is set to 1, the failure generator 130 inverts the signal generated by the parity generation circuit 302 and sends a signal that causes a failure to the I / O bus 107.
The register 303 is configured to be accessible by an input / output command of the CPU 121 of the management device 120. The management program 211 can forcibly stop the OS 201 of the computer 100 by executing an operation of setting the register 303 to 1 and accessing the I / O bus 107.
The fault generation device 130 sets the fault generation state register 304 to 1 when an invalid signal is sent to the parity signal 107a. Also, the register 303 is reset to 0 so that failure injection to the I / O bus 107 does not occur continuously.
In this embodiment, the failure is sent to the I / O bus by setting the parity of the address signal to an illegal value, but the method of generating an illegal bus signal is not limited to this.
The I / O bus control device 104 will be described. FIG. 4 is a diagram showing a part of the configuration of the I / O bus control device 104 in the present embodiment.
The I / O bus control device 104 performs transmission of data to the I / O bus 107 and fetching of data from the I / O bus 107. In order to check whether or not the data on the I / O bus 107 is illegal at the time of data capture, the parity signal 107a related to the address signal 107b is referred to. The parity calculation circuit 401 in the I / O bus control device 104 obtains a parity value from the address signal 107b. This parity value is compared with the parity signal 107a of the I / O bus 107. If they do not match, the bus error notification line 108 notifies the CPU 101 of a bus failure.
When a failure signal is sent to the I / O bus 107 by the failure generator 130, the parity value becomes invalid, so the CPU 101 is notified of the bus failure.
FIG. 5 shows a configuration relating to bus failure processing on the CPU 101 side. When the bus error is notified from the bus error signal line 108, the CPU 101 initializes the bus 103 by the bus initialization circuit 501. Here, the initialization of the bus 103 indicates that the state relating to the bus in the CPU 101 is set to the initial state, and is not a reset of the CPU 101. This bus initialization process is also necessary for other devices connected to the bus 103, and the other devices are instructed to initialize the bus as a bus initialization signal 103b.
Further, the CPU 101 delays the error notification signal 108 by the delay circuit 502 and, when the initialization of the bus 103 is completed, drives the interrupt control circuit 504 to internally generate a bus error interrupt.
A normal external interrupt is notified to the processor by an external interrupt signal 103a. The external interrupt is masked by the value of the interrupt prohibit register 503. If the interrupt due to the bus error notification is configured so as to bypass the mask control by the interrupt prohibition register 503 and drive the interrupt control circuit 504, an interrupt due to a bus fault can be generated even when the CPU 101 is disabled.
The bus initialization process of the CPU 101 will be described. FIG. 6 is a diagram showing a configuration example of the bus initialization circuit 501 of the CPU 101.
Circuits related to the bus of the CPU 101 are driven in synchronization with the clock signal 604.
Within the CPU 101 is a circuit that controls the bus 103. Among them, there is a portion that holds a state related to data that has flowed through the bus 103 in the past. In this example, it is assumed that the register 603 constituted by a flip-flop stores the bus state. The register 603 captures the bus state in synchronization with the clock signal 604.
The value of the register 603 during normal operation is determined by the bus control circuit 601. When the bus initialization signal 103 b is not active, that is, 0, the switch circuit 605 is configured so that the output value of the bus control circuit 601 reaches the register 603.
When the bus initialization signal 103 b is active, the switch circuit 605 is configured so that the value set in the initial state register 602 reaches the register 603. The value of the initial state register 602 is set in advance in the CPU 101 or is set by initialization when the computer 101 is turned on. Thus, the CPU 101 can set the register 603 to an initial state upon receiving the bus initialization signal 103b.
In this embodiment, the CPU 101 sends a bus initialization signal 103b to the bus 103. However, each device connected to the bus 103 may detect the bus error notification signal 108 and perform initialization in each device. .
In the present embodiment, with the above hardware configuration, the management device 120 connected to the I / O bus 107 of the computer 100 causes a failure in the I / O bus 107 at any time independent of the execution state of the computer 100. By sending a defined signal to the I / O bus 107, it is possible to initialize the internal state related to the bus 103 held by each device connected to the bus 103 and generate a bus error interrupt in the CPU 101. Become.
Next, software processing according to this embodiment will be described. FIG. 7 is a flowchart showing the processing of the bus error interrupt handler 204 in the OS 201 executed by the computer 100.
When the CPU 101 captures the bus error interrupt, it passes control to the interrupt handler 204 starting from step 701. The bus error interrupt may or may not be intentionally generated by the management device 120. The interrupt handler 204 first acquires the value of the fault generation status register 304 of the management device 120 (step 701). The register 304 is configured to be accessible from the CPU 101 via the I / O bus 107.
Subsequently, the obtained value of the register 304 is inspected (step 702). If the value of the register 304 is 0, that is, if the management device 120 has sent a bus fault, normal bus error processing (step 705). For example, display of failure information on the console 106, dumping of the main memory 102 to the external storage device 105, restart of the computer 100, and the like.
If the register 304 is 1, that is, if a bus error is caused by the management device 120 injecting a fault into the I / O bus 107, the fault state generation register is reset (step 703), and this is displayed on the console 106. (Step 704). Reference numeral 720 denotes an example of console screen display.
Processing of the management program 211 in the management apparatus 120 will be described. FIG. 8 is a flowchart illustrating a processing example of the management program 211.
First, in step 801, it is checked whether there is a stop request to the computer 100. The stop request is sent from the remote computer 151 or 170 to the modem 127 or the network adapter 128 via the communication line, or when the emergency stop button 129 is pressed.
If there is no stop request, the operation status of the computer 100 is collected and stored in the management data 210 (step 802). It is determined from the acquired data 210 whether the computer 100 is executing normally (step 803). If so, the operating status is transmitted to the remote computer (step 804). If it is stopped, the process proceeds to step 807, where failure information is acquired and transmitted to a remote computer.
If there is a stop request, step 805 is executed. Here, the fault generation register 303 is set to 1 and an instruction to access the I / O bus 107 is executed (step 806). As a result, a bus error interrupt is generated by the CPU 101, and control is passed to the bus error interrupt handler 204.
Thereafter, the process proceeds to step 807, and the failure information is transmitted to the remote computer.
With the hardware configuration and software procedure described above, the management apparatus 120 connected to the I / O bus 107 forcibly stops the execution of the OS 201 executed by the computer 100, and the bus error interrupt handler 204, which is an OS failure process. Can be executed.
In the present embodiment, the failure generation device 130 of the management device 120 executes the computer 100 by sending a signal that causes a failure to the I / O bus 107 at an arbitrary time regardless of the execution state of the computer 100. The forced stop of the OS 201 is realized. In this embodiment, the computer 100 and the management apparatus 120 are connected only by the I / O bus 107. Compared with the conventional method of connecting the management apparatus and the computer with a dedicated signal line, the restriction of the computer 100 to which the management apparatus 120 can be connected is relaxed.
Further, since the conventional management apparatus executes the restart of the computer by the CPU reset when the OS execution is stopped due to the failure, it is difficult to analyze the cause of the failure. On the other hand, in this embodiment, the I / O bus control device 104 notifies the CPU 101 of a bus error, and the CPU 101 generates an interrupt in response to the bus error and executes the interrupt handler 204. By extending the interrupt handler 204, failure processing such as storage of the contents of the main memory 102 to the external storage device 105, failure factor analysis, failure factor removal, etc., and stop processing of the OS 201 can be executed, so that later failure analysis and recovery Becomes easier.
Further, since the CPU 101 generates an interrupt after the CPU 101 and each device connected to the bus 103 initialize the internal state related to the bus 103, the possibility that the interrupt handler 204 can be executed increases.
In this embodiment, the contents of the main memory 102 are stored in the external storage device 105 by the bus error interrupt handler 204. However, all or a part of the contents of the main memory 102 and failure analysis information by the interrupt handler 204 are You may store in the main memory 122 of the management apparatus 120. FIG.
In this embodiment, the management device 120 sends a failure signal to the I / O bus 107. However, when a specific packet or data is received by a device such as a network adapter or a modem, a failure signal is sent to the I / O bus 107. The failure signal generation device 130 may be incorporated so as to be transmitted.
(2) Second embodiment
Next, a second embodiment of the present invention will be described.
In the first embodiment, the management apparatus 102 connected to the I / O bus 107 needs to send a signal recognized as a failure to the I / O bus 107. For this purpose, the management apparatus 120 must acquire the right to access the I / O bus 107. That is, the right to use the bus must be acquired by arbitrating the bus 107.
However, the management device 120 may not be able to acquire the right to use the I / O bus 107. When the CPU 101 executes a certain continuous non-division process for a device connected to the I / O bus 107, the bus use right is acquired assuming that the I / O bus 107 is exclusively used. This is called locking the bus. At this time, if the device cannot respond due to a failure of the target device, the right to use the bus 107 remains unreleased.
In such a case, in the first embodiment, since the failure signal cannot be injected into the I / O bus 107, the failure processing of the OS 201 of the computer 100 cannot be started from the management device 120.
In the second embodiment of the present invention, a means and procedure for sending a failure signal after releasing the locked state of the bus will be described. In the present embodiment, the management device 120 can check the lock state of the I / O bus 107. Further, the management device 120 makes it appear that the request operation is completed by sending arbitrary data to an I / O bus request that is not completed while the bus is locked, and causes the request issuer to release the bus lock.
A data flow on the I / O bus will be described. FIG. 9 is a timing chart showing the flow of data on the I / O bus 107 in this embodiment.
FIG. 9 shows the state of the bus signal when the access right arbitration of the I / O bus 107 is completed and data is actually transferred. After acquiring the access right, the device that accesses the I / O bus 107 outputs an address signal 107b that designates the access target device.
If this access is to be executed exclusively, the I / O bus lock signal 107c is simultaneously activated. A device connected to the I / O bus 107 is configured so that it cannot issue the next request to the I / O bus 107 while the bus lock signal 107c is active. The request source device keeps the bus lock signal 107c active until the operation is completed.
The device designated by the address signal 107b activates the response signal 107d when the operation is completed, and outputs data to the data signal line 107e when there is data.
The request source device detects that the response signal 107d becomes active, fetches data from the data signal line 107e, and cancels the bus lock signal 107c.
FIG. 10 is a diagram showing the configuration of the control device 120 in the second embodiment. The CPU 101 issues a non-divided continuous I / O request to the device 1020, but it is assumed that the device 1020 cannot respond.
When the CPU 101 issues an undivided I / O request, the I / O bus control device 104 activates the bus lock signal 107 c of the I / O bus 107.
The control device 120 is provided with a bus lock state register 1006 that holds the bus lock signal 107c at each time point. The bus lock state register 1006 is configured so that it can be referred to by the CPU 201 on the management apparatus 120, and the management program 211 can know its value.
In the normal operation, the management device 120 is configured to output the response signal 107d only when the address signal 107b of the I / O bus 107 designates the control device 120. In addition to this, it has means for sending a response signal 107d to the I / O bus 107 at an arbitrary time point according to an instruction from the management program 211.
The response signal 107d is controlled by the proxy response control register 1001. When the proxy response control register 1001 is 0, the response signal 1003 output from the device control circuit 1002 is output as the response signal 107d of the I / O bus.
The I / O bus data signal 107e is also controlled by the proxy response control register 1001. The switch circuit 1005 outputs the output value of the device control circuit 1002 or the output value of the proxy response value register 1004 to the data signal 107e according to the value of the register 1001.
That is, when the proxy response control register 1001 is set to 1, the response signal 107d becomes active, and the value stored in the proxy response value register 1004 is sent to the bus data signal 107e.
Next, processing of the control program 211 of this embodiment will be described. FIG. 11 is a flowchart showing the forced stop processing of the OS 201 of the control program 211.
First, the control program 211 checks whether or not the I / O bus 107 is locked with reference to the bus lock state register 1006 (step 1101). If it is not locked, the process proceeds to step 1103, and the fault generation register 303 is set to 1 and the fault signal is injected into the I / O bus 107 in the same procedure as in the first embodiment.
If it is locked, go to Step 1102. In step 1102, the proxy response control register is set to 1. As a result, the I / O bus 107 is unlocked, the process returns to step 1101, and the bus lock state is checked again. If the bus lock is released, the process proceeds to step 1103 and a failure signal is injected.
With the above means and procedure, the management apparatus 120 can inject a failure signal into the I / O bus 107 even if the I / O bus 107 is locked to another device. As a result, the failure range in which the OS 201 can be forcibly stopped from the management apparatus 120 connected to the computer 100 using only the I / O bus 107 is expanded.
(3) Third embodiment
Next, a third embodiment of the present invention will be described. In the second embodiment, the unlocking of the I / O bus 107 and the control of the fault injection to the I / O bus 107 are individually executed. In the present embodiment, a description will be given of a unit that realizes these as a single circuit in the control device 120.
FIG. 12 is a diagram showing the configuration of the fault generation device 1201 of this embodiment. The fault generation device 1201 includes a fault generation circuit 1202 and a bus lock release circuit 1203. The fault generation circuit 1202 has the same configuration as that of the fault generation apparatus 130 shown in FIG. 3 of the first embodiment. The bus lock release circuit 1203 has the same configuration as that shown in FIG. 10 of the second embodiment.
The fault generation device 1201 collects the bus lock signal 107 c of the I / O bus 107 in synchronization with the clock 604 and stores it in the bus lock status register 1204.
The fault generation device 1201 controls the fault signal injection using the fault generation register 1205. When the fault generation register 1205 is 0, the fault generation circuit 1202 and the bus lock release circuit 1203 do not operate. The control program 211 sets the failure generation register 1205 to 1 when the execution of the OS 201 is stopped.
If the bus lock signal 107c is not active when the fault generation register 1205 is set to 1, the fault generation circuit 1203 operates. The circuit 1203 sends a signal that becomes a failure to the I / O bus 107.
If the bus lock signal 107c is active when the register 1205 is set to 1, the bus lock release circuit 1204 operates. The circuit 1204 sends a bus response signal 107d and a bus data signal 107e to the I / O bus 107 to try to release the bus lock.
When the bus lock is released, that is, when the bus lock signal 107 c becomes inactive, the fault generation circuit 1203 operates and sends a fault signal to the I / O bus 107.
According to the present embodiment, the execution of the computer 100 can be stopped more reliably than when the lock signal is monitored by software and the failure signal is injected as in the second embodiment. Further, the control unit by software in the second embodiment can be removed.
In the second and third embodiments, the management device 120 sends a pseudo response signal to the I / O bus 107 to release the bus lock. Depending on the configuration of the I / O bus 107, there is a bus that must specify a response destination for a response. In this case, the identifier on the bus of the device that sent the bus transaction requiring the bus lock by the management device 120 may be recorded.
(4) Fourth embodiment
Next, a fourth embodiment of the present invention will be described. In the embodiments described so far, the method of stopping the execution of the computer 100 by connecting only the I / O bus 107 has been described. However, the management apparatus 120 may also include a conventional dedicated signal line. For example, when the execution of the computer 100 is to be stopped, first, the OS 201 is tried to be stopped by the means of the present invention, and if it cannot be stopped by the means of the present invention, the computer 100 is reset by conventional means. The configurations of the computer 100 and the management device 120 that realize this will be described.
FIG. 13 is a diagram illustrating configurations of the computer 100 and the management apparatus 120 according to the fourth embodiment. The computer 100 includes a reset circuit 1302 that resets the CPU 101. The reset circuit 1302 is connected to the management apparatus 120 through a reset control line 1303. When the reset control line 1303 becomes active, the reset circuit 1302 operates to reset the CPU 101. This resets the entire computer.
The management device 120 has a reset control register 1301. The reset control register 1301 is configured to be set by the CPU 121. The reset control line is configured to be active when the reset control register 1301 is set to 1.
Next, a processing flow of the management program 211 for stopping the computer 100 will be described. FIG. 14 shows the flowchart. First, the fault generation device 130 is driven to send a fault signal to the I / O bus 107 (step 1401). After waiting for a predetermined time (step 1402), it is checked whether the OS 201 has executed a failure process (step 1403). If the process is not executed, the reset control register 1302 is set to 1 in step 1404 to reset the computer 100.
(5) Fifth embodiment
In the embodiments described so far, a remote computer or operator gives an opportunity to send a fault to the I / O bus 107. However, the management device 120 or the management program 211 determines whether or not to send a fault. Also good. In the fifth embodiment of the present invention, a method in which the management agent program 203 and the management program 211 cooperate to perform fault transmission will be described. The management device 120 has an agent activation register that indicates that the management agent 203 is executing. The agent activation register is configured to be accessible from both the CPU 101 of the computer 100 and the CPU 201 of the management apparatus 120 (not shown).
The management agent 203 is configured to execute at regular time intervals and set an agent activation register at the time of execution (the flowchart is omitted). On the management device 120 side, by referring to the agent activation register, it is determined whether the computer 100 is executing normally.
FIG. 15 is a flowchart showing the processing of the management program 211 executed by the management device 120. The process shown in FIG. 15 is configured to be executed at regular time intervals.
The management program 211 holds a variable (unstarted count) for recording the number of times the register is not set when the agent startup register is checked.
Processing of the management program 211 will be described. First, the agent activation register of the management apparatus 120 is inspected (step 1501). If this register is set, this register is cleared (step 1504), the number of unstarts is set to 0 (step 1505), and the process ends.
If the register is not set, the number of unstarts is checked (step 1502). If the number of unstarts is a predetermined positive integer X, a failure signal is sent to the I / O bus 107 (step 1503). If it is not X, 1 is added to the number of unstarted times (step 1506), and the process ends.
As described above, the management program 211 can check the execution state of the computer 100 and send a fault to the I / O bus 107 spontaneously. When sending a fault, a message indicating that the computer 100 has been forcibly stopped may be transmitted to the remote computers 151 and 170.
In the fifth embodiment, the fault is sent to the I / O bus 107 by software. However, the fault generator 130 is driven unless the management apparatus 120 resets the fault for a certain period of time. A watchdog timer may be provided.
In this case, the management agent 203 is configured to execute at regular time intervals and reset the watchdog timer at the time of execution. On the management program 211 side, no special processing is required.
The management program 120 may check the execution status of the OS 201 with reference to the contents of the main memory 102 of the computer 100 and send a failure signal to the I / O bus 107 accordingly.
Industrial applicability
As described above, the failure processing method and apparatus for a computer according to the present invention sends a failure signal from the management device to the management target computer via the I / O bus, and the management target computer receives the signal. It is suitable for building a computer system that initializes the bus and generates interrupts.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of an embodiment of the present invention.
FIG. 2 is a block diagram of a program according to the embodiment of the present invention.
FIG. 3 is a block diagram of the device control apparatus.
FIG. 4 is a block diagram of the I / O bus control device.
FIG. 5 is a block diagram of the failure processing part in the CPU.
FIG. 6 is a block diagram of the bus initialization part in the CPU.
FIG. 7 is a flowchart of the processing of the OS bus error interrupt handler.
FIG. 8 is a flowchart of the processing of the management program executed by the management device.
FIG. 9 is a diagram showing the timing of signals on the I / O bus.
FIG. 10 is a block diagram of the bus lock releasing device in the management device in the second embodiment of the present invention.
FIG. 11 is a flowchart of the processing of the management program executed by the management device in the second embodiment of the present invention.
FIG. 12 is a configuration diagram of the failure generation device in the management device according to the third embodiment of the present invention.
FIG. 13 is a block diagram of a computer and a management device in the fourth embodiment of the present invention.
FIG. 14 is a flowchart of a computer stop process executed by the management apparatus in the fourth embodiment of the present invention.
FIG. 15 is a flowchart of computer stop processing executed by the management apparatus in the fifth embodiment of the present invention.

Claims

A failure processing method in a computer system in which a computer and a management device are connected by an I / O bus controlled by an I / O bus control device , wherein a failure occurs in the computer from the management device in the computer. I / O bus sends an I / O bus signal for generating an I / O bus failure, the I / O in the bus control unit for generating, based on said I / O bus signal I / O the bus fault triggered I A fault processing method in a computer system, comprising: initializing an / O bus, and then notifying an I / O bus fault to the CPU of the computer as an interrupt processed by an OS operating on the CPU.

A failure processing method in a computer system in which a computer and a management device are connected by an I / O bus controlled by an I / O bus control device, wherein the illegal data is sent from the computer to the management device. An I / O bus signal that causes an I / O bus failure is sent from the management device to the I / O bus in the computer, and the I / O bus control device generates an I / O based on the I / O bus signal. After initializing the I / O bus in response to a bus failure, the I / O bus failure is notified to the CPU of the computer as an interrupt processed by the OS operating on the CPU. Processing method.

A failure processing method in a computer system in which a computer and a management device are connected by an I / O bus controlled by an I / O bus control device, wherein the computer does not update the contents of a predetermined storage device within a predetermined time The I / O bus signal for causing an I / O bus failure is sent from the management device to the I / O bus in the computer, and the I / O bus control device generates an I / O bus signal based on the I / O bus signal. A computer system that initializes the I / O bus in response to a / O bus failure and then notifies the CPU of the computer of the I / O bus failure as an interrupt processed by an OS operating on the CPU. Failure handling method.

A computer and a management device, wherein the computer and the management device are connected to each other via an I / O bus controlled by an I / O bus control device, and the management device is at a point in time when a failure occurs in the computer To send an I / O bus signal that causes an I / O bus failure to the I / O bus in the computer, and the I / O bus controller detects an I / O bus failure based on the I / O bus signal. Accordingly, after initializing the I / O bus, an I / O bus failure is notified to the CPU of the computer as an interrupt processed by an OS operating on the CPU.

A computer and a management device, wherein the computer and the management device are interconnected by an I / O bus controlled by an I / O bus control device, and the management device is illegally connected from the computer to the management device. When the correct data is sent, an I / O bus signal causing an I / O bus failure is sent to the I / O bus in the computer, and the I / O bus control device sends an I / O bus signal based on the I / O bus signal. In response to detecting the / O bus failure, after the I / O bus is initialized, the I / O bus failure is notified to the CPU of the computer as an interrupt processed by the OS operating on the CPU. A featured computer system.

A computer and a management device, wherein the computer and the management device are interconnected by an I / O bus controlled by an I / O bus control device; When the contents of the storage device are not updated, an I / O bus signal that causes an I / O bus failure is sent to the I / O bus in the computer, and the I / O bus control device sends the I / O bus signal to the I / O bus signal. When the I / O bus failure is detected, the I / O bus is initialized, and then the I / O bus failure is notified to the CPU of the computer as an interrupt processed by the OS operating on the CPU. A computer system characterized by that.