JP2003032253A

JP2003032253A - In-advance countermeasure on-line diagnosis in manageable network

Info

Publication number: JP2003032253A
Application number: JP2001198027A
Authority: JP
Inventors: Igor Chirashnya; イゴール・シラシュヤ; Lee Shalev; リー・シャレフ; Kirill Shoikhet; キリル・ショイケット
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2001-06-29
Filing date: 2001-06-29
Publication date: 2003-01-31
Anticipated expiration: 2021-06-29
Also published as: JP3579834B2

Abstract

PROBLEM TO BE SOLVED: To provide a beforehand countermeasures on-line diagnosis, in a manageable network. SOLUTION: The diagnosis method for a system, made up of a plurality of mutually linked modules, includes alarm receive from the system for indicating an obstacle in one module. In response to the alarm, a causal network is constituted, and malfunctions and disorders of one or more modules with possibility of disorder are made to relate to each other, and further, each probability for the malfunction is made to relate to the conditional probability of the disorder. On the basis of the alarm and the causal network, at least one of probability of malfunction is renewed. The diagnosis of the alarm is presented, in response to the renewed probability.

Description

Detailed Description of the Invention

【０００１】関連出願の相互参照本願は、参照によって本明細書に組み込まれる米国特許
仮出願第６０／２１４９７１号明細書の利益を主張する
ものである。CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of US Provisional Application No. 60/214971, which is hereby incorporated by reference.

【０００２】[0002]

【発明の属する技術分野】本発明は、一般に通信ネット
ワークの障害診断のための方法およびシステムに関し、
詳細には、通常の通信アクティビティが進行中である間
にそのようなネットワーク内で欠陥のあるコンポーネン
トを識別する方法に関する。FIELD OF THE INVENTION The present invention relates generally to methods and systems for fault diagnosis in communication networks,
In particular, it relates to a method of identifying defective components within such networks while normal communication activity is in progress.

[Prior art]

【０００３】コンピュータ・ネットワークの複雑さは、
増大を続け、これらのネットワークついて要求される信
頼性、可用性、およびサービスも、高まり続けている。
これらの要因によって、コンピュータ・ネットワークで
ネットワーク障害を識別し、分離するのに使用される診
断システムに課せられる重荷が増える。ネットワーク・
アクティビティに深刻に干渉する可能性がある障害を防
ぐためには、初期の障害の前兆となる断続的な問題およ
び散発的な問題を検出し、その問題を引き起こしている
装置を正確に示すことが重要である。ネットワークの高
可用性を維持するために、ネットワークがオンラインで
あり、通常のアクティビティ・モードで稼動している間
にこれらの問題を識別しなければならない。その後、サ
ービス担当者に、完全に壊れてしまう前に欠陥のある要
素を交換するように指示することができる。The complexity of computer networks is
As they continue to grow, so do the reliability, availability, and services required of these networks.
These factors add to the burden on diagnostic systems used in computer networks to identify and isolate network failures. network·
To prevent failures that can seriously interfere with activity, it is important to detect the intermittent and sporadic problems that are the precursors to the initial failure and pinpoint the device causing the problem. Is. In order to maintain high availability of the network, these issues must be identified while the network is online and running in normal activity mode. The service personnel can then be instructed to replace the defective element before it has been completely destroyed.

【０００４】現代のネットワークは、通常は、トポロジ
・ファイル、システムワイドなエラー・ログ、およびコ
ンポーネント固有のトレース・ファイルなどの大量の診
断情報を提供する。これらの情報を分析してネットワー
ク障害を識別することは、最高の技量を有するネットワ
ーク管理者以外の人間の能力の範囲を超えている。ネッ
トワーク診断に対する最も自動化された手法では、if-t
henルールの形で専門家の知識をフレーム化し、このル
ールを診断情報に自動的に適用することによってこの問
題を克服しようとする。通常、ルールは、ヒューリステ
ィックであり、特にそれが適用されるシステムに合わせ
て作成しなければならない。その結果、ルール自体が、
考案が困難であり、発生する可能性があるすべてのエラ
ー状態に一般的に適用することができない。そのような
ルールは、包括的に適用可能ではなく、一般に、システ
ム構成が変更された時には更新されなければならない。Modern networks typically provide large amounts of diagnostic information such as topology files, system-wide error logs, and component-specific trace files. Analyzing this information to identify network failures is beyond the capabilities of any person other than the highest skilled network administrator. The most automated method for network diagnostics is if-t
We try to overcome this problem by framing expert knowledge in the form of hen rules and applying this rule automatically to diagnostic information. Rules are usually heuristics and must be specifically tailored to the system to which they apply. As a result, the rules themselves
It is difficult to devise and is not generally applicable to all possible error conditions. Such rules are not universally applicable and generally must be updated when the system configuration changes.

【０００５】モデルベースの診断手法では、誤動作の場
合に、問題のシステムの機能モデルから始め、それを分
析して、欠陥のあるコンポーネントを識別する。機能モ
デル（フォワード（forward）モデルまたは因果（causa
l）モデルとも称する）は、システム仕様書または信頼
性分析モデルの一部として簡単に入手可能であることが
しばしばである。そのようなモデルの開発は、通常は、
システム設計またはシステム分析の過程の単純な部分で
ある。したがって、モデルの作成では、設計者がシステ
ム障害診断の専門家である必要がない。診断結論に達す
るために、代わりに自動化されたアルゴリズムが機能モ
デルに適用される。システム・モデルが、構成変更を反
映するように更新される限り、これらのアルゴリズム
は、行われた変更に対して診断を自動的に適合させる。Model-based diagnostic techniques begin with a functional model of the system in question and analyze it to identify defective components in the event of malfunction. Functional model (forward model or causal (causa
l), also referred to as a model, is often readily available as part of a system specification or reliability analysis model. The development of such models usually involves
It is a simple part of the process of system design or system analysis. Therefore, the model creation does not require the designer to be an expert in system fault diagnosis. Instead, an automated algorithm is applied to the functional model to arrive at a diagnostic conclusion. As long as the system model is updated to reflect the configuration changes, these algorithms will automatically adapt the diagnostic to the changes made.

【０００６】System Area Networks（ＳＡＮ）などの交
換コンピューティング・ネットワークおよび交換回線通
信ネットワークでは、診断アプリケーションに関して、
その複雑さおよび固有の不確実性に関する特定の課題が
示される。複雑さでは、使用される多数のコンポーネン
ト、ネットワーク内の装置の間の複数の動的経路の存
在、およびそのネットワークが搬送する大量の情報を扱
わなければならない。不確実性は、なかんずく、アラー
ム・メッセージが、パケット形式でネットワークを介し
て搬送されるという事実から生ずる。その結果、アラー
ム伝送に未知の遅延が存在する可能性があり、アラーム
が順序どおりに到着せず、一部のアラーム・パケットが
失われる場合もある。In switched computing networks and switched communication networks, such as System Area Networks (SAN), for diagnostic applications,
Specific challenges are presented regarding its complexity and inherent uncertainty. The complexity has to deal with the large number of components used, the existence of multiple dynamic paths between devices in a network, and the large amount of information that the network carries. Uncertainty arises, inter alia, from the fact that alarm messages are carried over the network in packet form. As a result, there may be unknown delays in alarm transmission, alarms may arrive out of sequence, and some alarm packets may be lost.

【０００７】不確実性が存在する情況でのモデルベース
診断の技術で既知のパラダイムの１つが、ベイズ・ネッ
トワーク（Bayesian Network）である。カウエル（Cowe
ll）他が、「Probabilistic Networks and Expert Syst
ems」（Springer-Verlag、米国ニューヨーク州、１９９
９年）でベイズ・ネットワーク理論の全般的な説明を示
している。同書は、参照によって本明細書に組み込まれ
る。ベイズ・ネットワークは、領域変数に対応するノー
ドを有し、条件つき確率テーブルが各ノードに付加され
る、有向非輪状グラフである。グラフの辺の向きが、ノ
ードの間の因果関係に対応する時に、ベイズ・ネットワ
ークを、因果ネットワーク（causal network）とも呼
ぶ。ノードの対の間に辺がないことは、それらのノード
が条件的に独立であるという前提を表す。確率テーブル
の積によって、変数の同時確率分布が与えられる。確率
は、テストされるシステム内での障害および誤動作の共
起に関する新しい証拠が集められる時に更新される。診
断システムは、新しいアラームまたはアラームの組を受
け取る時に、ベイズ・ネットワークを使用して、アラー
ムの背後にある最も確率の高い誤動作を自動的に判定す
る。One of the known paradigms in the art of model-based diagnostics in the presence of uncertainty is the Bayesian Network. Cowe
ll) et al., “Probabilistic Networks and Expert Syst
ems "(Springer-Verlag, New York, USA, 199)
9) gives a general explanation of Bayesian network theory. The same document is incorporated herein by reference. The Bayesian network is a directed acyclic graph having nodes corresponding to the region variables and a conditional probability table added to each node. Bayesian networks are also called causal networks when the orientation of the edges of the graph corresponds to the causal relationships between the nodes. The lack of edges between pairs of nodes represents the assumption that those nodes are conditionally independent. The product of the probability tables gives the joint probability distribution of the variables. Probabilities are updated as new evidence is gathered for co-occurrence of failures and malfunctions within the system under test. When a diagnostic system receives a new alarm or set of alarms, it uses the Bayesian network to automatically determine the most probable malfunction behind the alarm.

【０００８】その開示が参照によって本明細書に組み込
まれる米国特許第６０７６０８３号明細書に、通信ネッ
トワークの診断へのベイズ・ネットワークの例示的適用
が記載されている。通信ネットワークが、ベイズ・ネッ
トワークとして表され、通信ネットワーク内の装置およ
び通信リンクが、ベイズ・ネットワークのノードとして
表される。通信ネットワークの障害が、識別され、トラ
ブル・チケット（trouble ticket）の形で記録され、障
害の１つまたは複数の考えられる原因が、ベイズ・ネッ
トワーク計算に基づいて与えられる。障害が訂正された
時に、ベイズ・ネットワークが、障害の訂正で習得され
た知識を用いて更新される。更新されたトラブル・チケ
ット情報が、ベイズ・ネットワークの適当な確率行列の
自動更新に使用される。米国特許第６０７６０８３号明
細書のベイズ・ネットワークは、静的であり、通信ネッ
トワークの構成の変更に対する備えがない。さらに、こ
のベイズ・ネットワークは、通信ネットワーク全体をモ
デル化するので、大規模で複雑な交換ネットワークを扱
わなければならない時に、簡単に手におえなくなる。US Pat. No. 6,076,083, the disclosure of which is incorporated herein by reference, describes an exemplary application of Bayesian networks to the diagnosis of communication networks. Communication networks are represented as Bayes networks, and devices and communication links within the communication networks are represented as nodes of the Bayes networks. Faults in the communication network are identified and recorded in the form of trouble tickets, and one or more possible causes of the faults are given based on Bayesian network calculations. When the fault is corrected, the Bayesian network is updated with the knowledge acquired in correcting the fault. The updated trouble ticket information is used to automatically update the appropriate Bayesian network probability matrix. The Bayesian network of U.S. Pat. No. 6,076,083 is static and provides no provision for changes in the configuration of communication networks. Moreover, this Bayesian network models the entire communication network, and is easily out of hand when dealing with large and complex switched networks.

【０００９】コンピュータ・システムでの障害診断に対
するベイズ・ネットワークの適用のもう１つの手法が、
参照によって本明細書に組み込まれる、ピッツァ（Pizz
a）他著、「Optimal Discrimination between Transien
t and Permanent Faults」、Proceedings of the Third
IEEE High Assurance System Engineering Symposiu
m、１９９８年に記載されている。この著者は、信頼性
理論の原理をコンピュータ・システムのコンポーネント
の過渡的障害と永久的障害の区別に適用することを提案
している。信頼性理論では、故障率または経時的な故障
分布に関して（平均故障間隔（ＭＴＢＦ）などに関し
て）所与の装置の故障の確率を予測する。標準的な信頼
性理論の技法は、既知の条件での装置動作をサンプリン
グすることに基づく。その一方で、ピッツァ他によって
提案された方式では、システム・コンポーネントの永久
的障害と過渡的障害の確率が、ベイズ・ネットワークを
使用する推論によって推定され、更新される。しかし、
この方式は、ごく限られた実用的適用度だけを有する。
というのは、故障確率に関する正確で最適な判断に達す
るために、あるモジュールから別のモジュールへのエラ
ー伝搬なしで、コンピュータ・システム内の各モジュー
ルを別々に調べるからである。これは、実世界の交換ネ
ットワークで合理的に行うことができる仮定ではない。Another method of applying the Bayesian network to fault diagnosis in a computer system is as follows.
Pizza (Pizza, incorporated herein by reference)
a) Others, “Optimal Discrimination between Transien
t and Permanent Faults, Proceedings of the Third
IEEE High Assurance System Engineering Symposiu
m, 1998. The author proposes to apply the principles of reliability theory to distinguish between transient and permanent failures of computer system components. Reliability theory predicts the probability of failure of a given device in terms of failure rate or failure distribution over time (such as mean time between failures (MTBF)). Standard reliability theory techniques are based on sampling the device behavior under known conditions. On the other hand, in the scheme proposed by Pizza et al., The probabilities of permanent and transient failures of system components are estimated and updated by inference using Bayesian networks. But,
This scheme has only limited practical applicability.
This is because each module in the computer system is examined separately, without error propagation from one module to another, in order to reach an accurate and optimal decision regarding the probability of failure. This is not an assumption that can reasonably be made in a real world switching network.

【００１０】[0010]

[Problems to be Solved by the Invention]

【課題を解決するための手段】本発明の好ましい実施形
態では、ベイズ・ネットワークと信頼性理論を組み合わ
せて、現実的で効率的な形で大規模で複雑な交換ネット
ワークを扱うことができる診断方法および診断システム
を提供する。診断システムは、全体としてのネットワー
クに関する最新のトポロジ情報と共に、ネットワーク内
の装置に関するローカル障害モデルを維持する。ローカ
ル障害モデルには、信頼性理論の項で表される、ネット
ワーク内のモジュールの推定誤動作率が含まれる。アラ
ーム（または一連のアラーム）がネットワークから受け
取られた時に、診断システムは、ローカル障害モデル、
推定誤動作率、およびトポロジ情報を使用して、アラー
ムの可能な原因とその確率を表すベイズ・ネットワーク
を構築する。その後、誤動作率推定値が、観察されたア
ラームおよびその到着時刻に基づいて更新される。所与
のモジュールの推定誤動作率が、ある閾値を超える時
に、診断システムは、そのモジュールに故障の疑いがあ
ると宣言し、疑わしいモジュールのテストまたは交換の
勧告を、システムのユーザに発行する。In a preferred embodiment of the present invention, a diagnostic method that combines Bayesian networks and reliability theory to handle large and complex switched networks in a realistic and efficient manner. And a diagnostic system. The diagnostic system maintains a local failure model for the devices in the network, along with up-to-date topological information about the network as a whole. The local fault model includes the estimated malfunction rate of the modules in the network expressed in terms of reliability theory. When an alarm (or series of alarms) is received from the network, the diagnostic system can
The estimated malfunction rate and topology information are used to build a Bayesian network that represents possible causes of alarms and their probabilities. The malfunction rate estimate is then updated based on the observed alarm and its arrival time. When the estimated malfunction rate of a given module exceeds a certain threshold, the diagnostic system declares the module suspected of failure and issues a test or replacement recommendation for the suspect module to the user of the system.

【００１１】したがって、当技術分野で既知のモデルベ
ースの診断方法とは違って、本発明の好ましい実施形態
では、動的なベイス・ネットワーク・モデルが使用さ
れ、このモデルは、特に受け取ったアラームまたはアラ
ームのグループのそれぞれに応答して作成される。その
結果、このモデルは、ネットワーク全体の完全なモデル
を維持することの極端に高い計算コストおよびメモリ要
件をこうむらずに、実際の最新のネットワーク状態を完
全かつ正確に反映する。この診断システムによって生成
される所与のモデルでは、装置モデルが分離状態でのみ
考慮される上述のピッツァ他の手法と異なって、接続さ
れたモジュールの間の相互作用およびエラー伝搬が考慮
される。本発明の実施形態では、カスケード接続された
スイッチなどのネットワーク・トポロジの規則的なパタ
ーンが、識別され、利用されて、モジュール間のエラー
伝搬を正しくモデル化するのに使用しなければならない
ベイズ・ネットワークのサイズが制限されることが好ま
しい。Therefore, unlike the model-based diagnostic methods known in the art, in the preferred embodiment of the present invention, a dynamic base network model is used, which model specifically receives alarms or Created in response to each of the groups of alarms. As a result, this model fully and accurately reflects the actual latest network conditions, without incurring the extremely high computational costs and memory requirements of maintaining a complete model of the entire network. In the given model produced by this diagnostic system, interaction and error propagation between connected modules are considered, unlike the above-mentioned Pizza et al. Approach, where the device model is only considered in isolation. In embodiments of the present invention, a regular pattern of network topologies, such as cascaded switches, has been identified and utilized to determine which Bayesian model must be used to correctly model error propagation between modules. It is preferred that the size of the network be limited.

【００１２】本発明のいくつかの好ましい実施形態で
は、診断システムが、ネットワーク内のモジュールの２
次故障確率を査定する、すなわち、推定平均故障率と確
率分布の積率（標準偏差）の両方を考慮する。所与のモ
ジュールの確率分布の平均および積率は、モジュールに
関してベイズ・ネットワークが構成され、評価されるた
びに更新される。２次確率の使用は、ベイズ信頼性理論
（ベイズ・ネットワークとは別個の）の特性である。ベ
イズ信頼性理論では、当技術分野で既知の診断システム
で使用される、より単純な１次サンプリングベースの方
法と異なって、初期査定および訂正の処理として故障率
査定を扱う。２次手法は、障害診断モデリングにより適
する。In some preferred embodiments of the invention, the diagnostic system comprises two of the modules in the network.
Assess the next failure probability, ie, consider both the estimated mean failure rate and the product moment (standard deviation) of the probability distribution. The mean and product moments of the probability distributions for a given module are updated each time the Bayesian network is constructed and evaluated for the module. The use of second-order probabilities is a property of Bayesian reliability theory (separate from Bayesian networks). Bayesian reliability theory treats failure rate assessment as a process of initial assessment and correction, unlike the simpler first-order sampling-based methods used in diagnostic systems known in the art. The secondary approach is more suitable for fault diagnostic modeling.

【００１３】本明細書では、交換コンピュータ・ネット
ワークでの障害診断に関して好ましい実施形態を説明す
るが、当業者は、本発明の原理を、他のタイプの通信ネ
ットワークだけではなく、他の種類の電気システムおよ
び機械システムならびに医療システムおよび金融システ
ムを含む他のシステムの障害の突き止めに同様に適用可
能であることを諒解するであろう。Although a preferred embodiment is described herein with respect to fault diagnosis in switched computer networks, those skilled in the art will appreciate that the principles of the present invention apply to other types of electrical networks as well as other types of communication networks. It will be appreciated that it is equally applicable to locating faults in other systems, including systems and mechanical systems as well as medical and financial systems.

【００１４】したがって、本発明の好ましい実施例によ
れば、相互リンクされた複数のモジュールから構成され
たシステムの診断のための方法であって、前記システム
から、前記モジュールの１つの障害を示すアラームを受
け取るステップと、前記アラームに応答して、前記障害
を前記障害につながった可能性がある１つまたは複数の
前記モジュールでの誤動作に関連付け、前記障害の条件
つき確率を前記誤動作のそれぞれの確率に関係付ける、
因果ネットワークを構成するステップと、前記アラーム
および前記因果ネットワークに基づいて、前記誤動作の
前記確率の少なくとも１つを更新するステップと、前記
更新された確率に応答して前記アラームの診断を提案す
るステップとを含む方法が提供される。Therefore, according to a preferred embodiment of the present invention, a method for the diagnosis of a system composed of a plurality of modules linked together, wherein the system provides an alarm indicating a failure of one of the modules. And, in response to the alarm, associating the failure with a malfunction in one or more of the modules that may have led to the failure and assigning a conditional probability of the failure to a probability of each of the malfunctions. Relate to,
Configuring a causal network, updating at least one of the probabilities of the malfunction based on the alarm and the causal network, and proposing a diagnosis of the alarm in response to the updated probability. A method is provided that includes and.

【００１５】前記アラームを受け取るステップが、前記
システム内の前記複数のモジュールからイベント・レポ
ートを集めるステップと、前記イベント・レポートから
前記アラームを抽出するステップとを含み、前記イベン
ト・レポートを集めるステップが、前記システムの構成
の変更のレポートを受け取るステップを含み、前記因果
ネットワークを構成するステップが、前記変更された構
成に基づいて前記因果ネットワークを構成するステップ
を含むことが好ましい。前記変更された構成に基づいて
前記因果ネットワークを構成するステップが、前記構成
が記録されるデータベースを維持するステップと、前記
因果ネットワークの構成に使用するために、前記構成の
前記変更の前記レポートに応答して前記データベースを
更新するステップとを含むことが最も好ましい。.Receiving the alarm comprises collecting an event report from the plurality of modules in the system and extracting the alarm from the event report, the step of collecting the event report comprising: Preferably, receiving a report of a change in the configuration of the system, and configuring the causal network preferably comprises configuring the causal network based on the changed configuration. Configuring the causal network based on the modified configuration, maintaining a database in which the configuration is recorded, and reporting the modification of the configuration for use in configuring the causal network. And responsively updating the database. .

【００１６】代替としてまたは追加として、前記アラー
ムを抽出するステップが、前記モジュールの前記１つで
の前記障害を示す前記アラームを含む、相互に近接する
時刻に発生するアラームのシーケンスを抽出するステッ
プを含み、前記確率の前記少なくとも１つを更新するス
テップが、前記確率を更新するために前記アラームの前
記シーケンスを処理するステップを含む。前記アラーム
の前記シーケンスを抽出するステップが、前記システム
からの前記アラームの受取の際の期待される遅延に応答
して、前記アラームのそれぞれの寿命を定義するステッ
プと、前記それぞれの寿命に応答して前記シーケンスか
ら抽出する前記アラームを選択するステップとを含むこ
とが好ましい。抽出する前記アラームを選択するステッ
プが、前記因果ネットワークがそれに応答して構成され
た前記モジュールの前記１つでの前記障害を示す前記ア
ラームの発生の時刻のそれぞれの寿命以内に発生した前
記アラームを選択するステップを含むことが最も好まし
い。Alternatively or additionally, the step of extracting the alarm comprises extracting a sequence of alarms occurring at times close to each other, including the alarm indicating the fault in the one of the modules. Including updating the at least one of the probabilities comprises processing the sequence of the alarms to update the probabilities. Defining the life of each of the alarms in response to an expected delay in receipt of the alarm from the system, and responsive to the respective lifespan of extracting the sequence of the alarms. And selecting the alarm to extract from the sequence. The step of selecting the alarms to extract identifies the alarms that have occurred within each lifetime of the time of occurrence of the alarms indicating the failure in the one of the modules that the causal network has been configured to respond to. Most preferably it includes a step of selecting.

【００１７】さらに追加としてまたは代替として、前記
因果ネットワークを構成するステップが、前記１つまた
は複数の前記モジュールでの前記誤動作の１つによって
引き起こされる期待されるアラームを定義するステップ
を含み、前記アラームの前記シーケンスを処理するステ
ップが、アラームの前記抽出されたシーケンス内の前記
期待されるアラームの発生に応答して前記確率を更新す
るステップを含む。Additionally or alternatively, configuring the causal network includes defining an expected alarm caused by one of the malfunctions in the one or more modules, wherein the alarm Processing the sequence of steps of updating the probability in response to the occurrence of the expected alarm in the extracted sequence of alarms.

【００１８】好ましい実施形態では、前記相互リンクさ
れた複数のモジュールが、規則的なパターンで相互リン
クされた前記モジュールの所与の１つの複数のインスタ
ンスを含み、前記因果ネットワークを構成するステップ
が、前記モジュールの前記所与の１つに対応する前記ネ
ットワーク内のノードのグループを含むテンプレートを
定義するステップと、前記アラームに応答して前記１つ
または複数のモジュールに関して前記テンプレートをイ
ンスタンス化するステップとを含む。前記テンプレート
を定義するステップが、前記モジュールの前記所与の１
つの前記インスタンスの１つでの前記誤動作の１つによ
って引き起こされる期待されるアラームを識別するステ
ップを含み、前記テンプレートをインスタンス化するス
テップが、前記期待されるアラームの発生に応答して前
記ネットワークに前記テンプレートのインスタンスを追
加するステップを含むことが好ましい。In a preferred embodiment, the interconnected modules include a given instance of one of the modules interconnected in a regular pattern, the step of configuring the causal network comprising: Defining a template containing a group of nodes in the network corresponding to the given one of the modules; instantiating the template for the one or more modules in response to the alarm. including. The step of defining the template is the step of defining the given one of the modules.
Identifying an expected alarm caused by the one of the malfunctions in one of the two instances of instantiating the template to the network in response to the occurrence of the expected alarm. Preferably, the step of adding an instance of said template is included.

【００１９】前記因果ネットワークを構成するステップ
が、前記障害が発生した前記モジュールの前記１つでの
ローカル障害状態を識別するステップと、前記ローカル
障害状態に応答して、前記因果ネットワーク内で、前記
モジュールの前記１つで発生する前記誤動作の１つに前
記障害をリンクするステップとを含むことが好ましい。
追加してまたは代替として、前記因果ネットワークを構
成するステップが、前記システム内の前記モジュールの
第２の１つとの接続に起因して前記モジュールの第１の
１つで発生する第１障害状態を識別するステップと、前
記第１障害状態に応答して、前記因果ネットワーク内
で、前記モジュールの前記第２の１つで発生する第２障
害状態に前記障害をリンクするステップとを含む。前記
障害をリンクするステップが、前記第２障害状態の可能
な原因が、前記モジュールの前記第２の１つと前記シス
テム内の前記モジュールの第３の１つとの間のもう１つ
の接続に起因するかどうかを判定するステップと、前記
もう１つの接続に応答して、前記因果ネットワーク内
で、前記モジュールの前記第３の１つで発生する第３障
害状態に前記障害をリンクするステップとを含むことが
好ましい。Configuring the causal network identifying a local failure condition at the one of the failed modules; and in response to the local failure condition, within the causal network, Linking the fault to one of the malfunctions occurring in the one of the modules.
Additionally or alternatively, the step of configuring the causal network comprises determining a first failure condition that occurs in the first one of the modules due to the connection with the second one of the modules in the system. Identifying, and responsive to the first failure condition, linking the failure to a second failure condition occurring in the second one of the modules within the causal network. Linking the faults, the possible cause of the second fault condition is due to another connection between the second one of the modules and a third one of the modules in the system. Determining whether or not, and responsive to said another connection, linking said fault to a third fault condition occurring in said third one of said modules within said causal network. It is preferable.

【００２０】好ましい実施形態では、前記因果ネットワ
ークを構成するステップが、前記誤動作の前記それぞれ
の確率に応答して、前記誤動作の１つの複数の発生を前
記因果ネットワークに追加するステップと、前記因果ネ
ットワーク内で前記複数の発生に前記障害をリンクする
ステップとを含む。前記複数の発生に前記障害をリンク
するステップが、前記発生のそれぞれによって引き起こ
される１つまたは複数の障害状態を判定するステップ
と、前記障害状態の少なくとも一部を前記障害にリンク
するステップとを含むことが好ましい。In a preferred embodiment, the step of configuring the causal network adds to the causal network one or more occurrences of the malfunction in response to the respective probabilities of the malfunction. Linking the fault to the plurality of occurrences. Linking the failure to the plurality of occurrences includes determining one or more failure conditions caused by each of the occurrences, and linking at least a portion of the failure conditions to the failure. It is preferable.

【００２１】もう１つの好ましい実施形態では、前記誤
動作の前記確率の前記少なくとも１つを更新するステッ
プが、前記１つまたは複数の前記モジュールの障害の間
の平均時間を査定するステップを含む。In another preferred embodiment, updating the at least one of the probabilities of the malfunction comprises assessing an average time between failures of the one or more of the modules.

【００２２】前記誤動作の前記確率が、平均および積率
を有する確率分布に関して定義され、前記確率の前記少
なくとも１つを更新するステップが、前記確率分布の前
記平均および前記積率を再査定するステップを含むこと
が好ましい。前記確率分布が、故障率分布を含み、前記
平均および前記積率を再査定するステップが、ベイズ信
頼性理論モデルを使用して前記故障率分布を更新するス
テップを含むことが最も好ましい。The probability of the malfunction is defined with respect to a probability distribution having a mean and a product moment, and updating the at least one of the probabilities reassessing the mean and the product moment of the probability distribution. It is preferable to include. Most preferably, the probability distribution comprises a failure rate distribution and the step of reassessing the mean and the product moments comprises updating the failure rate distribution using a Bayesian reliability theory model.

【００２３】前記診断を提案するステップが、前記更新
された確率の１つまたは複数を所定の閾値と比較するス
テップと、前記確率の前記１つが前記閾値を超える時に
診断アクションを起動するステップとを含むことが好ま
しい。通常は、前記診断アクションを起動するステップ
が、前記診断について前記システムのユーザに通知する
ステップを含み、前記ユーザに通知するステップが、前
記因果ネットワークに基づく前記診断の説明を提供する
ステップを含む。追加としてまたは代替として、前記診
断アクションを起動するステップが、前記誤動作を検証
するために診断テストを実行するステップを含み、前記
診断テストが、前記閾値を超える前記確率の前記１つに
応答して選択される。前記因果ネットワークが、前記診
断テストの結果に応答して変更されることが好ましい。Proposing the diagnosis comprises comparing one or more of the updated probabilities with a predetermined threshold and activating a diagnostic action when the one of the probabilities exceeds the threshold. It is preferable to include. Typically, invoking the diagnostic action comprises notifying a user of the system about the diagnosis, and notifying the user comprises providing a description of the diagnosis based on the causal network. Additionally or alternatively, activating the diagnostic action includes performing a diagnostic test to verify the malfunction, the diagnostic test responsive to the one of the probabilities to exceed the threshold. To be selected. The causal network is preferably modified in response to the results of the diagnostic test.

【００２４】本発明の好ましい実施形態によれば、相互
リンクされた複数のモジュールから構成されたシステム
の診断のための方法であって、前記モジュールの１つで
の障害を前記障害につながった可能性がある２つ以上の
前記モジュールでの誤動作と関連付け、前記障害の条件
つき確率を前記誤動作のそれぞれの確率分布に関係付け
る因果ネットワークを構成するステップと、前記障害を
示す前記システムからのアラームに応答して、前記誤動
作の前記確率分布を更新するステップと、前記更新され
た確率分布に応答して前記アラームの診断を提案するス
テップとを含む方法も提供される。According to a preferred embodiment of the present invention, a method for the diagnosis of a system composed of a plurality of modules linked together, wherein a fault in one of the modules can lead to the fault. A causal network associating with a malfunction in two or more of the modules that have a positive effect and relating the conditional probability of the failure to a respective probability distribution of the malfunction, and an alarm from the system indicating the failure. In response, a method is also provided that includes updating the probability distribution of the malfunction and proposing a diagnosis of the alarm in response to the updated probability distribution.

【００２５】本発明の好ましい実施形態によれば、相互
リンクされた複数のモジュールから構成されたシステム
の診断のための装置であって、前記装置が、診断プロセ
ッサを含み、前記診断プロセッサが、前記システムか
ら、前記モジュールの１つの障害を示すアラームを受け
取るように結合され、前記診断プロセッサが、前記アラ
ームに応答して、前記障害を前記障害につながった可能
性がある１つまたは複数の前記モジュールでの誤動作に
関連付け、前記障害の条件つき確率を前記誤動作のそれ
ぞれの確率に関係付ける、因果ネットワークを構成し、
前記アラームおよび前記因果ネットワークに基づいて、
前記誤動作の前記確率の少なくとも１つを更新して、前
記更新された確率に応答して前記アラームの診断を提案
するように配置される装置が、追加的に提供される。According to a preferred embodiment of the invention, a device for the diagnosis of a system composed of a plurality of modules linked together, said device comprising a diagnostic processor, said diagnostic processor comprising: One or more of the modules that are coupled to receive an alarm from the system indicating a failure of one of the modules, the diagnostic processor being responsive to the alarm and leading the failure to the failure. And constructing a causal network, which relates the conditional probability of the failure to each probability of the malfunction,
Based on the alarm and the causal network,
There is additionally provided a device arranged to update at least one of the probabilities of the malfunction and propose a diagnosis of the alarm in response to the updated probabilities.

【００２６】前記装置が、前記構成が記録されるデータ
ベースを含むメモリを含み、前記因果ネットワークの構
成に使用するために、前記診断プロセッサが、前記構成
の前記変更の前記レポートに応答して前記データベース
を更新するように結合されることが好ましい。The apparatus includes a memory containing a database in which the configuration is recorded, the diagnostic processor responsive to the report of the change in the configuration for use in configuring the causal network. Are preferably combined to update

【００２７】前記装置が、ユーザ・インターフェースを
含み、前記診断プロセッサが、前記ユーザ・インターフ
ェースを介して前記診断について前記システムのユーザ
に通知するように結合されることがさらに好ましい。It is further preferred that the device includes a user interface and that the diagnostic processor is coupled to notify the user of the system of the diagnosis via the user interface.

【００２８】さらに、本発明の好ましい実施形態によれ
ば、相互リンクされた複数のモジュールから構成された
システムの診断のための装置であって、前記装置が、診
断プロセッサを含み、前記診断プロセッサが、前記モジ
ュールの１つでの障害を前記障害につながった可能性が
ある２つ以上の前記モジュールでの誤動作と関連付け、
前記障害の条件つき確率を前記誤動作のそれぞれの確率
分布に関係付ける因果ネットワークを構成し、前記障害
を示す前記システムからのアラームに応答して、前記誤
動作の前記確率分布を更新して、前記更新された確率分
布に応答して前記アラームの診断を提案するように配置
される、装置が提供される。Further in accordance with a preferred embodiment of the present invention an apparatus for the diagnosis of a system consisting of a plurality of modules linked together, said apparatus comprising a diagnostic processor, said diagnostic processor comprising: , Correlating a failure in one of the modules with a malfunction in two or more of the modules that may have led to the failure,
Configuring a causal network relating the conditional probabilities of the failures to respective probability distributions of the malfunctions, updating the probability distributions of the malfunctions in response to an alarm from the system indicating the failure, and updating the An apparatus is provided that is arranged to propose a diagnosis of said alarm in response to a determined probability distribution.

【００２９】さらに、本発明の好ましい実施形態によれ
ば、相互リンクされた複数のモジュールから構成された
システムの診断のためのコンピュータ・ソフトウェア製
品であって、前記コンピュータ・ソフトウェア製品が、
プログラム命令が保管されたコンピュータ可読媒体を含
み、前記プログラム命令が、コンピュータによって読み
取られた時に、前記コンピュータに、前記システムから
前記モジュールの１つの障害を示すアラームを受け取る
ことと、前記アラームに応答して、前記障害を前記障害
につながった可能性がある１つまたは複数の前記モジュ
ールでの誤動作に関連付け、前記障害の条件つき確率を
前記誤動作のそれぞれの確率に関係付ける、因果ネット
ワークを構成することと、前記アラームおよび前記因果
ネットワークに基づいて、前記誤動作の前記確率の少な
くとも１つを更新して、前記更新された確率に応答して
前記アラームの診断を提案することとを行わせる、コン
ピュータ・ソフトウェア製品が提供される。Further in accordance with a preferred embodiment of the present invention a computer software product for the diagnosis of a system composed of a plurality of modules linked together, said computer software product comprising:
A computer readable medium having program instructions stored thereon, the program instructions receiving, upon reading by a computer, an alarm from the system indicating a failure of one of the modules; and responsive to the alarm. A causal network that associates the fault with a malfunction in one or more of the modules that may have led to the fault and correlates the conditional probability of the fault with each probability of the malfunction. And updating at least one of the probabilities of the malfunction based on the alarm and the causal network and proposing a diagnosis of the alarm in response to the updated probability. Software products are provided.

【００３０】さらに、本発明の好ましい実施形態によれ
ば、相互リンクされた複数のモジュールから構成された
システムの診断のための製品であって、前記製品が、プ
ログラム命令が保管されたコンピュータ可読媒体を含
み、前記プログラム命令が、コンピュータによって読み
取られた時に、前記コンピュータに、前記モジュールの
１つでの障害を前記障害につながった可能性がある２つ
以上の前記モジュールでの誤動作と関連付け、前記障害
の条件つき確率を前記誤動作のそれぞれの確率分布に関
係付ける因果ネットワークを構成することと、前記障害
を示す前記システムからのアラームに応答して、前記誤
動作の前記確率分布を更新して、前記更新された確率分
布に応答して前記アラームの診断を提案することとを行
わせる製品が提供される。Furthermore, according to a preferred embodiment of the present invention, a product for diagnosing a system composed of a plurality of modules linked to each other, said product being a computer-readable medium in which program instructions are stored. And, when the program instructions are read by a computer, associating with the computer a failure in one of the modules with a malfunction in two or more of the modules that may have led to the failure. Configuring a causal network that relates the conditional probability of failure to each probability distribution of the malfunction, and updating the probability distribution of the malfunction in response to an alarm from the system indicating the failure, A product is provided that allows for proposing a diagnosis of the alarm in response to an updated probability distribution. That.

【００３１】[0031]

【発明の実施の形態】図１は、本発明の好ましい実施形
態による、管理可能な通信ネットワークであるネットワ
ーク２２と、ネットワークを監視するのに使用される診
断ユニット２０を概略的に示すブロック図である。ネッ
トワーク２２には、通常は、当技術分野で既知のよう
に、system/storage area network（ＳＡＮ）が含まれ
る。そのようなネットワークでは、ノード２４に、サー
バまたは他のコンピュータ・プロセッサ、入出力装置、
記憶装置、またはゲートウェイを含めることができ、こ
れらが、スイッチ２８によって相互接続される。そのよ
うなネットワークの例が、米国ニューヨーク州アーモン
クのIBM Corporation社が製造するRS/6000 SPシステム
である。ネットワーク２２は、診断ユニット２０によっ
て使用される次の２つの鍵となる特徴を提供するという
意味で、「管理可能な」といわれる。第１に、このネッ
トワークは、パケット破壊または装置無応答などのエラ
ーおよび障害と、異常な機能性を反映する可能性がある
統計に関して監視される。第２に、このネットワーク
は、特にアラームを生成する時を決定するのに使用され
るエラー閾値などの装置パラメータをシステム・オペレ
ータまたは自動コントローラがセットする能力に関し
て、構成可能である。1 is a block diagram schematically illustrating a network 22, which is a manageable communication network, and a diagnostic unit 20 used to monitor the network, according to a preferred embodiment of the present invention. is there. Network 22 typically includes a system / storage area network (SAN), as is known in the art. In such networks, node 24 may include a server or other computer processor, input / output device,
Storage devices, or gateways, may be included, which are interconnected by switch 28. An example of such a network is the RS / 6000 SP system manufactured by IBM Corporation of Armonk, NY. The network 22 is said to be "manageable" in the sense that it provides the following two key features used by the diagnostic unit 20: First, the network is monitored for errors and failures such as packet corruption or device unresponsiveness and statistics that may reflect unusual functionality. Second, the network is configurable, especially with respect to the ability of the system operator or automated controller to set device parameters such as error thresholds used to determine when to generate an alarm.

【００３２】ネットワーク２２の管理機能は、ノードの
うちで、１次ノード２６として働くように選択されたノ
ードを介して調整されることが好ましい。ノード２４に
は、イベント・コレクタ３０が含まれ、このイベント・
コレクタ３０は、すべてのノードで稼動するネットワー
ク管理ソフトウェアの一部として稼動するソフトウェア
・エージェントとして実施されることが好ましい。これ
らのエージェントは、アラームおよび構成変更を含む、
それぞれのノードで発生するシステム・イベントを集め
る。イベント・コレクタ３０は、これらのイベントを、
管理パケットの形で、１次ノード２６上で稼動する１次
イベント・コレクタ３２に送信する。１次イベント・コ
レクタ３２は、下で説明するように、イベントのストリ
ームを処理のために診断ユニット２０に渡す。The management functions of the network 22 are preferably coordinated via the node of the nodes selected to act as the primary node 26. Node 24 includes an event collector 30, which
The collector 30 is preferably implemented as a software agent running as part of the network management software running on all nodes. These agents include alarms and configuration changes,
Collect system events that occur on each node. The event collector 30 collects these events
It is sent in the form of management packets to the primary event collector 32 running on the primary node 26. Primary event collector 32 passes a stream of events to diagnostic unit 20 for processing, as described below.

【００３３】概念的な明瞭さのために、診断ユニット２
０は、１次ノード２６とは別の機能ブロックとして図示
されているが、本発明の好ましい実施形態では、診断ユ
ニット２０が、１次ノード上で稼動するソフトウェア・
コンポーネントとして実施される。その代わりに、診断
ユニット・ソフトウェアを、１次ノードとは物理的に分
離された別のプロセッサ上で稼動させることができ、ま
た、ノードのグループまたはすべてのノードで分散アプ
リケーションとして稼動させることができる。このソフ
トウェアは、たとえば１次ノードまたは他のプロセッサ
へ電子的な形でネットワーク２２を介してダウンロード
することができ、その代わりに、ＣＤ−ＲＯＭなどの有
形の媒体上で供給することができる。For conceptual clarity, the diagnostic unit 2
Although 0 is shown as a functional block separate from the primary node 26, in the preferred embodiment of the present invention, the diagnostic unit 20 is a software program running on the primary node.
Implemented as a component. Alternatively, the diagnostic unit software can run on a separate processor that is physically separate from the primary node, and can run as a distributed application on a group of nodes or on all nodes. . This software can be downloaded electronically to the primary node or other processor, for example, via network 22 and, alternatively, can be provided on a tangible medium such as a CD-ROM.

【００３４】図２は、本発明の好ましい実施形態によ
る、診断ユニット２０の詳細を概略的に示すブロック図
である。診断ユニット２０が、上で注記したようにソフ
トウェアで実施されると仮定すると、図２に示されたブ
ロックは、通常は、別々のハードウェア要素ではなく、
診断ソフトウェア・パッケージ内の機能要素またはプロ
セスを表す。１次イベント・コレクタ３２によって収集
されたイベントのストリームが、診断ユニット２０内
で、イベント・フォーマッタおよびマージャ４０によっ
て受け取られる。このブロックは、イベントを順番に、
好ましくはイベント・コレクタ３０によってイベントの
発生の時刻を示すために適用されたタイム・スタンプに
基づく日時順で配置する。その代わりに、壽序を、１次
ノード２６でのイベントの受取の時刻に基づくものとす
ることができる。イベント・フォーマッタおよびマージ
ャ４０は、適宜、イベント・コレクタ３０から受け取っ
たイベント・メッセージ情報を、診断ユニット２０内の
後続ブロックによって効率的に処理できる統一されたフ
ォーマットで再フォーマットする。イベント・フォーマ
ッタおよびマージャ４０は、イベントを、構成変更イベ
ントとアラーム（すなわちエラー報告）に分離し、処理
のために２つのマージされたストリームを供給する。FIG. 2 is a schematic block diagram showing details of the diagnostic unit 20 according to a preferred embodiment of the present invention. Assuming that the diagnostic unit 20 is implemented in software as noted above, the blocks shown in FIG. 2 would typically not be separate hardware elements,
Represents a functional element or process within a diagnostic software package. The stream of events collected by the primary event collector 32 is received by the event formatter and merger 40 within the diagnostic unit 20. This block is a sequence of events,
Preferably, they are arranged in chronological order based on a time stamp applied by the event collector 30 to indicate the time of occurrence of the event. Alternatively, the schedule may be based on the time of receipt of the event at the primary node 26. The event formatter and merger 40 optionally reformats the event message information received from the event collector 30 in a unified format that can be efficiently processed by subsequent blocks within the diagnostic unit 20. The event formatter and merger 40 separates the events into configuration change events and alarms (ie error reports) and provides two merged streams for processing.

【００３５】構成トラッカ４２が、構成変更イベントを
受け取り、これらを処理して、システム・モデル４４に
基づいて構成データベース４６を更新する。構成データ
ベース４６は、現在使用可能なモジュール、その状況、
およびトポロジを含む、ネットワーク始動時のネットワ
ーク２２の完全な構成を用いて初期化される。このデー
タベースは、その後、たとえば、ノード２４の追加また
は除去、スイッチ２８上のポートの使用可能化または使
用不能化などの、発生したすべての変更を反映するため
に、リアル・タイムで自動的に更新される。システム・
モデル４４では、ネットワーク２２内で使用されるモジ
ュールが、その相互接続および階層を含めて記述され
る。用語「モジュール」は、本明細書では、通常は、特
定のエラー・レポートに関連付けることができる現場交
換可能ユニット（ＦＲＵ）またはＦＲＵの一部を指すの
に使用される。システム・モデル４４内のモジュールの
間の差異化によって、診断ユニット２０がエラー・レポ
ートを診断し、そのソースを局所化する際の粒度が決定
される。階層システム・モデルが、当技術分野で既知の
ように、Extensible Markup Language（ＸＭＬ）フォー
マットでネットワーク２２のオペレータによって診断ユ
ニット２０に供給されることが好ましい。The configuration tracker 42 receives the configuration change events and processes them to update the configuration database 46 based on the system model 44. The configuration database 46 stores the currently available modules, their status,
And initialized with the complete configuration of the network 22 at network startup, including topology. This database is then automatically updated in real time to reflect any changes that have occurred, eg the addition or removal of nodes 24, the enabling or disabling of ports on the switch 28. To be done. system·
In model 44, the modules used within network 22 are described, including their interconnections and hierarchy. The term "module" is generally used herein to refer to a field replaceable unit (FRU) or portion of a FRU that can be associated with a particular error report. The differentiation between the modules in system model 44 determines the granularity with which diagnostic unit 20 diagnoses the error report and localizes its source. The hierarchical system model is preferably provided to the diagnostic unit 20 by an operator of the network 22 in Extensible Markup Language (XML) format, as is known in the art.

【００３６】診断エンジン４８は、イベント・フォーマ
ッタおよびマージャ４０からアラーム・ストリームを受
け取り、この情報を使用して、各アラームに関連するモ
ジュールの信頼性査定を判定し、更新する。信頼性査定
は、各アラームに対応するベイズ・ネットワークをオン
ザフライで構成し、ベイズ信頼性理論を使用して、モジ
ュールのそれぞれのさまざまな誤動作の誤動作率を査定
することによって、更新される。診断エンジンが使用す
る方法は、後で詳細に説明する。ベイズ・ネットワーク
を構成する際に、診断エンジンは、上で説明したよう
に、システム・モデル４４および構成データベース４６
によって供給される情報を使用する。診断エンジンは、
ネットワーク２２内の可能な障害を記述する障害モデル
５０にも頼る。この文脈での障害は、ローカルな問題ま
たは予想されない入力に起因して所与のモジュール内で
発生する可能性がある、異常な状態または振る舞いであ
る。The diagnostic engine 48 receives the alarm stream from the event formatter and merger 40 and uses this information to determine and update the reliability assessment of the module associated with each alarm. The reliability assessment is updated by configuring the Bayesian network for each alarm on the fly and using Bayesian reliability theory to assess the malfunction rate of each different malfunction of the module. The method used by the diagnostic engine is described in detail below. In configuring the Bayesian network, the diagnostic engine uses the system model 44 and the configuration database 46 as described above.
Use the information provided by. The diagnostic engine is
It also relies on a fault model 50 that describes possible faults in the network 22. A fault in this context is an abnormal condition or behavior that can occur within a given module due to local problems or unexpected inputs.

【００３７】障害モデル５０は、好ましくはネットワー
ク・オペレータによって、最も好ましくはＸＭＬフォー
マットで供給される。障害モデルのサンプルのＤＴＤ
（Document Type Definition）を、付録Ａとして本明細
書に添付する。これには、通常は、グローバル障害情報
が、システム・モデル内の基本モジュールのすべてに関
する個別の障害モデルと共に含まれる。これらの基本モ
ジュールは、モジュール階層の最下位レベルにあるモジ
ュールである。The fault model 50 is preferably supplied by the network operator, most preferably in XML format. Disability model sample DTD
(Document Type Definition) is attached to this specification as Appendix A. It typically contains global fault information along with individual fault models for all of the base modules in the system model. These basic modules are the lowest level modules in the module hierarchy.

【００３８】障害モデル５０のグローバル障害情報に
は、ネットワーク２２で可能なすべてのタイプの誤動作
と、その期待される率が記述される。この文脈での用語
「誤動作」は、モジュール内の障害の根本原因を指す。
モジュールで障害が検出される時に、その障害は、その
モジュール自体で発生した誤動作に起因する場合と、障
害が検出されたモジュールへネットワークを介して通信
トラフィックで伝搬された別のモジュールの誤動作に起
因する場合がある。障害モデル５０の誤動作確率は、通
常は、故障の間の推定平均時間（ＭＴＢＦ）などの故障
率に関して表現される。推定された率に、確率分布の標
準偏差（または第１積率）に関して表現された推定の信
頼性の尺度が付随することが好ましい。誤動作率査定
は、対数時間スケールでの正規分布によって記述するこ
とができる。したがって、たとえば、秒単位での誤動作
率査定（１０、１）は、誤動作発生の間の平均時間が１
０¹⁰秒であり、発生の間の実際の時間が区間［１０⁸、
１０¹²］秒である確率が０．９５であることを示す。診
断エンジン４８は、ネットワーク２２から受け取るアラ
ームを処理する際に、平均および標準偏差の両方を推論
によって更新する。The global fault information of fault model 50 describes all types of malfunctions possible in network 22 and their expected rates. The term "malfunction" in this context refers to the root cause of a failure within a module.
When a fault is detected on a module, the fault may be due to a malfunction occurring in the module itself or a malfunction of another module propagated in the communication traffic through the network to the module where the fault is detected. You may. The malfunction probability of failure model 50 is typically expressed in terms of failure rate, such as estimated mean time between failures (MTBF). The estimated rate is preferably accompanied by a measure of the reliability of the estimation expressed in terms of the standard deviation (or first product moment) of the probability distribution. Malfunction rate assessment can be described by a normal distribution on a logarithmic time scale. Therefore, for example, the malfunction rate assessment (10, 1) in seconds is based on the average time between malfunction occurrences of 1
0 ¹⁰ seconds and the actual time between occurrences is the interval [10 ⁸ ,
It is shown that the probability of being 10 ¹² ] seconds is 0.95. The diagnostic engine 48 infers both the mean and standard deviation as it processes alarms it receives from the network 22.

【００３９】各基本モジュールの個々の障害モデルに
は、以下の情報が含まれる。・そのモジュールで発生する可能性がある誤動作のそれ
ぞれについて、それがそのモジュール自体によって検出
され、そのモジュールによるアラームの生成につながる
かどうかと、その誤動作がそのモジュールの出力の障害
状態を引き起こすかどうか。障害状態は、障害すなわ
ち、上で注記したようにモジュールの異常な状態または
振る舞いの出現につながる誤動作の発生の結果である。
モジュール自体で障害を引き起こすモジュール内の障害
状態を、本明細書では「ローカル障害状態」と呼称す
る。別のモジュールの異常な入力状態を引き起こす、モ
ジュール出力での障害状態を、「接続障害状態」と呼称
する。・モジュールの入力に現れる可能性がある障害状態のそ
れぞれについて、その状態がそのモジュールによって伝
搬されるか、検出されるか、その両方であるか。・検出される障害状態のそれぞれについて、モジュール
がどのアラームを報告するか。The individual fault models of each basic module include the following information: • For each malfunction that can occur in that module, whether it is detected by the module itself and leads to the generation of an alarm by that module, and whether that malfunction causes a fault condition at the output of that module. . A fault condition is the result of a fault, ie the occurrence of a malfunction that leads to the appearance of an abnormal state or behavior of a module as noted above.
A fault condition within a module that causes a fault in the module itself is referred to herein as a "local fault condition." A fault condition at the module output that causes an abnormal input state of another module is called a "connection fault condition". • For each fault condition that may appear at the module's input, whether the condition is propagated, detected, or both by the module. • Which alarms the module reports for each detected fault condition.

【００４０】勧告および説明ジェネレータ５２が、診断
エンジン４８によって計算された誤動作査定を受け取
り、ネットワーク２２内の異なるモジュールの査定を、
障害モデル５０に保持された期待されるベースライン値
と比較する。所与のモジュールの故障率査定が、そのベ
ースライン値より大幅に高い時には、勧告および説明ジ
ェネレータ５２は、通常は、さらに診断処置を講ずる
か、そのモジュールを含むＦＲＵを交換するようにユー
ザに勧告する。そのような勧告を行うための判断基準
は、下でさらに説明する。勧告は、ユーザ・インターフ
ェース５４を介して提示される。このユーザ・インター
フェースを用いて、ユーザが、勧告および説明ジェネレ
ータへの照会を入力でき、それに応答して勧告の根本的
理由の包括的な説明を受け取ることができることが好ま
しい。説明は、診断エンジン４８によって構成されたベ
イズ・ネットワークに基づいて、当技術分野で既知の説
明を生成する方法を使用して生成されることが好まし
い。このための例示的方法が、ドラズデル（Druzdel）
著、「Qualitative Verbal Explanations in Bayesian
Belief Networks」、Artificial Intelligence and Sim
ulation of Behavior Quarterly 94（１９９６年）、４
３ないし５４ページと、マディガン（Madigan）他著、
「Explanation in Belief Networks」、Journal of Com
putational and Graphical Statistics 6、１６０ない
し１８１ページ（１９９７年）に記載されている。これ
らの出版物の両方が、参照によって本明細書に組み込ま
れる。The recommendation and explanation generator 52 receives the malfunction assessments calculated by the diagnostic engine 48 and assesses the different modules in the network 22.
Compare with expected baseline value held in the fault model 50. When the failure rate assessment for a given module is significantly higher than its baseline value, the advisory and explanation generator 52 typically advises the user to take further diagnostic action or replace the FRU containing that module. To do. The criteria for making such a recommendation are discussed further below. The recommendations are presented via the user interface 54. Using this user interface, the user is preferably able to enter a query to the recommendation and explanation generator and in response receive a comprehensive explanation of the root cause of the recommendation. The description is preferably generated using methods known in the art for generating a description based on a Bayesian network constructed by the diagnostic engine 48. An exemplary method for this is Druzdel.
Written by Qualitative Verbal Explanations in Bayesian
Belief Networks, Artificial Intelligence and Sim
ulation of Behavior Quarterly 94 (1996), 4
Pages 3-54 and by Madigan et al.,
"Explanation in Belief Networks", Journal of Com
putational and Graphical Statistics 6, pages 160-181 (1997). Both of these publications are incorporated herein by reference.

【００４１】図３は、本発明の好ましい実施形態によ
る、診断ユニット２０でアラームを処理し、勧告を生成
する方法を概略的に示す流れ図である。この方法は、ア
ラーム受取のステップ６０で、診断エンジン４８がアラ
ームを受け取るたびに起動されることが好ましい。その
代わりに、この方法を、あるタイプまたはグループのア
ラームに応答して呼び出すことができる。シーケンス組
合せのステップ６２で、短い時間間隔で発生する関係す
るアラームを、処理のために組み合わせることが好まし
い。集合処理のためのシーケンス内でのアラームの組合
せに適用可能な方法および考慮事項を、図５に関して下
で詳細に説明する。FIG. 3 is a flow chart that schematically illustrates a method of handling alarms and generating recommendations in the diagnostic unit 20, according to a preferred embodiment of the present invention. The method is preferably activated each time the diagnostic engine 48 receives an alarm at step 60 of receiving an alarm. Alternatively, the method can be invoked in response to some type or group of alarms. In step 62 of the sequence combination, the relevant alarms occurring at short time intervals are preferably combined for processing. The methods and considerations applicable to combining alarms in a sequence for aggregation processing are described in detail below with respect to FIG.

【００４２】診断エンジン４８は、ネットワーク構築の
ステップ６４で、特定のアラームまたはアラーム・シー
ケンスに適用可能なベイズ・ネットワーク（または因果
ネットワーク）を構築する。単一のアラームに応答して
構成された通常のベイズ・ネットワークを図４に示し、
このネットワークを構成するのに使用される方法を、図
６ないし８に関して下で詳細に説明する。ベイズ・ネッ
トワークは、有向非輪状グラフであり、そのノードが、
問題のアラームにつながる可能なモジュール誤動作、障
害状態、および障害を含む変数に対応する。誤動作ノー
ドは、期待される誤動作率または査定された誤動作率に
基づく、指定された確率分布を有する。残りの変数の確
率は、グラフ内の親に対して、対応する変数の条件つき
確率を表す確率テーブルによって記述される。The diagnostic engine 48 builds a Bayesian network (or causal network) applicable to the particular alarm or alarm sequence at network building step 64. A typical Bayesian network configured in response to a single alarm is shown in Figure 4,
The method used to construct this network is described in detail below with respect to FIGS. Bayesian networks are directed acyclic graphs whose nodes are
Respond to variables that include possible module malfunctions, fault conditions, and faults that lead to problem alarms. The malfunction node has a specified probability distribution based on the expected malfunction rate or the assessed malfunction rate. The probabilities of the remaining variables are described for the parents in the graph by probability tables that represent the conditional probabilities of the corresponding variables.

【００４３】グラフを構成した後に、診断エンジン４８
は、更新のステップ６６で、シーケンス内のアラームに
基づいてノードの確率テーブルを更新する。指定された
時間枠内で発生する異なるアラームを相関させることに
よって、診断エンジンは、ノードの条件つき確率を調整
することができ、その後、グラフを作って、誤動作ノー
ドの誤動作率査定を更新することができる。言い換える
と、すべての所与の観察されたアラームＡについて、そ
の確率Ｐ（Ａ＝真）に、１をセットする。期待されるア
ラームの確率は、その寿命分布に従って判定される。そ
の後、ベイズ・ネットワークのノードの確率テーブルを
再計算して、これらの結果との一貫性を有するようにす
る。この手順を、ベイズ・ネットワークの分野では、
「証拠伝搬（evidence propagation）」と称する。After constructing the graph, the diagnostic engine 48
Updates the node's probability table based on the alarms in the sequence at update step 66. By correlating different alarms that occur within a specified time frame, the diagnostic engine can adjust the conditional probabilities of the nodes and then graph them to update the malfunction rate assessments of malfunction nodes. You can In other words, for every given observed alarm A, set its probability P (A = true) to 1. The expected alarm probability is determined according to its lifetime distribution. The Bayesian network's probability table of nodes is then recomputed to be consistent with these results. In the field of Bayesian networks, this procedure is
It is called "evidence propagation".

【００４４】更新された誤動作査定は、勧告のステップ
６８で、勧告および説明ジェネレータ５２がユーザに勧
告を提供するための基礎として働く。ユーザが、各モジ
ュールに適用される２つの閾値レベルすなわち、モジュ
ールに「障害の疑いあり」としてフラグが立てられる低
閾値と、疑わしいモジュールが疑わしくないものとして
再分類される高閾値を定義することが好ましい。これら
の閾値は、各モジュールの査定された誤動作率と、シス
テム仕様に基づくそのモジュールの期待される故障率の
間の差に関係する。ユーザは、この２つの閾値の信頼性
レベルも定義する。この信頼性レベルは、モジュールの
誤動作率査定に関連する標準偏差値に対して検査され
る。したがって、たとえば、ユーザは、そのＭＴＢＦ
（誤動作率の逆数）が１０⁸未満に低下したことが１０
％の信頼性レベルである時に、所与のモジュールに障害
の疑いありとしてフラグを立てることを指定することが
できる。あるアラーム・シーケンスに続くステップ６６
の後に、そのモジュールについて査定されたＭＴＢＦ
が、上で説明した対数表記を使用して（９、２）である
と仮定する。そのような場合には、実際のＭＴＢＦが閾
値１０⁸未満に低下した確率が１０％を超えるので、そ
のモジュールにそれ相応にフラグが立てられる。ユーザ
は、通常、問題のＦＲＵを交換するか他の形でサービス
するコストに応じて、ネットワーク動作中のモジュール
の障害の結果の深刻さに対して重みをつけて、閾値およ
び信頼性レベルを設定する。The updated malfunction assessment serves as the basis for the advisory and explanation generator 52 to provide the advisor to the user at advisory step 68. The user can define two threshold levels applied to each module: a low threshold at which the module is flagged as "suspected" and a high threshold at which the suspicious module is reclassified as non-suspicious. preferable. These thresholds relate to the difference between the assessed malfunction rate of each module and the expected failure rate of that module based on system specifications. The user also defines the confidence level for these two thresholds. This confidence level is checked against the standard deviation value associated with the malfunction rate assessment of the module. Thus, for example, the user may
The fact that (the reciprocal of the malfunction rate) fell below 10 ⁸ was 10
It is possible to specify that a given module should be flagged as suspected of having a failure at a confidence level of%. Step 66 following an alarm sequence
MTBF assessed for the module after
Is (9,2) using the logarithmic notation described above. In such a case, the probability that the actual MTBF has dropped below the threshold of 10 ⁸ is greater than 10% and the module is flagged accordingly. Users typically set thresholds and confidence levels, weighing the severity of the consequences of a module failure during network operation, depending on the cost of replacing or otherwise servicing the FRU in question. To do.

【００４５】所与のモジュールが、障害の疑いありとし
てフラグを立てられている時に、勧告および説明ジェネ
レータ５２が、そのモジュールの状況を検証するために
そのモジュールに適用することができるオンラインの非
破壊試験手順があるかどうかを判定するために検査す
る。そうである場合には、ジェネレータが、その手順を
自動的に呼び出すか、その代わりに、その手順を実行す
るようにユーザに促すことが好ましい。この手順の結果
が、診断エンジン４８にフィード・バックされることが
好ましく、この診断エンジン４８は、適用可能なベイズ
・ネットワークにその結果を組み込み、その誤動作率査
定をそれ相応に更新する。この手順の次に、勧告および
説明ジェネレータ５２が、ＦＲＵを交換しなければなら
ないかどうかを判定することができる。その代わりに、
問題のモジュールに関連する可能な誤動作のすべてに関
するＭＴＢＦ査定が、高閾値未満に低下する（おそらく
はネットワーク２２からの追加のアラームの受取および
処理の後に）場合に、そのモジュールの障害の疑いフラ
グをリセットする。When a given module is flagged as suspected of failing, the advisory and explanation generator 52 can apply it to that module to verify the status of that module. Inspect to determine if there is a test procedure. If so, the generator preferably invokes the procedure automatically or, instead, prompts the user to perform the procedure. The results of this procedure are preferably fed back to the diagnostic engine 48, which incorporates the results into the applicable Bayesian network and updates its malfunction rate assessment accordingly. Following this procedure, the advisory and explanation generator 52 can determine if the FRU should be replaced. Instead,
If the MTBF assessment for all possible malfunctions associated with the module in question drops below the high threshold (perhaps after receipt and processing of additional alarms from network 22), reset the suspected failure flag for that module. To do.

【００４６】図４は、本発明の好ましい実施形態によ
る、診断エンジン４８によって生成される例示的ベイズ
・ネットワークであるネットワーク７０を概略的に示す
グラフである。この例では、診断エンジン４８が、図３
の方法のステップ６０で受け取る、観察されたＵＳＤ
（非送信請求データ）アラーム７１に応答して、ネット
ワーク７０を構成する。このアラームは、ＵＳＤ障害７
２が発生し、これによって、スイッチ２８の１つの受信
器ポートが、データの前に送信されなければならない、
正しいパケットの先頭（ＢＯＰ）文字が先行していなか
ったデータを受信したことを意味する。この障害を引き
起こす可能性がある、障害モデル５０に記述されたシナ
リオには、次の２つがある。・破壊されたＢＯＰ − このエラーを報告したスイッ
チにデータを送信した、ネットワーク２２内の先行する
スイッチの受信器部分と、エラーが検出された実際の受
信器ポートとの間のどのモジュールでも発生する可能性
がある。・ローカル設計欠陥 − メモリ破壊以外の、報告する
スイッチのローカルな問題。FIG. 4 is a graph that schematically illustrates network 70, which is an exemplary Bayesian network produced by diagnostic engine 48, in accordance with a preferred embodiment of the present invention. In this example, the diagnostic engine 48 is shown in FIG.
Observed USD received in step 60 of the method
(Unsolicited data) The network 70 is configured in response to the alarm 71. This alarm is a USD failure 7
2 occurs, which causes one receiver port of the switch 28 to be transmitted before the data,
Means that data was received that was not preceded by the beginning (BOP) character of the correct packet. There are two scenarios described in the fault model 50 that can cause this fault: Corrupted BOP-occurs in any module between the receiver portion of the preceding switch in the network 22 that sent data to the switch that reported this error and the actual receiver port in which the error was detected. there is a possibility. Local design defect-A local issue in the reporting switch other than memory corruption.

【００４７】ネットワーク７０を構築するために、診断
エンジン４８は、観察されたＵＳＤアラーム７１に対応
するノードと、そのアラームを引き起こしたＵＳＤ障害
７２に対応するノードから始める。障害モデル５０に基
づいて、報告するスイッチでＵＳＤ障害７２を引き起こ
した可能性がある障害状態７４に対応するノードを、ネ
ットワークに追加する。上で注記したように、これらの
障害状態には、リンク上またはスイッチ自体の中で破壊
されたビットと、破壊を引き起こした可能性があるロー
カル設計欠陥の両方が含まれる。その後、障害モデルを
使用して、再帰的な形でネットワーク７０にさらに障害
状態７６を追加する。追加される障害状態には、報告す
るスイッチ上の障害状態、またはそれに接続され、報告
するスイッチに伝搬され、したがって障害状態７４の１
つを引き起こした可能性がある先行するスイッチ上の障
害状態のすべてを含めなければならない。この処理は、
最終的に停止する。というのは、データ・フローが非輪
状であり、ネットワーク２２が有限だからである。そう
であっても、通信ネットワーク２２全体を介する障害状
態の伝搬によって、手におえないほど大きいベイズ・ネ
ットワーク７０がもたらされるはずである。現在の例で
は、スイッチ２８が、破壊されたデータを再送信しない
ので、伝搬が停止し、したがって、ＢＯＰ破壊が、ネッ
トワーク２２内で、先行するスイッチの受信器ポートよ
り遠い位置から発した可能性はない。下の図９で、本発
明の好ましい実施形態に従って作成されたベイズ・ネッ
トワークのサイズを制限する、もう１つの技法を示す。To build the network 70, the diagnostic engine 48 starts with the node corresponding to the observed USD alarm 71 and the node corresponding to the USD fault 72 that caused the alarm. Based on the failure model 50, add a node to the network that corresponds to the failure condition 74 that may have caused the USD failure 72 at the reporting switch. As noted above, these fault conditions include both bits that were destroyed on the link or in the switch itself, and local design defects that might have caused the destruction. The fault model is then used to recursively add more fault states 76 to the network 70. The added fault condition is the fault condition on the reporting switch, or one connected to it and propagated to the reporting switch, and thus one of the fault conditions 74.
All of the fault conditions on the preceding switch that may have caused one must be included. This process
Finally stop. This is because the data flow is non-circular and the network 22 is finite. Even so, the propagation of fault conditions through the entire communication network 22 should result in an unmanageably large Bayesian network 70. In the current example, the switch 28 does not retransmit the corrupted data, so propagation stops, and thus the BOP disruption may have originated in the network 22 from a location further than the receiver port of the preceding switch. There is no. In FIG. 9 below, another technique for limiting the size of a Bayesian network created in accordance with the preferred embodiment of the present invention is shown.

【００４８】障害状態７４および７６のそれぞれについ
て、診断エンジン４８は、その状態を引き起こした可能
性がある誤動作８０に対応するノードをネットワーク７
０に追加する。誤動作ノードは、故障率分布をそれに関
連付けられ、これによって、特定の誤動作の連続的な確
率が示される。ネットワーク７０を完成させるために、
誤動作８０を、ブール発生７８に関して離散化する。言
い換えると、所与の誤動作８０を、離散化された故障率
分布を有する区間変数によって表す。各区間について、
連続故障率分布関数の値を計算して（通常は区間の中点
で）、区間の離散化された故障率分布の値を与える。出
現変数は、対応する誤動作が発生する確率の計算に使用
される。言い換えると、出現変数は、ｔが、ネットワー
ク７０がそれについて構成された観察されたアラームの
観察の時刻であるものとして、その項目がＰ（時刻ｔに
発生した誤動作｜ａ＜故障率＜ｂ）によって与えられる
条件つき確率テーブルを有するブール変数である。確率
テーブルは、ポアソン到着統計などの適当なモデルに従
って、対応する誤動作の推定された率によって決定され
ることが好ましい。ネットワーク７０の複雑さを減らす
ために、誤動作８０ごとに１つのブール発生のノード７
８だけがあることが好ましい。誤動作によって引き起こ
される障害状態７４および７６は、その誤動作に関連す
る出現変数に接続される。For each of the fault conditions 74 and 76, the diagnostic engine 48 will identify the node corresponding to the malfunction 80 that may have caused that condition to the network 7.
Add to 0. A malfunction node has a failure rate distribution associated with it, which indicates the continuous probability of a particular malfunction. To complete the network 70,
The malfunction 80 is discretized with respect to the Boolean occurrence 78. In other words, a given malfunction 80 is represented by an interval variable with a discretized failure rate distribution. For each section,
The value of the continuous failure rate distribution function is calculated (usually at the midpoint of the section) and the value of the discretized failure rate distribution of the section is given. Appearance variables are used to calculate the probability that a corresponding malfunction will occur. In other words, the occurrence variable is that item is P (malfunction occurred at time t | a <failure rate <b), where t is the time of observation of the observed alarm for which network 70 was configured. Is a Boolean variable with a conditional probability table given by The probability table is preferably determined by the estimated rate of corresponding malfunctions according to a suitable model such as Poisson arrival statistics. To reduce the complexity of the network 70, one Boolean occurrence node 7 per malfunction 80
Preferably there are only eight. Fault conditions 74 and 76 caused by a malfunction are connected to the appearance variables associated with that malfunction.

【００４９】ネットワーク２２内の前のスイッチと、Ｕ
ＳＤアラームを報告したスイッチとの間のリンクでＢＯ
Ｐビット破壊が発生した場合（ケーブル上と、ケーブル
を装置に接続する補助コンポーネントで発生した破壊を
含む）、破壊されたビットは、エラー検出コード（ＥＤ
Ｃ）障害８２も引き起こしていなければならない。この
情況は、「リンク上のＵＳＤ」障害状態ノードをＥＤＣ
障害ノードに接続する、ネットワーク７０に追加された
辺によって反映される。ＥＤＣ障害は、観察されたＵＳ
Ｄアラーム７１の他に、スイッチにＥＤＣアラーム８４
を発行させているはずである。このＥＤＣアラーム８４
が、「期待されるアラーム」としてネットワーク７０に
追加される。診断ユニット２０でのＥＤＣアラームの到
着または非到着は、ＵＳＤアラームの可能性の高い原因
を判定するのに重要な要因であり、したがって、ネット
ワーク７０内のノードの条件つき確率を調整するのに重
要な要因である。The previous switch in the network 22 and U
BO on the link to the switch that reported the SD alarm
If a P-bit break occurs (including the break on the cable and in the auxiliary component that connects the cable to the device), the broken bit is the error detection code (ED
C) Disability 82 must also be causing. This situation causes the "USD on link" failure state node to EDC
Reflected by the edges added to network 70 that connect to the failed node. EDC disorders were observed US
In addition to D alarm 71, EDC alarm 84 on the switch
Should be issued. This EDC alarm 84
Is added to the network 70 as an “expected alarm”. The arrival or non-arrival of an EDC alarm at diagnostic unit 20 is an important factor in determining the likely cause of a USD alarm, and thus in adjusting the conditional probability of a node in network 70. It is a factor.

【００５０】図５は、診断エンジン４８が受け取るアラ
ーム９０のシーケンスの処理を概略的に示すタイミング
図である。これらのアラームは、現在のアラームに関す
るベイズ・ネットワークを構築し、ネットワーク内のノ
ードの確率を評価するのに使用するために、ステップ６
２（図３）で組み合わせられる。適当な時間ウィンドウ
内のアラームのシーケンスを集めることが、たとえば、
期待されたアラーム８４が、観察されたアラーム７１と
共に到着したか否かの判定に使用される。時間ウィンド
ウの選択は、診断ユニット２０でのアラーム到着時刻お
よびアラームの到着の順序の不確実性を正しく扱うため
に重要である。FIG. 5 is a timing diagram that schematically illustrates the processing of a sequence of alarms 90 that the diagnostic engine 48 receives. These alarms are used to build a Bayesian network for the current alarm and to use them to evaluate the probabilities of the nodes in the network, step 6.
2 (FIG. 3). Collecting a sequence of alarms within a suitable time window, for example,
The expected alarm 84 is used to determine if it arrived with the observed alarm 71. The selection of the time window is important to correctly handle the uncertainty of alarm arrival time and the order of alarm arrival at the diagnostic unit 20.

【００５１】所与のシーケンス内のどのアラームを処理
のために組み合わせるかを判定するために、時間に対す
る正規分布を、アラームの各タイプに関連付ける。この
分布は、アラームの「寿命分布」と称するが、ネットワ
ーク２２内である時刻Ｔ＝０に発生したイベントに関連
するアラームの、診断ユニット２０での到着の時間に対
する確率を表す。言い換えると、図５を参照すると、ア
ラームＡ"の寿命分布によって、アラームＡが時刻Ｔ₀に
受け取られた時に、Ａと同一の障害状態によって生成さ
れたアラームＡ"が、時刻Ｔ₁に受け取られる推定確率が
与えられる。通常、各アラーム・タイプの寿命は、診断
ユニットのユーザによって指定されるが、その代わり
に、ネットワーク２２の実際の性能に基づいて、診断ユ
ニットによって寿命を計算することができる。A normal distribution over time is associated with each type of alarm to determine which alarms in a given sequence are combined for processing. This distribution, referred to as the "lifetime distribution" of alarms, represents the probability over time of arrival at the diagnostic unit 20 of an alarm associated with an event that occurred in the network 22 at time T = 0. In other words, referring to FIG. 5, due to the lifetime distribution of alarm A ", when alarm A is received at time T ₀ , alarm A" generated by the same fault condition as A is received at time T _1. Estimated probabilities are given. Normally, the life of each alarm type is specified by the user of the diagnostic unit, but instead the life can be calculated by the diagnostic unit based on the actual performance of the network 22.

【００５２】場合によっては、ネットワーク２２のモジ
ュールが、障害の発生のすべてでアラームを発生するの
ではなく、ある回数の発生を累算し、その後、バッチ・
アラームを発行する。この場合、単独のアラーム寿命に
閾値係数を掛け、その結果、アラームの寿命分布が広く
なるようにする必要がある。したがって、図５には、閾
値係数を有しない第１アラーム・タイプの狭い分布９２
と、低い閾値係数を有する第２アラーム・タイプの中間
の分布９４と、高い閾値係数を有する第３アラーム・タ
イプの広い分布９６が示されている。In some cases, a module of network 22 accumulates a certain number of occurrences rather than raising an alarm for every occurrence of a failure, and then batch
Issue an alarm. In this case, it is necessary to multiply the single alarm life by a threshold coefficient so that the life distribution of the alarm becomes wider. Therefore, in FIG. 5, the narrow distribution 92 of the first alarm type without the threshold factor is shown.
And an intermediate distribution 94 of the second alarm type with a low threshold coefficient and a broad distribution 96 of the third alarm type with a high threshold coefficient.

【００５３】確率テーブルおよび誤動作率査定を更新す
るためにベイズ・ネットワークを処理する（図３の方法
のステップ６６）前に、診断エンジン４８が、シーケン
ス内の関係する観察されたアラームおよび期待されるア
ラームのすべてを受け取るまで、待つことが好ましい。
待つ時間の長さは、アラーム寿命によって決定される。
図５に示されているように、診断エンジン４８は、すべ
ての期待されるアラームの到着確率が、所定の閾値未満
になる時刻Ｔ_ENDまで待つことが好ましい。その場合
に、アラームＡ₀、…、Ａ'、Ａ"は、アラームＡに関連
するとみなされるが、Ｔ_ENDの後に到着するアラームＡ_N
は、そう見なされない。Prior to processing the Bayesian network to update the probability table and malfunction rate assessment (step 66 of the method of FIG. 3), the diagnostic engine 48 causes the diagnostic engine 48 to observe the relevant observed alarms and expected values in the sequence. It is preferable to wait until you receive all of the alarms.
The length of time to wait is determined by the alarm life.
As shown in FIG. 5, the diagnostic engine 48 preferably waits until time T _END when the probability of arrival of all expected alarms is below a predetermined threshold. In that case, the alarms A ₀ , ..., A ′, A ″ are considered to be related to alarm A, but the alarm A _N arrives after T _END.
Is not considered so.

【００５４】図６は、本発明の好ましい実施形態によ
る、ネットワーク構築のステップ６４（図３の方法の）
の詳細を概略的に示す流れ図である。これは、再帰的な
方法であり、好ましくは、図４に示されたネットワーク
７０などのベイズ・ネットワークの構成に使用される。
この方法は、初期化のステップ１００で、観察されたア
ラームＡ（観察されたＵＳＤアラーム７１など）が時刻
ＴにモジュールＭで受け取られることから始まる。ネッ
トワーク作成のステップ１０２で、診断エンジン４８
が、新しいベイズ・ネットワークＢＮを作成し、アラー
ムＡに対応するノードをＢＮに追加する。障害発見のス
テップ１０４で、エンジンが、Ａに対応する障害Ｆを見
つけるために、障害モデル５０でアラームを検索する。
Ｆに対応するノードを、辺（Ｆ、Ａ）と共にＢＮに追加
する。FIG. 6 illustrates a step 64 of network construction (of the method of FIG. 3) in accordance with a preferred embodiment of the present invention.
2 is a flowchart schematically showing the details of FIG. This is a recursive method and is preferably used in the construction of Bayesian networks such as the network 70 shown in FIG.
The method begins with an observed alarm A (such as observed USD alarm 71) being received at module M at time T at initialization step 100. In step 102 of network creation, the diagnostic engine 48
Creates a new Bayesian network BN and adds the node corresponding to alarm A to BN. In the find fault step 104, the engine searches the fault model 50 for an alarm to find the fault F corresponding to A.
Add the node corresponding to F to BN along with the edge (F, A).

【００５５】障害状態発見のステップ１０６で、診断エ
ンジン４８が、次に、Ｆを引き起こした可能性がある障
害状態Ｃを見つけるために、障害モデル５０で障害Ｆを
検索する。図４の例からわかるように、どの所与の障害
についても、通常は複数のそのような障害状態がある。
そのような障害状態Ｃのそれぞれについて、診断エンジ
ン４８が、障害状態追加のステップ１０８を実行し、こ
れによって、モジュールＭ上の状態Ｃに対応するノード
がＢＮに追加され、Ｃにつながった可能性がある追加の
障害状態が検索される。ステップ１０８には、再帰ルー
チンが含まれるが、これについては、図７に関して下で
詳細に説明する。このステップでは、各障害状態につな
がる誤動作および誤動作発生に対応するノードおよび辺
も追加される。辺追加のステップ１１０で、Ｆを引き起
こした可能性がある障害状態Ｃのそれぞれについて、対
応する辺（Ｃ、Ｆ）をＢＮに追加する。Ｆを引き起こし
た可能性がある可能な障害状態Ｃのすべてをこの形で処
理した後に、ベイズ・ネットワークが完成する。In step 106 of finding fault conditions, diagnostic engine 48 then searches fault model 50 for fault F to find fault condition C that may have caused F. As can be seen from the example of FIG. 4, for any given fault, there will normally be multiple such fault conditions.
For each such fault condition C, the diagnostic engine 48 performs the add fault condition step 108, which may have added the node corresponding to state C on module M to BN and connected to C. Are searched for additional fault conditions. Step 108 includes a recursive routine, which is described in detail below with respect to FIG. In this step, nodes and edges corresponding to malfunctions and occurrences of malfunctions leading to each failure state are also added. In step 110 of adding edges, for each fault condition C that may have caused F, add the corresponding edge (C, F) to BN. After processing in this form all possible fault conditions C that may have caused F, the Bayesian network is complete.

【００５６】図７は、本発明の好ましい実施形態によ
る、障害状態追加のステップ１０８で実行されるルーチ
ンの詳細を概略的に示す流れ図である。このルーチン
は、ノード追加のステップ１２０で、モジュールＭの状
態Ｃに対応するノードをＢＮに追加することから始ま
る。局所性検査のステップ１２２で、診断エンジン４８
が、障害モデル５０を検査して、状態Ｃがローカル障害
状態と接続障害状態のどちらであるかを判定する。ロー
カル障害状態の場合、状態Ｃを引き起こした、モジュー
ルＭの誤動作Ｎだけを検査すればよい。誤動作発見のス
テップ１２４で、診断エンジン４８が、障害モデル５０
の可能な誤動作を検索する。誤動作Ｎのそれぞれについ
て、誤動作検査のステップ１２６で、エンジンが、Ｎに
対応するノードがＢＮに既に存在するかどうかを検査す
る。そうでない場合には、ノード追加のステップ１２８
で、ノードＮをＢＮに追加する。その後、辺追加のステ
ップ１２９で、辺（Ｎ、Ｃ）をＢＮに追加する。可能な
誤動作のすべてをＢＮに追加した時に、ステップ１０８
が完了する。FIG. 7 is a flow diagram that schematically illustrates details of the routine performed at step 108 of adding a fault condition, according to a preferred embodiment of the present invention. The routine begins by adding the node corresponding to state C of module M to BN at step 120 of adding nodes. At step 122 of the locality check, the diagnostic engine 48
Examines the fault model 50 to determine if state C is a local fault condition or a connection fault condition. In the case of a local fault condition, only the malfunction N of the module M that caused the condition C need be checked. In step 124 of malfunction detection, the diagnosis engine 48 determines that the failure model 50
Search for possible malfunctions of. For each malfunction N, at malfunction check step 126, the engine checks whether the node corresponding to N already exists in the BN. Otherwise, add node step 128.
Then, the node N is added to BN. Then, in step 129 of adding an edge, the edge (N, C) is added to BN. When all possible malfunctions have been added to the BN, step 108
Is completed.

【００５７】ステップ１２２で、接続障害状態が識別さ
れる時には、扱いがより複雑になる。この場合、モジュ
ール発見のステップ１３０で、診断エンジン４８が、シ
ステム・モデル４４および構成データベース４６でモジ
ュールＭを検索して、Ｍ'からＭへの接続で障害状態Ｃ
が現れる原因になった可能性がある形でＭに接続されて
いる１つまたは複数のモジュールＭ'を見つける。障害
状態追加のステップ１３２で、そのようなモジュール
Ｍ'のそれぞれについて、診断エンジン４８が、状態Ｃ
を引き起こした可能性がある、Ｍ'上およびＭ'につなが
る接続上の障害状態を見つけ、ＢＮに追加する。このス
テップには、図８に関して下で詳細に説明するルーチン
が含まれる。ステップ１３２のルーチンは、ステップ１
０８のルーチンの再帰の一部を形成する。このルーチン
は、障害状態Ｃの出現につながった可能性がある、Ｍ'
上およびその接続上（Ｍ'に接続された他のモジュール
などを含む）の障害状態のすべてに対応するノードおよ
び辺がＢＮに追加されるまで継続する。At step 122, the handling becomes more complex when a connection failure condition is identified. In this case, in the module discovery step 130, the diagnostic engine 48 searches the system model 44 and the configuration database 46 for the module M, and the fault condition C on the connection M ′ to M.
Find one or more modules M ′ connected to M in a manner that might have caused In the add fault condition step 132, for each such module M ′, the diagnostic engine 48 causes the condition C to
Find the fault condition on M ′ and on the connection leading to M ′ that may have caused This step includes the routine described in detail below with respect to FIG. The routine of step 132 is step 1
It forms part of the recursion of the 08 routine. This routine may have led to the appearance of fault condition C, M '.
Continue until nodes and edges corresponding to all of the fault conditions above and on that connection (including other modules connected to M ', etc.) are added to the BN.

【００５８】障害状態Ｃにつながる可能な接続障害状態
のすべてを探査した後に、予期される障害の発見のステ
ップ１３４で、診断エンジン４８が、障害モデル５０を
照会して、これらの障害状態が、ステップ１０４で見つ
かった障害Ｆ以外の別の障害Ｆ'につながる可能性があ
るかどうかを判定する。ＥＤＣ障害８２（図４）が、そ
のような障害の例である。障害ノード追加のステップ１
３６で、そのような期待される障害Ｆ'のそれぞれのノ
ードをＢＮに追加する。さらに、Ｆ'によって生成され
る期待されるアラームＡ'に対応するノードを、辺
（Ｃ、Ｆ'）および（Ｆ'、Ａ'）と共にＢＮに追加す
る。他のモジュール上のローカル障害状態に対応する辺
および障害Ｆ'につながった可能性があるモジュールに
関連する接続障害状態に対応する辺が、さらにネットワ
ークに追加される可能性もある。最初のアラームＡに対
する相対的な時間（指定された寿命によって与えられ
る）以内に期待されるアラームＡ'の発生または非発生
が、ステップ６６（図３）でのベイズ・ネットワークに
関する状態確率テーブルの書込に使用される。After exploring all possible connection fault conditions leading to fault condition C, diagnostic engine 48 queries fault model 50 in step 134 of Expected Fault Locator to find these fault conditions. It is determined whether there is a possibility of being connected to another failure F ′ other than the failure F found in step 104. EDC failure 82 (FIG. 4) is an example of such a failure. Step 1 for adding a failed node
At 36, add each node of such expected failure F'to the BN. In addition, the node corresponding to the expected alarm A'generated by F'is added to BN along with the edges (C, F ') and (F', A '). Additional edges may be added to the network that correspond to local failure conditions on other modules and connection failure conditions associated with modules that may have led to failure F '. The occurrence or non-occurrence of alarm A'expected within a time relative to the first alarm A (given by the specified lifetime) is written in the state probability table for the Bayesian network at step 66 (Fig. 3). Used for inclusion.

【００５９】図８は、本発明の好ましい実施形態によ
る、障害状態追加のステップ１３２で実行されるルーチ
ンの詳細を概略的に示す流れ図である。上で注記したよ
うに、このルーチンは、接続障害状態ＣがＭ'とＭの間
の接続上に現れる可能性がある形でＭに接続されたモジ
ュールＭ'のそれぞれについて実行される。このルーチ
ンは、Ｍ'に接続されたモジュールＭ"についても再帰的
に実行される可能性がある。ローカル障害検査のステッ
プ１４０で、診断エンジン４８が、まず、障害モデル５
０を検査して、Ｍに接続されたＭ'の出力上で状態Ｃを
生じた可能性があるＭ'上のローカル障害状態Ｃ'がある
かどうかを確認する。そのような状態Ｃ'がある場合に
は、ステップ１０８のルーチンに従い、必要な変更を加
えて、診断エンジン４８が、モジュールＭ'上のＣ'に対
応するノードをベイズ・ネットワークＢＮに追加する。
このルーチンは、Ｃ'を引き起こした可能性があるロー
カル誤動作に対応するノードと、適当な辺の、ＢＮへの
追加にもつながる。辺追加のステップ１４２で、辺
（Ｃ'、Ｃ）もＢＮに追加する。FIG. 8 is a flow chart that schematically illustrates details of the routine performed in the add fault condition step 132, in accordance with a preferred embodiment of the present invention. As noted above, this routine is executed for each of the modules M'connected to M in a way that a connection failure condition C may appear on the connection between M'and M. This routine may also be executed recursively for the module M ″ connected to M ′. At step 140 of the local fault check, the diagnostic engine 48 first causes the fault model 5
Check 0 to see if there is a local fault condition C'on M'which may have caused state C on the output of M'connected to M. If there is such a state C ′, then according to the routine of step 108, with the necessary modifications, the diagnostic engine 48 adds the node corresponding to C ′ on module M ′ to the Bayesian network BN.
This routine also leads to the addition of appropriate edges to the BN, as well as nodes corresponding to local malfunctions that may have caused C '. In step 142 of adding an edge, the edge (C ′, C) is also added to BN.

【００６０】ローカル障害状態Ｃ'が見つかった場合で
もそうでない場合でも、Ｃを生じたＣ'を生じた可能性
がある、Ｍ'とＭ'に接続された他のモジュールＭ"との
間の接続障害状態Ｃ"もある場合がある。この情況は、
Ｍ'がＣに伝搬すると言うのと同等である。伝搬のステ
ップ１４４で、診断エンジン４８が、障害モデル５０を
参照することによって、Ｍ'がＣに伝搬するかどうかを
確認する。Ｍ'がＣに伝搬する場合には、入力検査のス
テップ１４６で、診断エンジン４８が、障害モデルを照
会して、接続障害状態Ｃ"が現れた可能性があるＭ'の入
力を判定する。そのような入力のそれぞれについて、
Ｍ'上の接続障害状態Ｃ"をＢＮに追加する。このステッ
プでも、必要な変更を加えて、ステップ１０８のルーチ
ンに従う。辺追加のステップ１４８で、障害状態Ｃ"の
それぞれについて、辺（Ｃ"、Ｃ）をＢＮに追加する。
ここでステップ１３２が終了し、ベイズ・ネットワーク
の構成は、すべての再帰が完了するまでステップ１３４
で継続される。Whether a local fault condition C'is found or not, it may have caused C'caused C'between M'and another module M "connected to M '. There may also be a connection failure condition C ". This situation is
Equivalent to M'propagating to C. In propagation step 144, diagnostic engine 48 determines whether M ′ propagates to C by referencing fault model 50. If M ′ propagates to C, then at input checking step 146, the diagnostic engine 48 queries the fault model to determine the input of M ′ at which the connection fault condition C ″ may have appeared. For each such input,
Add the connection failure condition C "on M'to the BN. Again, make the necessary changes and follow the routine of step 108. In step 148 of adding edges, for each failure condition C", the edge (C Add ", C) to BN.
At this point, step 132 is complete and the Bayesian network configuration is step 134 until all recursion is complete.
Will continue in.

【００６１】通信ネットワーク２２は有限でなければな
らないので、図６ないし８によって例示されたベイズ・
ネットワークを構成する方法は、最終的に必ず停止す
る。しかし、障害伝搬のためにベイズ・ネットワークが
非常に大きくなり、通信ネットワーク全体を表す点まで
大きくなる場合がありえる。そのような情況は、完全に
手におえない情況であり、回避しなければならない。Since the communication network 22 must be finite, the Bayesian network illustrated by FIGS.
The method of configuring the network will eventually stop. However, the failure propagation can make the Bayesian network very large, up to the point of representing the entire communications network. Such a situation is completely out of hand and must be avoided.

【００６２】したがって、本発明の好ましい実施形態で
は、ＳＡＮなどの交換ネットワークに固有の規則性を利
用することによって、ステップ６４でベイズ・ネットワ
ークの増大を制限する。そのようなネットワークは、一
般に、少数の異なるモジュール・タイプを有し、これら
のタイプが、通常は規則的な構成に配置される。これら
の構造は、ベイズ・ネットワークではテンプレートによ
って表されることが好ましい。所与のテンプレートのす
べてのインスタンスが、所与の障害状態の下で同一の期
待されるアラームを生じる。通信ネットワークの構造に
物理的に存在する、所与のテンプレートの多数のインス
タンスが存在する可能性があるが、テンプレートの特定
のインスタンスが、その期待されるアラームの１つが実
際に観察された時に限って、インスタンス化される、す
なわち、ベイズ・ネットワークに追加されることが好ま
しい。Therefore, the preferred embodiment of the present invention limits the growth of the Bayesian network in step 64 by taking advantage of the regularity inherent in switching networks such as SAN. Such networks generally have a small number of different module types, which are usually arranged in regular arrangements. These structures are preferably represented by templates in the Bayesian network. All instances of a given template give the same expected alarm under a given fault condition. There may be many instances of a given template physically present in the structure of the communication network, but only when a particular instance of the template actually sees one of its expected alarms. And is instantiated, ie added to the Bayesian network.

【００６３】図９は、通信ネットワーク１６８と、本発
明の好ましい実施形態に従って診断エンジン４８によっ
て構成された、対応するベイズ・ネットワーク１７５と
の規則的な構造を示すグラフである。この例の通信ネッ
トワーク１６８には、カスケード接続されたスイッチ１
７０、１７２、および１７４が含まれ、スイッチ１７０
は、カスケードの第１層にあり、スイッチ１７２は第２
層、スイッチ１７４は第３層にある。ベイズ・ネットワ
ーク１７５の構成は、スイッチ１７０のポートの１つで
観察されたアラーム１７６に対応するノードから開始さ
れる。図６ないし８の手順に従って、アラーム１７６を
引き起こした責任を負う障害１７８に対応するノード
と、その障害を引き起こしたスイッチ１７０の受信器ポ
ートでの障害状態１８０のノードがベイズ・ネットワー
ク１７５に追加される。この状態は、スイッチの中央キ
ューでの障害状態１８２によって引き起こされた可能性
がある。これらは、スイッチ１７０でのアラーム１７６
を引き起こした可能性があるローカル障害である。FIG. 9 is a graph showing the regular structure of the communication network 168 and the corresponding Bayesian network 175 constructed by the diagnostic engine 48 in accordance with the preferred embodiment of the present invention. The communication network 168 in this example includes switches 1 connected in cascade.
70, 172, and 174 are included and switch 170
Is on the first layer of the cascade and switch 172 is on the second
Layer, switch 174, is on the third layer. The Bayesian network 175 configuration begins with the node corresponding to the alarm 176 observed on one of the ports of the switch 170. 6-8, the node corresponding to the fault 178 responsible for causing the alarm 176 and the node in fault 180 at the receiver port of the switch 170 that caused the fault are added to the Bayesian network 175. It This condition may have been caused by a fault condition 182 in the central queue of the switch. These are alarms 176 on the switch 170.
Is a local failure that may have caused

【００６４】アラーム１７６が、スイッチ１７２の１つ
からスイッチ１７０に伝搬した障害によって引き起こさ
れた可能性もある。そのような障害伝搬は、スイッチ１
７２の送信器ポートでの障害状態１８４、スイッチを接
続するケーブルでの障害状態１８６、スイッチ１７２の
受信器ポートでの障害状態１８８、またはスイッチ１７
２の中央キューでの障害状態１９０を含む、一連の障害
状態の１つによって引き起こされる可能性がある。スイ
ッチ１７０の場合と同様に、スイッチ１７２での障害状
態１８８または１９０は、スイッチ１７２の受信器ポー
トでの障害１９２を生じ、期待されるアラーム１９４に
つながる。Alarm 176 may also be caused by a failure propagating from one of switches 172 to switch 170. Such fault propagation is a switch 1
Fault condition 184 at the transmitter port 72, fault condition 186 at the cable connecting the switches, fault condition 188 at the receiver port of switch 172, or switch 17
It can be caused by one of a series of fault conditions, including the fault condition 190 at two central queues. As with switch 170, a fault condition 188 or 190 at switch 172 causes a fault 192 at the receiver port of switch 172 leading to an expected alarm 194.

【００６５】障害状態１８４、１８６、１８８、および
１９０が、障害１９２および期待されるアラーム１９４
と共に、通信ネットワーク１６８内のスイッチの１つに
対応するベイズ・ネットワーク・テンプレートを構成す
る（障害状態につながる可能性がある誤動作および誤動
作発生に対応するノードは、簡単にするためにここでは
省略する）。スイッチ１７２の１つが、アラーム１７６
の適当な時間制限内に期待されるアラーム１９４を発行
した場合には、アラーム１７６と期待されるアラーム１
９４が互いに関連すると仮定する慨然論的基礎がある。
この場合、アラームを発行するスイッチに対応するテン
プレートがインスタンス化される、すなわち、それがベ
イズ・ネットワーク１７５に追加される。期待されるア
ラームが発生しなかった場合には、対応するスイッチ
が、更新された誤動作査定の計算（ステップ６６）に影
響せず、テンプレートを、計算を妥協せずにベイズ・ネ
ットワークから省略することができる。この形で、所与
のアラームに応答して構成されたベイズ・ネットワーク
が、計算的に小さく、扱いやすい状態に保たれる。スイ
ッチ１７２の１つのテンプレートをインスタンス化する
場合には、診断エンジン４８が、第３層のスイッチ１７
４をベイズ・ネットワーク１７５に含める必要があるか
どうかを判定するために、第３層のスイッチ１７４に対
応する期待されるアラームを検討することが好ましい。
しかし、実際には、一般にごく少数のテンプレートをイ
ンスタンス化することだけが必要になる。Fault conditions 184, 186, 188, and 190 indicate fault 192 and expected alarm 194.
Together, it configures a Bayesian network template corresponding to one of the switches in the communication network 168 (nodes corresponding to malfunctions and malfunctions that may lead to fault conditions are omitted here for simplicity. ). One of the switches 172 has an alarm 176
If the expected alarm 194 is issued within the appropriate time limit of the alarm, alarm 176 and expected alarm 1
There is a convincing basis to assume that 94 are related to each other.
In this case, the template corresponding to the alarm issuing switch is instantiated, i.e. it is added to the Bayesian network 175. If the expected alarm does not occur, the corresponding switch does not affect the updated malfunction assessment calculation (step 66) and omits the template from the Bayesian network without compromising the calculation. You can In this way, the Bayesian network constructed in response to a given alarm remains computationally small and manageable. When instantiating one template of the switch 172, the diagnostic engine 48 causes the third layer switch 17
It is preferable to consider the expected alarms corresponding to the layer 3 switch 174 to determine if 4 should be included in the Bayesian network 175.
However, in practice you generally only need to instantiate a few templates.

【００６６】診断ユニット２０を使用するネットワーク
２２の障害診断に関して（発明人のRS/6000 SPシステム
での経験から採用した例を用いて）好ましい実施形態を
説明したが、当業者は、本発明の原理が、他のネットワ
ークおよびシステムの障害の突き止めに同様に適用可能
であることを諒解するであろう。ほとんどの現代の通信
ネットワーク、特にパケット・データ・ネットワーク
は、診断ユニット２０などの診断システムによって使用
することができる障害報告機能および構成機能を有す
る、扱いやすいものである。ネットワークまたはシステ
ムの要素のすべてがモデル化され、これらの要素の間の
データ・フローが非輪状である限り、ベイズ・ネットワ
ークおよびベイズ信頼性理論に基づく診断モデルを、本
発明の原理に基づいて適用することができる。この原理
は、通信ネットワークおよびコンピュータ・ネットワー
ク（およびそのようなネットワークのサブシステム）だ
けではなく、他の種類の電気システムおよび機械システ
ムならびに医療システムおよび金融システムにも適用可
能である。Although a preferred embodiment has been described with respect to fault diagnosis of a network 22 using a diagnostic unit 20 (using an example taken from the inventor's experience with an RS / 6000 SP system), those skilled in the art will understand that the It will be appreciated that the principles are equally applicable to locating faults in other networks and systems. Most modern communication networks, especially packet data networks, are easy to handle with fault reporting and configuration capabilities that can be used by diagnostic systems such as diagnostic unit 20. As long as all the elements of the network or system are modeled and the data flow between these elements is acyclic, a Bayesian network and Bayesian reliability theory based diagnostic model is applied in accordance with the principles of the present invention. can do. This principle is applicable not only to communication networks and computer networks (and subsystems of such networks), but also to other types of electrical and mechanical systems as well as medical and financial systems.

【００６７】付録Ａ − 障害モデルＤＴＤ <?xml encoding="UTF-8"?> <!ELEMENT Fault_Model_Information (g_files-list, g_malfunctions-list, g_alarms-list, g_fault-conditions-list, g_fault-condition-groupings)> <!ELEMENT g_files-list (g_file)+> <!ELEMENT g_file EMPTY> <!ATTLIST g_file name CDATA #REQUIRED desc CDATA #IMPLIED> <!ELEMENT g_malfunctions-list (g_malfunction*)> <!ELEMENT g_malfunction (g_fault-condition-caused*)> <!ATTLIST g_malfunction name CDATA #REQUIRED mean CDATA #REQUIRED deviation CDATA #REQUIRED> <!ELEMENT g_fault-condition-caused EMPTY> <!ATTLIST g_fault-condition-caused name CDATA #REQUIRED> <!ELEMENT g_alarms-list (g_alarm*)> <!ELEMENT g_alarm EMPTY> <!ATTLIST g_alarm name CDATA #REQUIRED number CDATA #REQUIRED related-fault CDATA #REQUIRED desc CDATA #IMPLIED> <!ELEMENT g_fault-conditions-list (g_fault-condition*)> <!ELEMENT g_fault-condition (g_fault-caused*)> <!ATTLIST g_fault-condition name CDATA #REQUIRED connection-type (Service-Traffic | Data-Traffic | Service-And-Data-Traffic | Sync-Traffic) #REQUIRED> <!ELEMENT g_fault-caused EMPTY> <!ATTLIST g_fault-caused name CDATA #REQUIRED on-input (true | false) #REQUIRED locally (true | false) #REQUIRED> <!ELEMENT g_fault-condition-groupings (g_fault-condition-group*)> <!ELEMENT g_fault-condition-group (g_member-fault-condition*, g_fault-condition-group*)> <!ATTLIST g_fault-condition-group name CDATA #REQUIRED> <!ELEMENT g_member-fault-condition EMPTY> <!ATTLIST g_member-fault-condition name CDATA #REQUIRED> <!ELEMENT module (malfunctions, reports, triggers, (barrier-to | propagates), converts? )> <!ATTLIST module name CDATA #REQUIRED> <!ELEMENT malfunctions (malfunction*)> <!ELEMENT malfunction EMPTY> <!ATTLIST malfunction name CDATA #REQUIRED> <!ELEMENT reports (data-inputs, sync-inputs, service-inputs, local)> <!ELEMENT data-inputs (fault*)> <!ELEMENT sync-inputs (fault*)> <!ELEMENT service-inputs (fault*)> <!ELEMENT local (fault*)> <!ELEMENT fault EMPTY> <!ATTLIST fault name CDATA #REQUIRED> <!ELEMENT converts (fault-condition-pair*)> <!ELEMENT fault-condition-pair EMPTY> <!ATTLIST fault-condition-pair fault-cond-input CDATA #REQUIRED fault-cond-output CDATA #REQUIRED> <!ELEMENT triggers (fault-condition*)> <!ELEMENT fault-condition EMPTY> <!ATTLIST fault-condition name CDATA #REQUIRED> <!ELEMENT barrier-to (fault-condition*)> <!ELEMENT propagates (fault-condition*)>[0067] Appendix A-Disability Model DTD <? xml encoding = "UTF-8"?> <! ELEMENT Fault_Model_Information (g_files-list, g_malfunctions-list, g_alarms-list, g_fault-conditions-list, g_fault-condition-groupings)> <! ELEMENT g_files-list (g_file) +> <! ELEMENT g_file EMPTY> <! ATTLIST g_file name CDATA #REQUIRED desc CDATA #IMPLIED> <! ELEMENT g_malfunctions-list (g_malfunction *)> <! ELEMENT g_malfunction (g_fault-condition-caused *)> <! ATTLIST g_malfunction name CDATA #REQUIRED mean CDATA #REQUIRED deviation CDATA #REQUIRED> <! ELEMENT g_fault-condition-caused EMPTY> <! ATTLIST g_fault-condition-caused name CDATA #REQUIRED> <! ELEMENT g_alarms-list (g_alarm *)> <! ELEMENT g_alarm EMPTY> <! ATTLIST g_alarm name CDATA #REQUIRED number CDATA #REQUIRED related-fault CDATA #REQUIRED desc CDATA #IMPLIED> <! ELEMENT g_fault-conditions-list (g_fault-condition *)> <! ELEMENT g_fault-condition (g_fault-caused *)> <! ATTLIST g_fault-condition name CDATA #REQUIRED connection-type (Service-Traffic | Data-Traffic | Service-And-Data-Traffic | Sync-Traffic) #REQUIRED> <! ELEMENT g_fault-caused EMPTY> <! ATTLIST g_fault-caused name CDATA #REQUIRED on-input (true | false) #REQUIRED locally (true | false) #REQUIRED> <! ELEMENT g_fault-condition-groupings (g_fault-condition-group *)> <! ELEMENT g_fault-condition-group (g_member-fault-condition *, g_fault-condition-group *)> <! ATTLIST g_fault-condition-group name CDATA #REQUIRED> <! ELEMENT g_member-fault-condition EMPTY> <! ATTLIST g_member-fault-condition name CDATA #REQUIRED> <! ELEMENT module (malfunctions, reports, triggers, (barrier-to | propagates), converts?)> <! ATTLIST module name CDATA #REQUIRED> <! ELEMENT malfunctions (malfunction *)> <! ELEMENT malfunction EMPTY> <! ATTLIST malfunction name CDATA #REQUIRED> <! ELEMENT reports (data-inputs, sync-inputs, service-inputs, local)> <! ELEMENT data-inputs (fault *)> <! ELEMENT sync-inputs (fault *)> <! ELEMENT service-inputs (fault *)> <! ELEMENT local (fault *)> <! ELEMENT fault EMPTY> <! ATTLIST fault name CDATA #REQUIRED> <! ELEMENT converts (fault-condition-pair *)> <! ELEMENT fault-condition-pair EMPTY> <! ATTLIST fault-condition-pair fault-cond-input CDATA #REQUIRED fault-cond-output CDATA #REQUIRED> <! ELEMENT triggers (fault-condition *)> <! ELEMENT fault-condition EMPTY> <! ATTLIST fault-condition name CDATA #REQUIRED> <! ELEMENT barrier-to (fault-condition *)> <! ELEMENT propagates (fault-condition *)>

【００６８】まとめとして、本発明の構成に関して以下
の事項を開示する。In summary, the following matters will be disclosed regarding the configuration of the present invention.

【００６９】（１）相互リンクされた複数のモジュール
から構成されたシステムの診断のための方法であって、
前記システムから、前記モジュールの１つの障害を示す
アラームを受け取るステップと、前記アラームに応答し
て、前記障害を前記障害につながった可能性がある１つ
または複数の前記モジュールでの誤動作に関連付け、前
記障害の条件つき確率を前記誤動作のそれぞれの確率に
関係付ける、因果ネットワークを構成するステップと、
前記アラームおよび前記因果ネットワークに基づいて、
前記誤動作の前記確率の少なくとも１つを更新するステ
ップと、前記更新された確率に応答して前記アラームの
診断を提案するステップとを含む方法。（２）前記アラームを受け取るステップが、前記システ
ム内の前記複数のモジュールからイベント・レポートを
集めるステップと、前記イベント・レポートから前記ア
ラームを抽出するステップとを含む、上記（１）に記載
の方法。（３）前記イベント・レポートを集めるステップが、前
記システムの構成の変更のレポートを受け取るステップ
を含み、前記因果ネットワークを構成するステップが、
前記変更された構成に基づいて前記因果ネットワークを
構成するステップを含む、上記（２）に記載の方法。（４）前記変更された構成に基づいて前記因果ネットワ
ークを構成するステップが、前記構成が記録されるデー
タベースを維持するステップと、前記因果ネットワーク
の構成に使用するために、前記構成の前記変更の前記レ
ポートに応答して前記データベースを更新するステップ
とを含む、上記（３）に記載の方法。（５）前記アラームを抽出するステップが、前記モジュ
ールの前記１つでの前記障害を示す前記アラームを含
む、相互に近接する時刻に発生するアラームのシーケン
スを抽出するステップを含み、前記確率の前記少なくと
も１つを更新するステップが、前記確率を更新するため
に前記アラームの前記シーケンスを処理するステップを
含む、上記（２）に記載の方法。（６）前記アラームの前記シーケンスを抽出するステッ
プが、前記システムからの前記アラームの受取の際の期
待される遅延に応答して、前記アラームのそれぞれの寿
命を定義するステップと、前記それぞれの寿命に応答し
て前記シーケンスから抽出する前記アラームを選択する
ステップとを含む、上記（５）に記載の方法。（７）抽出する前記アラームを選択するステップが、前
記因果ネットワークがそれに応答して構成された前記モ
ジュールの前記１つでの前記障害を示す前記アラームの
発生の時刻のそれぞれの寿命以内に発生した前記アラー
ムを選択するステップを含む、上記（６）に記載の方
法。（８）前記因果ネットワークを構成するステップが、前
記１つまたは複数の前記モジュールでの前記誤動作の１
つによって引き起こされる期待されるアラームを定義す
るステップを含み、前記アラームの前記シーケンスを処
理するステップが、アラームの前記抽出されたシーケン
ス内の前記期待されるアラームの発生に応答して前記確
率を更新するステップを含む、上記（５）に記載の方
法。（９）前記因果ネットワークを構成するステップが、前
記システム内の前記モジュールのカテゴリおよび前記カ
テゴリ内の前記モジュールでの前記誤動作の１つによっ
て引き起こされる期待されるアラームに対応する前記ネ
ットワーク内のノードのグループを含むテンプレートを
定義するステップと、アラームの前記抽出されたシーケ
ンス内の前記期待されるアラームの発生に応答して前記
因果ネットワーク内で前記テンプレートをインスタンス
化するステップとを含む、上記（５）に記載の方法。（１０）前記相互リンクされた複数のモジュールが、規
則的なパターンで相互リンクされた前記モジュールの所
与の１つの複数のインスタンスを含み、前記因果ネット
ワークを構成するステップが、前記モジュールの前記所
与の１つに対応する前記ネットワーク内のノードのグル
ープを含むテンプレートを定義するステップと、前記ア
ラームに応答して前記１つまたは複数のモジュールに関
して前記テンプレートをインスタンス化するステップと
を含む、上記（１）に記載の方法。（１１）前記テンプレートを定義するステップが、前記
モジュールの前記所与の１つの前記インスタンスの１つ
での前記誤動作の１つによって引き起こされる期待され
るアラームを識別するステップを含み、前記テンプレー
トをインスタンス化するステップが、前記期待されるア
ラームの発生に応答して前記ネットワークに前記テンプ
レートのインスタンスを追加するステップを含む、上記
（１０）に記載の方法。（１２）前記因果ネットワークを構成するステップが、
前記障害が発生した前記モジュールの前記１つでのロー
カル障害状態を識別するステップと、前記ローカル障害
状態に応答して、前記因果ネットワーク内で、前記モジ
ュールの前記１つで発生する前記誤動作の１つに前記障
害をリンクするステップとを含む、上記（１）に記載の
方法。（１３）前記因果ネットワークを構成するステップが、
前記システム内の前記モジュールの第２の１つとの接続
に起因して前記モジュールの第１の１つで発生する第１
障害状態を識別するステップと、前記第１障害状態に応
答して、前記因果ネットワーク内で、前記モジュールの
前記第２の１つで発生する第２障害状態に前記障害をリ
ンクするステップとを含む、上記（１）に記載の方法。（１４）前記障害をリンクするステップが、前記第２障
害状態の可能な原因が、前記モジュールの前記第２の１
つと前記システム内の前記モジュールの第３の１つとの
間のもう１つの接続に起因するかどうかを判定するステ
ップと、前記もう１つの接続に応答して、前記因果ネッ
トワーク内で、前記モジュールの前記第３の１つで発生
する第３障害状態に前記障害をリンクするステップとを
含む、上記（１３）に記載の方法。（１５）前記因果ネットワークを構成するステップが、
前記誤動作の前記それぞれの確率に応答して、前記誤動
作の１つの複数の発生を前記因果ネットワークに追加す
るステップと、前記因果ネットワーク内で前記複数の発
生に前記障害をリンクするステップとを含む、上記
（１）に記載の方法。（１６）前記複数の発生に前記障害をリンクするステッ
プが、前記発生のそれぞれによって引き起こされる１つ
または複数の障害状態を判定するステップと、前記障害
状態の少なくとも一部を前記障害にリンクするステップ
とを含む、上記（１５）に記載の方法。（１７）前記誤動作の前記確率の前記少なくとも１つを
更新するステップが、前記１つまたは複数の前記モジュ
ールの障害の間の平均時間を査定するステップを含む、
上記（１）に記載の方法。（１８）前記誤動作の前記確率が、平均および積率を有
する確率分布に関して定義され、前記確率の前記少なく
とも１つを更新するステップが、前記確率分布の前記平
均および前記積率を再査定するステップを含む、上記
（１）に記載の方法。（１９）前記確率分布が、故障率分布を含み、前記平均
および前記積率を再査定するステップが、ベイズ信頼性
理論モデルを使用して前記故障率分布を更新するステッ
プを含む、上記（１８）に記載の方法。（２０）前記診断を提案するステップが、前記更新され
た確率の１つまたは複数を所定の閾値と比較するステッ
プと、前記確率の前記１つが前記閾値を超える時に診断
アクションを起動するステップとを含む、上記（１）に
記載の方法。（２１）前記診断アクションを起動するステップが、前
記診断について前記システムのユーザに通知するステッ
プを含む、上記（２０）に記載の方法。（２２）前記ユーザに通知するステップが、前記因果ネ
ットワークに基づく前記診断の説明を提供するステップ
を含む、上記（２１）に記載の方法。（２３）前記診断アクションを起動するステップが、前
記誤動作を検証するために診断テストを実行するステッ
プを含み、前記診断テストが、前記閾値を超える前記確
率の前記１つに応答して選択される、上記（２０）に記
載の方法。（２４）前記診断テストの結果に応答して前記因果ネッ
トワークを変更するステップを含む、上記（２３）に記
載の方法。（２５）相互リンクされた複数のモジュールから構成さ
れたシステムの診断のための方法であって、前記モジュ
ールの１つでの障害を前記障害につながった可能性があ
る２つ以上の前記モジュールでの誤動作と関連付け、前
記障害の条件つき確率を前記誤動作のそれぞれの確率分
布に関係付ける因果ネットワークを構成するステップ
と、前記障害を示す前記システムからのアラームに応答
して、前記誤動作の前記確率分布を更新するステップ
と、前記更新された確率分布に応答して前記アラームの
診断を提案するステップとを含む方法。（２６）前記確率分布を更新するステップが、前記２つ
以上の前記モジュールの障害の間の平均時間を査定する
ステップを含む、上記（２５）に記載の方法。（２７）前記確率分布が、平均および積率を有し、前記
確率分布を更新するステップが、前記確率分布の前記平
均および前記積率を再査定するステップを含む、上記
（２５）に記載の方法。（２８）前記確率分布が、故障率分布を含み、前記平均
および前記積率を再査定するステップが、ベイズ信頼性
理論モデルを使用して前記故障率分布を更新するステッ
プを含む、上記（２７）に記載の方法。（２９）前記２つ以上の前記モジュールが、前記障害が
発生した前記モジュールの前記１つを含み、前記因果ネ
ットワークを構成するステップが、前記モジュールの前
記１つでのローカル障害状態を識別するステップと、前
記ローカル障害状態に応答して、前記因果ネットワーク
内で、前記モジュールの前記１つで発生する前記誤動作
の１つに前記障害をリンクするステップとを含む、上記
（２５）に記載の方法。（３０）前記２つ以上の前記モジュールが、第１モジュ
ールおよび第２モジュールを含み、前記因果ネットワー
クを構成するステップが、前記システム内の前記第２モ
ジュールとの接続に起因して前記第１モジュールで発生
する第１障害状態を識別するステップと、前記第１障害
状態に応答して、前記因果ネットワーク内で、前記第２
モジュールで発生する第２障害状態に前記障害をリンク
するステップとを含む、上記（２５）に記載の方法。（３１）前記２つ以上の前記モジュールが、第３モジュ
ールを含み、前記障害をリンクするステップが、前記第
２障害状態の可能な原因が、前記第２モジュールと前記
第３モジュールとの間の前記システム内のもう１つの接
続に起因するかどうかを判定するステップと、前記もう
１つの接続に応答して、前記因果ネットワーク内で、前
記第３モジュールで発生する第３障害状態に前記障害を
リンクするステップとを含む、上記（３０）に記載の方
法。（３２）相互リンクされた複数のモジュールから構成さ
れたシステムの診断のための装置であって、前記装置
が、診断プロセッサを含み、前記診断プロセッサが、前
記システムから、前記モジュールの１つの障害を示すア
ラームを受け取るように結合され、前記診断プロセッサ
が、前記アラームに応答して、前記障害を前記障害につ
ながった可能性がある１つまたは複数の前記モジュール
での誤動作に関連付け、前記障害の条件つき確率を前記
誤動作のそれぞれの確率に関係付ける、因果ネットワー
クを構成し、前記アラームおよび前記因果ネットワーク
に基づいて、前記誤動作の前記確率の少なくとも１つを
更新して、前記更新された確率に応答して前記アラーム
の診断を提案するように配置される、装置。（３３）前記診断プロセッサが、前記システム内の前記
複数のモジュールからイベント・レポートを受け取り、
前記イベント・レポートから前記アラームを抽出するよ
うにリンクされる、上記（３２）に記載の装置。（３４）前記イベント・レポートが、前記システムの構
成の変更のレポートを含み、前記診断プロセッサが、前
記変更された構成に基づいて前記因果ネットワークを構
成するように配置される、上記（３３）に記載の装置。（３５）前記構成が記録されるデータベースを含むメモ
リを含み、前記因果ネットワークの構成に使用するため
に、前記診断プロセッサが、前記構成の前記変更の前記
レポートに応答して前記データベースを更新するように
結合される、上記（３４）に記載の装置。（３６）前記診断プロセッサが、前記モジュールの前記
１つでの前記障害を示す前記アラームを含む、相互に近
接する時刻に発生するアラームのシーケンスを抽出し、
前記確率を更新するために前記アラームの前記シーケン
スを処理するように結合される、上記（３３）に記載の
装置。（３７）それぞれの寿命が、前記システムからの前記ア
ラームの受取の際の期待される遅延に応答して、前記ア
ラームに関して定義され、前記診断プロセッサが、前記
それぞれの寿命に応答して前記シーケンスから抽出する
前記アラームを選択するように配置される、上記（３
６）に記載の装置。（３８）前記診断プロセッサが、前記因果ネットワーク
がそれに応答して構成された前記モジュールの前記１つ
での前記障害を示す前記アラームの発生の時刻のそれぞ
れの寿命以内に発生した前記アラームを選択するように
配置される、上記（３７）に記載の装置。（３９）前記因果ネットワークを構成する際に、前記診
断プロセッサが、前記１つまたは複数の前記モジュール
での前記誤動作の１つによって引き起こされる期待され
るアラームを定義するように配置され、前記診断プロセ
ッサが、さらに、アラームの前記抽出されたシーケンス
内の前記期待されるアラームの発生に応答して前記確率
を更新するように配置される、上記（３６）に記載の装
置。（４０）前記システム内の前記モジュールのカテゴリお
よび前記カテゴリ内の前記モジュールでの前記誤動作の
１つによって引き起こされる期待されるアラームに対応
する前記ネットワーク内のノードのグループを含むテン
プレートが定義され、前記診断プロセッサが、アラーム
の前記抽出されたシーケンス内の前記期待されるアラー
ムの発生に応答して前記因果ネットワーク内で前記テン
プレートをインスタンス化するように配置される、上記
（３６）に記載の装置。（４１）前記相互リンクされた複数のモジュールが、規
則的なパターンで相互リンクされた前記モジュールの所
与の１つの複数のインスタンスを含み、前記モジュール
の前記所与の１つに対応する前記ネットワーク内のノー
ドのグループを含むテンプレートが定義され、前記診断
プロセッサが、前記アラームに応答して１つまたは複数
の前記モジュールに関して前記テンプレートをインスタ
ンス化するように配置される、上記（３２）に記載の装
置。（４２）前記テンプレートが、前記モジュールの前記所
与の１つの前記インスタンスの１つでの前記誤動作の１
つによって引き起こされる期待されるアラームを含み、
前記診断プロセッサが、前記期待されるアラームの発生
に応答して前記ネットワークに前記テンプレートのイン
スタンスを追加することによって前記テンプレートをイ
ンスタンス化するように配置される、上記（４１）に記
載の装置。（４３）前記診断プロセッサが、前記障害が発生した前
記モジュールの前記１つでのローカル障害状態を識別
し、前記ローカル障害状態に応答して、前記因果ネット
ワーク内で、前記モジュールの前記１つで発生する前記
誤動作の１つに前記障害をリンクするように配置され
る、上記（３２）に記載の装置。（４４）前記診断プロセッサが、前記システム内の前記
モジュールの第２の１つとの接続に起因して前記モジュ
ールの第１の１つで発生する第１障害状態を識別し、前
記第１障害状態に応答して、前記因果ネットワーク内
で、前記モジュールの前記第２の１つで発生する第２障
害状態に前記障害をリンクするように配置される、上記
（３２）に記載の装置。（４５）前記診断プロセッサが、前記第２障害状態の可
能な原因が、前記モジュールの前記第２の１つと前記シ
ステム内の前記モジュールの第３の１つとの間のもう１
つの接続に起因するかどうかを判定し、前記もう１つの
接続に応答して、前記因果ネットワーク内で、前記モジ
ュールの前記第３の１つで発生する第３障害状態に前記
障害をリンクするように配置される、上記（４４）に記
載の装置。（４６）前記診断プロセッサが、前記誤動作の前記それ
ぞれの確率に応答して、前記誤動作の１つの複数の発生
を前記因果ネットワークに追加し、前記因果ネットワー
ク内で前記複数の発生に前記障害をリンクするように配
置される、上記（３２）に記載の装置。（４７）前記診断プロセッサが、前記発生のそれぞれに
よって引き起こされる１つまたは複数の障害状態を判定
し、前記障害状態の少なくとも一部を前記障害にリンク
するように配置される、上記（４６）に記載の装置。（４８）前記誤動作の前記確率の前記少なくとも１つ
が、前記１つまたは複数の前記モジュールの障害の間の
平均時間として表される、上記（３２）に記載の装置。（４９）前記誤動作の前記確率が、平均および積率を有
する確率分布に関して定義され、前記診断プロセッサ
が、前記確率分布の前記平均および前記積率を更新する
ように配置される、上記（３２）に記載の装置。（５０）前記確率分布が、故障率分布を含み、前記診断
プロセッサが、ベイズ信頼性理論モデルを使用して前記
故障率分布を更新するように配置される、上記（４９）
に記載の装置。（５１）前記診断プロセッサが、前記更新された確率の
１つまたは複数を所定の閾値と比較し、前記確率の前記
１つが前記閾値を超える時に診断アクションを起動する
ように配置される、上記（３２）に記載の装置。（５２）ユーザ・インターフェースを含み、前記診断プ
ロセッサが、前記ユーザ・インターフェースを介して前
記診断について前記システムのユーザに通知するように
結合される、上記（５１）に記載の装置。（５３）前記診断プロセッサが、前記ユーザ・インター
フェースを介して、前記因果ネットワークに基づく前記
診断の説明を提供するように配置される、上記（５２）
に記載の装置。（５４）前記診断アクションが、前記誤動作を検証する
ために実行される診断テストを含み、前記診断テスト
が、前記閾値を超える前記確率の前記１つに応答して選
択される、上記（５１）に記載の装置。（５５）前記診断プロセッサが、前記診断テストの結果
に応答して前記因果ネットワークを変更するように配置
される、上記（５４）に記載の装置。（５６）相互リンクされた複数のモジュールから構成さ
れたシステムの診断のための装置であって、前記装置
が、診断プロセッサを含み、前記診断プロセッサが、前
記モジュールの１つでの障害を前記障害につながった可
能性がある２つ以上の前記モジュールでの誤動作と関連
付け、前記障害の条件つき確率を前記誤動作のそれぞれ
の確率分布に関係付ける因果ネットワークを構成し、前
記障害を示す前記システムからのアラームに応答して、
前記誤動作の前記確率分布を更新して、前記更新された
確率分布に応答して前記アラームの診断を提案するよう
に配置される、装置。（５７）前記確率分布が、前記２つ以上の前記モジュー
ルの障害の間の平均時間を示す、上記（５６）に記載の
装置。（５８）前記確率分布が、平均および積率を有し、前記
診断プロセッサが、前記アラームに応答して、前記確率
分布の前記平均および前記積率を再査定するように配置
される、上記（５６）に記載の装置。（５９）前記確率分布が、故障率分布を含み、前記診断
プロセッサが、ベイズ信頼性理論モデルを使用して前記
故障率分布を更新するように配置される、上記（５８）
に記載の装置。（６０）前記２つ以上の前記モジュールが、前記障害が
発生した前記モジュールの前記１つを含み、前記診断プ
ロセッサが、前記モジュールの前記１つでのローカル障
害状態を識別し、前記ローカル障害状態に応答して、前
記因果ネットワーク内で、前記モジュールの前記１つで
発生する前記誤動作の１つに前記障害をリンクするよう
に配置される、上記（５６）に記載の装置。（６１）前記２つ以上の前記モジュールが、第１モジュ
ールおよび第２モジュールを含み、前記診断プロセッサ
が、前記システム内の前記第２モジュールとの接続に起
因して前記第１モジュールで発生する第１障害状態を識
別し、前記第１障害状態に応答して、前記因果ネットワ
ーク内で、前記第２モジュールで発生する第２障害状態
に前記障害をリンクするように配置される、上記（５
６）に記載の装置。（６２）前記２つ以上の前記モジュールが、第３モジュ
ールを含み、前記診断プロセッサが、前記第２障害状態
の可能な原因が、前記第２モジュールと前記第３モジュ
ールとの間の前記システム内のもう１つの接続に起因す
るかどうかを判定し、前記もう１つの接続に応答して、
前記因果ネットワーク内で、前記第３モジュールで発生
する第３障害状態に前記障害をリンクするように配置さ
れる、上記（６１）に記載の装置。（６３）相互リンクされた複数のモジュールから構成さ
れたシステムの診断のためのコンピュータ・ソフトウェ
ア製品であって、前記コンピュータ・ソフトウェア製品
が、プログラム命令が保管されたコンピュータ可読媒体
を含み、前記プログラム命令が、コンピュータによって
読み取られた時に、前記コンピュータに、前記システム
から前記モジュールの１つの障害を示すアラームを受け
取ることと、前記アラームに応答して、前記障害を前記
障害につながった可能性がある１つまたは複数の前記モ
ジュールでの誤動作に関連付け、前記障害の条件つき確
率を前記誤動作のそれぞれの確率に関係付ける、因果ネ
ットワークを構成することと、前記アラームおよび前記
因果ネットワークに基づいて、前記誤動作の前記確率の
少なくとも１つを更新して、前記更新された確率に応答
して前記アラームの診断を提案することとを行わせる、
コンピュータ・ソフトウェア製品。（６４）前記プログラム命令が、前記コンピュータに、
前記システム内の前記複数のモジュールからイベント・
レポートを受け取ることと、前記イベント・レポートか
ら前記アラームを抽出することとを行わせる、上記（６
３）に記載のコンピュータ・ソフトウェア製品。（６５）前記イベント・レポートが、前記システムの構
成の変更のレポートを含み、前記プログラム命令が、前
記コンピュータに、前記変更された構成に基づいて前記
因果ネットワークを構成することを行わせる、上記（６
４）に記載のコンピュータ・ソフトウェア製品。（６６）前記プログラム命令が、前記コンピュータに、
前記構成の前記変更の前記レポートに応答して、前記因
果ネットワークの構成に使用するために、前記構成が記
録されるデータベースを更新することを行わせる、上記
（６５）に記載のコンピュータ・ソフトウェア製品。（６７）前記プログラム命令が、前記コンピュータに、
前記モジュールの前記１つでの前記障害を示す前記アラ
ームを含む、相互に近接する時刻に発生するアラームの
シーケンスを抽出することと、前記確率を更新するため
に前記アラームの前記シーケンスを処理することとを行
わせる、上記（６４）に記載のコンピュータ・ソフトウ
ェア製品。（６８）それぞれの寿命が、前記システムからの前記ア
ラームの受取の際の期待される遅延に応答して、前記ア
ラームに関して定義され、前記プログラム命令が、前記
コンピュータに、前記それぞれの寿命に応答して前記シ
ーケンスから抽出する前記アラームを選択することを行
わせる、上記（６７）に記載のコンピュータ・ソフトウ
ェア製品。（６９）前記プログラム命令が、前記コンピュータに、
前記因果ネットワークがそれに応答して構成された前記
モジュールの前記１つでの前記障害を示す前記アラーム
の発生の時刻のそれぞれの寿命以内に発生した前記アラ
ームを選択することを行わせる、上記（６８）に記載の
コンピュータ・ソフトウェア製品。（７０）前記プログラム命令が、前記コンピュータに、
前記因果ネットワークを構成する際に、前記１つまたは
複数の前記モジュールでの前記誤動作の１つによって引
き起こされる期待されるアラームを定義することと、ア
ラームの前記抽出されたシーケンス内の前記期待される
アラームの発生に応答して前記確率を更新することとを
行わせる、上記（６７）に記載のコンピュータ・ソフト
ウェア製品。（７１）前記システム内の前記モジュールのカテゴリお
よび前記カテゴリ内の前記モジュールでの前記誤動作の
１つによって引き起こされる期待されるアラームに対応
する前記ネットワーク内のノードのグループを含むテン
プレートが定義され、前記プログラム命令が、前記コン
ピュータに、アラームの前記抽出されたシーケンス内の
前記期待されるアラームの発生に応答して前記因果ネッ
トワーク内で前記テンプレートをインスタンス化するこ
とを行わせる、上記（６７）に記載のコンピュータ・ソ
フトウェア製品。（７２）前記相互リンクされた複数のモジュールが、規
則的なパターンで相互リンクされた前記モジュールの所
与の１つの複数のインスタンスを含み、前記モジュール
の前記所与の１つに対応する前記ネットワーク内のノー
ドのグループを含むテンプレートが定義され、前記プロ
グラム命令が、前記コンピュータに、前記アラームに応
答して前記モジュールの１つまたは複数に関して前記テ
ンプレートをインスタンス化することを行わせる、上記
（６３）に記載のコンピュータ・ソフトウェア製品。（７３）前記テンプレートが、前記モジュールの前記所
与の１つの前記インスタンスの１つでの前記誤動作の１
つによって引き起こされる期待されるアラームを含み、
前記プログラム命令が、前記コンピュータに、前記期待
されるアラームの発生に応答して前記ネットワークに前
記テンプレートのインスタンスを追加することによって
前記テンプレートをインスタンス化することを行わせ
る、上記（７２）に記載のコンピュータ・ソフトウェア
製品。（７４）前記プログラム命令が、前記コンピュータに、
前記障害が発生した前記モジュールの前記１つでのロー
カル障害状態を識別することと、前記ローカル障害状態
に応答して、前記因果ネットワーク内で、前記モジュー
ルの前記１つで発生する前記誤動作の１つに前記障害を
リンクすることとを行わせる、上記（６３）に記載のコ
ンピュータ・ソフトウェア製品。（７５）前記プログラム命令が、前記コンピュータに、
前記システム内の前記モジュールの第２の１つとの接続
に起因して前記モジュールの第１の１つで発生する第１
障害状態を識別することと、前記第１障害状態に応答し
て、前記因果ネットワーク内で、前記モジュールの前記
第２の１つで発生する第２障害状態に前記障害をリンク
することとを行わせる、上記（６３）に記載のコンピュ
ータ・ソフトウェア製品。（７６）前記プログラム命令が、前記コンピュータに、
前記第２障害状態の可能な原因が、前記モジュールの前
記第２の１つと前記システム内の前記モジュールの第３
の１つとの間のもう１つの接続に起因するかどうかを判
定することと、前記もう１つの接続に応答して、前記因
果ネットワーク内で、前記モジュールの前記第３の１つ
で発生する第３障害状態に前記障害をリンクすることと
を行わせる、上記（７５）に記載のコンピュータ・ソフ
トウェア製品。（７７）前記プログラム命令が、前記コンピュータに、
前記誤動作の前記それぞれの確率に応答して、前記誤動
作の１つの複数の発生を前記因果ネットワークに追加す
ることと、前記因果ネットワーク内で前記複数の発生に
前記障害をリンクすることとを行わせる、上記（６３）
に記載のコンピュータ・ソフトウェア製品。（７８）前記プログラム命令が、前記コンピュータに、
前記発生のそれぞれによって引き起こされる１つまたは
複数の障害状態を判定することと、前記障害状態の少な
くとも一部を前記障害にリンクすることとを行わせる、
上記（７７）に記載のコンピュータ・ソフトウェア製
品。（７９）前記誤動作の前記確率の前記少なくとも１つ
が、前記１つまたは複数の前記モジュールの障害の間の
平均時間として表される、上記（６３）に記載のコンピ
ュータ・ソフトウェア製品。（８０）前記誤動作の前記確率が、平均および積率を有
する確率分布に関して定義され、前記プログラム命令
が、前記コンピュータに、前記確率分布の前記平均およ
び前記積率を更新することを行わせる、上記（６３）に
記載のコンピュータ・ソフトウェア製品。（８１）前記確率分布が、故障率分布を含み、前記プロ
グラム命令が、前記コンピュータに、ベイズ信頼性理論
モデルを使用して前記故障率分布を更新することを行わ
せる、上記（８０）に記載のコンピュータ・ソフトウェ
ア製品。（８２）前記プログラム命令が、前記コンピュータに、
前記更新された確率の１つまたは複数を所定の閾値と比
較することと、前記確率の前記１つが前記閾値を超える
時に診断アクションを起動することとを行わせる、上記
（６３）に記載のコンピュータ・ソフトウェア製品。（８３）前記プログラム命令が、前記コンピュータに、
前記診断について前記システムのユーザに通知すること
を行わせる、上記（８２）に記載のコンピュータ・ソフ
トウェア製品。（８４）前記プログラム命令が、前記コンピュータに、
前記因果ネットワークに基づく前記診断の説明をユーザ
に提供することを行わせる、上記（８３）に記載のコン
ピュータ・ソフトウェア製品。（８５）前記診断アクションが、前記誤動作を検証する
ために実行される診断テストを含み、前記診断テスト
が、前記閾値を超える前記確率の前記１つに応答して選
択される、上記（８２）に記載のコンピュータ・ソフト
ウェア製品。（８６）前記プログラム命令が、前記コンピュータに、
前記診断テストの結果に応答して前記因果ネットワーク
を変更することを行わせる、上記（８５）に記載のコン
ピュータ・ソフトウェア製品。（８７）相互リンクされた複数のモジュールから構成さ
れたシステムの診断のための製品であって、前記製品
が、プログラム命令が保管されたコンピュータ可読媒体
を含み、前記プログラム命令が、コンピュータによって
読み取られた時に、前記コンピュータに、前記モジュー
ルの１つでの障害を前記障害につながった可能性がある
２つ以上の前記モジュールでの誤動作と関連付け、前記
障害の条件つき確率を前記誤動作のそれぞれの確率分布
に関係付ける因果ネットワークを構成することと、前記
障害を示す前記システムからのアラームに応答して、前
記誤動作の前記確率分布を更新して、前記更新された確
率分布に応答して前記アラームの診断を提案することと
を行わせる、製品。（８８）前記確率分布が、前記２つ以上の前記モジュー
ルの障害の間の平均時間を示す、上記（８７）に記載の
製品。（８９）前記確率分布が、平均および積率を有し、前記
プログラム命令が、前記コンピュータに、前記アラーム
に応答して、前記確率分布の前記平均および前記積率を
再査定することを行わせる、上記（８７）に記載の製
品。（９０）前記確率分布が、故障率分布を含み、前記プロ
グラム命令が、前記コンピュータに、ベイズ信頼性理論
モデルを使用して前記故障率分布を更新することを行わ
せる、上記（８９）に記載の製品。（９１）前記２つ以上の前記モジュールが、前記障害が
発生した前記モジュールの前記１つを含み、前記プログ
ラム命令が、前記コンピュータに、前記モジュールの前
記１つでのローカル障害状態を識別することと、前記ロ
ーカル障害状態に応答して、前記因果ネットワーク内
で、前記モジュールの前記１つで発生する前記誤動作の
１つに前記障害をリンクすることとを行わせる、上記
（８７）に記載の製品。（９２）前記２つ以上の前記モジュールが、第１モジュ
ールおよび第２モジュールを含み、前記プログラム命令
が、前記コンピュータに、前記システム内の前記第２モ
ジュールとの接続に起因して前記第１モジュールで発生
する第１障害状態を識別することと、前記第１障害状態
に応答して、前記因果ネットワーク内で、前記第２モジ
ュールで発生する第２障害状態に前記障害をリンクする
こととを行わせる、上記（８７）に記載の製品。（９３）前記２つ以上の前記モジュールが、第３モジュ
ールを含み、前記プログラム命令が、前記コンピュータ
に、前記第２障害状態の可能な原因が、前記第２モジュ
ールと前記第３モジュールとの間の前記システム内のも
う１つの接続に起因するかどうかを判定することと、前
記もう１つの接続に応答して、前記因果ネットワーク内
で、前記第３モジュールで発生する第３障害状態に前記
障害をリンクすることとを行わせる、上記（８７）に記
載の製品。(1) Multiple modules linked to each other
A method for diagnosing a system consisting of:
From the system, indicates a failure of one of the modules
Receiving an alarm and responding to the alarm
And one that may have led to the disorder
Or associated with a malfunction in multiple of the above modules,
The conditional probability of failure to each probability of the malfunction
Correlating, configuring a causal network,
Based on the alarm and the causal network,
A step of updating at least one of the probabilities of the malfunction.
Of the alarm in response to the updated probability.
Proposing a diagnosis. (2) The step of receiving the alarm includes the system
Event reports from the multiple modules in
From the event report,
Lame extraction step, as described in (1) above.
the method of. (3) Before the step of collecting the event report
To receive a report of system configuration changes
And configuring the causal network,
The causal network based on the modified configuration
The method according to (2) above, including the step of configuring. (4) The causal network based on the changed configuration
The step of configuring the
Maintaining the database and the causal network
The configuration of the modification of the configuration for use in the configuration of
Updating the database in response to a port
The method according to (3) above, which comprises: (5) The step of extracting the alarm is performed by the module.
Including said alarm indicating said failure in said one of the
The sequence of alarms that occur at times close to each other
Extracting at least one of the probabilities
To update the probability of updating one more
Processing the sequence of the alarms
The method according to (2) above. (6) Step for extracting the sequence of the alarm
Is responsible for receiving the alarm from the system.
In response to the awaited delay, the life of each of the alarms
The steps of defining life and responding to each of the aforementioned lifetimes
Select the alarm to extract from the sequence
The method according to (5) above, including the steps of: (7) The step of selecting the alarm to be extracted is
The above-mentioned mode in which the causal network was configured in response
Of the alarm indicating the failure at the one of the modules
The alert that occurred within each lifetime of the time of occurrence
The method according to (6) above, including the step of selecting the system.
Law. (8) The step of configuring the causal network is
One of the malfunctions in one or more of the modules
Defines the expected alarm triggered by
And processing the sequence of alarms.
The step of managing is the extracted sequence of alarms.
In response to the occurrence of the expected alarm in
One according to (5) above, including the step of updating the rate
Law. (9) The step of configuring the causal network is
The category of the module in the system and the module
Due to one of the malfunctions in the module in the category
Corresponding to the expected alarm caused by
Template containing a group of nodes in the network
Defining step and said extracted sequence of alarms
In response to the expected occurrence of the alarm in the
Instance the template in the causal network
The method according to (5) above, including the step of: (10) The plurality of mutually linked modules are
Where the modules are linked together in a regular pattern
Said causal net including one or more instances of
The step of structuring the work is performed at the location of the module.
A group of nodes in the network corresponding to one of the
Group containing a template,
In response to the one or more modules
And instantiating the template
The method according to (1) above, which comprises: (11) The step of defining the template is
One of the given instance of the given one of the modules
Expected to be caused by one of the above malfunctions in
Identifying the alarms that are
Instantiating the
To the network in response to the occurrence of an alarm
Above, including the step of adding a rate instance
The method according to (10). (12) The step of configuring the causal network includes
The row in the one of the modules that has failed
Cull failure condition identifying the local failure
In response to a condition, within the causal network, the module
One of the malfunctions that occurs in the one of the
Linking harm, as described in (1) above.
Method. (13) The step of configuring the causal network includes
Connection to a second one of the modules in the system
The first occurring in the first one of the modules due to
Identifying a fault condition and responding to the first fault condition.
In response, within the causal network,
Reset the fault to the second fault condition that occurs in the second one.
The method according to (1) above. (14) The step of linking the obstacle is the second obstacle.
A possible cause of the harmful condition is the second one of the modules.
One and a third one of the modules in the system
A step to determine if it is due to another connection between
And the causal network in response to the other connection.
Occurs in the third one of the modules in the network
Linking the fault to a third fault condition
The method according to (13) above, which comprises: (15) The step of configuring the causal network includes
In response to the respective probabilities of the malfunction, the malfunction
Add one or more occurrences of a work to the causal network
And the plurality of occurrences within the causal network.
Linking the disorder to life,
The method described in (1). (16) A step that links the failure to the plurality of occurrences.
One that is triggered by each of the above occurrences
Or determining a plurality of fault conditions,
Linking at least some of the conditions to the fault
The method according to (15) above, which comprises: (17) The at least one of the probabilities of the malfunction is
Updating comprises the step of updating the one or more of the modules.
Including the step of assessing the average time between disability failures,
The method according to (1) above. (18) The probability of the malfunction has an average and a product moment.
Defined with respect to the probability distribution
Updating one of the
And the step of reassessing the product moment
The method described in (1). (19) The probability distribution includes a failure rate distribution, and the average
And the step of reassessing the product moment is Bayesian reliability
A step for updating the failure rate distribution using a theoretical model.
The method according to (18) above, which comprises: (20) The step of proposing the diagnosis is updated as described above.
The step of comparing one or more of the
And a diagnosis when the one of the probabilities exceeds the threshold.
(1) above, including the step of activating an action
The method described. (21) The step of activating the diagnostic action is
Steps to notify the user of the system about the diagnosis.
The method according to (20) above, which comprises: (22) The step of notifying the user is the causal link.
Providing a description of the diagnosis based on network
The method according to (21) above, which comprises: (23) The step of activating the diagnostic action is
Steps to run diagnostic tests to verify malfunctions.
Check that the diagnostic test exceeds the threshold.
Selected in response to said one of the rates, noted in (20) above.
How to list. (24) In response to the result of the diagnostic test, the causal network
As described in (23) above, including the step of changing the network.
How to list. (25) Consists of multiple modules linked together
A method for diagnosing a
May have led to a disability in one of the
Associated with malfunctions in two or more of the modules
The conditional probability of failure is divided by the probability of each malfunction.
Steps of constructing a causal network associated with cloth
And respond to an alarm from the system indicating the fault
And updating the probability distribution of the malfunction
Of the alarm in response to the updated probability distribution.
Proposing a diagnosis. (26) The step of updating the probability distribution is the two
Assess the average time between failures of the above modules
The method according to (25) above, including the step. (27) The probability distribution has a mean and a product moment, and
Updating the probability distribution comprises:
And the step of reassessing the product moment
The method according to (25). (28) The probability distribution includes a failure rate distribution, and the average
And the step of reassessing the product moment is Bayesian reliability
A step for updating the failure rate distribution using a theoretical model.
The method according to (27) above, which comprises: (29) If the two or more modules are
Including the one of the generated modules,
The step of configuring the network before the module.
The step of identifying a local fault condition in one
In response to a local fault condition, said causal network
In which the malfunction that occurs in the one of the modules
Linking the fault to one of the
The method according to (25). (30) The two or more modules are the first module.
And a second module, the causal network
And configuring the second module in the system.
Occurs in the first module due to the connection with the module
Identifying a first fault condition to perform, said first fault
In response to a condition, in the causal network, the second
Link the fault to the second fault condition that occurs in the module
The method according to (25) above. (31) The two or more modules are a third module.
Linking the faults,
2 possible causes of the fault condition are the second module and the
Another connection in the system to the third module
The step of determining whether it is due to
In response to one connection, within the causal network,
The above-mentioned fault is added to the third fault condition that occurs in the third module.
The method according to (30) above, including the step of linking.
Law. (32) Consists of multiple modules linked together
For diagnosing a stored system, said apparatus
Includes a diagnostic processor, the diagnostic processor
The system indicates that one of the modules has failed.
And a diagnostic processor coupled to receive a ramp.
Responds to the alarm by switching the fault to the fault.
One or more of the modules that may have been traced
The conditional probability of the failure is
A causal network that is associated with each probability of malfunction
Configuring the alarm, the alarm and the causal network
Based on at least one of the probabilities of the malfunction
Updating the alarm in response to the updated probability
A device arranged to suggest a diagnosis of. (33) The diagnostic processor is the
Receive event reports from multiple modules,
Extract the alarm from the event report
The device according to (32) above, which is linked as follows. (34) The event report is the composition of the system.
Including a report of changes in the
The causal network is constructed based on the changed configuration.
The apparatus according to (33) above, wherein the apparatus is arranged to: (35) A memo including a database in which the configuration is recorded
For use in configuring the causal network, including
In the diagnostic processor, the change of the configuration
Update the database in response to reports
The device according to (34) above, which is coupled. (36) The diagnostic processor is the module of the module.
Close to each other, including the alarm indicating one of the failures
Extract the sequence of alarms that occur at the time of contact,
The sequence of the alarms to update the probability
The process according to (33) above, which is coupled to process
apparatus. (37) The life of each of the
In response to the expected delay in receiving the
Defined in terms of
Extract from the sequence in response to each lifetime
Arranged to select the alarm, above (3
The device according to 6). (38) The diagnostic processor is the causal network.
The one of the modules configured in response to
At the time of occurrence of the alarm indicating the fault in
To select the alarm that occurred within its life
The device according to (37) above, which is arranged. (39) When configuring the causal network, the diagnosis
A processor, the one or more of the modules
Expected to be caused by one of the above malfunctions in
Are arranged to define alarms that
And further the extracted sequence of alarms.
The probability in response to the occurrence of the expected alarm in
The device according to (36) above, which is arranged to update
Place (40) The category of the module in the system
And the malfunction of the module in the category
Responds to expected alarms triggered by one
Including a group of nodes in the network to
The plate is defined and the diagnostic processor activates the alarm
Of the expected aller in the extracted sequence of
In the causal network in response to the occurrence of
Arranged to instantiate the plate, above
The device according to (36). (41) The plurality of mutually linked modules are
Where the modules are linked together in a regular pattern
A module including one or more instances of
No in the network corresponding to the given one of
A template containing a group of
A processor that responds to the alarm with one or more
Install the template for the module of
The device according to (32) above, which is arranged so that
Place (42) The template is the part of the module
One of the malfunctions in one of the one instance of a given
Including the expected alarm triggered by
The diagnostic processor generates the expected alarm
In response to importing the template into the network.
Install the template by adding a stance.
(41) above, which is arranged so that it becomes an instance.
On-board equipment. (43) Before the fault occurs in the diagnostic processor
Identify a local fault condition on the one of the modules
And in response to the local fault condition, the causal net
Within the workpiece, the one generated by the one of the modules
Placed to link the fault to one of the malfunctions
The device according to (32) above. (44) The diagnostic processor is the
Due to the connection with the second one of the modules said module
The first failure condition that occurs in the first one of the
In the causal network in response to the first fault condition
And a second failure that occurs in the second one of the modules
Arranged to link the obstacle to a harmed state, above
The device according to (32). (45) The diagnostic processor may be in the second failure state.
Effective cause is that the second one of the modules and the system
Another between the third one of said modules in the stem
Determine whether it is due to one connection, and
In response to a connection, within the causal network, the module
To a third fault condition that occurs in the third one of the
Placed to link obstacles, as described in (44) above.
On-board equipment. (46) The diagnostic processor is configured to detect the malfunction.
One or more occurrences of said malfunction in response to each probability
To the causal network and add the causal network
Within the network to link the failures to the multiple occurrences.
The device according to (32) above, which is placed. (47) The diagnostic processor is provided for each of the occurrences.
Determine one or more fault conditions caused by
Link at least part of the fault condition to the fault
The device according to (46) above, which is arranged to: (48) At least one of the probabilities of the malfunction
During a failure of the one or more of the modules
The device according to (32) above, expressed as an average time. (49) The probability of the malfunction has an average and a product moment.
The diagnostic processor defined with respect to a probability distribution
Updates the mean and the product moment of the probability distribution
The device according to (32) above, which is arranged as follows. (50) The probability distribution includes a failure rate distribution, and the diagnosis
The processor uses the Bayesian reliability theory model
Arranged to update the failure rate distribution, above (49)
The device according to. (51) The diagnostic processor determines whether the updated probability
Comparing one or more with a predetermined threshold,
Trigger a diagnostic action when one exceeds the threshold
The device according to (32) above, which is arranged as follows. (52) A diagnostic interface including a user interface.
The processor can be forwarded via the user interface
Notify users of the system about diagnostics
The device according to (51) above, which is coupled. (53) The diagnostic processor uses the user interface.
Based on the causal network through the face
Arranged to provide a diagnostic description, above (52).
The device according to. (54) The diagnostic action verifies the malfunction.
Including a diagnostic test performed for
Is selected in response to the one of the probabilities above the threshold.
The device according to (51) above, which is selected. (55) The diagnostic processor determines the result of the diagnostic test.
Arranged to change the causal network in response to
The device according to (54) above. (56) Consists of multiple modules linked together
For diagnosing a stored system, said apparatus
Includes a diagnostic processor, the diagnostic processor
A failure in one of the modules may have led to the failure.
Associated with malfunctions in two or more of the modules that have the potential
And the conditional probability of the failure for each of the malfunctions.
Construct a causal network related to the probability distribution of
In response to an alarm from the system indicating the failure,
The probability distribution of the malfunction is updated, and the updated
To suggest a diagnosis of said alarm in response to a probability distribution
Is located at the device. (57) The probability distribution is the two or more modules
(56) above, showing the mean time between disability
apparatus. (58) The probability distribution has a mean and a product moment, and
The diagnostic processor responds to the alarm with the probability
Arranged to reassess the mean and the product moment of the distribution
The device according to (56) above. (59) The probability distribution includes a failure rate distribution, and the diagnosis
The processor uses the Bayesian reliability theory model
Arranged to update the failure rate distribution, above (58)
The device according to. (60) The two or more modules are
Including the one of the generated modules, the diagnostic
The processor has a local fault in the one of the modules.
Identifies the harm condition and responds to the local fault condition by
In the causal network, the one of the modules
To link the fault to one of the malfunctions that occurs
The device according to (56) above, which is disposed at. (61) The two or more modules are the first module.
And a second module, the diagnostic processor
Causes the connection to the second module in the system.
Therefore, the first failure state occurring in the first module is identified.
Separately, in response to the first fault condition, the causal network is
Second fault condition occurring in the second module in the network
Is arranged to link the obstacle to
The device according to 6). (62) The two or more modules are combined into a third module.
The diagnostic processor includes the second fault condition.
Possible causes are the second module and the third module.
Due to another connection in the system with the
And then responding to the other connection,
Occurring in the third module within the causal network
Arranged to link the fault to a third fault condition
The device according to (61) above. (63) Consists of multiple modules linked together
Computer software for diagnosing distributed systems
A product, the computer software product
Is a computer-readable medium on which the program instructions are stored.
Including the program instructions by a computer
The system, when read, on the computer
Received an alarm indicating a failure of one of the modules
And taking the fault in response to the alarm.
One or more of the models that may have led to a disability.
Associated with the malfunction of the module
A causal link that relates the rate to each probability of the malfunction.
Network, the alarm and the
Based on a causal network, the probability of the malfunction
Update at least one and respond to said updated probability
And suggest a diagnosis of the alarm.
Computer software product. (64) The program command causes the computer to
Events from the modules in the system
Receiving a report and the event report
The above-mentioned (6)
The computer software product described in 3). (65) The event report indicates the structure of the system.
Including a report of changes in the
Based on the changed configuration, the computer
The causal network is configured, as described in (6) above.
The computer software product described in 4). (66) The program command causes the computer to
In response to the report of the change in the configuration, the cause
The above configuration is recorded for use in configuring the result network.
The above, which causes the database to be recorded to be updated
The computer software product according to (65). (67) The program command causes the computer to
The alert indicating the failure in the one of the modules
Alarms that occur at times close to each other, including
To extract the sequence and to update the probability
Processing said sequence of said alarms
The computer software according to (64) above.
Products. (68) The life of each of the
In response to the expected delay in receiving the
Defined in terms of
The computer in response to each of the aforementioned lifetimes.
Select the alarm to extract from the
The computer software according to (67) above.
Products. (69) The program command causes the computer to
Said causal network configured in response to said
The alarm indicating the fault in the one of the modules
The ara occurred within each life of the occurrence time of
A method according to (68) above, wherein selection of
Computer software product. (70) The program command causes the computer to
When configuring the causal network, one of the
Triggered by one of the malfunctions in multiple modules
Defining the expected alarms that will be triggered
The expected in the extracted sequence of Rams
Updating the probability in response to the occurrence of an alarm.
The computer software according to (67), which is executed.
Clothing products. (71) The category of the module in the system
And the malfunction of the module in the category
Responds to expected alarms triggered by one
Including a group of nodes in the network to
A plate is defined and the program instructions are
Computer, within the extracted sequence of alarms
In response to the expected occurrence of the alarm, the causal network
Instantiating the template within the network
And the computer software according to (67) above.
Software products. (72) The plurality of interconnected modules are
Where the modules are linked together in a regular pattern
A module including one or more instances of
No in the network corresponding to the given one of
A template containing a group of
A gram command causes the computer to respond to the alarm.
Answering one or more of the modules
Template to instantiate the template, above
The computer software product according to (63). (73) The template is the location of the module
One of the malfunctions in one of the one instance of a given
Including the expected alarm triggered by
The program instructions cause the computer to
To the network in response to the occurrence of an alarm
By adding an instance of the template
Let's do instantiating the template
The computer software according to (72) above.
Product. (74) The program command causes the computer to
The row in the one of the modules that has failed
Cull fault condition and said local fault condition
In response to the
One of the malfunctions that occurs in the one of the
The link described in (63) above that causes linking.
Computer software products. (75) The program command causes the computer to
Connection to a second one of the modules in the system
The first occurring in the first one of the modules due to
Identifying a fault condition and responding to the first fault condition.
And within the causal network,
Link the fault to a second fault condition that occurs in the second one
The computer according to (63) above, which causes
Data software products. (76) The program command causes the computer to
A possible cause of the second fault condition is that before the module.
The second one and the third of the modules in the system
Determine whether it is due to another connection between one of
And in response to the other connection,
In the result network, the third one of the modules
Linking the fault to a third fault condition that occurs in
The computer software according to (75) above.
Software products. (77) The program command causes the computer to
In response to the respective probabilities of the malfunction, the malfunction
Add one or more occurrences of a work to the causal network
And the multiple occurrences within the causal network.
Linking the obstacles, (63) above
The computer software product described in. (78) The program command causes the computer to
One or each caused by each of the above occurrences
Determining multiple fault conditions and
Linking at least part of the disorder
Made by the computer software described in (77) above.
Goods. (79) The at least one of the probabilities of the malfunction.
During a failure of the one or more of the modules
The computer according to (63) above, expressed as an average time.
Computer software products. (80) The probability of the malfunction has an average and a product moment.
The program instructions defined with respect to a probability distribution
Of the probability distribution to the computer.
And the above-mentioned product ratio are updated.
The listed computer software product. (81) The probability distribution includes a failure rate distribution,
Gram command tells the computer that Bayesian reliability theory
Done updating the failure rate distribution using a model
The computer software according to (80) above.
A product. (82) The program command causes the computer to
Ratio one or more of the updated probabilities to a predetermined threshold
And the one of the probabilities exceeds the threshold
Sometimes triggering diagnostic actions and doing the above
The computer software product according to (63). (83) The program command causes the computer to
Notifying users of the system about the diagnosis
The computer software according to (82) above.
Software products. (84) The program command causes the computer to
User explaining the diagnosis based on the causal network
The computer according to (83) above, which is configured to
Computer software products. (85) The diagnostic action verifies the malfunction.
Including a diagnostic test performed for
Is selected in response to the one of the probabilities above the threshold.
Computer software according to (82) above, which is selected
Clothing products. (86) The program command causes the computer to
The causal network in response to the results of the diagnostic test
The computer according to (85) above, wherein
Computer software products. (87) Consists of multiple modules linked together
For diagnosing a defective system, said product
Is a computer-readable medium on which the program instructions are stored.
Including the program instructions by a computer
When read, the computer
May have led to a disability in one of the
Associated with a malfunction in two or more of the modules,
The conditional probability of failure is defined as the probability distribution of each of the malfunctions.
Configuring a causal network relating to
In response to an alarm from the system indicating a failure,
Update the probability distribution of malfunctions to
Proposing a diagnosis of said alarm in response to a rate distribution
Let the product do the work. (88) The probability distribution is the two or more modules.
(87) above, showing the mean time between disorders
Product. (89) The probability distribution has a mean and a product moment, and
Program instructions cause the computer to generate the alarm
In response to the mean of the probability distribution and the product moment
The product according to (87) above, which allows reassessment to be performed.
Goods. (90) The probability distribution includes a failure rate distribution,
Gram command tells the computer that Bayesian reliability theory
Done updating the failure rate distribution using a model
The product according to (89) above. (91) The two or more modules are
Including the one of the generated modules,
RAM instructions to the computer before the module
Identifying a local fault condition in one of the
Within the causal network in response to
Of the malfunction that occurs in the one of the modules
Linking the faults to one, said
The product according to (87). (92) The two or more modules are combined into a first module.
And a second module, the program instructions
To the second computer in the system.
Occurs in the first module due to the connection with the module
Identifying a first fault condition that
In response to the second causal network in the causal network.
Link the fault to a second fault condition that occurs in the tool
The product according to (87), which is used to do the above. (93) The two or more modules are combined into a third module.
And the program instructions include the computer
In addition, the possible causes of the second fault condition are
In the system between the module and the third module.
Determining whether it is due to one connection
In response to another connection, within the causal network
Then, in the third failure state that occurs in the third module,
As described in (87) above, which causes linking of obstacles.
Listed products.

[Brief description of drawings]

【図１】本発明の好ましい実施形態による、モデルベー
スの診断ユニットを有する管理可能なコンピュータ・ネ
ットワークを概略的に示すブロック図である。FIG. 1 is a block diagram that schematically illustrates a manageable computer network having a model-based diagnostic unit according to a preferred embodiment of the present invention.

【図２】本発明の好ましい実施形態による、図１の診断
ユニットの詳細を概略的に示すブロック図である。2 is a block diagram schematically illustrating details of the diagnostic unit of FIG. 1 according to a preferred embodiment of the present invention.

【図３】本発明の好ましい実施形態による、ネットワー
ク診断のための方法を概略的に示す流れ図である。FIG. 3 is a schematic flow chart of a method for network diagnosis according to a preferred embodiment of the present invention.

【図４】本発明の好ましい実施形態による、通信ネット
ワーク内のアラームに応答して構成された例示的ベイズ
・ネットワークを示すグラフである。FIG. 4 is a graph showing an exemplary Bayesian network configured in response to an alarm in a communication network in accordance with a preferred embodiment of the present invention.

【図５】本発明の好ましい実施形態による、アラームの
シーケンスを処理する方法を概略的に例示するタイミン
グ図である。FIG. 5 is a timing diagram that schematically illustrates a method of processing a sequence of alarms according to a preferred embodiment of the present invention.

【図６】本発明の好ましい実施形態による、アラームに
応答してベイズ・ネットワークを構成する方法を概略的
に示す流れ図である。FIG. 6 is a flow diagram that schematically illustrates a method of configuring a Bayesian network in response to an alarm, according to a preferred embodiment of the present invention.

【図７】図６の方法に従って構成されたベイズ・ネット
ワークに障害状態を追加する方法を概略的に示す流れ図
である。7 is a flow diagram that schematically illustrates a method of adding a fault condition to a Bayesian network configured according to the method of FIG.

【図８】図６の方法に従って構成されたベイズ・ネット
ワークに障害状態を追加する方法を概略的に示す流れ図
である。8 is a flow diagram that schematically illustrates a method of adding a fault condition to a Bayesian network configured according to the method of FIG.

【図９】本発明の好ましい実施形態による、モデル化さ
れる通信ネットワークの規則性を利用するベイズ・ネッ
トワークの構成の方法を示すグラフである。FIG. 9 is a graph showing a method of constructing a Bayesian network that utilizes the regularity of a modeled communication network according to a preferred embodiment of the present invention.

[Explanation of symbols]

２０診断ユニット４０イベント・フォーマッタおよびマージャ４２構成トラッカ４４システム・モデル４６構成データベース４８診断エンジン５０障害モデル５２勧告および説明ジェネレータ５４ユーザ・インターフェース６０アラームを受け取るステップ６２シーケンス内で他のアラームと組み合わせるステ
ップ６４既存の誤動作率査定を使用して、アラーム・シー
ケンスに関するベイズ・ネットワークを構築するステッ
プ６６ネットワーク内の誤動作に関する率査定を更新す
るステップ６８誤動作率に基づいて勧告を作るステップ20 diagnostic unit 40 event formatter and merger 42 configuration tracker 44 system model 46 configuration database 48 diagnostic engine 50 fault model 52 recommendation and explanation generator 54 user interface 60 receiving an alarm step 62 combining with other alarms in sequence 64 Building Bayesian Networks for Alarm Sequences Using Existing Malfunction Rate Assessments Step 66 Updating Malfunction Rate Assessments in the Network Step 68 Making Recommendations Based on Malfunction Rates

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｌ 29/14 Ｈ０４Ｌ 13/00 ３１３ (72)発明者イゴール・シラシュヤイスラエルハイファシムキン・ストリート 34エイ (72)発明者リー・シャレフイスラエルジクロン・ヤコフデレク・サラ・ストリート 819／５ (72)発明者キリル・ショイケットイスラエルハイファアズライ・ヤコフ・ストリート 11 Ｆターム(参考） 5B048 AA18 CC15 DD11 FF02 5B089 GA21 GB02 HA10 JB17 KA12 5K030 GA12 JA10 MA01 MC06 MC08 5K035 BB04 DD01 FF01 HH07 JJ01─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) H04L 29/14 H04L 13/00 313 (72) Inventor Igor Silashya Israel Haifa Simkin Streat 34 A ( 72) Inventor Lee Shalev Israel Zyklon Yakov Derek Sarah Street 819/5 (72) Inventor Kirill Shoiket Israel Haifa Azraj Yakov Street 11 F Term (reference) 5B048 AA18 CC15 DD11 FF02 5B089 GA21 GB02 HA10 JB17 KA12 5K030 GA12 JA10 MA01 MC06 MC08 5K035 BB04 DD01 FF01 HH07 JJ01

Claims

[Claims]

1. A method for diagnosing a system composed of a plurality of interconnected modules, the method comprising: receiving from the system an alarm indicating a failure of one of the modules; and responsive to the alarm. Configuring a causal network that associates the failure with a malfunction in one or more of the modules that may have led to the failure and associates a conditional probability of the failure with a respective probability of the malfunction. And based on the alarm and the causal network,
Updating at least one of said probabilities of said malfunction, and proposing a diagnosis of said alarm in response to said updated probabilities.

2. The method of claim 1, wherein receiving the alarm comprises collecting an event report from the plurality of modules in the system and extracting the alarm from the event report. Method.

3. The step of collecting the event report includes the step of receiving a report of a configuration change of the system, and the step of configuring the causal network configures the causal network based on the changed configuration. The method of claim 2 including the step of:

4. The step of configuring the causal network based on the modified configuration maintains a database in which the configuration is recorded and the configuration of the causal network for use in configuring the causal network. Updating the database in response to the report of changes.

5. The step of extracting said alarms comprises the step of extracting a sequence of alarms occurring at times close to each other, including said alarms indicating said failure in said one of said modules, said probability The method of claim 2, wherein updating at least one of the steps comprises processing the sequence of alarms to update the probability.

6. Extracting the sequence of the alarms, responsive to an expected delay in receipt of the alarms from the system, defining the respective lifetimes of the alarms; Selecting the alarm to extract from the sequence in response to the lifetime of the alarm.

7. The step of selecting the alarm to extract comprises within each lifetime of the time of occurrence of the alarm indicating the failure in the one of the modules to which the causal network is configured in response. 7. The method of claim 6 including the step of selecting the alarm that has occurred.

8. The step of configuring the causal network comprises defining an expected alarm caused by one of the malfunctions in the one or more modules, processing the sequence of alarms. The method of claim 5, wherein the step of updating comprises updating the probability in response to the occurrence of the expected alarm in the extracted sequence of alarms.

9. The step of configuring the causal network in the network corresponding to a category of the modules in the system and an expected alarm caused by one of the malfunctions in the modules in the category. Defining a template containing a group of nodes; instantiating the template in the causal network in response to the occurrence of the expected alarm in the extracted sequence of alarms. The method according to 5.

10. The interconnected modules include multiple instances of a given one of the modules interconnected in a regular pattern, the step of configuring the causal network comprising: Defining a template containing a group of nodes in the network corresponding to the given one;
Instantiating the template for the one or more modules in response to the alarm.

11. Defining the template comprises identifying an expected alarm caused by one of the malfunctions in one of the given ones of the instances of the module, the template 11. The method of claim 10, wherein instantiating a template includes adding an instance of the template to the network in response to occurrence of the expected alarm.

12. The step of configuring the causal network comprises identifying a local fault condition at the one of the failed modules, and responsive to the local fault condition within the causal network. Linking the fault to one of the malfunctions occurring in the one of the modules.

13. The step of configuring the causal network identifies a first fault condition that occurs in a first one of the modules due to a connection with a second one of the modules in the system. 2. The method of claim 1, comprising: responsive to the first failure condition, linking the failure to a second failure condition occurring in the second one of the modules within the causal network. The method described.

14. The step of linking the faults wherein the possible causes of the second fault condition are the second one of the modules and the third one of the modules in the system.
Determining if it is due to another connection to and from the third connection, and in response to the other connection, a third failure occurring in the third one of the modules in the causal network. Linking the fault to a condition.

15. The step of configuring the causal network adds, in response to the respective probabilities of the malfunction, one or more occurrences of the malfunction to the causal network, and within the causal network. Linking the fault to a plurality of occurrences.

16. Linking the fault to the plurality of occurrences, determining one or more fault conditions caused by each of the occurrences, and linking at least some of the fault conditions to the failure. 16. The method of claim 15 including the step of:

17. The method of claim 1, wherein updating the at least one of the probabilities of the malfunction comprises assessing an average time between failures of the one or more modules. ..

18. The probability of the malfunction is defined with respect to a probability distribution having a mean and a product moment, and the step of updating the at least one of the probabilities reassess the mean and the product moment of the probability distribution. Including the steps to
The method of claim 1.

19. The probability distribution comprises a failure rate distribution, and the step of reassessing the mean and the product moments comprises updating the failure rate distribution using a Bayesian reliability theory model. Item 18. The method according to Item 18.

20. The step of proposing the diagnosis comprises comparing one or more of the updated probabilities with a predetermined threshold, and activating a diagnostic action when the one of the probabilities exceeds the threshold. The method of claim 1, comprising:

21. The method of claim 20, wherein invoking the diagnostic action comprises notifying a user of the system about the diagnostic.

22. The method of claim 21, wherein informing the user comprises providing a description of the diagnosis based on the causal network.

23. Activating the diagnostic action includes performing a diagnostic test to verify the malfunction, the diagnostic test selected in response to the one of the probabilities to exceed the threshold. 21.
The method described in.

24. Modifying the causal network in response to the results of the diagnostic test.
The method described in.

25. A method for diagnosing a system composed of a plurality of interconnected modules, wherein the failure of one of the modules is more than one that may have led to the failure. Configuring a causal network that associates a conditional probability of the fault with a respective probability distribution of the fault in association with the fault in the module; and in response to an alarm from the system indicating the fault, Updating the probability distribution, and proposing a diagnosis of the alarm in response to the updated probability distribution.

26. The method of claim 25, wherein updating the probability distribution comprises assessing an average time between failures of the two or more modules.

27. The method of claim 25, wherein the probability distribution has a mean and a product moment, and updating the probability distribution comprises reassessing the mean and the product moment of the probability distribution. the method of.

28. The probability distribution comprises a failure rate distribution, and the step of reassessing the mean and the product moments comprises updating the failure rate distribution using a Bayesian reliability theory model. Item 27. The method according to Item 27.

29. The two or more of the modules include the one of the failed modules, and the step of configuring the causal network identifies a local failure condition in the one of the modules. 26. The method of claim 25, further comprising: responsive to the local failure condition, linking the failure to one of the malfunctions occurring in the one of the modules in the causal network. Method.

30. The two or more of the modules are first
Configuring a causal network, including a module and a second module, identifying a first failure condition occurring in the first module due to a connection with the second module in the system; 26. Responsive to a first failure condition, linking the failure to a second failure condition occurring in the second module within the causal network.

31. The two or more of the modules are a third
Linking the faults, including a module,
A possible cause of the second fault condition is another cause in the system between the second module and the third module.
Determining if it is due to one connection, and linking the failure to a third failure condition occurring in the third module in the causal network in response to the another connection. 31. The method of claim 30.

32. An apparatus for diagnosing a system composed of a plurality of interconnected modules, the apparatus including a diagnostic processor, the diagnostic processor from the system to one of the modules. Coupled to receive an alarm indicative of a failure, the diagnostic processor is responsive to the alarm to associate the failure with a malfunction in one or more of the modules that may have led to the failure, the failure Constructing a causal network relating the conditional probabilities of the respective to the respective probabilities of the malfunction, updating at least one of the probabilities of the malfunction based on the alarm and the causal network, and the updated probability. An apparatus arranged to propose a diagnosis of said alarm in response to.

33. The apparatus of claim 32, wherein the diagnostic processor is linked to receive event reports from the plurality of modules in the system and extract the alarms from the event reports.

34. The event report includes a report of a configuration change of the system and the diagnostic processor is arranged to configure the causal network based on the changed configuration. The device according to.

35. A memory including a database in which the configuration is recorded, the diagnostic processor updating the database in response to the report of the change in the configuration for use in configuring the causal network. 35. The device of claim 34, wherein the device is coupled to:

36. The diagnostic processor extracts a sequence of alarms occurring at times in close proximity to each other, including the alarm indicating the fault in the one of the modules, and updating the probabilities. 34. The apparatus of claim 33, coupled to process the sequence of alarms.

37. Each lifetime is responsive to an expected delay in receipt of the alarm from the system,
37. The apparatus of claim 36, wherein the apparatus is defined for the alarms and the diagnostic processor is arranged to select the alarms to extract from the sequence in response to the respective lifetimes.

38. The diagnostic processor detects an alarm generated within a respective lifetime of the time of occurrence of the alarm indicating the fault in the one of the modules to which the causal network is configured in response. 38. The device of claim 37, wherein the device is arranged to select.

39. When configuring the causal network,
The diagnostic processor is arranged to define an expected alarm caused by the one of the malfunctions in the one or more of the modules, the diagnostic processor further being within the extracted sequence of alarms. 37. The apparatus of claim 36, arranged to update the probability in response to the occurrence of the expected alarm of.

40. A template is defined that includes a category of the modules in the system and a group of nodes in the network corresponding to an expected alarm caused by one of the malfunctions in the modules in the category. 37. The apparatus of claim 36, wherein the diagnostic processor is arranged to instantiate the template in the causal network in response to the occurrence of the expected alarm in the extracted sequence of alarms. .

41. The interlinked modules include instances of a given one of the modules interconnected in a regular pattern, corresponding to the given one of the modules. 33. A template is defined that includes a group of nodes in the network, and the diagnostic processor is arranged to instantiate the template for one or more of the modules in response to the alarm. Equipment.

42. The template includes an expected alarm caused by one of the malfunctions in one of the instances of the given one of the modules, and the diagnostic processor includes the expected alarm. 42. Arranged to instantiate the template by adding an instance of the template to the network in response to the occurrence of
The device according to.

43. The diagnostic processor identifies a local fault condition in the one of the failed modules and responds to the local fault condition in the causal network with the one of the modules. 33. The apparatus of claim 32, arranged to link the fault to one of the malfunctions that occurs in one.

44. The diagnostic processor identifies a first fault condition occurring in a first one of the modules due to a connection with a second one of the modules in the system; Arranged in response to a failure condition to link the failure to a second failure condition occurring in the second one of the modules within the causal network,
The device according to claim 32.

45. The diagnostic processor causes the possible cause of the second fault condition to be another connection between the second one of the modules and a third one of the modules in the system. Arranged to link the fault to a third fault condition occurring in the third one of the modules in the causal network in response to the another connection. Claim 44
The device according to.

46. The diagnostic processor is responsive to the respective probabilities of the malfunction to add one or more occurrences of the malfunction to the causal network and to impair the plurality of occurrences within the causal network. 33. The device of claim 32, arranged to link the.

47. The diagnostic processor is arranged to determine one or more fault conditions caused by each of the occurrences and link at least a portion of the fault conditions to the fault. The device according to.

48. The apparatus of claim 32, wherein the at least one of the probabilities of the malfunction is expressed as an average time between failures of the one or more modules.

49. The probability of the malfunction is defined with respect to a probability distribution having a mean and a product moment, and the diagnostic processor is arranged to update the mean and the product moment of the probability distribution. 32. The device according to 32.

50. The apparatus of claim 49, wherein the probability distribution comprises a failure rate distribution and the diagnostic processor is arranged to update the failure rate distribution using a Bayesian reliability theory model. .

51. The diagnostic processor is arranged to compare one or more of the updated probabilities with a predetermined threshold and trigger a diagnostic action when the one of the probabilities exceeds the threshold. The device according to claim 32.

52. The apparatus of claim 51, including a user interface, wherein the diagnostic processor is coupled to notify a user of the system about the diagnostic via the user interface.

53. The apparatus of claim 52, wherein the diagnostic processor is arranged to provide, via the user interface, an explanation of the diagnostics based on the causal network.

54. The diagnostic action comprises a diagnostic test performed to verify the malfunction, the diagnostic test being selected in response to the one of the probabilities to exceed the threshold. 51. The apparatus according to 51.

55. The apparatus of claim 54, wherein the diagnostic processor is arranged to modify the causal network in response to a result of the diagnostic test.

56. A device for diagnosing a system composed of a plurality of modules linked together, said device including a diagnostic processor, wherein the diagnostic processor detects a fault in one of the modules. A system for indicating a fault by configuring a causal network that associates a malfunction in two or more of the modules that may have led to the fault and associates a conditional probability of the fault with a respective probability distribution of the malfunction. An apparatus arranged to update the probability distribution of the malfunction in response to an alarm from and to propose a diagnosis of the alarm in response to the updated probability distribution.

57. The apparatus of claim 56, wherein the probability distribution indicates a mean time between failures of the two or more modules.

58. The probability distribution has a mean and a product moment, and the diagnostic processor is responsive to the alarm.
57. The apparatus of claim 56, arranged to reassess the mean and the product moment of the probability distribution.

59. The apparatus of claim 58, wherein the probability distribution comprises a failure rate distribution and the diagnostic processor is arranged to update the failure rate distribution using a Bayesian reliability theory model. .

60. The two or more modules include the one of the failed modules, the diagnostic processor identifying a local failure condition in the one of the modules, 57. The apparatus of claim 56, arranged to link the fault to one of the malfunctions occurring in the one of the modules in the causal network in response to a fault condition.

61. The two or more of the modules are first
A module and a second module, the diagnostic processor identifying a first fault condition occurring in the first module due to a connection with the second module in the system and responding to the first fault condition. 57. The apparatus of claim 56, wherein the apparatus is arranged to link the fault to a second fault condition occurring in the second module within the causal network.

62. The two or more of the modules comprise a third
A module, wherein the diagnostic processor is configured such that the possible causes of the second fault condition are the second module and the third
Determining whether it is due to another connection in the system with the module, and in response to the other connection, in the causal network to a third fault condition occurring in the third module. 62. The device of claim 61, arranged to link the obstacles.

63. A computer software product for diagnosing a system composed of a plurality of interconnected modules, the computer software product comprising a computer readable medium having program instructions stored thereon. When the program instructions are read by a computer, the computer may receive an alarm from the system indicating a fault in one of the modules and, in response to the alarm, may have linked the fault to the fault. Associating with a malfunction in one or more of the modules and relating the conditional probability of the failure to a respective probability of the malfunction,
Configuring a causal network, updating at least one of the probabilities of the malfunction based on the alarm and the causal network, and proposing a diagnosis of the alarm in response to the updated probability. Computer software products that let you do

64. The program instructions cause the computer to receive an event report from the plurality of modules in the system and extract the alarm from the event report. The computer software product described in.

65. The event report includes a report of a configuration change of the system, the program instructions causing the computer to configure the causal network based on the changed configuration. The computer software product of claim 64.

66. The program instructions cause the computer to respond to the report of the change in the configuration,
66. The computer software product of claim 65, which causes updating of a database in which the configuration is recorded for use in configuring the causal network.

67. The program instructions for causing the computer to extract a sequence of alarms occurring at close times of each other, including the alarm indicating the fault in the one of the modules, and the probability. 65. The computer software product of claim 64, wherein processing the sequence of alarms to update the.

68. Each lifetime is responsive to an expected delay in receipt of the alarm from the system,
68. The computer software product of claim 67, defined for the alarm, the program instructions causing the computer to select the alarm to extract from the sequence in response to the respective lifetime.

69. The program instructions occur to the computer within a respective lifetime of the occurrence of the alarm indicating the fault in the one of the modules to which the causal network is configured. 69. The computer software product of claim 68 causing selection of said alarm that has occurred.

70. The program instructions define to the computer an expected alarm caused by one of the malfunctions in the one or more modules in configuring the causal network. 68. The computer software product of claim 67, further comprising: updating the probability in response to the occurrence of the expected alarm in the extracted sequence of alarms.

71. A template is defined that includes a category of the modules in the system and a group of nodes in the network that corresponds to an expected alarm caused by one of the malfunctions in the modules in the category. 70. The method of claim 67, wherein the program instructions cause the computer to instantiate the template in the causal network in response to the occurrence of the expected alarm in the extracted sequence of alarms. The listed computer software product.

72. The interlinked modules include multiple instances of a given one of the modules interconnected in a regular pattern, corresponding to the given one of the modules. A template is defined that includes a group of nodes in the network, the program instructions causing the computer to instantiate the template for one or more of the modules in response to the alarm. 63. A computer software product according to 63.

73. The template includes an expected alarm caused by one of the malfunctions in one of the instances of the given one of the modules, the program instructions causing the computer to: 73. The computer software product of claim 72, which causes instantiation of the template by adding an instance of the template to the network in response to an expected alarm occurrence.

74. In the causal network, the program instructions identify to the computer a local fault condition on the one of the failed modules, and in response to the local fault condition. 64. The computer software product of claim 63, causing the fault to be linked to one of the malfunctions occurring in the one of the modules.

75. The program instructions identify to the computer a first fault condition occurring in a first one of the modules due to a connection with a second one of the modules in the system. 64. and, in response to the first failure condition, linking the failure to a second failure condition occurring in the second one of the modules within the causal network. The computer software product described in.

76. The program instructions provide the computer with another possible cause of the second fault condition between the second one of the modules and a third one of the modules in the system. Determining whether it is due to one connection, and in response to said another connection,
Within the causal network, the third of the modules
76. The computer of claim 75, further comprising: linking the fault to a third fault condition that occurs in one of
Software product.

77. The program instructions instructing the computer to add one or more occurrences of the malfunction to the causal network in response to the respective probabilities of the malfunction; and within the causal network. 64. The computer software product of claim 63 causing linking of the fault to multiple occurrences.

78. The program instructions cause the computer to determine one or more fault conditions caused by each of the occurrences and link at least a portion of the fault conditions to the fault. 78. The computer software product of claim 77, which causes it to run.

79. The computer software product of claim 63, wherein the at least one of the probabilities of the malfunction is expressed as an average time between failures of the one or more modules.

80. The probability of the malfunction is defined with respect to a probability distribution having a mean and a product moment, and the program instructions cause the computer to update the mean and the product moment of the probability distribution. Claim 63
The computer software product described in.

81. The method of claim 80, wherein the probability distribution comprises a failure rate distribution and the program instructions cause the computer to update the failure rate distribution using a Bayesian reliability theory model. The listed computer software product.

82. The program instructions instructing the computer to compare one or more of the updated probabilities to a predetermined threshold and to initiate a diagnostic action when the one of the probabilities exceeds the threshold. 64. The computer software product of claim 63, which causes:

83. The computer of claim 82, wherein the program instructions cause the computer to notify a user of the system about the diagnosis.
Software product.

84. The computer software product of claim 83, wherein the program instructions cause the computer to provide a user with a description of the diagnostics based on the causal network.

85. The diagnostic action comprises a diagnostic test performed to verify the malfunction, the diagnostic test being selected in response to the one of the probabilities to exceed the threshold. A computer software product according to item 82.

86. The computer software product of claim 85, wherein the program instructions cause the computer to change the causal network in response to a result of the diagnostic test.

87. A product for diagnosing a system comprising a plurality of interconnected modules, the product comprising a computer readable medium having program instructions stored thereon, the program instructions being implemented by a computer. When read, associates with the computer a failure in one of the modules with a malfunction in two or more of the modules that may have led to the failure;
Configuring a causal network that relates the conditional probability of failure to each probability distribution of the malfunction,
Updating the probability distribution of the malfunction in response to an alarm from the system indicating the fault and proposing a diagnosis of the alarm in response to the updated probability distribution.

88. The article of claim 87, wherein the probability distribution indicates a mean time between failures of the two or more modules.

89. The probability distribution has a mean and a product moment, and the program instructions cause the computer to reassess the mean and the product moment of the probability distribution in response to the alarm. 88. The product of claim 87 that is performed.

90. The method of claim 89, wherein the probability distribution comprises a failure rate distribution and the program instructions cause the computer to update the failure rate distribution using a Bayesian reliability theory model. Product listed.

91. The two or more modules include the one of the failed modules, the program instructions causing the computer to identify a local failure condition in the one of the modules. What to do
89. The article of claim 87, wherein in response to the local failure condition, linking the failure to one of the malfunctions occurring in the one of the modules within the causal network.

92. The two or more of the modules include a first
A module and a second module, wherein the program instructions identify to the computer a first fault condition that occurs in the first module due to a connection with the second module in the system; 89. The product of claim 87, wherein in response to a first failure condition, linking the failure to a second failure condition occurring in the second module within the causal network.

93. The two or more of the modules comprise a third
A module, the program instructions causing the computer to cause the second failure condition to occur in the second cause.
Determining whether it is due to another connection in the system between the module and the third module, and in response to the other connection, in the causal network, in the third module. 89. Linking the fault to a third fault condition that occurs.
Product described in.