Nothing Special   »   [go: up one dir, main page]

CN101589370B - A parallel computer system and fault recovery method therefor - Google Patents

A parallel computer system and fault recovery method therefor Download PDF

Info

Publication number
CN101589370B
CN101589370B CN2008800032537A CN200880003253A CN101589370B CN 101589370 B CN101589370 B CN 101589370B CN 2008800032537 A CN2008800032537 A CN 2008800032537A CN 200880003253 A CN200880003253 A CN 200880003253A CN 101589370 B CN101589370 B CN 101589370B
Authority
CN
China
Prior art keywords
node
computing node
computing
hardware
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2008800032537A
Other languages
Chinese (zh)
Other versions
CN101589370A (en
Inventor
D·达灵顿
P·J·麦卡西
A·彼得斯
A·西德尼克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN101589370A publication Critical patent/CN101589370A/en
Application granted granted Critical
Publication of CN101589370B publication Critical patent/CN101589370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/267Reconfiguring circuits for testing, e.g. LSSD, partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/24Resetting means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • G06F11/0724Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

A method and apparatus for fault recovery of on a parallel computer system from a soft failure without endingan executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Where possible, the failed node is reset and re-loaded with software without ending the software job being executed by the partition containing the failed node.

Description

A kind of concurrent computational system and the method for carrying out fault recovery above that
Technical field
Fault recovery on the relate generally to concurrent computational system of the present invention more particularly, relates to fault recovery on the large-scale parallel supercomputer and does not finish the operation carried out with handle node failures.
Background technology
Supercomputer constantly develops to tackle complicated computational tasks.These computing machines are particularly useful for the scientist who is engaged in high-performance calculation (HPC) application, and said application comprises: life science, financial modeling, fluid mechanics, quantum chemistry, molecular dynamics, astronomy and space exploration and meteorological modeling.The supercomputer developer has been absorbed in the massively parallel computer structure and has come to solve these needs to ever-increasing complicated calculations demand.
A kind of this type of massively parallel computer by International Business Machines Corp.'s exploitation is a Blue Gene system.Blue Gene system is a kind of telescopic system, and wherein max calculation node number is 65,536.Each node all comprises single ASIC (special IC) and storer.Each node typically all has the local storage of 512 megabyte or 1 GB.All computing machines will be contained in intensive 64 frames or rack that is arranged in common point and uses that some networks link together.Each frame all has 32 gusset plates, and each gusset plate has 32 nodes, and each node has 2 processors.
65,536 computing nodes and 1024 I/O processors of Blue Gene/supercomputer are arranged to logical tree network and the three-dimensional looped network of logic simultaneously.Said logical tree network is the logical network on the gathering network topology.Blue Gene/can be described to have the compute node core on I/O node surface.The input and output function of 64 computing nodes of each I/O node processing.The I/O node does not have local memory storage.The I/O node is connected to computing node and through its built-in GBIC, also has functional wide area network ability through logical tree network.Node can be allocated to a plurality of partition of nodes, so that can carry out single application or operation on one group of Blue Gene/node in partition of nodes.
Soft fault in the computer system is mistake or the fault that the hardware fault that takes place repeatedly no thanks to or hard fault cause.Random occurrence such as α particle and noise can cause soft fault.In most computer systems, this type of soft fault is very rare and can handle in a conventional manner.In the massively parallel computer system of similar Blue Gene/, because the quantity of computing node in the complicacy of system and the system, the problem of soft fault and hard fault significantly increases.In addition, in the prior art, the fault in node can cause the whole subregion of computer system to become unavailable, or causes needs to be ended or restart the operation of just on subregion, carrying out.
Since computer system stop time with restart the valuable system resource of operation meeting waste; If not having a kind ofly more has the method for efficient recovery from the system failure that is caused by soft fault, then concurrent computational system is with continuing to stand the hardware utilization of poor efficiency and unnecessary computer dead time.
Summary of the invention
According to preferred embodiment, a kind of method and apparatus has been described, the individual node fault on the concurrent computational system that is used for causing from soft fault is carried out fault recovery, and does not finish the operation just on partition of nodes, carried out.In a preferred embodiment, the use of the failed hardware recovery mechanism on service node heart beat monitor determines when and node failure occurs.Under possible situation, the node break down of resetting also uses software to reload said node, and does not finish the software operation carried out by the partition of nodes that comprises the said node that breaks down.
The embodiment that is disclosed relates to the Blue Gene/framework, but can on having a plurality of any concurrent computational systems that are arranged in the processor in the network structure, realize.Preferred embodiment is especially useful for massively parallel computer system.
Shown in accompanying drawing, from following description more specifically to the preferred embodiment of the present invention, above-mentioned and other characteristics and advantage of the present invention will be conspicuous.
Description of drawings
Below will combine accompanying drawing to describe the preferred embodiments of the present invention, wherein identical label is represented identical element, and these accompanying drawings are:
Fig. 1 is the calcspar according to the massively parallel computer system of preferred embodiment;
Fig. 2 is the calcspar according to the computing node in the massively parallel computer system of preferred embodiment;
Fig. 3 is the calcspar according to the node replacement hardware of preferred embodiment;
Fig. 4 is the method flow diagram according to the heartbeat timer on the computing node that massively parallel computer system is set of preferred embodiment; And
Fig. 5 is the method flow diagram according to the fault recovery of the malfunctioning node on the massively parallel computer system of preferred embodiment.
Embodiment
The present invention relates to a kind of apparatus and method, be used on the node of concurrent computational system, carrying out fault recovery and not finishing to comprise the operation of carrying out on the partition of nodes of this malfunctioning node from soft fault.Will be according to describing preferred embodiment by the Blue Gene// L massively parallel computer of International Business Machines Corp.'s exploitation.
Fig. 1 shows the calcspar of the massively parallel computer system 100 of expression such as Blue Gene// L computer system.Blue Gene// Lindenmayer system is a kind of telescopic system, and wherein max calculation node number is 65,536.Each node all has special IC (ASIC) 112, and it is also referred to as Blue Gene// L computing chip 112.Said computing chip has combined two processors or CPU (CPU) and has been installed on the node subcard 114.Said node typically has the local storage of 512 megabyte.Gusset plate 120 holds 32 node subcards 114, and each node subcard 114 all has node 110.Therefore, each gusset plate has 32 nodes, and each node has the associative storage of 2 processors and each processor.Frame 130 is the casings that comprise 32 gusset plates 120.Each gusset plate 120 all is connected in the middle plate printed circuit board 132 by middle connector for substrate 134.Middle plate 132 is in frame inside and not shown in Fig. 1.All Blue Gene// L computer system will be contained in 64 frames 130 or the rack, have 32 gusset plates 120 in each frame 130 or the rack.All system will have 65,536 nodes and 131,072 CPU (2 CPU of 32 node x of 32 gusset plate x of 64 frame x).
Blue Gene// L Computer Systems Organization can be described to have the compute node core on I/O node surface, wherein by the communication of each I/O node processing to 1024 computing node 110 with the I/O processor 170 that is connected to service node 140.The I/O node does not have local memory storage.The I/O node is connected to computing node and through the GBIC (not shown), also has functional wide area network ability through logical tree network.GBIC is connected to I/O processor (or Blue Gene// L link chip) 170, and the latter is positioned on the gusset plate 120 and handles the communication from service node 160 to a plurality of nodes.Blue Gene// Lindenmayer system has the one or more I/O processors 170 that are connected to gusset plate 120 on I/O plate (not shown).The I/O processor can be configured to and 8,32 or 64 node communications.Service node through with computing node on link cards communicate by letter and use giga-bit network control connective.Be similar to the connection of computing node to the connection of I/O node, except the I/O node is free of attachment to the looped network.
Refer again to Fig. 1, computer system 100 comprises service node 140, and the latter handles to the operation of node load software with the control total system.Service node 140 typically is the microcomputer system with control desk (not shown), like the IBM pSeries server of operation Linux.Service node 140 is connected to the frame 130 of computing node 110 by control system network 150.The control system network is that Blue Gene// Lindenmayer system provides control, test and started infrastructure.Control system network 150 comprises the various network interfaces that necessary communication is provided for massively parallel computer system.Below further described network interface.
Service node 140 management are exclusively used in the control system network 150 of system management.Control system network 150 is special-purpose 100Mb/s Ethernets, and the latter is connected to the Ido chip 180 that is positioned on the gusset plate 120 and handles the communication from service node 160 to a plurality of nodes.This network is called as the JTAG network sometimes, because it uses the JTAG agreement to communicate.Come all controls, test and the startup of the computing node 110 on the management node plate 120 through the jtag port of communicating by letter with service node.Further described this network below with reference to Fig. 2.
Blue Gene// L supercomputer is communicated by letter through some additional communication networks.65,536 computing nodes are arranged to logical tree network and physical three-dimensional looped network simultaneously.Logical tree network connects computing node with the binary tree structure, so that each node is all communicated by letter with two node with a father node.Looped network connects computing node with the architecture logic ground of similar 3D grid, makes each computing node to communicate by letter with its immediate 6 neighbours in the computing machine section.Other communication networks that are connected to node comprise barrier network (Barrier network).The barrier network uses barrier communication system to realize software barriers so that the similar process on the synchronometer operator node, so that when some task of completion, move to different the processing stage.Also exist the global interrupt of each node to connect.
Refer again to Fig. 1, service node 140 comprises failed hardware recovery mechanism 142.Failed hardware recovery mechanism is included in the software in the service node 140, and it can be operated to recover from node failure according to the preferred embodiment in this requirement protection.Failed hardware recovery mechanism uses heart beat monitor 144 to confirm when node breaks down.Further describe as following, heart beat monitor reads and removes then the heart beat flag that places storer on the node.When heartbeat no longer exists; Mean heart beat flag is not set; Then node break down and as below further describe, failed hardware recovery mechanism is attempted recovery nodes, and does not finish anyly comprising the operation of carrying out on the partition of nodes of this malfunctioning node.
Fig. 2 shows the calcspar of the computing node 110 in the Blue Gene// L computer system according to prior art.Computing node 110 has node compute chip 112, and the latter has two processing unit 210A, 210B.Each processing unit 210 all has its processing core 212 that has on-chip cache (L1 high-speed cache) 214.Processing unit 210 also all has second level cache (L2 high-speed cache) 216.Processing unit 210 is connected to three grades of high-speed caches (L3 high-speed cache) 220 and is connected to SRAM memory set 230.Data from L3 high-speed cache 220 are loaded among one group of DDR SDRAM 240 by DDR controller 250.
Refer again to Fig. 2, SRAM storer 230 is connected to jtag interface 260, and the latter makes computing chip 112 communicate by letter with Ido chip 180.Service node is communicated by letter with computing node through Ido chip 180 on the ethernet link of a part that is control system network 150 (above with reference to figure 1 description).In Blue Gene// Lindenmayer system, each gusset plate 120 has an Ido chip, and other are in each on the plate in the plate 132 (Fig. 1).The Ido chip uses original UDP to divide into groups to receive order from service node on trusted private 100Mb/s Ethernet Control Network.The Ido chip is supported multiple serial protocol of communicating by letter with computing node.The JTAG agreement is used for from service node 140 (Fig. 1) to computing node the reading and writing of any address of the SRAM 230 in 110, and is used for system initialization and bootup process.Jtag interface 260 is also communicated by letter with configuration register 270, further describes as following, and configuration register 270 is preserved the replacement position of the various piece of the node compute chip 112 that is used to reset.
Refer again to Fig. 2, computing node 110 also comprises timer 2 80, and it has the warning time 285 that can under software control, be provided with.In the preferred embodiment herein, timer is used for producing heartbeat and correctly works with heart beat monitor 144 nodes of notification service node 140 (Fig. 1).Node is from the service node receiving alarm time 285.Timer 2 80 is set to give the correct time termly, and its cycle equals warning time 285.Pass through warning time 285 when timer detects, and if node correctly work, then in the mailbox 235 of SRAM 230, heart beat flag 236 is set.The heart beat monitor 144 of service node 140 makes regular check on the existence of the heart beat flag 236 of all nodes, and as following detailed description, the node of executable operations when heart beat flag does not exist to recover to break down.
Node compute chip 112 shown in Fig. 2 also comprises the network hardware 290.The network hardware 290 comprise be used to encircle 292, the hardware of tree 294 and global interrupt 296 networks.Like above concise and to the point description, these networks of Blue Gene// L are used for the communicating by letter of other nodes of computing node 110 and system.
Fig. 3 is the calcspar that the replacement ability of computing chip 112 is shown.Computing chip 112 comprises plurality of single replacement, and they are intended to strengthen the diagnosis capability of computing chip 112.In a preferred embodiment, these replacements are used for said fault recovery.In order to reset, the hardware on the computing chip is divided into ASIC hardware 310, the network hardware 290 and DDR controller 250 generally.ASIC hardware 310 is all the other ASIC hardware that are not included as the part of the network hardware 290 or DDR controller 250.Configuration register 270 is preserved the position (not shown) of resetting so that as the hardware of resetting as described in above.The such output of resetting that drives as shown in Figure 3 of replacement position in the configuration register 270.By the ASIC hardware 312 ASIC hardware 310 of resetting of resetting, by the network hardware 314 network hardwares 290 of resetting of resetting, and by the DDR 316 DDR controllers of resetting of resetting.Said replacement provides typical replacement characteristic to be set to known condition with associated hardware so that carry out initialization.
In the preferred embodiment herein, a plurality of replacements on the computing chip 112 are used for recovering and not finishing application or the operation that the subregion of concurrent computational system is being carried out from soft fault.Possibly suspended between convalescence having the application software moved on the subregion of failure code at node, if but recovers successfully, then application can continue and need not after the node recovery, to restart.In a preferred embodiment, timer is set so that in the mailbox of each node, heart beat flag is provided with predetermined space.Heart beat flag in heart beat monitor supervision in the service node and each node of resetting is to determine whether to occur node failure.If there is not heart beat flag on the node, then the failed hardware recovery mechanism on the service node is attempted recovery nodes and the network hardware of not resetting, thus other nodes of the network hardware on the operational failure node just in the EVAC not.The replacement network hardware restarts the application of just on subregion, carrying out with needs, because this will interrupt between the adjacent node in looped network and logical tree network the information flow through node.It is to be noted that fault recovery described here is not to the fault related with the network hardware.Network hardware fault will cause by a plurality of faults of interconnecting nodes indication and need other means not described here.
Detect lack heartbeat after, if failed hardware recovery mechanism can successfully be loaded into diagnostic code among the SRAM and DDR controller and storer can be operated, the DDR controller and the function software kernel is re-loaded in the node of then resetting.Said node can continue and the whole ASIC that need not to reset then.If failed hardware recovery mechanism is loaded into diagnostic code among the SRAM with failing, then use the ASIC ASIC except that the network hardware that resets, replacement DDR and the function software kernel is re-loaded in the node.This process allows the node of replacement minimum number from fault recovery the time.Computing node can continue then to operate and subregion in the application that can continue carrying out of all the other nodes operation and need not to start anew to restart application.
Fig. 4 shows and heartbeat on the computing node is set so that the method for fault recovery 400 according to embodiments herein.Said method relates to the operation of carrying out on the computing node provides heartbeat with the heart beat monitor in service node, but said method can be started by other parts of the bootup process of service node or computing node.Computing node receives heart time (step 410) and uses heart time that timer (step 420) is set from the control system of service node.When the timer in each computing node detects heartbeat, in the SRAM mailbox, heart beat flag is set, so that heart beat monitor inspection computing node heartbeat (step 430).The method is accomplished then.
Fig. 5 shows the method 500 according to the fault recovery on the concurrent computational system of embodiments herein.As above said with reference to figure 1, the operation of describing in the method is carried out by failed hardware recovery mechanism 142 and heart beat monitor 144.The heartbeat (step 510) that heart beat monitor is come each node in the supervisory computer system through the heart beat flag of checking as stated in each node.If do not have malfunctioning node (step 520=is not), then return step 510 and continue and keep watch on.If exist as lacking the indicated malfunctioning node of heart beat flag (step 520=is), then node of other in subregion and application software notify this node unavailable (step 530).Then, attempt diagnostic code is loaded among the SRAM of malfunctioning node to check the operation (step 540) of this node.If load unsuccessful (step 550=is not), the ASIC (step 555) except that the network hardware of then resetting, to the SRAM loading code with replacement DDR (step 560), reload the special system kernel then so that node continues execution processing (step 565).If load successfully (step 550=is), then carry out diagnosis with inspection DDR (step 570).If DDR good (step 575=is) then to service node output ASIC mistake (step 580), reloads the special system kernel so that node continues to carry out processing (step 565) then.If DDR existing problems (step 575=is not), then to the SRAM loading code with replacement DDR (step 560), reload the special system kernel then so that node continues to carry out and handles (step 565).The method is accomplished then.
As stated, each embodiment provides a kind of method and apparatus, is used on the node of concurrent computational system, carrying out fault recovery from soft fault, and does not finish the operation carried out on the partition of nodes in the large-scale parallel supercomputer system.Embodiments herein allows the non-network portion of service node replacement malfunctioning node, thereby does not influence other nodes in the subregion, so that reduce system downtime and improve the efficient of computer system.
It will be understood by those skilled in the art that within the scope of the invention many variations all are possible.Therefore,, it will be understood by those skilled in the art that under the situation that does not break away from the spirit and scope of the present invention, can make these and other changes in form and details though specifically illustrate and described the present invention with reference to the preferred embodiments of the present invention.

Claims (9)

1. concurrent computational system comprises:
A plurality of computing nodes, each computing node all have the replacement hardware of the network hardware part of the said computing node that is used to reset, the replacement hardware separation of said replacement hardware and the remainder of the said computing node that is used to reset;
Service node is used for the operation through the said computing node of network control, and said service node comprises the failed hardware recovery mechanism that detects the computing node that breaks down;
And wherein said failed hardware recovery mechanism reset said remainder and the said network hardware part of not resetting of the said computing node that breaks down, so that the fault recovery from the said computing node that breaks down.
2. according to the concurrent computational system of claim 1, wherein said a plurality of computing nodes also comprise: timer is used at the storer of said computing node heart beat flag being set with predetermined space and correctly works to indicate said computing node.
3. according to the concurrent computational system of claim 2; Wherein said failed hardware recovery mechanism also comprises: heart beat monitor; Be used for keeping watch on the said heart beat flag of said computing node, so that detect the computing node that breaks down in said a plurality of node by the heart beat flag that lacks setting.
4. according to the concurrent computational system of claim 3, wherein detect the fault on the said computing node that breaks down through said heart beat monitor.
5. according to the concurrent computational system of claim 3; Wherein said heart beat flag is stored in the static memory on the said computing node; And said failed hardware recovery mechanism reads said static memory through the Ethernet via the said static memory of jtag interface access on the said computing node; Wherein said concurrent computational system is a massively parallel computer system, and it has the great amount of calculation node in the computer rack that is contained in a plurality of intensive layouts.
6. according to the concurrent computational system of claim 1; The said remainder of wherein said computing node is the DDR Memory Controller of asic processor chip; Said concurrent computational system is a massively parallel computer system, and it has the great amount of calculation node in the computer rack that is contained in a plurality of intensive layouts.
7. computer implemented method that is used to operate concurrent computational system, said concurrent computational system has a plurality of computing nodes that are connected to service node through the control system network, said method comprising the steps of:
A) each node all provides heartbeat;
B) heartbeat of each computing node of supervision in the said service node of said computer system; And
C) attempt from by the fault recovery that lacks in the indicated said computing node of heartbeat the said computing node, and do not end comprising the application that moves on the partition of nodes of computing node with said fault,
Wherein, the step of the fault recovery of trial from said computing node may further comprise the steps:
G) attempt diagnostic code is loaded in the said computing node; And
H) if said loading is unsuccessful, then: the part of the said computing node of resetting comprises all sections of the said computing node except that network hardware section; Reset Memory Controller in the said computing node; And system kernel is loaded in the said computing node.
8. according to the computer implemented method of claim 7, the step of wherein keeping watch on said computing node may further comprise the steps:
D) said computing node receives heart time from said service node;
E) use said heart time that timer is set; And
F) detect said timer process heart time and in the storer of said computing node, heart beat flag is set.
9. according to the computer implemented method of claim 7, the step of wherein attempting the fault recovery from said computing node may further comprise the steps:
I) if said the loading successfully then carried out said diagnostic code to check whether Memory Controller correctly works; And
J) if said Memory Controller correctly work, then: code is loaded in the said computing node with the said Memory Controller of resetting; The said Memory Controller of resetting; And the said computing node of using system kernel loads.
CN2008800032537A 2007-02-02 2008-02-01 A parallel computer system and fault recovery method therefor Active CN101589370B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11/670,803 US7631169B2 (en) 2007-02-02 2007-02-02 Fault recovery on a massively parallel computer system to handle node failures without ending an executing job
US11/670,803 2007-02-02
PCT/EP2008/051266 WO2008092952A2 (en) 2007-02-02 2008-02-01 Fault recovery on a massively parallel computer system to handle node failures without ending an executing job

Publications (2)

Publication Number Publication Date
CN101589370A CN101589370A (en) 2009-11-25
CN101589370B true CN101589370B (en) 2012-05-02

Family

ID=39671965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008800032537A Active CN101589370B (en) 2007-02-02 2008-02-01 A parallel computer system and fault recovery method therefor

Country Status (5)

Country Link
US (1) US7631169B2 (en)
EP (1) EP2115588B1 (en)
KR (1) KR101081092B1 (en)
CN (1) CN101589370B (en)
WO (1) WO2008092952A2 (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7644254B2 (en) 2007-04-18 2010-01-05 International Business Machines Corporation Routing data packets with hint bit for each six orthogonal directions in three dimensional torus computer system set to avoid nodes in problem list
US9880970B2 (en) * 2007-10-03 2018-01-30 William L. Bain Method for implementing highly available data parallel operations on a computational grid
US20120171111A1 (en) * 2009-09-15 2012-07-05 Osaka University Process for production of oxidation reaction product of aromatic compound
JP2012230597A (en) * 2011-04-27 2012-11-22 Fujitsu Ltd Processor, control device and processing method
CN103034507A (en) * 2011-10-09 2013-04-10 上海共联通信信息发展有限公司 Chip program repair method of telephone stored program control exchange
US8903893B2 (en) * 2011-11-15 2014-12-02 International Business Machines Corporation Diagnostic heartbeating in a distributed data processing environment
WO2013078558A1 (en) * 2011-11-28 2013-06-06 Jigsee Inc. Method of determining transport parameters for efficient data transport across a network
US9389657B2 (en) * 2011-12-29 2016-07-12 Intel Corporation Reset of multi-core processing system
CN102902615B (en) * 2012-09-18 2016-12-21 曙光信息产业(北京)有限公司 A kind of Lustre parallel file system false alarm method and system thereof
CN103136086A (en) * 2013-03-06 2013-06-05 中国人民解放军国防科学技术大学 Plug-in frame tight-coupling monitoring and management system for supper parallel computers
US10102032B2 (en) 2014-05-29 2018-10-16 Raytheon Company Fast transitions for massively parallel computing applications
US10007586B2 (en) 2016-01-08 2018-06-26 Microsoft Technology Licensing, Llc Deferred server recovery in computing systems
US10078559B2 (en) * 2016-05-27 2018-09-18 Raytheon Company System and method for input data fault recovery in a massively parallel real time computing system
JP6885193B2 (en) * 2017-05-12 2021-06-09 富士通株式会社 Parallel processing device, job management method, and job management program
CN109067567B (en) * 2018-07-12 2021-11-05 中铁磁浮科技(成都)有限公司 Network communication interruption diagnosis method
CN112527710B (en) * 2020-12-17 2023-07-25 西安邮电大学 JTAG data capturing and analyzing system
CN113726553A (en) * 2021-07-29 2021-11-30 浪潮电子信息产业股份有限公司 Node fault recovery method and device, electronic equipment and readable storage medium
CN113760592B (en) * 2021-07-30 2024-02-27 郑州云海信息技术有限公司 Node kernel detection method and related device
CN113917999A (en) * 2021-08-31 2022-01-11 湖南同有飞骥科技有限公司 Control panel redundancy switching and recovering method and device
CN115658368B (en) * 2022-11-11 2023-03-28 北京奥星贝斯科技有限公司 Fault processing method and device, storage medium and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711700B2 (en) * 2001-04-23 2004-03-23 International Business Machines Corporation Method and apparatus to monitor the run state of a multi-partitioned computer system

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6134655A (en) * 1992-05-13 2000-10-17 Comverge Technologies, Inc. Method and apparatus for initializing a microprocessor to insure fault-free operation
US5513319A (en) * 1993-07-02 1996-04-30 Dell Usa, L.P. Watchdog timer for computer system reset
US6530047B1 (en) * 1999-10-01 2003-03-04 Stmicroelectronics Limited System and method for communicating with an integrated circuit
US6629257B1 (en) * 2000-08-31 2003-09-30 Hewlett-Packard Development Company, L.P. System and method to automatically reset and initialize a clocking subsystem with reset signaling technique
US6678840B1 (en) * 2000-08-31 2004-01-13 Hewlett-Packard Development Company, Lp. Fault containment and error recovery in a scalable multiprocessor
US20020065646A1 (en) * 2000-09-11 2002-05-30 Waldie Arthur H. Embedded debug system using an auxiliary instruction queue
US7555566B2 (en) * 2001-02-24 2009-06-30 International Business Machines Corporation Massively parallel supercomputer
US7421478B1 (en) * 2002-03-07 2008-09-02 Cisco Technology, Inc. Method and apparatus for exchanging heartbeat messages and configuration information between nodes operating in a master-slave configuration
EP1351145A1 (en) * 2002-04-04 2003-10-08 Hewlett-Packard Company Computer failure recovery and notification system
JP3640187B2 (en) 2002-07-29 2005-04-20 日本電気株式会社 Fault processing method for multiprocessor system, multiprocessor system and node
US7191372B1 (en) * 2004-08-27 2007-03-13 Xilinx, Inc. Integrated data download
US7360253B2 (en) * 2004-12-23 2008-04-15 Microsoft Corporation System and method to lock TPM always ‘on’ using a monitor
KR100687739B1 (en) * 2005-03-29 2007-02-27 한국전자통신연구원 Method for monitoring link performance and diagnosing active status of link for Ethernet Passive Optical Network
US7523350B2 (en) * 2005-04-01 2009-04-21 Dot Hill Systems Corporation Timer-based apparatus and method for fault-tolerant booting of a storage controller

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711700B2 (en) * 2001-04-23 2004-03-23 International Business Machines Corporation Method and apparatus to monitor the run state of a multi-partitioned computer system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A.Gara et al..Overview of the Blue Gene/L system architecture.《IBM Journal of Research and Development》.International Business Machines Corporation,2005,第49卷(第2/3期),195-212. *
R.A.Haring et al..Blue Gene/L compute chip: control, test and bring-up infrastructure.《IBM Journal of Research and Development》.International Business Machines Corporation,2005,第49卷(第2/3期),289-291. *
R.A.Haringetal..BlueGene/Lcomputechip:control test and bring-up infrastructure.《IBM Journal of Research and Development》.International Business Machines Corporation

Also Published As

Publication number Publication date
US20080189573A1 (en) 2008-08-07
EP2115588B1 (en) 2015-06-10
CN101589370A (en) 2009-11-25
WO2008092952A3 (en) 2008-11-20
WO2008092952A2 (en) 2008-08-07
EP2115588A2 (en) 2009-11-11
US7631169B2 (en) 2009-12-08
KR101081092B1 (en) 2011-11-07
KR20090084897A (en) 2009-08-05

Similar Documents

Publication Publication Date Title
CN101589370B (en) A parallel computer system and fault recovery method therefor
US7620841B2 (en) Re-utilizing partially failed resources as network resources
US7644254B2 (en) Routing data packets with hint bit for each six orthogonal directions in three dimensional torus computer system set to avoid nodes in problem list
US9477564B2 (en) Method and apparatus for dynamic node healing in a multi-node environment
US7765385B2 (en) Fault recovery on a parallel computer system with a torus network
CN110807064B (en) Data recovery device in RAC distributed database cluster system
US11226753B2 (en) Adaptive namespaces for multipath redundancy in cluster based computing systems
CN105700907A (en) Leverage offload programming model for local checkpoints
CN101324877A (en) System and manufacture method of multi-node configuration of processor cards connected via processor fabrics
JP2004532442A5 (en)
JP2002041348A (en) Communication pass through shared system resource to provide communication with high availability, network file server and its method
CN111414268A (en) Fault processing method and device and server
US20040216003A1 (en) Mechanism for FRU fault isolation in distributed nodal environment
US20090083467A1 (en) Method and System for Handling Interrupts Within Computer System During Hardware Resource Migration
CN112199240A (en) Method for switching nodes during node failure and related equipment
US20100085871A1 (en) Resource leak recovery in a multi-node computer system
US7512836B2 (en) Fast backup of compute nodes in failing midplane by copying to nodes in backup midplane via link chips operating in pass through and normal modes in massively parallel computing system
CN100449494C (en) State tracking and recovering method and system in multi-processing computer system
CN105446818B (en) A kind of method of business processing, relevant apparatus and system
US8537662B2 (en) Global detection of resource leaks in a multi-node computer system
CN100472457C (en) Method and system to recover from control block hangs in a heterogenous multiprocessor environment
JP5832408B2 (en) Virtual computer system and control method thereof
CN117527564A (en) Distributed cluster deployment method, device, equipment and readable storage medium
CA2251455A1 (en) Computing system having fault containment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant