CN101589370B - A parallel computer system and fault recovery method therefor - Google Patents
A parallel computer system and fault recovery method therefor Download PDFInfo
- Publication number
- CN101589370B CN101589370B CN2008800032537A CN200880003253A CN101589370B CN 101589370 B CN101589370 B CN 101589370B CN 2008800032537 A CN2008800032537 A CN 2008800032537A CN 200880003253 A CN200880003253 A CN 200880003253A CN 101589370 B CN101589370 B CN 101589370B
- Authority
- CN
- China
- Prior art keywords
- node
- computing node
- computing
- hardware
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000011084 recovery Methods 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 238000005192 partition Methods 0.000 claims abstract description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000003068 static effect Effects 0.000 claims 3
- 238000000926 separation method Methods 0.000 claims 1
- 108090000623 proteins and genes Proteins 0.000 description 20
- 238000004891 communication Methods 0.000 description 8
- 230000004888 barrier function Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 208000034657 Convalescence Diseases 0.000 description 1
- LBDSXVIYZYSRII-IGMARMGPSA-N alpha-particle Chemical compound [4He+2] LBDSXVIYZYSRII-IGMARMGPSA-N 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 238000000329 molecular dynamics simulation Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 229920001200 poly(ethylene-vinyl acetate) Polymers 0.000 description 1
- 238000010572 single replacement reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
- G06F11/267—Reconfiguring circuits for testing, e.g. LSSD, partitioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/24—Resetting means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
- G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1441—Resetting or repowering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
- Multi Processors (AREA)
Abstract
A method and apparatus for fault recovery of on a parallel computer system from a soft failure without endingan executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Where possible, the failed node is reset and re-loaded with software without ending the software job being executed by the partition containing the failed node.
Description
Technical field
Fault recovery on the relate generally to concurrent computational system of the present invention more particularly, relates to fault recovery on the large-scale parallel supercomputer and does not finish the operation carried out with handle node failures.
Background technology
Supercomputer constantly develops to tackle complicated computational tasks.These computing machines are particularly useful for the scientist who is engaged in high-performance calculation (HPC) application, and said application comprises: life science, financial modeling, fluid mechanics, quantum chemistry, molecular dynamics, astronomy and space exploration and meteorological modeling.The supercomputer developer has been absorbed in the massively parallel computer structure and has come to solve these needs to ever-increasing complicated calculations demand.
A kind of this type of massively parallel computer by International Business Machines Corp.'s exploitation is a Blue Gene system.Blue Gene system is a kind of telescopic system, and wherein max calculation node number is 65,536.Each node all comprises single ASIC (special IC) and storer.Each node typically all has the local storage of 512 megabyte or 1 GB.All computing machines will be contained in intensive 64 frames or rack that is arranged in common point and uses that some networks link together.Each frame all has 32 gusset plates, and each gusset plate has 32 nodes, and each node has 2 processors.
65,536 computing nodes and 1024 I/O processors of Blue Gene/supercomputer are arranged to logical tree network and the three-dimensional looped network of logic simultaneously.Said logical tree network is the logical network on the gathering network topology.Blue Gene/can be described to have the compute node core on I/O node surface.The input and output function of 64 computing nodes of each I/O node processing.The I/O node does not have local memory storage.The I/O node is connected to computing node and through its built-in GBIC, also has functional wide area network ability through logical tree network.Node can be allocated to a plurality of partition of nodes, so that can carry out single application or operation on one group of Blue Gene/node in partition of nodes.
Soft fault in the computer system is mistake or the fault that the hardware fault that takes place repeatedly no thanks to or hard fault cause.Random occurrence such as α particle and noise can cause soft fault.In most computer systems, this type of soft fault is very rare and can handle in a conventional manner.In the massively parallel computer system of similar Blue Gene/, because the quantity of computing node in the complicacy of system and the system, the problem of soft fault and hard fault significantly increases.In addition, in the prior art, the fault in node can cause the whole subregion of computer system to become unavailable, or causes needs to be ended or restart the operation of just on subregion, carrying out.
Since computer system stop time with restart the valuable system resource of operation meeting waste; If not having a kind ofly more has the method for efficient recovery from the system failure that is caused by soft fault, then concurrent computational system is with continuing to stand the hardware utilization of poor efficiency and unnecessary computer dead time.
Summary of the invention
According to preferred embodiment, a kind of method and apparatus has been described, the individual node fault on the concurrent computational system that is used for causing from soft fault is carried out fault recovery, and does not finish the operation just on partition of nodes, carried out.In a preferred embodiment, the use of the failed hardware recovery mechanism on service node heart beat monitor determines when and node failure occurs.Under possible situation, the node break down of resetting also uses software to reload said node, and does not finish the software operation carried out by the partition of nodes that comprises the said node that breaks down.
The embodiment that is disclosed relates to the Blue Gene/framework, but can on having a plurality of any concurrent computational systems that are arranged in the processor in the network structure, realize.Preferred embodiment is especially useful for massively parallel computer system.
Shown in accompanying drawing, from following description more specifically to the preferred embodiment of the present invention, above-mentioned and other characteristics and advantage of the present invention will be conspicuous.
Description of drawings
Below will combine accompanying drawing to describe the preferred embodiments of the present invention, wherein identical label is represented identical element, and these accompanying drawings are:
Fig. 1 is the calcspar according to the massively parallel computer system of preferred embodiment;
Fig. 2 is the calcspar according to the computing node in the massively parallel computer system of preferred embodiment;
Fig. 3 is the calcspar according to the node replacement hardware of preferred embodiment;
Fig. 4 is the method flow diagram according to the heartbeat timer on the computing node that massively parallel computer system is set of preferred embodiment; And
Fig. 5 is the method flow diagram according to the fault recovery of the malfunctioning node on the massively parallel computer system of preferred embodiment.
Embodiment
The present invention relates to a kind of apparatus and method, be used on the node of concurrent computational system, carrying out fault recovery and not finishing to comprise the operation of carrying out on the partition of nodes of this malfunctioning node from soft fault.Will be according to describing preferred embodiment by the Blue Gene// L massively parallel computer of International Business Machines Corp.'s exploitation.
Fig. 1 shows the calcspar of the massively parallel computer system 100 of expression such as Blue Gene// L computer system.Blue Gene// Lindenmayer system is a kind of telescopic system, and wherein max calculation node number is 65,536.Each node all has special IC (ASIC) 112, and it is also referred to as Blue Gene// L computing chip 112.Said computing chip has combined two processors or CPU (CPU) and has been installed on the node subcard 114.Said node typically has the local storage of 512 megabyte.Gusset plate 120 holds 32 node subcards 114, and each node subcard 114 all has node 110.Therefore, each gusset plate has 32 nodes, and each node has the associative storage of 2 processors and each processor.Frame 130 is the casings that comprise 32 gusset plates 120.Each gusset plate 120 all is connected in the middle plate printed circuit board 132 by middle connector for substrate 134.Middle plate 132 is in frame inside and not shown in Fig. 1.All Blue Gene// L computer system will be contained in 64 frames 130 or the rack, have 32 gusset plates 120 in each frame 130 or the rack.All system will have 65,536 nodes and 131,072 CPU (2 CPU of 32 node x of 32 gusset plate x of 64 frame x).
Blue Gene// L Computer Systems Organization can be described to have the compute node core on I/O node surface, wherein by the communication of each I/O node processing to 1024 computing node 110 with the I/O processor 170 that is connected to service node 140.The I/O node does not have local memory storage.The I/O node is connected to computing node and through the GBIC (not shown), also has functional wide area network ability through logical tree network.GBIC is connected to I/O processor (or Blue Gene// L link chip) 170, and the latter is positioned on the gusset plate 120 and handles the communication from service node 160 to a plurality of nodes.Blue Gene// Lindenmayer system has the one or more I/O processors 170 that are connected to gusset plate 120 on I/O plate (not shown).The I/O processor can be configured to and 8,32 or 64 node communications.Service node through with computing node on link cards communicate by letter and use giga-bit network control connective.Be similar to the connection of computing node to the connection of I/O node, except the I/O node is free of attachment to the looped network.
Refer again to Fig. 1, computer system 100 comprises service node 140, and the latter handles to the operation of node load software with the control total system.Service node 140 typically is the microcomputer system with control desk (not shown), like the IBM pSeries server of operation Linux.Service node 140 is connected to the frame 130 of computing node 110 by control system network 150.The control system network is that Blue Gene// Lindenmayer system provides control, test and started infrastructure.Control system network 150 comprises the various network interfaces that necessary communication is provided for massively parallel computer system.Below further described network interface.
Blue Gene// L supercomputer is communicated by letter through some additional communication networks.65,536 computing nodes are arranged to logical tree network and physical three-dimensional looped network simultaneously.Logical tree network connects computing node with the binary tree structure, so that each node is all communicated by letter with two node with a father node.Looped network connects computing node with the architecture logic ground of similar 3D grid, makes each computing node to communicate by letter with its immediate 6 neighbours in the computing machine section.Other communication networks that are connected to node comprise barrier network (Barrier network).The barrier network uses barrier communication system to realize software barriers so that the similar process on the synchronometer operator node, so that when some task of completion, move to different the processing stage.Also exist the global interrupt of each node to connect.
Refer again to Fig. 1, service node 140 comprises failed hardware recovery mechanism 142.Failed hardware recovery mechanism is included in the software in the service node 140, and it can be operated to recover from node failure according to the preferred embodiment in this requirement protection.Failed hardware recovery mechanism uses heart beat monitor 144 to confirm when node breaks down.Further describe as following, heart beat monitor reads and removes then the heart beat flag that places storer on the node.When heartbeat no longer exists; Mean heart beat flag is not set; Then node break down and as below further describe, failed hardware recovery mechanism is attempted recovery nodes, and does not finish anyly comprising the operation of carrying out on the partition of nodes of this malfunctioning node.
Fig. 2 shows the calcspar of the computing node 110 in the Blue Gene// L computer system according to prior art.Computing node 110 has node compute chip 112, and the latter has two processing unit 210A, 210B.Each processing unit 210 all has its processing core 212 that has on-chip cache (L1 high-speed cache) 214.Processing unit 210 also all has second level cache (L2 high-speed cache) 216.Processing unit 210 is connected to three grades of high-speed caches (L3 high-speed cache) 220 and is connected to SRAM memory set 230.Data from L3 high-speed cache 220 are loaded among one group of DDR SDRAM 240 by DDR controller 250.
Refer again to Fig. 2, SRAM storer 230 is connected to jtag interface 260, and the latter makes computing chip 112 communicate by letter with Ido chip 180.Service node is communicated by letter with computing node through Ido chip 180 on the ethernet link of a part that is control system network 150 (above with reference to figure 1 description).In Blue Gene// Lindenmayer system, each gusset plate 120 has an Ido chip, and other are in each on the plate in the plate 132 (Fig. 1).The Ido chip uses original UDP to divide into groups to receive order from service node on trusted private 100Mb/s Ethernet Control Network.The Ido chip is supported multiple serial protocol of communicating by letter with computing node.The JTAG agreement is used for from service node 140 (Fig. 1) to computing node the reading and writing of any address of the SRAM 230 in 110, and is used for system initialization and bootup process.Jtag interface 260 is also communicated by letter with configuration register 270, further describes as following, and configuration register 270 is preserved the replacement position of the various piece of the node compute chip 112 that is used to reset.
Refer again to Fig. 2, computing node 110 also comprises timer 2 80, and it has the warning time 285 that can under software control, be provided with.In the preferred embodiment herein, timer is used for producing heartbeat and correctly works with heart beat monitor 144 nodes of notification service node 140 (Fig. 1).Node is from the service node receiving alarm time 285.Timer 2 80 is set to give the correct time termly, and its cycle equals warning time 285.Pass through warning time 285 when timer detects, and if node correctly work, then in the mailbox 235 of SRAM 230, heart beat flag 236 is set.The heart beat monitor 144 of service node 140 makes regular check on the existence of the heart beat flag 236 of all nodes, and as following detailed description, the node of executable operations when heart beat flag does not exist to recover to break down.
Fig. 3 is the calcspar that the replacement ability of computing chip 112 is shown.Computing chip 112 comprises plurality of single replacement, and they are intended to strengthen the diagnosis capability of computing chip 112.In a preferred embodiment, these replacements are used for said fault recovery.In order to reset, the hardware on the computing chip is divided into ASIC hardware 310, the network hardware 290 and DDR controller 250 generally.ASIC hardware 310 is all the other ASIC hardware that are not included as the part of the network hardware 290 or DDR controller 250.Configuration register 270 is preserved the position (not shown) of resetting so that as the hardware of resetting as described in above.The such output of resetting that drives as shown in Figure 3 of replacement position in the configuration register 270.By the ASIC hardware 312 ASIC hardware 310 of resetting of resetting, by the network hardware 314 network hardwares 290 of resetting of resetting, and by the DDR 316 DDR controllers of resetting of resetting.Said replacement provides typical replacement characteristic to be set to known condition with associated hardware so that carry out initialization.
In the preferred embodiment herein, a plurality of replacements on the computing chip 112 are used for recovering and not finishing application or the operation that the subregion of concurrent computational system is being carried out from soft fault.Possibly suspended between convalescence having the application software moved on the subregion of failure code at node, if but recovers successfully, then application can continue and need not after the node recovery, to restart.In a preferred embodiment, timer is set so that in the mailbox of each node, heart beat flag is provided with predetermined space.Heart beat flag in heart beat monitor supervision in the service node and each node of resetting is to determine whether to occur node failure.If there is not heart beat flag on the node, then the failed hardware recovery mechanism on the service node is attempted recovery nodes and the network hardware of not resetting, thus other nodes of the network hardware on the operational failure node just in the EVAC not.The replacement network hardware restarts the application of just on subregion, carrying out with needs, because this will interrupt between the adjacent node in looped network and logical tree network the information flow through node.It is to be noted that fault recovery described here is not to the fault related with the network hardware.Network hardware fault will cause by a plurality of faults of interconnecting nodes indication and need other means not described here.
Detect lack heartbeat after, if failed hardware recovery mechanism can successfully be loaded into diagnostic code among the SRAM and DDR controller and storer can be operated, the DDR controller and the function software kernel is re-loaded in the node of then resetting.Said node can continue and the whole ASIC that need not to reset then.If failed hardware recovery mechanism is loaded into diagnostic code among the SRAM with failing, then use the ASIC ASIC except that the network hardware that resets, replacement DDR and the function software kernel is re-loaded in the node.This process allows the node of replacement minimum number from fault recovery the time.Computing node can continue then to operate and subregion in the application that can continue carrying out of all the other nodes operation and need not to start anew to restart application.
Fig. 4 shows and heartbeat on the computing node is set so that the method for fault recovery 400 according to embodiments herein.Said method relates to the operation of carrying out on the computing node provides heartbeat with the heart beat monitor in service node, but said method can be started by other parts of the bootup process of service node or computing node.Computing node receives heart time (step 410) and uses heart time that timer (step 420) is set from the control system of service node.When the timer in each computing node detects heartbeat, in the SRAM mailbox, heart beat flag is set, so that heart beat monitor inspection computing node heartbeat (step 430).The method is accomplished then.
Fig. 5 shows the method 500 according to the fault recovery on the concurrent computational system of embodiments herein.As above said with reference to figure 1, the operation of describing in the method is carried out by failed hardware recovery mechanism 142 and heart beat monitor 144.The heartbeat (step 510) that heart beat monitor is come each node in the supervisory computer system through the heart beat flag of checking as stated in each node.If do not have malfunctioning node (step 520=is not), then return step 510 and continue and keep watch on.If exist as lacking the indicated malfunctioning node of heart beat flag (step 520=is), then node of other in subregion and application software notify this node unavailable (step 530).Then, attempt diagnostic code is loaded among the SRAM of malfunctioning node to check the operation (step 540) of this node.If load unsuccessful (step 550=is not), the ASIC (step 555) except that the network hardware of then resetting, to the SRAM loading code with replacement DDR (step 560), reload the special system kernel then so that node continues execution processing (step 565).If load successfully (step 550=is), then carry out diagnosis with inspection DDR (step 570).If DDR good (step 575=is) then to service node output ASIC mistake (step 580), reloads the special system kernel so that node continues to carry out processing (step 565) then.If DDR existing problems (step 575=is not), then to the SRAM loading code with replacement DDR (step 560), reload the special system kernel then so that node continues to carry out and handles (step 565).The method is accomplished then.
As stated, each embodiment provides a kind of method and apparatus, is used on the node of concurrent computational system, carrying out fault recovery from soft fault, and does not finish the operation carried out on the partition of nodes in the large-scale parallel supercomputer system.Embodiments herein allows the non-network portion of service node replacement malfunctioning node, thereby does not influence other nodes in the subregion, so that reduce system downtime and improve the efficient of computer system.
It will be understood by those skilled in the art that within the scope of the invention many variations all are possible.Therefore,, it will be understood by those skilled in the art that under the situation that does not break away from the spirit and scope of the present invention, can make these and other changes in form and details though specifically illustrate and described the present invention with reference to the preferred embodiments of the present invention.
Claims (9)
1. concurrent computational system comprises:
A plurality of computing nodes, each computing node all have the replacement hardware of the network hardware part of the said computing node that is used to reset, the replacement hardware separation of said replacement hardware and the remainder of the said computing node that is used to reset;
Service node is used for the operation through the said computing node of network control, and said service node comprises the failed hardware recovery mechanism that detects the computing node that breaks down;
And wherein said failed hardware recovery mechanism reset said remainder and the said network hardware part of not resetting of the said computing node that breaks down, so that the fault recovery from the said computing node that breaks down.
2. according to the concurrent computational system of claim 1, wherein said a plurality of computing nodes also comprise: timer is used at the storer of said computing node heart beat flag being set with predetermined space and correctly works to indicate said computing node.
3. according to the concurrent computational system of claim 2; Wherein said failed hardware recovery mechanism also comprises: heart beat monitor; Be used for keeping watch on the said heart beat flag of said computing node, so that detect the computing node that breaks down in said a plurality of node by the heart beat flag that lacks setting.
4. according to the concurrent computational system of claim 3, wherein detect the fault on the said computing node that breaks down through said heart beat monitor.
5. according to the concurrent computational system of claim 3; Wherein said heart beat flag is stored in the static memory on the said computing node; And said failed hardware recovery mechanism reads said static memory through the Ethernet via the said static memory of jtag interface access on the said computing node; Wherein said concurrent computational system is a massively parallel computer system, and it has the great amount of calculation node in the computer rack that is contained in a plurality of intensive layouts.
6. according to the concurrent computational system of claim 1; The said remainder of wherein said computing node is the DDR Memory Controller of asic processor chip; Said concurrent computational system is a massively parallel computer system, and it has the great amount of calculation node in the computer rack that is contained in a plurality of intensive layouts.
7. computer implemented method that is used to operate concurrent computational system, said concurrent computational system has a plurality of computing nodes that are connected to service node through the control system network, said method comprising the steps of:
A) each node all provides heartbeat;
B) heartbeat of each computing node of supervision in the said service node of said computer system; And
C) attempt from by the fault recovery that lacks in the indicated said computing node of heartbeat the said computing node, and do not end comprising the application that moves on the partition of nodes of computing node with said fault,
Wherein, the step of the fault recovery of trial from said computing node may further comprise the steps:
G) attempt diagnostic code is loaded in the said computing node; And
H) if said loading is unsuccessful, then: the part of the said computing node of resetting comprises all sections of the said computing node except that network hardware section; Reset Memory Controller in the said computing node; And system kernel is loaded in the said computing node.
8. according to the computer implemented method of claim 7, the step of wherein keeping watch on said computing node may further comprise the steps:
D) said computing node receives heart time from said service node;
E) use said heart time that timer is set; And
F) detect said timer process heart time and in the storer of said computing node, heart beat flag is set.
9. according to the computer implemented method of claim 7, the step of wherein attempting the fault recovery from said computing node may further comprise the steps:
I) if said the loading successfully then carried out said diagnostic code to check whether Memory Controller correctly works; And
J) if said Memory Controller correctly work, then: code is loaded in the said computing node with the said Memory Controller of resetting; The said Memory Controller of resetting; And the said computing node of using system kernel loads.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/670,803 US7631169B2 (en) | 2007-02-02 | 2007-02-02 | Fault recovery on a massively parallel computer system to handle node failures without ending an executing job |
US11/670,803 | 2007-02-02 | ||
PCT/EP2008/051266 WO2008092952A2 (en) | 2007-02-02 | 2008-02-01 | Fault recovery on a massively parallel computer system to handle node failures without ending an executing job |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101589370A CN101589370A (en) | 2009-11-25 |
CN101589370B true CN101589370B (en) | 2012-05-02 |
Family
ID=39671965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2008800032537A Active CN101589370B (en) | 2007-02-02 | 2008-02-01 | A parallel computer system and fault recovery method therefor |
Country Status (5)
Country | Link |
---|---|
US (1) | US7631169B2 (en) |
EP (1) | EP2115588B1 (en) |
KR (1) | KR101081092B1 (en) |
CN (1) | CN101589370B (en) |
WO (1) | WO2008092952A2 (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7644254B2 (en) | 2007-04-18 | 2010-01-05 | International Business Machines Corporation | Routing data packets with hint bit for each six orthogonal directions in three dimensional torus computer system set to avoid nodes in problem list |
US9880970B2 (en) * | 2007-10-03 | 2018-01-30 | William L. Bain | Method for implementing highly available data parallel operations on a computational grid |
US20120171111A1 (en) * | 2009-09-15 | 2012-07-05 | Osaka University | Process for production of oxidation reaction product of aromatic compound |
JP2012230597A (en) * | 2011-04-27 | 2012-11-22 | Fujitsu Ltd | Processor, control device and processing method |
CN103034507A (en) * | 2011-10-09 | 2013-04-10 | 上海共联通信信息发展有限公司 | Chip program repair method of telephone stored program control exchange |
US8903893B2 (en) * | 2011-11-15 | 2014-12-02 | International Business Machines Corporation | Diagnostic heartbeating in a distributed data processing environment |
WO2013078558A1 (en) * | 2011-11-28 | 2013-06-06 | Jigsee Inc. | Method of determining transport parameters for efficient data transport across a network |
US9389657B2 (en) * | 2011-12-29 | 2016-07-12 | Intel Corporation | Reset of multi-core processing system |
CN102902615B (en) * | 2012-09-18 | 2016-12-21 | 曙光信息产业(北京)有限公司 | A kind of Lustre parallel file system false alarm method and system thereof |
CN103136086A (en) * | 2013-03-06 | 2013-06-05 | 中国人民解放军国防科学技术大学 | Plug-in frame tight-coupling monitoring and management system for supper parallel computers |
US10102032B2 (en) | 2014-05-29 | 2018-10-16 | Raytheon Company | Fast transitions for massively parallel computing applications |
US10007586B2 (en) | 2016-01-08 | 2018-06-26 | Microsoft Technology Licensing, Llc | Deferred server recovery in computing systems |
US10078559B2 (en) * | 2016-05-27 | 2018-09-18 | Raytheon Company | System and method for input data fault recovery in a massively parallel real time computing system |
JP6885193B2 (en) * | 2017-05-12 | 2021-06-09 | 富士通株式会社 | Parallel processing device, job management method, and job management program |
CN109067567B (en) * | 2018-07-12 | 2021-11-05 | 中铁磁浮科技(成都)有限公司 | Network communication interruption diagnosis method |
CN112527710B (en) * | 2020-12-17 | 2023-07-25 | 西安邮电大学 | JTAG data capturing and analyzing system |
CN113726553A (en) * | 2021-07-29 | 2021-11-30 | 浪潮电子信息产业股份有限公司 | Node fault recovery method and device, electronic equipment and readable storage medium |
CN113760592B (en) * | 2021-07-30 | 2024-02-27 | 郑州云海信息技术有限公司 | Node kernel detection method and related device |
CN113917999A (en) * | 2021-08-31 | 2022-01-11 | 湖南同有飞骥科技有限公司 | Control panel redundancy switching and recovering method and device |
CN115658368B (en) * | 2022-11-11 | 2023-03-28 | 北京奥星贝斯科技有限公司 | Fault processing method and device, storage medium and electronic equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6711700B2 (en) * | 2001-04-23 | 2004-03-23 | International Business Machines Corporation | Method and apparatus to monitor the run state of a multi-partitioned computer system |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6134655A (en) * | 1992-05-13 | 2000-10-17 | Comverge Technologies, Inc. | Method and apparatus for initializing a microprocessor to insure fault-free operation |
US5513319A (en) * | 1993-07-02 | 1996-04-30 | Dell Usa, L.P. | Watchdog timer for computer system reset |
US6530047B1 (en) * | 1999-10-01 | 2003-03-04 | Stmicroelectronics Limited | System and method for communicating with an integrated circuit |
US6629257B1 (en) * | 2000-08-31 | 2003-09-30 | Hewlett-Packard Development Company, L.P. | System and method to automatically reset and initialize a clocking subsystem with reset signaling technique |
US6678840B1 (en) * | 2000-08-31 | 2004-01-13 | Hewlett-Packard Development Company, Lp. | Fault containment and error recovery in a scalable multiprocessor |
US20020065646A1 (en) * | 2000-09-11 | 2002-05-30 | Waldie Arthur H. | Embedded debug system using an auxiliary instruction queue |
US7555566B2 (en) * | 2001-02-24 | 2009-06-30 | International Business Machines Corporation | Massively parallel supercomputer |
US7421478B1 (en) * | 2002-03-07 | 2008-09-02 | Cisco Technology, Inc. | Method and apparatus for exchanging heartbeat messages and configuration information between nodes operating in a master-slave configuration |
EP1351145A1 (en) * | 2002-04-04 | 2003-10-08 | Hewlett-Packard Company | Computer failure recovery and notification system |
JP3640187B2 (en) | 2002-07-29 | 2005-04-20 | 日本電気株式会社 | Fault processing method for multiprocessor system, multiprocessor system and node |
US7191372B1 (en) * | 2004-08-27 | 2007-03-13 | Xilinx, Inc. | Integrated data download |
US7360253B2 (en) * | 2004-12-23 | 2008-04-15 | Microsoft Corporation | System and method to lock TPM always ‘on’ using a monitor |
KR100687739B1 (en) * | 2005-03-29 | 2007-02-27 | 한국전자통신연구원 | Method for monitoring link performance and diagnosing active status of link for Ethernet Passive Optical Network |
US7523350B2 (en) * | 2005-04-01 | 2009-04-21 | Dot Hill Systems Corporation | Timer-based apparatus and method for fault-tolerant booting of a storage controller |
-
2007
- 2007-02-02 US US11/670,803 patent/US7631169B2/en not_active Expired - Fee Related
-
2008
- 2008-02-01 WO PCT/EP2008/051266 patent/WO2008092952A2/en active Application Filing
- 2008-02-01 CN CN2008800032537A patent/CN101589370B/en active Active
- 2008-02-01 EP EP08708573.4A patent/EP2115588B1/en active Active
- 2008-02-01 KR KR1020097010832A patent/KR101081092B1/en not_active IP Right Cessation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6711700B2 (en) * | 2001-04-23 | 2004-03-23 | International Business Machines Corporation | Method and apparatus to monitor the run state of a multi-partitioned computer system |
Non-Patent Citations (3)
Title |
---|
A.Gara et al..Overview of the Blue Gene/L system architecture.《IBM Journal of Research and Development》.International Business Machines Corporation,2005,第49卷(第2/3期),195-212. * |
R.A.Haring et al..Blue Gene/L compute chip: control, test and bring-up infrastructure.《IBM Journal of Research and Development》.International Business Machines Corporation,2005,第49卷(第2/3期),289-291. * |
R.A.Haringetal..BlueGene/Lcomputechip:control test and bring-up infrastructure.《IBM Journal of Research and Development》.International Business Machines Corporation |
Also Published As
Publication number | Publication date |
---|---|
US20080189573A1 (en) | 2008-08-07 |
EP2115588B1 (en) | 2015-06-10 |
CN101589370A (en) | 2009-11-25 |
WO2008092952A3 (en) | 2008-11-20 |
WO2008092952A2 (en) | 2008-08-07 |
EP2115588A2 (en) | 2009-11-11 |
US7631169B2 (en) | 2009-12-08 |
KR101081092B1 (en) | 2011-11-07 |
KR20090084897A (en) | 2009-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101589370B (en) | A parallel computer system and fault recovery method therefor | |
US7620841B2 (en) | Re-utilizing partially failed resources as network resources | |
US7644254B2 (en) | Routing data packets with hint bit for each six orthogonal directions in three dimensional torus computer system set to avoid nodes in problem list | |
US9477564B2 (en) | Method and apparatus for dynamic node healing in a multi-node environment | |
US7765385B2 (en) | Fault recovery on a parallel computer system with a torus network | |
CN110807064B (en) | Data recovery device in RAC distributed database cluster system | |
US11226753B2 (en) | Adaptive namespaces for multipath redundancy in cluster based computing systems | |
CN105700907A (en) | Leverage offload programming model for local checkpoints | |
CN101324877A (en) | System and manufacture method of multi-node configuration of processor cards connected via processor fabrics | |
JP2004532442A5 (en) | ||
JP2002041348A (en) | Communication pass through shared system resource to provide communication with high availability, network file server and its method | |
CN111414268A (en) | Fault processing method and device and server | |
US20040216003A1 (en) | Mechanism for FRU fault isolation in distributed nodal environment | |
US20090083467A1 (en) | Method and System for Handling Interrupts Within Computer System During Hardware Resource Migration | |
CN112199240A (en) | Method for switching nodes during node failure and related equipment | |
US20100085871A1 (en) | Resource leak recovery in a multi-node computer system | |
US7512836B2 (en) | Fast backup of compute nodes in failing midplane by copying to nodes in backup midplane via link chips operating in pass through and normal modes in massively parallel computing system | |
CN100449494C (en) | State tracking and recovering method and system in multi-processing computer system | |
CN105446818B (en) | A kind of method of business processing, relevant apparatus and system | |
US8537662B2 (en) | Global detection of resource leaks in a multi-node computer system | |
CN100472457C (en) | Method and system to recover from control block hangs in a heterogenous multiprocessor environment | |
JP5832408B2 (en) | Virtual computer system and control method thereof | |
CN117527564A (en) | Distributed cluster deployment method, device, equipment and readable storage medium | |
CA2251455A1 (en) | Computing system having fault containment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |