KR100411978B1

KR100411978B1 - Fault tolerant system and duplication method thereof

Info

Publication number: KR100411978B1
Application number: KR10-2001-0003513A
Authority: KR
Inventors: 조보영
Original assignee: (주) 로커스네트웍스
Priority date: 2001-01-22
Filing date: 2001-01-22
Publication date: 2003-12-24
Also published as: KR20020062483A

Abstract

본 발명은 내 고장성(Fault Tolerant) 시스템 및 이를 위한 이중화 방법에 관한 것으로, 특히 액티브(active)로 작동 중인 호스트가 절체된 경우에도 중단 없이 원활하게 서비스를 복구하여 진행할 수 있는 이중화 시스템 및 방법을 제공한다.The present invention relates to a fault-tolerant system and a redundancy method therefor. In particular, the present invention relates to a redundancy system and method capable of smoothly recovering a service without interruption even when an active host is switched over. to provide.

본 발명은 이중화된 네트워크 이외에도 이중화된 서버에서 동작하는 이중화 감시 프로세스(DGP) 사이에 전용 하트 비트 라인을 구비하여 서로 상태의 상태를 감시함으로써, 호스트에 장애가 발생한 경우, 네트워크에 장애가 발생한 경우, 데이터 베이스에 장애가 발생한 경우, 프로세스 구동의 장애가 선정된 회수 이상 발생한 경우 자동 절체될 수 있으며, 또한 운영자에 의한 수동 절체를 가능하게 함으로써, 내 고장성 서버 시스템을 구현하게 된다.The present invention provides a dedicated heartbeat line between the redundant monitoring process (DGP) operating in the redundant server in addition to the redundant network to monitor the status of each other, so that when the host fails, when the network fails, In the event of a failure in the process, a failure in the process driving can be automatically switched when more than a predetermined number of times occurs, and also by enabling the manual switching by the operator, to implement a fault-tolerant server system.

그 결과, 본 발명은 신뢰성이 요구되는 통신 시스템에 적용되어 중단 없는 통신 서비스를 제공할 수 있게 된다.As a result, the present invention can be applied to a communication system requiring reliability, thereby providing an uninterrupted communication service.

Description

Fault-tolerant system and duplication method {FAULT TOLERANT SYSTEM AND DUPLICATION METHOD THEREOF}

본 발명은 내 고장성(Fault Tolerant) 시스템 및 이를 위한 이중화 방법에관한 것으로, 특히 액티브(active)로 작동 중인 호스트가 절체된 경우에도 중단 없이 원활하게 서비스를 복구하여 진행할 수 있는 이중화 시스템 및 방법에 관한 것이다.The present invention relates to a fault tolerant system and a duplication method for the same, and more particularly, to a duplication system and method capable of smoothly recovering a service without interruption even when an active host is switched over. It is about.

네트워크를 이용한 데이터 통신은 현재 널리 대중화가 되어 금융 및 의학 분야, 이동통신 분야, 공공기관 등 사회 전반에 걸쳐서 사용되고 있으며, 네트워크를 통해 다양한 서비스 및 정보가 제공되고 있다. 네트워크 상에는 하나 또는 그 이상의 서버와 클라이언트가 존재하게 되며, 네트워크 서버는 공유된 자원 및 데이터를 관리하고 처리하게 된다.Data communication using the network is now widely popularized and used throughout the society, such as the financial and medical field, the mobile communication field, and public institutions, and various services and information are provided through the network. There will be one or more servers and clients on the network, which will manage and process the shared resources and data.

그러나, 네트워크를 이용한 데이터 통신은 서비스를 제공하는 네트워크 서버에 네트워크 장애가 발생하거나, 네트워크 서버를 구성하고 있는 장치에 이상이 발생하면, 네트워크 서버에 공유되어 있는 자원 및 데이터를 이용할 수 없게 되는 단점이 있어 서비스 중단에 의한 문제점이 발생한다.However, data communication using a network has a disadvantage in that, when a network failure occurs in a network server providing a service or an error occurs in a device constituting the network server, resources and data shared in the network server cannot be used. Problems occur due to service interruption.

더욱이, 매 초마다 방대한 양의 데이터를 처리하여야 하는 네트워크 서버의 경우, 네트워크 서버에 장애가 발생하면 상기 네트워크 서버가 복구되어 정상 동작 할 때까지 데이터를 손실하게 되는 문제점이 발생할 수 있다.Moreover, in the case of a network server that needs to process a large amount of data every second, a problem may occur that the data is lost until the network server recovers and operates normally when a failure occurs in the network server.

예를 들어, 네트워크를 통한 금융 서비스를 제공하는 금융 기관의 네트워크 서버 또는 통신 서비스 업자의 고객들의 통화 내역 및 과금을 관리하는 서버의 경우 네트워크 장애 또는 서버 장애에 의한 서비스 중단은 막대한 재정적 피해와 연결되므로, 서버에 장애가 발생한 경우에도 데이터를 안전하게 보존하고 중단 없이 서비스를 제공할 수 있는 네트워크 서버 시스템 및 방법이 요구되고 있다.For example, a network server of a financial institution that provides financial services over a network, or a server that manages call history and billing of a customer of a telecommunications service provider, can cause significant financial damage due to network or server outages. In addition, there is a need for a network server system and method capable of safely preserving data and providing services without interruption even in the event of a server failure.

이와 같은 문제점을 해결하기 위하여, 액티브 서버(active server)와 스탠바이 서버(stand-by server)를 통한 이중화 방식을 이용하여 네트워크 서버에 장애가 발생한 경우에도 서비스를 중단 없이 지속적으로 제공하는 기술이 제시되었다.In order to solve such a problem, a technology for continuously providing a service without interruption even in the event of a network server failure using a duplication method using an active server and a stand-by server has been proposed.

이러한 서버 이중화 시스템은 대한민국 특허출원 제10-2000-0014192호, 미합중국 특허 제4,853,875호 등에 상술되어 있다. 그런데, 전술한 미합중국 특허 제4,853,875호에 개시된 이중화 시스템 기술은 액티브 서버의 주기억장치에 저장된 데이터와 공유 저장 수단에 저장된 데이터를 공유시키기 위하여, 광대역 데이터 전송 수단이 요구되는 기술적 어려움이 있으며, 장애가 발생한 경우 액티브 서버로부터 스탠바이 서버로의 절체에 지연이 발생하여 데이터가 유실되는 불편함이 있다.Such a server duplication system is described in Korean Patent Application No. 10-2000-0014192, US Patent No. 4,853,875, and the like. However, the above-described duplex system technology disclosed in U.S. Patent No. 4,853,875 has a technical difficulty that requires a broadband data transmission means in order to share data stored in the main memory of the active server and data stored in the shared storage means. There is a inconvenience in that data is lost due to a delay in switching from the active server to the standby server.

그 결과, 시스템 운영자는 이중화 서버 절체 시에 유실된 데이터를 일일이 점검하여 복원하여 주어야 하는 불편함이 있어 왔다. 한편, 대한민국 특허출원 제10-2000-0014192호에 개시된 서버 이중화 기술의 경우 액티브 서버와 스탠바이 서버는 이중화 감시 프로세스(DGP; duplication guardian process)가 하트 비트 (heart beat)를 서로 주고받음으로써 시스템 장애 여부를 판단하게 되는데, 이중화된 네트워크의 허브에 모두 장애가 발생한 경우에는 하트 비트를 교환할 수 없는 문제점이 있다.As a result, the system administrator has been inconvenient to check and restore lost data at the time of redundant server changeover. Meanwhile, in the server duplication technology disclosed in Korean Patent Application No. 10-2000-0014192, the active server and the standby server have a system failure due to a duplication guardian process (DGP) sending heartbeats to each other. When all of the hubs of the redundant network fail, heartbeats cannot be exchanged.

더욱이, 전술한 대한민국 특허출원 제10-2000-0014192호에 개시된 서버 이중화 기술의 경우, 데이터베이스에 장애가 발생한 경우 신속히 시스템을 절체하여 서비스를 지속적으로 제공하는데 기술적 어려움이 있어 왔다.In addition, in the server duplication technology disclosed in the above-described Korean Patent Application No. 10-2000-0014192, there has been a technical difficulty in continuously providing a service by quickly switching a system when a database failure occurs.

따라서, 본 발명의 제1 목적은 네트워크 또는 서버의 장애 발생 시에도 데이터 손실 없이 서비스를 지속적으로 수행할 수 있는 내 고장 시스템(Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.Accordingly, a first object of the present invention is to provide a fault tolerant system and a duplication method capable of continuously performing a service without data loss even when a network or server failure occurs.

본 발명의 제2 목적은 상기 제1 목적에 부가하여, 서버 이중화를 위하여 광대역 데이터 통신이 요구되지 않는 고 신뢰성의 내 고장 시스템(Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.A second object of the present invention is to provide a high reliability Fault Tolerant System and a redundancy method in which, in addition to the first object, broadband data communication is not required for server redundancy.

본 발명의 제3 목적은 상기 제1 목적에 부가하여, 프로세스 시퀀스를 동기화시키지 아니하고도 서버 절체 시에 데이터 및 프로세스를 유실하지 않고 신속히 대기중인 서버로 전환될 수 있는 내 고장 시스템(Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.The third object of the present invention is, in addition to the first object, a fault tolerant system that can be switched to a server which is rapidly waiting without losing data and processes during server changeover without synchronizing process sequences. And a method of redundancy.

본 발명의 제4 목적은 상기 제1 목적에 부가하여, 이중화된 네트워크에 장애가 발생한 경우에도 데이터의 유실 없이 지속적으로 서비스를 수행할 수 있는 내 고장 시스템(Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.A fourth object of the present invention is to provide a fault tolerant system and a redundancy method capable of continuously performing a service without loss of data even in the event of a failure in the redundant network in addition to the first object. .

본 발명의 제5 목적은 상기 제1 목적에 부가하여, 이중화된 서버 시스템에 있어서 액티브 서버의 장애를 신속히 감지하여 스탠바이 서버가 서비스를 중단 없이 지속적으로 제공하기 위한 새로운 통신 채널 및 프로토콜을 구비한 내 고장 시스템 (Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.In addition to the first object, the fifth object of the present invention is to provide a new communication channel and protocol for rapidly detecting a failure of an active server in a redundant server system and continuously providing a service without interruption. Fault Tolerant System and redundancy method.

본 발명의 제6 목적은 상기 제1 목적에 부가하여, 이중화된 서버 시스템에 있어서 시스템을 자동 절체시키기 위한 조건을 판단하여 지속적인 서비스를 제공하는 내 고장 시스템(Fault Tolerant System) 및 이중화 방법을 제공하는데 있다.A sixth object of the present invention is to provide a fault tolerant system and a redundancy method that provide continuous service by determining a condition for automatically switching a system in a redundant server system in addition to the first object. have.

도1은 본 발명에 따른 내 고장성 서버 시스템의 구성을 나타낸 도면.1 is a view showing the configuration of a fault-tolerant server system according to the present invention.

도2a 내지 도2h는 본 발명에 따른 이중화 서버의 자동 절체 방법을 나타낸 도면.2A to 2H are diagrams illustrating an automatic switching method of a redundant server according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100, 101 : 이중화 네트워크100, 101: redundant network

110 : 액티브 서버(제1 서버)110: active server (first server)

120 : 스탠바이 서버(제2 서버)120: standby server (second server)

320 : 프로세스320: process

311 : 프로세스 리소스 관리 프로세스311: process resource management process

330, 331 : 이중화 감시 프로세스330, 331: redundant monitoring process

350 : 하트 비트 네트워크350: heartbeat network

상기 목적을 달성하기 위하여, 본 발명은 제1 서버와 제2 서버를 이중화하여 각각 액티브 서버와 스탠바이 서버로 상기 이중화된 네트워크 라인을 통해 접속되어 이중화된 서버 시스템에 있어서, 상기 제1 서버와 상기 제2 서버는 각각 액티브 서버로 작동중인 상태에서는 형성 테이블(configuration table)에 설정되어 있는 프로세스들을 가동시키고, 네트워크 이중화를 위해 구비된 네트워크 인터페이스 카드의 상태를 감시하며 프로세스를 관리하는 프로세스 리소스 관리 프로세스(PNR; process and resource manager); 액티브 또는 스탠바이 상태와 관계없이 부팅이 완료되어 운용중인 경우에는 상기 프로세스 리소스 관리 프로세스에 의해 기동되어, 상기 이중화된 네트워크와 별도로 구비된 하트 비트(heart beat) 네트워크 라인을 통해 직접 연결된 상대방 서버와의 주기적 통신을 통해 상대 서버의 장애 유무를 판단하는 이중화 감시 프로세스(DGP; duplication guardian process); 및 액티브 서버에서 작동중인 프로세스가 생성하는 데이터를 상대 스탠바이 서버에서 실시간으로 동기화하여 액세스할 수 있도록 상대 서버와 광케이블로 접속되어 있는 리플렉티브 메모리(RFM)를 구비한 이중화 서버 장치를 제공한다.In order to achieve the above object, the present invention provides a redundant server system in which a first server and a second server are redundantly connected to an active server and a standby server, respectively, through the redundant network line. Each of the two servers runs the processes set in the configuration table while operating as an active server, monitors the status of the network interface cards provided for network redundancy, and manages the processes. process and resource manager; When booting is completed and in operation regardless of an active or standby state, it is started by the process resource management process and periodically connected with a counterpart server directly connected through a heart beat network line provided separately from the redundant network. A duplication guardian process (DGP) for determining whether a counterpart server has failed through communication; And a reflective memory device (RFM) connected to an opponent server via an optical cable so that data generated by a process running in an active server can be accessed in real time from a partner standby server.

이하에서는, 첨부 도면 도1 및 도2를 참조하여 본 발명에 따른 내 고장성 (Fault-Tolerant) 서버 시스템 및 이중화 방법을 상세히 설명한다.Hereinafter, a fault-tolerant server system and a duplication method according to the present invention will be described in detail with reference to FIGS. 1 and 2.

본 발명에 따른 내 고장성 서버 시스템은 기본적으로 중단 없는 서비스를 지향한다. 이를 위해 서버는 제1 서버 및 제2 서버로써 구성하여 서로 감시하고, 이상이 발생한 경우 빠른 시간 내에 절체를 하여 서비스를 복구하도록 구성되어 있다.The fault tolerant server system according to the present invention basically aims for a service without interruption. To this end, the server is configured as a first server and a second server to monitor each other, and when an abnormality occurs, it is configured to switch over quickly to recover the service.

도1은 본 발명에 따른 내 고장성 서버 시스템의 구성을 나타낸 도면이다. 도1을 참조하면, 두 개의 서버(110, 120)는 하나의 이중화(duplication)된 서버를 구성하도록 되어 있으며, 본 발명에 따른 내 고장성(Fault Tolerant) 구조는 액티브(active)와 스탠바이(stand-by) 상태로 되어 있다.1 is a view showing the configuration of a fault-tolerant server system according to the present invention. Referring to FIG. 1, two servers 110 and 120 are configured to configure one redundant server, and fault tolerant structures according to the present invention are active and standby. -by) state.

본 발명에 따른 내 고장성 서버 시스템을 구성하는 액티브 서버(제1 서버; 110)와 스탠바이 서버(제2 서버; 120)는 모두 부팅되어 오퍼레이팅 시스템(OS)이 가동되고 있는 점에서는 서로 차이가 없다. 그러나, 기능적 측면에서 액티브 상태의 서버(110)에서는 필요한 프로세스(320)가 기동되어 있고 서비스를 제공하기 때문에 스탠바이 상태의 서버(120)와는 차이가 있다.The active server (first server; 110) and the standby server (second server; 120) constituting the fault-tolerant server system according to the present invention are both booted and there is no difference in that the operating system (OS) is operating. . However, in the functional aspect, the server 110 in the active state is different from the server 120 in the standby state because the necessary process 320 is started and provides a service.

스탠바이 상태의 서버(120)는 상대편 액티브 서버(110)를 계속적으로 감시한다. 이 때에, 스탠바이 서버(120)에서도 액티브 서버(110)의 감시를 위해 필요한 프로세스들은 기동되어 있다. 즉, 스탠바이 서버(120)에는 프로세스 리소스 관리 프로세스(PNR; Process and Resource Manager; 311) 이중화 감시 프로세스(DGP; Duplication Guardian Process; 331)가 기동되어 있다.The server 120 in the standby state continuously monitors the other active server 110. At this time, in the standby server 120, processes necessary for monitoring the active server 110 are activated. That is, the standby server 120 activates a process resource management process (PNR) and a duplication guardian process (DGP) 331.

한편, 액티브 서버(120)에는 형성 테이블(configuration table)에 기술되어 있는 모든 프로세스(320)가 기동되게 된다. 반면에, 스탠바이 서버(120)에는 프로세스 리소스 관리 프로세스(PNR; 311), 이중화 감시 프로세스(DGP; 331) 및 가상 메시지 통신 프로세스(VMP; Virtual Message Process; 도시하지 않음) 외에 다른 프로세스들(321)은 모두 기동하지 않는다.On the other hand, in the active server 120, all the processes 320 described in the configuration table are activated. On the other hand, the standby server 120 has other processes 321 in addition to the process resource management process (PNR) 311, the redundancy monitoring process (DGP) 331, and the virtual message communication process (VMP) (not shown). Does not start at all.

즉, 스탠바이 측의 프로세스(321)들은 액티브 서버(120)가 다운(down)되어 있는 것을 감지하여 절체된 직후 기동되어 액티브로서의 기능을 하도록 구성된다. 또한, 본 발명에 따라 이중화된 제1 서버(110)와 제2 서버(120)는 각각 절체 시에 보존되어야 할 데이터를 구비하고 있으므로 리플렉티브 메모리(RFM; Reflective Memory; 도시하지 않음)를 이용하여 두 시스템 사이의 메모리 내용을 동기화 하도록 하고 있다.That is, the standby processes 321 are configured to be activated immediately after being switched to detect that the active server 120 is down and to function as an active. In addition, since the first server 110 and the second server 120, which are duplicated according to the present invention, each have data to be preserved at the time of switching, a reflective memory (RFM; not shown) is used. To synchronize the contents of the memory between the two systems.

따라서, 본 발명에 따른 내 고장성(Fault Tolerant) 서버 시스템을 위한 구성 요소로서 프로세스 리소스 관리 프로세스(PNR; 310, 311), 이중화 감시 프로세스(DGP; 330, 331) 및 리플렉티브 메모리(RFM)를 구비함을 특징으로 한다.Thus, as a component for a fault tolerant server system according to the present invention, a process resource management process (PNR) 310, 311, a redundant monitoring process (DGP) 330, 331 and a reflective memory (RFM) Characterized in having a.

본 발명에 따른 프로세스 리소스 관리 프로세스(PNR; 310, 311)는 프로세스 (320, 321)를 관리하여 주는 프로세스로서, 시스템이 처음 부팅되는 단계에서 반드시 가동시켜 주도록 되어 있다. 따라서, 서버가 부팅이 완료되어 운용중인 상태라면 PNR 프로세스(310, 311)는 반드시 기동되어 있는 상태라 할 수 있다.Process resource management process (PNR) 310, 311 according to the present invention is a process for managing the processes (320, 321), it is to be activated in the first booting stage of the system. Accordingly, if the server is booted and is in a running state, the PNR processes 310 and 311 may be always activated.

본 발명에 따른 프로세스 리소스 관리 프로세스(PNR; 310, 311)는 자신이 기동되고 나면 제일 먼저 프로세스 관리 테이블, 즉 전술한 형성 테이블을 읽어 온다. 즉, 정해진 형성 파일로부터 프로세스 구동에 관한 정보를 독출하여서 프로세스를 구동시켜 주는데, 기동시킨 프로세스(PNR에 대해서는 자식 프로세스)가 정상적인 경우이건 비정상적인 경우이건 종료되었을 경우에는 이를 알아내어 해당 프로세스를 형성 파일(configuration file)에 있는 설정 값을 참조하여 다시 기동시켜 주게 된다.The process resource management process (PNR) 310, 311 according to the present invention first reads the process management table, i. In other words, it reads information about process execution from a given configuration file and starts the process. When the started process (child process for PNR) is normal or abnormal, it is detected and terminated. It will be restarted by referring to the configuration value in (configuration file).

즉, 본 발명에 따른 PNR(310, 311)은 다른 프로세스(320, 321)에 대한 부모 프로세스이며, PNR이 종료되면 자식 프로세스들은 모두 종료된다. 또한, 본 발명에 따른 프로세스 리소스 관리 프로세스(PNR; 310, 311)는 네트워크 이중화(100, 101)를 위해 자신의 시스템에 설치되어 있는 네트워크 인터페이스 카드(NIC)의 상태를 점검하고 관리한다.That is, the PNRs 310 and 311 according to the present invention are the parent processes for the other processes 320 and 321. When the PNR ends, all the child processes are terminated. In addition, the process resource management process (PNR) 310, 311 according to the present invention checks and manages the status of the network interface card (NIC) installed in its system for network redundancy (100, 101).

여기에서, 완전한 내 고장(Fault Tolerant) 시스템을 구현하기 위해서는 프로세스 각각이 관리하는 정보도 이중화되어야 한다. 만일, 전술한 정보가 메인 메모리에 있는 공유 메모리(shared memory) 또는 로컬 메모리(local memory) 영역에 보관된다면, 갑작스런 절체가 발생한 경우 기존의 정보를 새로운 액티브 서버가 액세스하는 것이 불가능하게 된다. 이를 해결하기 위해, 본 발명은 리플렉티브 메모리 (reflective memory)를 사용한다.Here, in order to implement a complete fault tolerant system, the information managed by each process must also be duplicated. If the above information is stored in the shared memory or local memory area of the main memory, it is impossible for the new active server to access the existing information in case of sudden changeover. In order to solve this problem, the present invention uses a reflective memory.

본 발명에 따른 리플렉티브 메모리(RFM)는 메인 메모리 외에 별도로 구비하는 메모리 보드로서 PCI 방식을 사용할 수 있다. 본 발명에 따른 바람직한 실시예로서, PCI용 RFM 보드는 액티브와 스탠바이 양측 서버에 각각 설치되며, 두 메모리 보드끼리 광케이블로 연결될 수 있다. 액티브 서버(110)에서 기동 중인 프로세스(320)들은 어느 한쪽 RFM에 데이터를 기록하면 상대편 스탠바이 서버(120)의 RFM에도 같은 내용이 기록되게 되어 데이터의 동기화가 가능하게 된다.The reflective memory (RFM) according to the present invention may use a PCI method as a memory board separately provided in addition to the main memory. As a preferred embodiment according to the present invention, the RF RF board for PCI is installed in both the active and standby server, respectively, two memory boards may be connected by an optical cable. When the processes 320 running in the active server 110 write data to either RFM, the same contents are recorded in the RFM of the standby server 120 on the other side, thereby enabling data synchronization.

본 발명에 따른 프로세스 리소스 관리 프로세스(PNR)가 가장 먼저 기동시켜 주는 프로세스가 이중화 감시 프로세스(DGP; Duplication Guardian Process; 330, 331)인데, 이중화 감시 프로세스는 하트 비트 네트워크(350)를 통해 상대편 이중화감시 프로세스와 주기적으로 통신을 하고 있다가 상대편이 절체되었는지를 감지하여 절체 작업이 이루어지도록 한다.The first process initiated by the process resource management process (PNR) according to the present invention is a duplication monitoring process (DGP; Duplication Guardian Process (330, 331), the duplication monitoring process through the heartbeat network 350 Communicates periodically with the process and detects if the other side has been transferred so that the transfer can take place.

즉, 본 발명에 따른 내 고장(Fault Tolerant) 서버 시스템은 이중화 감시 프로세스(DGP; 330, 331)에 의해 이중화 감시 기능을 맡고 있으며, 이중화된 서비스 네트워크 (100, 101)와 별도로 분리되어 있는 하트 비트 네트워크(heart beat network; 350)를 통하여 주기적으로 상대편 이중화 감시 프로세스(DGP)와 통신을 수행함으로써 상대편 서버의 장애 유무를 판단한다.That is, the fault tolerant server system according to the present invention is in charge of the redundancy monitoring function by the redundancy monitoring process (DGP) 330, 331, the heartbeat is separated from the redundant service network (100, 101) By periodically communicating with the other party's redundant monitoring process (DGP) through the network (heart beat network) 350 to determine whether the other party's server failure.

본 발명에 따라 이중화되어 있는 두 호스트(110, 120) 사이의 액티브/스탠바이 상태 절체에는 자동 절체와 수동 절체의 두 가지 방식이 가능하다. 본 발명에 따른 자동 절체는 스탠바이 호스트 측에서 액티브 호스트의 이상을 감지하여 자동적으로 절체가 이루어지며, 수동 절체는 운영자의 조작에 의해 임의로 액티브와 스탠바이 호스트의 기능을 서로 교환하도록 절체한다. 이하에서는 본 발명에 따른 이중화 서버의 자동 절체 조건을 상세히 설명한다.According to the present invention, two types of automatic / passive switching are possible in the active / standby state switching between two redundant hosts 110 and 120. The automatic switchover according to the present invention automatically switches over by detecting an abnormality of the active host on the standby host side, and the manual switchover alternates the functions of the active and standby hosts with each other by an operator's operation. Hereinafter, the automatic switching condition of the redundant server according to the present invention will be described in detail.

첫째로, 액티브 호스트(110)의 전원이나 중앙 처리 유닛트(CPU) 등에 장애가 발생하는 경우 스탠바이 호스트(120)가 이를 감지하여 액티브 호스트로 전이하게 된다. 또한, 액티브 호스트의 네트워크 라인에 장애가 발생한 경우에도 비록 호스트 자체는 정상적으로 동작하여 프로세스가 제대로 동작할 수 있어도 외부와의 통신 기능이 두절되므로 액티브 서버로서의 기능을 제대로 수행할 수 없게 된다.First, when a failure occurs in the power source of the active host 110 or the central processing unit (CPU), the standby host 120 detects this and transitions to the active host. In addition, even when the network line of the active host fails, even if the host itself operates normally and the process can operate properly, communication with the outside is lost, so that the function as an active server cannot be performed properly.

이러한 경우, 호스트에 이상이 감지된 경우와 마찬가지로 절체 작업을 수행하게 된다. 본 발명에 따른 내 고장(Fault Tolerant) 서버 시스템은 데이터베이스에 장애가 발생한 경우에도 절체 작업을 수행할 수 있다. 또한, 액티브 서버에서 구동되는 프로세스가 선정된 회수만큼 이상으로 다운되는 경우에도 이를 시스템 장애로 인식하고 절체 작업을 진행할 수 있다.In this case, as in the case where an abnormality is detected in the host, the switching operation is performed. The fault tolerant server system according to the present invention can perform a switching operation even when a database has a fault. In addition, even when the process running in the active server is down more than a predetermined number of times, it can be recognized as a system failure and the transfer operation can proceed.

도2a 내지 도2h는 본 발명에 따른 이중화 서버의 자동 절체 방법을 나타낸 도면이다. 도2a를 참조하면, 본 발명을 설명하기 위하여 일단 제1 서버(110)를 액티브 시스템으로 가정하였고, 제2 서버(120)를 스탠바이 시스템으로 하여 설명을 시작한다. 통상적으로, 초기 상태의 결정은 이중화 감시 프로세스(DGP) 환경 설정 파일에 의해 결정된다.2A to 2H are diagrams illustrating an automatic switching method of a redundant server according to the present invention. Referring to FIG. 2A, in order to explain the present invention, it is assumed that the first server 110 is an active system, and the description begins with the second server 120 as the standby system. Typically, the determination of the initial state is determined by a redundant monitoring process (DGP) configuration file.

본 발명에 따른 이중화 감시 프로세스(330, 331)는 서로 하트 비트(heart beat)를 하트 비트 전용 네트워크를 통해 교환한다. 본 발명에 따른 양호한 실시예로서 0.5초 주기로 할 수 있으며, 환경설정 파일에서 수정이 가능하도록 할 수 있다. 본 발명에 따른 이중화 감시 프로세스(DGP)는 하트 비트 메시지를 0.5초 주기로 계속적으로 전송하는 쓰레드(thread)를 생성시켜 DGP가 기동되어 있는 동안은 항상 하트 비트 메시지를 주고받을 수 있도록 한다.The redundancy monitoring process 330, 331 according to the present invention exchanges heart beats with each other via a heartbeat dedicated network. As a preferred embodiment according to the present invention can be a 0.5 second cycle, it can be modified in the configuration file. The redundancy monitoring process (DGP) according to the present invention creates a thread that continuously transmits a heartbeat message every 0.5 seconds so that the heartbeat message can be exchanged at all times while the DGP is activated.

도2b를 참조하면, 만일 액티브 호스트로 동작하는 제1 서버(110)에 장애가 발생하여 액티브로 작동중인 제1 서버(110)에서 기동되어 있던 DGP(330)가 정상적으로 작동을 하지 않을 경우 이를 감지할 수 있다.Referring to FIG. 2B, if a failure occurs in the first server 110 acting as an active host and the DGP 330 activated in the active first server 110 does not operate normally, it may be detected. Can be.

즉, 제1 서버(110)가 액티브로 동작중인데, 제1 서버에 장애가 발생하여 DGP(330)가 정상적으로 동작하지 않는 경우, 제1 서버(110)의 이중화 감시 프로세스(330)는 계속 하트 비트 메시지를 전송하지 못하므로 제2 서버의 이중화 감시 프로세스(331)는 타임아웃(time out)이 발생할 동안까지 하트 비트 메시지가 도착하지 않으므로 상대의 장애를 감지하게 된다.That is, when the first server 110 is active and the DGP 330 does not operate normally due to a failure of the first server, the redundancy monitoring process 330 of the first server 110 continues with a heartbeat message. Since the second server's redundancy monitoring process 331 does not transmit the heartbeat message until the timeout occurs, it detects the other party's failure.

본 발명의 양호한 실시예로서, 하트 비트(heart beat) 타임아웃(time out) 값은 하트 비트 발생 주기, 예를 들어 0.5초의 두 배인 1초로 할 수 있다. 이 경우를 첫 번째 타임아웃이 발생한 것으로 규정할 수 있다. 그 결과, 스탠바이 시스템의 DGP(331)는 첫 번째 타임아웃이 발생하면 실제로 호스트 자체는 이상이 없지만 하트 비트 네트워크(350)만 이상이 생긴 경우도 있을 수 있으므로, 서비스 네트워크(100, 101)를 통해서 액티브 호스트가 실제로 다운되어 있는지를 조사해 봐야 한다.In a preferred embodiment of the present invention, the heart beat time out value may be 1 second, which is twice the heart beat generation period, for example 0.5 seconds. This case can be defined as the first timeout. As a result, when the first timeout occurs, the DGP 331 of the standby system may actually have an abnormality in the heartbeat network 350, although the host itself may be abnormal. You should check whether the active host is actually down.

따라서, 모든 서비스 네트워크를 통해서 하트 비트를 전송하여 이에 대한 응답을 기다린다. 만약, 제1 서버 자체는 이상이 없는데 하트 비트 네트워크만 이상이 생겨서 하트 비트 전송에 실패한 경우라면, 제1 서버(110) 내의 DGP(330)는 서비스 네트워크를 통해 하트 비트 메시지를 받게 될 것이다. 이를 받으면 두 DGP (330, 331)는 이제부터는 이 네트워크를 통해서 하트 비트를 주고받게 된다.Therefore, heartbeats are transmitted through all service networks and wait for a response. If the first server itself is intact and the heartbeat network fails because only the heartbeat network fails, the DGP 330 in the first server 110 may receive a heartbeat message through the service network. Upon receiving this, the two DGPs 330 and 331 will now exchange heartbeats over this network.

물론, 이 경우에도 하트 비트 네트워크를 통한 하트 비트 전송은 계속 시도한다. 따라서, 하트 비트 망의 복구가 이루어지면 이때부터는 서비스 네트워크가 아닌 하트 비트 네트워크를 통해 하트 비트 메시지를 주고받게 된다.Of course, even in this case, heartbeat transmission over the heartbeat network continues to be attempted. Therefore, when the heartbeat network is restored, the heartbeat message is exchanged through the heartbeat network instead of the service network.

만약, 서비스 네트워크를 통해서 하트 비트를 주고받다가 타임아웃이 발생하게 되면 이 경우도 하트 비트 네트워크에서 발생한 타임아웃의 경우와 같은 경우로 취급한다. 서비스 네트워크를 통해 하트 비트를 주고받다가 타임아웃이 발생한 경우 이 역시 첫 번째 타임아웃이 발생한 경우로 규정한다.If a timeout occurs while sending and receiving a heartbeat through the service network, this case is treated as if the timeout occurred in the heartbeat network. If a timeout occurs while sending and receiving heartbeats through the service network, this is also defined as the first timeout.

서비스 네트워크를 통해서 또는 하트 비트 네트워크를 통해서 첫 번째 타임이 발생한 이후에 다시 타임아웃이 발생하면 이를 두 번째 타임아웃으로 규정한다. 여기서, 두 번째 타임아웃 값도 역시 하트 비트 간격의 두배 즉, 현재 1초로 설정할 수 있다. 한편, 타임아웃이 두 번 발생을 하면 DGP(331)는 상대편 시스템의 DGP(330)가 정상적으로 동작하지 않고 있다고 판단하고, 절체 작업을 수행하게 된다.If the timeout occurs again after the first time occurs through the service network or through the heartbeat network, this is defined as the second timeout. Here, the second timeout value may also be set to twice the heartbeat interval, that is, 1 second. On the other hand, if the timeout occurs twice, the DGP 331 determines that the DGP 330 of the other system is not operating normally, and performs the switching operation.

도2c를 참조하면, 타임아웃이 2회 발생하게 되면 스탠바이 측 DGP(331)는 액티브 시스템(110)에 장애가 생긴 경우로 간주하고 절체 작업에 들어가게 된다. 도2c를 참조하면, 제2 서버의 DGP는 제1 서버의 이상을 감지하였으므로 절체를 시작한다. 제1 서버의 이상을 감지하는 데에 걸리는 시간은 하트 비트 간격의 4배인 2초가 걸린다. 물론, 이 값은 하트 비트의 전송 주기에 따라 좌우되는 값이므로 하트 비트 전송 주기를 변경함으로써 변경할 수 있다.Referring to FIG. 2C, when the timeout occurs twice, the standby side DGP 331 assumes that the active system 110 has failed and enters the transfer operation. Referring to FIG. 2C, since the DGP of the second server detects an abnormality of the first server, the transfer is started. The time it takes to detect an abnormality of the first server takes two seconds, four times the heartbeat interval. Of course, since this value depends on the transmission period of the heartbeat, it can be changed by changing the heartbeat transmission period.

도2d를 참조하면, 제1 서버의 다운을 감지한 제2 서버의 이중화 감시 프로세스(DGP; 331)는 프로세스 리소스 관리 프로세스(PNR; 311)에게 SIGUSR1 시그널을 보내어 절체를 시작한다. 제2 서버의 PNR(311)은 시그널 SIGUSR1을 받으면 스탠바이 상태에서 액티브 상태로 전이를 시작한다. 본 발명에 따른 바람직한 실시예로서, 서버에서 기동 중인 PNR이 SIGUSR2 시그널을 수신하면 액티브 상태에서 스탠바이 상태로 전이를 한다.Referring to FIG. 2D, the duplication monitoring process (DGP) 331 of the second server that detects the down of the first server sends a SIGUSR1 signal to the process resource management process (PNR) 311 to start the transfer. The PNR 311 of the second server starts the transition from the standby state to the active state upon receiving the signal SIGUSR1. In a preferred embodiment according to the present invention, when the PNR running on the server receives the SIGUSR2 signal, it transitions from the active state to the standby state.

본 발명에 따른 PNR은 리플렉티브 메모리 내에 시스템의 상태 정보를 관리하고 있는데 DGP는 절체에 들어갈 때에 상태 변경 내용을 동기화된 리플렉티브 메모리에 반영시켜 PNR로 하여금 정확한 상태를 관리할 수 있도록 하여 준다. 위의 경우에는 DGP가 액티브 시스템이라고 현재의 상태를 리플렉티브 메모리에 기록하게 된다.The PNR according to the present invention manages the state information of the system in the reflective memory, and the DGP allows the PNR to manage the correct state by reflecting the state change in the synchronized reflective memory when entering the transfer. . In this case, the DGP is an active system, which records its current state in reflective memory.

도2e를 참조하면, 제2 서버(120)의 PNR 프로세스(311)는 SIGUSR1 시그널을 받으면 우선 스탠바이 상태에서만 가동되도록 설정되어 있는 프로세스들을 종료시킨 다음 액티브 상태에서 가동되도록 설정되어 있는 프로세스들을 가동시킨다. 그리고, 액티브 시스템으로의 절체 작업을 수행하게 된다.Referring to FIG. 2E, upon receiving the SIGUSR1 signal, the PNR process 311 of the second server 120 first terminates the processes set to run only in the standby state and then starts the processes set to run in the active state. Then, the switching to the active system is performed.

그 결과, 절체 작업이 완료되면 제2 서버 시스템은 액티브 시스템으로서의 기능을 수행한다. 제1 서버 시스템이 다운되어 있더라도 제2 서버(120)의 DGP(331)는 계속적으로 하트 비트 메시지 전송을 시도할 것이므로 2초(하트 비트 전송 간격의 4배) 후에는 상대편이 다운되어 있음을 감지하게 된다. 그러면, 제2 서버(120)의 DGP(331)는 스탠바이 서버(110)가 다운되어 있다고 인지하게 된다. 그리고 이에 필요한 작업을 수행한다. 예컨대, 파트너 스탠바이 시스템의 상태를 다운 상태로 기록해 놓는 등의 일을 한다.As a result, when the transfer operation is completed, the second server system performs a function as an active system. Even if the first server system is down, the DGP 331 of the second server 120 will continue to attempt to transmit the heartbeat message, and thus detects that the other party is down after 2 seconds (four times the heartbeat transmission interval). Done. Then, the DGP 331 of the second server 120 recognizes that the standby server 110 is down. And do the necessary work. For example, the state of the partner standby system is recorded in the down state.

도2f를 참조하면, 다운되었던 제1 서버가 복구되어 PNR(310)이 재 기동되면 DGP(330)가 기동되어 동작을 시작한다. 복구된 제1 서버(110)의 DGP(330)는 기동되면 초기 상태를 Init 상태로 설정하고 파트너 호스트(120)의 DGP(331)에게 자신의 상태가 어떤 상태로 전이되어야 하는지를 문의하는 내용을 담은 하트 비트 메시지를 전송하게 된다. 본 발명에 따른 양호한 실시예에 따라, 하트 비트 메시지의 ack_req를 REQUIRED로, 코드를 REQ_CHG_TO_STANDBY로 설정하여 보내면 된다.Referring to FIG. 2F, when the first server which has been down is restored and the PNR 310 is restarted, the DGP 330 is activated to start operation. When the DGP 330 of the restored first server 110 is started, the initial state is set to the Init state, and the DGP 331 of the partner host 120 includes a request for asking which state its state should be transitioned to. Send a heartbeat message. According to a preferred embodiment of the present invention, the ack_req of the heartbeat message may be set to REQUIRED and the code may be sent to REQ_CHG_TO_STANDBY.

이때에, 하트 비트 메시지의 전송은 하트 비트 네트워크와 모든 서비스 네트워크를 통하여 이루어진다. 또한, 하트 비트 메시지 타임아웃 값은 하트 비트 전송 주기의 두 배 이다. 제1 서버(110)의 DGP(330)는 타임아웃이 나기 전에 응답이 오면 이에 따라 상태 전이를 진행하면 될 것이고, 만약 타임아웃이 발생하였다면 파트너 호스트의 DGP가 정상적으로 동작하고 있지 않다는 가정을 할 수 있으므로, 자신의 상태를 액티브 상태로 전이시킨다. 즉, PNR 프로세스에게 SIGUSR1 시그널을 보낸다. 위의 경우에는 제2 서버가 살아 있으므로 제2 서버(120)의 DGP(331)로부터 하트 비트 메시지를 전송 받게 된다.At this time, the transmission of the heartbeat message is performed through the heartbeat network and all service networks. In addition, the heartbeat message timeout value is twice the heartbeat transmission period. If a response is received before the timeout occurs, the DGP 330 of the first server 110 may proceed with the transition accordingly. If the timeout occurs, the DGP 330 may assume that the DGP of the partner host is not operating normally. Therefore, it transfers its state to the active state. That is, it sends a SIGUSR1 signal to the PNR process. In the above case, since the second server is alive, the heartbeat message is transmitted from the DGP 331 of the second server 120.

도2g를 참조하면, 제2 서버(120)의 DGP(331)는 제1 서버(110)의 DGP(330)로부터 하트 비트(heart beat) 메시지를 받는다. 이를 분석해 본 결과, 코드 값이 REQ_CHG_TO_STANDBY 임을 알게 된다. 이는 위에서 설명했던 바와 같이 상대편이 자신의 상태를 어떤 상태로 전이시켜야 하는지를 묻는 것이므로, 이 메시지를 받은 제2 서버의 DGP는 제2 서버 호스트의 상태를 보고 자신의 상태가 어떤 상태인지에 따라 이에 적절한 조치를 취하게 된다.Referring to FIG. 2G, the DGP 331 of the second server 120 receives a heart beat message from the DGP 330 of the first server 110. As a result of this analysis, it is found that the code value is REQ_CHG_TO_STANDBY. This is because, as explained above, the other party asks what state it should transition its state to, so the DGP on the second server that receives this message will look at the state of the second server host and decide what state it is in. Take action.

(1) 만약 제2 서버(120)의 상태가 액티브 상태이면 상대편에게 스탠바이 상태로 전이하라는 정보를 코드 값에 INST_CHG_TO_STANDBY를 채움으로써 알려준다.(1) If the state of the second server 120 is active, the other party is informed by filling the code value with INST_CHG_TO_STANDBY to inform the other party to transition to the standby state.

(2) 만약 제2 서버(120)의 상태가 스탠바이 상태이면 상대편에게 액티브 상태로 전이하라고 보내게 된다. 이 역시 코드 값에 INST_CHG_TO_ACTIVE를 채워서 하트 비트 메시지로 전송하면 된다.(2) If the state of the second server 120 is in the standby state, the other party is sent to transition to the active state. This can also be sent as a heartbeat message by filling the code value with INST_CHG_TO_ACTIVE.

(3) 만약 제2 서버(120)의 상태가 Init 상태라면 환경 설정 파일(dgpconfiguration file)의 설정 내용에 따라 디폴트(default)로 제2 서버(120)가 액티브 상태로 되어 있다면 자신은 액티브 상태로 전이하고 상대방에게는 스탠바이로 전이하라고 알려준다. 즉, 제1 서버(110)의 DGP(330)에게 하트 비트의 코드에 INST_CHG_TO _STANDBY를 실어 전송한다. 초기 시스템 상태가 스탠바이로 되어 있다면 마찬가지 원리로 자신은 스탠바이 상태로 전이하고, 상대편에게 액티브 상태로 전이하라고 알려준다(INST_CHG_TO_ACTIVE).(3) If the state of the second server 120 is in the Init state, if the second server 120 is in the active state by default according to the contents of the configuration file (dgpconfiguration file), it is in the active state. Transfer and tell the other party to go to standby. That is, INST_CHG_TO_STANDBY is transmitted to the DGP 330 of the first server 110 in the code of the heartbeat. If the initial system state is in standby, then in the same way, it transitions to the standby state and tells the other side to transition to the active state (INST_CHG_TO_ACTIVE).

한편, 제2 서버(120)의 DGP(331)는 다운되어 있던 제1 서버(110)가 복구되었음을 하트 비트 메시지를 전달받고서 인지할 수 있으므로 이 사실을 PNR(311)에게 알리어 PNR(311)로 하여금 상태 관리를 할 수 있도록 하여 준다. 이는 PNR(311)에게 SIGTTIN 시그널을 전송함으로써 이루어진다. 본 발명에 따른 양호한 실시예로서, PNR(311)은 SIGTTIN 시그널을 받으면 상대편 스탠바이 시스템이 복구되었다고 인지하고, SIGTTOU 시그널을 받으면 상대편 스탠바이 시스템이 다운되었다고 인지한다.Meanwhile, the DGP 331 of the second server 120 may recognize the fact that the first server 110, which has been down, has been recovered by receiving a heartbeat message, so as to inform the PNR 311 to the PNR 311. Allows state management. This is done by sending a SIGTTIN signal to the PNR 311. In the preferred embodiment according to the present invention, the PNR 311 recognizes that the other party's standby system is restored upon receiving the SIGTTIN signal, and recognizes that the other party's standby system is down upon receiving the SIGTTOU signal.

한편, 다운되어 있던 스탠바이 시스템이 복구되어 살아나게 되면 액티브 서버의 DGP는 RFM 내용의 일치를 위해 RFM Sync를 해주게 된다. 이는 RFM이 DRAM으로 이루어져 있기 때문에 전원이 꺼지면 그 내용이 모두 지워지기 때문이다. 따라서, 다운되어 있던 스탠바이 시스템이 살아나게 되면 그 시스템의 RFM 내용이 모두 지워져 있다고 가정하면 RFM Sync를 시켜주게 되는 것이다.On the other hand, when the standby system is restored and survived, the DGP of the active server performs RFM Sync to match the RFM contents. This is because the RFM is made up of DRAM and its contents are erased when the power is turned off. Therefore, when the standby system comes to life, RFM Sync is performed assuming that all RFM contents of the system are deleted.

도2h를 참조하면, 제1 서버(110)가 스탠바이 제2 서버(120)가 액티브로 전이가 완료되어 있다. 한편, 본 발명에 따른 수동 절체 방법은 다음과 같다. 시스템운영자가 수동 절체를 요청하는 경우 수동 절체 명령은 액티브 호스트의 PNR 프로세스에 내려지면, PNR은 DGP 프로세스에게 SIGUSR1 시그널(스탠바이에로의 전이 명령)을 전송함으로써 절체를 수행한다.Referring to FIG. 2H, the first server 110 has transitioned from the standby second server 120 to the active state. On the other hand, the manual transfer method according to the present invention is as follows. When the system operator requests manual switchover, the manual switchover command is issued to the PNR process of the active host, and the PNR performs the switchover by sending a SIGUSR1 signal (transition command to the standby) to the DGP process.

이하에서는 본 발명에 따른 프로세스 리소스 관리 프로세스(PNR)와 이중화 감시 프로세스(DGP) 소프트웨어 모듈의 기능에 대하여 상세히 설명한다. 본 발명에 따른 PNR의 기능은 액티브 및 스탠바이 상태에 따른 프로세스들을 관리하고, 프로세스 혹은 리소스에 이상이 발생한 경우 경고(Alarm)를 동작시킨다. 또한, 프로세스 또는 리소스에 대한 생성/삭제 및 네트워크의 상태 체크를 수행하고, DGP 및 데이터베이스 관리 프로세스(DBMGR)와 연동하여 관리한다. 본 발명에 따른 PNR은 초기 설정 파일 inittab에 기술되어 있으면서 Init 프로세스에 의해 부팅시 기동된다.Hereinafter, the functions of the process resource management process (PNR) and the redundant monitoring process (DGP) software module according to the present invention will be described in detail. The function of the PNR according to the present invention manages processes according to active and standby states, and activates an alarm when an error occurs in a process or a resource. It also performs creation / deletion of processes or resources and checks the status of the network, and manages them in conjunction with the DGP and DBMGR. The PNR according to the invention is started at boot by the Init process as described in the initial configuration file inittab.

본 발명에 따른 PNR 프로세스는 초기에 환경 설정 파일(pnr.tab) 파일을 읽어 들여 액티브 또는 스탠바이 상태에서의 동작 프로세스의 목록을 만든다. 본 발명에 따른 PNR은 처음 기동될 때에 pnr.tab을 읽어 들이지만 SIGHUP(상세한 의미는 후술할 것임)을 받아도 이 환경 설정 파일을 새로이 독출한다.The PNR process according to the present invention initially reads the configuration file (pnr.tab) file to list the active processes in the active or standby state. The PNR according to the present invention reads pnr.tab the first time it is started, but reads this configuration file anew even if it receives SIGHUP (details will be described later).

또한, 본 발명에 따른 PNR 프로세스는 pnr_conf.ini 파일을 독출하여, 현재 시스템의 호스트 네임을 읽어온다. 또한, 본 발명에 따른 DGP로부터 액티브 상태가 되는 경우 SIGUSR1 시그널을 수신하고, 스탠바이 상태이면 SIGUSR2 시그널을 수신하게 된다. 전술한 시그널을 수신하는 경우, 본 발명에 따른 PNR은 액티브상태에 동작하는 프로세스들, 스탠바이 상태에 동작하는 프로세스들을 각각 기동시킨다.한편, 프로세스들은 SIGTERM에 의해 종료 될 수 있어야 한다. 본 발명에 따른 PNR의 동작 시에 다른 프로세스와 인터페이스는 시그널을 통하여 이루어진다. 본 발명에 따른 PNR의 시그널 처리의 양호한 실시예는 다음과 같다.In addition, the PNR process according to the present invention reads the pnr_conf.ini file and reads the host name of the current system. In addition, the SIGUSR1 signal is received from the DGP according to the present invention in an active state, and the SIGUSR2 signal is received in a standby state. Upon receiving the above-described signal, the PNR according to the present invention activates the processes operating in the active state and the processes operating in the standby state, respectively. On the other hand, the processes should be able to be terminated by SIGTERM. In the operation of the PNR according to the invention, other processes and interfaces are made via signals. Preferred embodiments of the signal processing of the PNR according to the present invention are as follows.

SIGUSR1은 액티브상태로의 전이를 의미하며, DGP에서 PNR로 보내지는 시그널로서 액티브 상태로 상태 전이가 일어났을 때 받는 시그널이다. 상태(STATE)가 스탠바이일 때 사용되었던 프로세스들을 SIGTERM 시그널을 보내 종료시킨 후 상태가 액티브로 되어 있는 프로세스들을 생성시킨다.SIGUSR1 is a transition to the active state. It is a signal sent from the DGP to the PNR and is a signal received when a state transition occurs to the active state. Processes that used to be idle when the STATE is in standby send a SIGTERM signal to create processes that have a status of active.

SIGUSR2는 액티브 상태로의 전이를 의미하며, DGP에서 PNR로 보내지는 시그널로서, 액티브 상태로 상태 전이가 일어났을 때 받는 시그널이다. 상태가 스탠바이일 때 사용되었던 프로세스들을 SIGTERM 시그널을 보내 종료시킨 후 상태가 액티브로 되어 있는 프로세스들을 생성시킨다.SIGUSR2 is a transition to the active state, and is a signal sent from the DGP to the PNR, and is a signal received when a state transition occurs to the active state. Processes that used to be in a standby state are terminated by sending a SIGTERM signal, then create processes with the state active.

SIGTTIN은 상대 서버가 초기 (INIT) 상태에 있다가 스탠바이(STANDBY) 상태로 전이되었음을 알려주는 것으로서, 현재 시스템이 액티브 상태라면 DGP에 의해서 스탠바이 상태를 알 수 있다. 상대편 시스템의 상태가 스탠바이가 되면 발생한다.SIGTTIN indicates that the counterpart server is in the initial (INIT) state and then transitions to the standby state. If the current system is active, the standby state can be known by the DGP. Occurs when the state of the opposing system is standby.

SIGTTOU는 상대 서버가 스탠바이 상태에 있다가 초기 상태(INIT)로 전이되었음을, 혹은 다운되었음을 알려주는 것으로서, 상대편 시스템이 스탠바이 상태가 되지 않았을 경우에, 비정상적인 상태인 경우 DGP에 의해서 발생하는 시그널이다. 이때에, 스탠바이 시스템이 다운되었음을 알리는 경고(Alarm)메시지를 생성한다.SIGTTOU indicates that the other server is in the standby state and then transitioned to the initial state (INIT) or down, and is a signal generated by the DGP when the other system is in an abnormal state when the other system is not in the standby state. At this time, an alarm message indicating that the standby system is down is generated.

SIGHUP은 재형성(Reconfiguration) pnr.tab의 내용이 바뀐 경우 사용자에 의해 받는 시그널로서, pnr.tab에 새로운 내용을 추가하여 새로운 형성(configuration)으로 프로세스를 구동하고 싶을 경우 PNR에 보낸다. PNR은 이 시그널을 받으면 프로세스 테이블을 다시 읽어 들이고, 그에 상응하여 프로세스를 종료시키거나 새로이 생성시킨다.SIGHUP is a signal received by the user when the contents of the reconfiguration pnr.tab are changed. It is sent to the PNR if you want to add new contents to the pnr.tab and run the process with a new configuration. When the PNR receives this signal, it rereads the process table and terminates or creates a new process accordingly.

SIGCHLD는 자식 프로세스의 종료를 의미하며, RESPAWN 상태로 동작하는 프로세스가 종료되면 발생하는 시그널이다. 이것은 프로세스 종료로서 경고음(ALARM)을 발생시킨다. 또한, 프로세스를 재생성하고 경고음 삭제(ALARM CLEAR)를 보낸다.SIGCHLD is the termination of a child process. This signal is issued when a process running in RESPAWN state terminates. This generates an alarm (ALARM) as the process ends. It also recreates the process and sends an ALARM CLEAR.

서비스 중인 서버는 3가지 상태를 가질 수 있다, 즉, 액티브(ACTIVE), 스탠바이 (STAND-BY) 및 초기 상태(INIT)로 분류된다. 여기서, INIT 상태는 스탠바이 상태 이전의 상태로서 서버의 초기 상태는 INIT 상태이다. 서버가 복구되어 살아나면 무조건 INIT 상태가 된다. 그리고 나서, 액티브 서버에게 스탠바이 상태로 변할지 아니면 액티브상태로 바뀌어야 하는지 물어본다.A server in service can have three states, that is, classified into ACTIVE, STAND-BY, and INIT. Here, the INIT state is a state before the standby state, and the initial state of the server is the INIT state. Once the server is restored and alive, it is unconditionally INIT. Then it asks the active server whether it should go to standby or active.

서버는 INIT 상태에서 상대편 서버로부터 확인을 받고 난 다음에 스탠 바이상태로 되거나 액티브 상태로 바뀐다. 스탠바이 서버는 액티브 서버의 상태를 감시하고 이상이 발생되면 액티브 서버로 절체한다. 또한 수동 절체를 지원하여 DGP에 상태변경을 요청하면 스탠바이 서버에게 상태 변경을 요구하고 자신도 상태를 변경한다.After receiving confirmation from the other server in the INIT state, the server is either standby or changed to active state. The standby server monitors the status of the active server and switches over to the active server when an error occurs. It also supports manual switching, so when DGP requests a state change, it requests the standby server to change the state and itself changes the state.

본 발명에 따른 내 고장성 시스템에 있어서, 하트 비트 인터페이스(350)에 장애가 발생하면 다른 네트워크를 통하여 반송 요구 패킷(Packet)을 보내고 반송 받은 네트워크로 계속 반송 요구 패킷을 전송한다. 스탠바이 시스템은 가능한 어느 네트워크를 통해 하트 비트가 도착하면 타이머(timer)를 리셋(reset)한다.In the fault tolerant system according to the present invention, when a failure occurs in the heartbeat interface 350, a return request packet is transmitted through another network, and the return request packet is continuously transmitted to the returned network. The standby system resets the timer when a heartbeat arrives through any network possible.

본 발명에 따른 하트 비트 네트워크로써 정상적으로 패킷이 전송되다가 타임 아웃이 발생하면 일단 하트 비트 네트워크에 장애 이상을 의심할 수 있으므로, 모든 네트워크 즉 현재 이 중화된 2개의 네트워크(100, 101)와 하트 비트 네트워크 (350)으로 승인(Ack)을 요구하는 패킷(Packet)을 전송한다.When a timeout occurs while a packet is normally transmitted as a heartbeat network according to the present invention, since a failure abnormality may be suspected in the heartbeat network, all networks, that is, two networks (100 and 101) currently duplicated and a heartbeat network In step 350, a packet requesting an acknowledgment is transmitted.

만약, 이에 대한 반송 패킷이 하트 비트 네트워크가 아닌 경우, 다음의 하트 비트는 응답이 온 네트워크를 통하여 전송한다. 이 상태는 하트 비트 네트워크 (350)에 장애가 발생한 상태이다. 이와 같은 상태에서 재 전송하던 네트워크에 장애가 발생하면, 또 다시 모든 네트워크에로의 반송 요구 패킷을 보내고 응답을 기다린다. 응답이 정해진 시간 하트 비트 주기의 두 배 시간(2*t1) 내에 돌아오지 않으면 액티브로 절체한다. 어떤 경우든 하트 비트 네트워크(350)의 패킷이 검출되면 정상적으로 작동을 한다.If the returned packet is not a heartbeat network, the next heartbeat is transmitted through the network on which the response came. This state is a state where the heartbeat network 350 has failed. If the retransmitted network fails in this state, it sends a request packet back to all networks and waits for a response. If the response does not return within two times (2 * t1) of the defined time heartbeat period, it transitions to active. In any case, if a packet of the heartbeat network 350 is detected, it operates normally.

한편, 하트 비트 전송이 하트 비트 주기의 두 배 시간(2*t1)에 의해 타임아웃(timeout)이 되거나, 혹은 이중화된 랜(LAN) 네트워크(100, 101)를 통한 하트 비트 전송이 2*t1에 의해 타임아웃이 발생하면 전 네트워크로 반송을 요구하는 하트 비트를 전송한다.On the other hand, the heartbeat transmission is timed out by two times the heartbeat period (2 * t1), or the heartbeat transmission via the redundant LAN network 100, 101 is 2 * t1. If a timeout occurs by sending a heartbeat requesting a return to the entire network.

그 후, 또 다시 2*t1이 지나 타임아웃이 발생하면 이것은 액티브 서버가 다운(Down)된 것으로 간주하고 절체를 시도한다. 반송이 특정 이중화된 랜(LAN) 네트워크(100, 101)로 도착하면 전송된 랜(LAN)에 의한 하트 비트 전송을 수행한다. 이때에, 액티브 서버의 다운 후 4*t1초 이내에 확인이 되는 것이다.Then, if a timeout occurs after 2 * t1 again, it assumes that the active server is down and attempts to switch over. When the return arrives on a specific redundant LAN network 100, 101, heartbeat transmission is performed by the transmitted LAN. At this time, confirmation is made within 4 * t1 seconds after the active server is down.

한편, 하트 비트 네트워크(350)를 통한 하트 비트가 2*t1 시간이 경과하면이중화된 LAN 네트워크(100, 101)를 통한 하트 비트 전송의 상태이다. 이때에는 하트 비트 네트워크를 통한 하트 비트 전송은 계속시도하고, 승인(Ack)을 요구하는 하트 비트를 받으면 반송한다. 이중화된 LAN(100, 101)을 통한 하트 비트 전송 시 2*t1시간 내에 하트 비트가 도착하지 않으면 스탠바이 서버가 동작하지 않는 것으로 간주하고 스탠바이다운(StandbyDown)을 수행한다. 다시 하트 비트가 도착하면 스탠바이 업(StandbyUp)을 수행한다.On the other hand, when the heartbeat via the heartbeat network 350 is 2 * t1 time elapsed, the heartbeat transmission state through the redundant LAN networks 100 and 101 is performed. At this time, the heartbeat transmission through the heartbeat network continues to be attempted and is returned when a heartbeat requesting an acknowledgment is received. When the heartbeat does not arrive within 2 * t1 hours when the heartbeat is transmitted through the redundant LANs 100 and 101, the standby server is considered inoperable and performs standby down. When the heartbeat arrives again, it performs StandbyUp.

또한, 초기 시스템이 동작을 시작하면 Init에 의해 초기화 과정을 수행한다. 이후 각 DGP는 REQ_TO_STANDBY 를 코드에 실어 반송 요구를 하는 패킷을 보낸다. 만약 2*t1시간 내에 어떤 패킷도 도착하지 않으면 상대 DGP가 없는 것으로 간주하고 액티브 서버로 동작한다. 따라서, 하나의 시스템만 동작하는 경우 2*t1시간 후에 액티브 서버로 동작한다. 이미 액티브 서버가 동작하고 있으면 이 액티브 서버가 INST_CHG_TO_STANDBY를 반송하므로 바로 스탠바이 서버로 동작한다.Also, when the initial system starts to operate, Init performs the initialization process. After that, each DGP sends REQ_TO_STANDBY in the code to request a return. If no packet arrives within 2 * t1 hours, it assumes that there is no counterpart DGP and acts as an active server. Therefore, when only one system is operating, it operates as an active server after 2 * t1 hours. If the active server is already running, the active server returns INST_CHG_TO_STANDBY, so it acts as a standby server.

한편, 두 개의 시스템이 동시에 동작을 시작하는 경우에 하나의 서버는 자신이 REQ_TO_STANDBY 메시지를 전송하고 메시지를 기다리는 동안 REQ_TO_STANDBY를 받는 경우이다. 이때에는 초기 시스템의 설정에 의해 시작 상태(Start State)가 액티브 시스템으로 설정되어 있으면 자신은 액티브 서버로 동작하고 INST_CHG_TO _STANDBY를 전송한다. 또한, 그 반대의 경우에는 자신은 스탠바이 서버로 동작하고 INST_CHG_TO_ACTIVE를 전송한다.On the other hand, when two systems start to operate simultaneously, one server sends a REQ_TO_STANDBY message and receives a REQ_TO_STANDBY while waiting for the message. At this time, if the start state is set as the active system by the initial system setting, the mobile station operates as an active server and transmits INST_CHG_TO_STANDBY. Also, in the opposite case, it acts as a standby server and sends INST_CHG_TO_ACTIVE.

전술한 내용은 후술할 발명의 특허 청구 범위를 보다 잘 이해할 수 있도록 본 발명의 특징과 기술적 장점을 다소 폭넓게 개설하였다. 본 발명의 특허 청구범위를 구성하는 부가적인 특징과 장점들이 이하에서 상술될 것이다. 개시된 본 발명의 개념과 특정 실시예는 본 발명과 유사 목적을 수행하기 위한 다른 구조의 설계나 수정의 기본으로서 즉시 사용될 수 있음이 당해 기술 분야의 숙련된 사람들에 의해 인식되어야 한다.The foregoing has outlined rather broadly the features and technical advantages of the present invention to better understand the claims of the invention which will be described later. Additional features and advantages that make up the claims of the present invention will be described below. It should be appreciated by those skilled in the art that the conception and specific embodiments of the invention disclosed may be readily used as a basis for designing or modifying other structures for carrying out similar purposes to the invention.

또한, 본 발명에서 개시된 발명 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로서 당해 기술 분야의 숙련된 사람들에 의해 사용되어질 수 있을 것이다. 또한, 당해 기술 분야의 숙련된 사람에 의한 그와 같은 수정 또는 변경된 등가 구조는 특허 청구 범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능하다.In addition, the inventive concepts and embodiments disclosed herein may be used by those skilled in the art as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. In addition, such modifications or altered equivalent structures by those skilled in the art may be variously changed, substituted, and changed without departing from the spirit or scope of the invention described in the claims.

이상과 같이, 본 발명은 이중화된 네트워크 이외에도 이중화된 서버에서 동작하는 이중화 감시 프로세스(DGP) 사이에 전용 하트 비트 라인을 구비하여 서로 상태의 상태를 감시함으로써, 호스트에 장애가 발생한 경우, 네트워크에 장애가 발생한 경우, 데이터 베이스에 장애가 발생한 경우, 프로세스 구동의 장애가 선정된 회수 이상 발생한 경우 자동 절체될 수 있으며, 또한 운영자에 의한 수동 절체를 가능하게 함으로써, 내 고장성 서버 시스템을 구현하게 된다.As described above, the present invention provides a dedicated heartbeat line between the redundant monitoring process (DGP) operating in the redundant server in addition to the redundant network to monitor the state of each other, so that when the host fails, the network fails. In this case, when a failure occurs in the database, when a failure in the process driving occurs more than a predetermined number of times, automatic switching can be performed, and by enabling manual switching by an operator, a fault tolerant server system can be implemented.

Claims

And a heartbeat separate from the redundant network line in order to check whether a counterpart server has failed between the first server that is operated by the user process and the second server that is duplicated through the redundant network line. In a redundant server system having a memory means synchronized in real time between the first server and the second server in order to access the data generated by the user process running on one server without interruption when the second server is switched The first server and the second server each start a process and resource manager (PRN) and a duplication guardian process (DGP), each operating as a standby server or a standby server. Become,

The process management process is for starting a user process set in a configuration table in a server operating as an active server, monitoring a state of a network interface card provided for the network redundancy, and managing the user process. Features,

The redundancy monitoring process is started by the process resource management process when booting is always completed after being operated regardless of whether the server is an active server or a standby server. Judging whether there is a failure of the counterpart server through periodic communication with the monitoring process, and in case of detecting a failure of the counterpart server which was acting as an active server, it transmits a signal requesting a transfer to its process resource management process and records it in the memory means It is characterized in that the user process is started again with reference to the configuration file, and the state change is requested to the redundant monitoring process of the counterpart server. The memory means is an RFM installed separately in each of the active server and the standby server. two And data recorded in the memory means of the active server are recorded in synchronization with the memory means of the counterpart server in real time.

delete

The process resource management process (PNR) for running and managing user application processes and monitoring the status of a network interface card provided for network redundancy and the redundancy for monitoring the status of the counterpart server in the redundant server. It is equipped with a Duplication Guardian Process (DGP) that monitors faults through the heartbeat line provided separately from the established network, and the data generated by the processes are connected to each other by an optical cable and interfaced in a PCI manner to synchronize data with each other. In the first server and the second server, which are stored in a reflective memory to write and are duplicated, a method in which an active first server is transferred to a standby state and the second server is transferred to an active state,

(a) Redundancy monitoring of the second server when a heartbeat message is not received from the redundancy monitoring process (DGP) running on the first server through the heartbeat line for a predetermined multiplier period of a heartbeat communication period. The process includes sending a heartbeat to the redundancy monitoring process of the first server through other network lines except the heartbeat line;

(b) a failure occurs in the heartbeat transmitted through the other network line in step (a) to receive a heartbeat message from the duplication monitoring process to the first server within a predetermined multiplier period of the heartbeat communication period. If not, the redundancy monitoring process of the second server sends a signal requesting a transition from the standby state to the active state to the process resource management process (PNR) of the second server, and notifies the state change in the switching step. Writing to the reflective memory and thereafter writing to the reflective memory that a second server is an active server;

(c) the process resource management process of the second server receives the state transition request signal from the duplication monitoring process of the second server, suspends the user application process that is set to run only in the standby state, and in the active state. Starting a process to be driven; And

(d) The duplication monitoring process of the second server attempts to transmit a heartbeat message to the counterpart first server and recognizes that the standby server is down when there is no response within the predetermined multiplier period of the heartbeat communication period. step

Server redundancy switching method comprising a.

The method of claim 6, wherein the server redundancy switching method

Restoring the first server that was down and starting the process resource management process (PNR) of the first server, and then setting itself to an initial state when the redundancy monitoring process (DGP) is started;

The redundancy monitoring process of the restored first server sends a heartbeat message to the redundancy monitoring process of the second server, which is currently active, to the heartbeat line and the heartbeat message containing information on which state its state should be transitioned to. Transmitting via any one or a combination of the redundant networks; And

Transitioning the state of the first server to a standby or active state according to the response from the duplication monitoring process of the second server;

Redundant server switching method further comprising.

The process resource management process (PNR) for running and managing user application processes and monitoring the status of a network interface card provided for network redundancy and the redundancy for monitoring the status of the counterpart server in the redundant server. It is equipped with a Duplication Guardian Process (DGP) that monitors faults through the heartbeat line provided separately from the network, and the data generated by the processes are connected to each other with an optical cable and interfaced by PCI to synchronize data. In the first server and the second server, which are stored in a reflective memory to be recorded and are duplicated, the method of switching from the first server that is active to the second server in the standby state is:

(A) the process resource management process (PNR) of the first server receiving the transfer command of the system operator sends a signal requesting a change to the standby state to the redundant monitoring process of the first server;

(b) the duplication monitoring process of the first server sends a heartbeat message requesting a state change to an active state to the duplication monitoring process of the second server through the heartbeat line; And

(c) the process resource management process of the second server receives the state transition request signal from the duplication monitoring process of the second server, suspends the user application process that is set to run only in the standby state, and in the active state. Starting the process to be started

Server redundancy switching method comprising a.

(a) the process resource management process (PNR) of the first server may include transmitting a signal requesting a transition to a standby state to a duplication monitoring process of the first server when a database server fails;

Server redundancy switching method comprising a.

(a) When the process resource management process (PNR) of the first server fails more than a predetermined number of times of any one or a combination of the user application processes, the process resource management process (PNR) of the first server returns to a standby state in the redundant monitoring process of the first server. Transmitting a signal requesting a transition;

Server redundancy switching method comprising a.

The process resource management process (PNR) for running and managing user application processes and monitoring the status of a network interface card provided for network redundancy and the redundancy for monitoring the status of the counterpart server in the redundant server. It is equipped with a Duplication Guardian Process (DGP), which monitors faults through a heartbeat line provided separately from the network, and synchronizes data by interconnecting data generated by the processes with an optical cable using a PCI method. In the first memory and the second server which is stored in a reflective memory to record and duplicated,

(a) the duplication monitoring process of the first server or the second server may include transmitting a first heartbeat at a predetermined period of the duplication monitoring process of the counterpart server;

(b) if the transmitted heartbeat requests an acknowledgment (ACK) from a receiving server, transmitting a second heartbeat at the predetermined period through a network channel receiving the heartbeat;

(c) if the transmitted first heartbeat requests the transition of the state of the receiving server to either initial, active, or standby by the receiving side redundancy monitoring process; Requesting a state process resource management process for a state transition of the system;

(d) one or a combination of the heartbeat line and another redundant network, if the second heartbeat is not transmitted from a receiving server in a doubled period of a predetermined period after the first heartbeat is transmitted; Transmitting a third heartbeat to a receiving server using the;

(e) If the heartbeat has not been transmitted from the receiving end through any communication channel of the heartbeat line or the redundant network at twice the predetermined period after the third heartbeat is transmitted, the counterpart server Recognizing it is down and transferring itself to the active;

(f) when the downed server is restored, the restored server is set to an initial state (INIT), and the redundant monitoring process of the restored server queries the redundant monitoring server of the partner server for its state transition;

(g) The recovered server makes a state transition to the standby or active state according to the instructions contained in the heartbeat sent by the redundant monitoring server of the counterpart server in response to the state transition query in step (g). Transitioning his state to active if none

Server redundancy switching method comprising a.