CN1547121A

CN1547121A - Method for monitoring large-scale cluster system

Info

Publication number: CN1547121A
Application number: CNA200310119410XA
Authority: CN
Inventors: 博李; 李博; 马捷
Original assignee: Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd
Priority date: 2003-12-10
Filing date: 2003-12-10
Publication date: 2004-11-17
Anticipated expiration: 2023-12-10
Also published as: CN1270240C

Abstract

The invention is a supervising method for a kind of large-scale armada system. The invention divides the supervising system into four layers and five devices. The system state information is gathered by the software and hardware information collector (joint information collecting layer) periodically, the group information management device gathers and manages the state information of each group member periodically, the armada information managing device gathers and arranges the stat data from each group information managing device, and the state data are memorized with the MySQL databank. Finally, the data is read from the databank by the armada supervising terminal. And the state data of supervised object of each kind are displayed to the manager with graphic mode. The large-scale armada system can be supervised.

Description

A kind of method for supervising of extensive Network of Workstation

Technical field

The present invention relates to the method for supervising in high-performance group of planes server technology field, particularly a kind of extensive Network of Workstation.

Technical background

A group of planes is the solution of a kind of super quality and competitive price in the current high-performance calculation, and along with the maturation of group of planes technology and the reduction of cost, its increase in size also is more and more faster.The scale that a group of planes is huge and a large amount of resources all need us can understand their state timely and effectively, can this normally move and form computing power for a computing environment and have great significance, so this just requires us can have the effective method for supervising of a cover to monitor so large-scale computational resource.

Group monitoring method in the past has some shortcomings, at first, computers group monitoring in the past generally adopts the double-layer structure of Client/Server (client/server) pattern, so too much to the scale restriction of a group of planes own, in case group of planes scale changes, particularly the increase in size several times time, supervisory system is difficult to adapt to, secondly, method for supervising in the past is to be described from the angle of the Physical View running status to certain (a bit) node of group of planes inside mostly, and can not a certain class state of resources of group of planes whole interior be described from the angle of logical view, the 3rd, group monitoring method is in the past often only monitored such as the central processing unit utilization factor, other status informations of operating system software level such as memory usage, and not to the temperature of cluster environment, voltage, status informations such as rotation speed of the fan are monitored.

Summary of the invention

In view of the deficiency of existing group monitoring method, the invention provides a kind of method for supervising of extensive Network of Workstation.This method provides a kind of execution scheme for the supervisory system of an extensive group of planes; also constructed a multi-level monitoring environment simultaneously, in network environment, realized monitored cluster server is carried out the function that status information capture, status information gather, status information is put in order, status information is stored, status information shows.

Specific implementation method of the present invention is as follows:

A. the framed structure of group monitoring method

This method is divided into 4 levels and 5 devices with supervisory system integral body from structure, is respectively node information acquisition layer (software information collector, hardware information collector), group information management layer (group information manager), group of planes information management layer (group of planes information manager), group monitoring layer (group monitoring terminal).See Linux group of planes superserver supervisory system structural representation for details.This multi-level system architecture makes this supervisory system be easy to adapt to the group of planes of various scales, and the scale from the scale of several nodes to thousands of nodes can be finished the monitoring to them.

B. the collection of Network of Workstation status information

The status information of whole extensive Network of Workstation is gathered by the status information of each node and forms; and the collecting work of the status information of each node is finished by the node information acquisition layer, and the node information acquisition layer is made of software information collector and 2 devices of hardware information collector.

The software information collector obtains system state data by timing read operation systematic parameter.The system state data of the required collection of software information collector module mainly comprises: central processing unit operating position, Installed System Memory capacity and operating position, system interaction partition size and operating position, disk operating position (busy extent of read-write operation), respectively overlap state (break-make), transmitting-receiving bag situation, the packet loss of network, the state of application program operation.

The hardware information collector is a hardware device, and it finishes the collection to the status information of Network of Workstation internal hardware devices by data monitoring card (capture card), temp probe, voltage measuring apparatus, fan speed measurement device.The data owner that the hardware information collector is gathered will comprise: the magnitude of voltage of each hardware device and working temperature, each rotation speed of the fan etc.

C. the Network of Workstation status information gathers

The architecture of this method for supervising is divided into 4 levels, after the Network of Workstation status information is finished by the node information acquisition layer collection of bottom, respectively through arrangement at all levels, gather and form.

The status information of each node that the node information acquisition layer is collected obtains gathering for the first time at group information manager place.The group information manager can periodic group membership's (node) to it be sent out request, ask for the status information of the software and hardware of each node, software information collector on each node can be by the communication mode based on socket, this node application state information is reported to the group information manager, and the hardware information collector on each node then is delivered to group information manager with the hardware state data of node by serial port by the I2C agreement.

Each group information manager reports own all group memberships' that managed status data to group of planes information manager, is that status data gathered in the second time of group of planes inside.Group of planes information manager can periodically be sent out request to each group information manager, asks for the summary information of each node of each group information manager reservation.After receiving request, each group information manager can send to group of planes information manager by the communication mode based on socket with the status information of all members in the group that oneself keeps, and the software of inner all nodes of a group of planes, the status information of hardware are gathered at group of planes information manager.

D. the preservation of Network of Workstation status information

Software information collector on the node can be after node os starting success, marks off the status data that memory field together is used for keeping this node in the node internal memory, and the status data that keeps in this internal memory can periodically be refreshed by the software information collector.

Similar with the software information collector, the group information manager is after the success of node os starting, mark off memory field together and be used for keeping this and organize the status data set that all nodes are reported up in the node internal memory, the status data that keeps in the internal memory can periodically be refreshed by the group information manager.

The status data of the whole group of planes of group of planes information management management, comprising the management of current status data and the management of historical data, the management of current status data and software information collector are similar, after the success of node os starting, group of planes information manager marks off memory field together and is used for keeping the status data set that inner all the group information managers of a group of planes are reported up in the node internal memory, the status data that keeps in the internal memory can periodically be refreshed by group of planes information manager.Simultaneously, group of planes information manager is also being managed group of planes historical state data, and this work is finished by utilizing the MySQL database.Group of planes information manager deposits each cycle the table of MySQL in from the status data that the group information manager collects, and table is to set up by the sky, sets up a new table every day, deposits all status datas of this group of planes on the same day.

E. the demonstration of Network of Workstation status information

The demonstration of Network of Workstation status information is finished by the group monitoring terminal, and the group monitoring terminal is positioned at the group monitoring layer.

The interface of group monitoring terminal is made up of one group of view, and it comprises static information view, real-time information view and historical data analysis view three classes.Mode by figure is come out the information representation of a monitored group of planes, the group monitoring terminal data from database server.The static information view is that unit shows the information relevant with cluster configuration such as its central processing unit information, memory size, hard-disk capacity according to a group of planes.The real-time information view dynamically shows each node central processing unit utilization factor, memory usage, interactive partition utilization factor, hard disk utilization factor in the group of planes with histogram or broken line graph form, and hardware fault situation, comprise voltage, electric current shakiness, fan stall, temperature anomaly etc.The historical data analysis view is that preface, group of planes integral body are analytic target with time, provide central processing unit operating position, hard disk duty, the memory usage of all nodes in the group of planes, the variation tendency of interactive partition utilization factor, whether the performance of analyzing a current group of planes can satisfy the demand of current application, simultaneously, with time is preface, add up soft, hardware fault point and failure-frequency, so that assist to carry out upgrading soft, hardware.View also is the form demonstration with histogram and broken line graph.

Description of drawings

Fig. 1 is the structural representation of extensive Network of Workstation method for supervising of the present invention;

Fig. 2 is the deployment synoptic diagram of computers group monitoring of this method of application of Fig. 1;

Fig. 3 is the process flow diagram of extensive Network of Workstation method for supervising of the present invention.

As shown in fig. 1; extensive Network of Workstation method for supervising is divided into 4 levels, 5 devices from structure, and they are respectively node information acquisition layer 1 (software information collector, hardware information collector), group information management layer 2 (group information manager), group of planes information management layer 3 (group of planes information manager), group monitoring layer 4 (group monitoring terminal).

The node information collection is divided into the software information collection and hardware information is gathered two parts.Hardware information collector on each node is delivered to the group information manager to the node hardware information of collecting by the I2C dedicated network, equally, software information collector on each node also passes to the group information manager to corresponding node system status information, the information that each group information manager can be managed 0～128 node, the information of several group information managers is aggregated in the group of planes information manager, group of planes information manager will be collected, processing also utilizes database to preserve the data that these constantly send over, for the keeper monitors inner each node state of a group of planes, the history run information of understanding node provides data.The group monitoring terminal is a set of diagrams shape interface management instrument, it is by obtaining the current and historic state information of the inner node of a group of planes from database, and give keeper's mode with patterned showing interface, make the keeper obtain the current of a monitored group of planes and historic state information intuitively, in time, accurately.

As shown in Figure 2, the supervisory system of having used this method is deployed in each module in the group of planes on the corresponding node, forms the complete supervisory system of a cover, and co-ordination.

Soft, hardware information collector is deployed in inner each of a group of planes and calculates on the node, is responsible for collecting soft, the hardware status information of this node; The group information manager is deployed on the inner group of the group of planes management node, is responsible for gathering the status information of each node in the group; Group of planes information manager is deployed on the node of cluster network outlet (having outer net and Intranet simultaneously), is responsible for gathering the status information of each group, deposits data in database simultaneously; The group monitoring terminal part is deployed in database to be had on the terminal that network is connected, and the various status informations in the database are shown.

The extensive Network of Workstation method for supervising of Fig. 3; its step is as follows: the periodic respectively running state information of collecting the software and hardware of this node of step S1 software information collector and hardware information collector; group information manager under the status information of each node periodically is summarized in; each group information manager of step S2 is collected; each node state property information cycle of arrangement management be summarized in group of planes information manager; step S3 group of planes information manager is collected; each group of preservation management is periodically put the group of planes status information of being managed in order and is deposited database in, and step S4 group monitoring terminal obtains information needed and shows from database.

Effect of the present invention is embodied in:

1. the machine of the obstructed scale of the easier adaptation of architecture of four levels proposing of this method for supervising Group, particularly Large Scale Cluster, the as compared with the past Client/Server of computers group monitoring employing The double-layer structure of pattern has better extensibility.

2. this group monitoring method utilizes database technology that a large amount of status datas is managed, and is fixed The phase backup, and for data analysis tool provides source data, greatly facilitate administrative staff to being supervised The analysis of control group of planes history run status data.

3. this group monitoring method has proposed the scalability of view logical level, for the keeper provides Different visual angles observe the state of the various resources of monitored lattice point, the keeper both can be with machine Interior all nodes of group are used as an integral body and are observed its certain class resource status, also can check lattice point Interior arbitrarily certain resource behaviour in service of node.

4. this group monitoring method is not only to operation systems such as central processing unit utilization rate, memory usages The status information of system software level is monitored, but also to temperature, voltage, the wind of cluster environment Hardware status information such as fan rotating speed etc. is monitored, and this is that in the past computers group monitoring does not have.

Claims

1. the method for supervising of an extensive Network of Workstation; it is divided into four levels; five devices; comprise the node information acquisition layer; the group information management layer; group of planes information management layer; the group monitoring layer; it is characterized in that by by the software information collector; the hardware information collector is the acquisition system status information periodically; the group information manager is periodically from soft; the hardware information collector is collected; put the status information of each group membership (node) in order; group of planes information manager is periodically collected from each group information manager again; arrangement; preserve the status data of (utilizing the MySQL database) each group information management management; these status datas are read out from the MySQL database by the group monitoring terminal at last; and the status data of various types of monitored objects is shown to the keeper with the angle of the mode of figure and logical view; in the method; communication mode between group monitoring terminal and the MySQL database adopts based on the JDBC communication pattern of (Java DataBaseConnectivity-Java database is connected), and the employing of the intermodule communication of different levels is finished based on the communication pattern of socket (socket) in addition.

2. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: this method is divided into four levels, five devices with supervisory system, comprises node information acquisition layer (software information collector, hardware information collector), group information management layer (group information manager), group of planes information management layer (group of planes information manager), group monitoring layer (group monitoring terminal).

3. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the application state information of periodically being gathered monitored system by the software information collector.

4. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the hardware status information of periodically being gathered monitored system by the hardware information collector

5. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: by organizing information manager periodically from status information soft, that the hardware information collector is collected, put each group membership (node) in order.

6. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the communication mode of employing based on socket of communicating by letter of group information manager and lower floor's software information manager.

7. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: group of planes information manager is periodically collected, arrangement, is preserved the status data of (utilizing the MySQL database) each group information management management from each group information manager.

8. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: group of planes information manager and communicating by letter of lower floor group information manager are adopted the communication mode based on socket.

9. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: these status datas are read out from the MySQL database by the group monitoring terminal, and the status data of various types of monitored objects is shown to the keeper with the mode of figure.

10. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: these status datas are read out from the MySQL database by the group monitoring terminal, and give the keeper with these state of resources data presentation from the logic visual angle of group of planes resource.

11. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: communicate by letter between group monitoring terminal and the MySQL of lower floor database with adopting and finish based on the communication pattern of JDBC.

12; a kind of method for supervising of extensive Network of Workstation; its step is as follows: the periodic respectively running state information of collecting the software and hardware of this node of step S1 software information collector and hardware information collector; group information manager under the status information of each node periodically is summarized in; each group information manager of step S2 is collected; each node state property information cycle of arrangement management be summarized in group of planes information manager; step S3 group of planes information manager is collected; each group of preservation management is periodically put the group of planes status information of being managed in order and is deposited database in, and step S4 group monitoring terminal obtains information needed and shows from database.