Nothing Special   »   [go: up one dir, main page]

CN1547121A - Method for monitoring large-scale cluster system - Google Patents

Method for monitoring large-scale cluster system Download PDF

Info

Publication number
CN1547121A
CN1547121A CNA200310119410XA CN200310119410A CN1547121A CN 1547121 A CN1547121 A CN 1547121A CN A200310119410X A CNA200310119410X A CN A200310119410XA CN 200310119410 A CN200310119410 A CN 200310119410A CN 1547121 A CN1547121 A CN 1547121A
Authority
CN
China
Prior art keywords
group
information
planes
supervising
workstation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200310119410XA
Other languages
Chinese (zh)
Other versions
CN1270240C (en
Inventor
博 李
李博
马捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN 200310119410 priority Critical patent/CN1270240C/en
Publication of CN1547121A publication Critical patent/CN1547121A/en
Application granted granted Critical
Publication of CN1270240C publication Critical patent/CN1270240C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention is a supervising method for a kind of large-scale armada system. The invention divides the supervising system into four layers and five devices. The system state information is gathered by the software and hardware information collector (joint information collecting layer) periodically, the group information management device gathers and manages the state information of each group member periodically, the armada information managing device gathers and arranges the stat data from each group information managing device, and the state data are memorized with the MySQL databank. Finally, the data is read from the databank by the armada supervising terminal. And the state data of supervised object of each kind are displayed to the manager with graphic mode. The large-scale armada system can be supervised.

Description

A kind of method for supervising of extensive Network of Workstation
Technical field
The present invention relates to the method for supervising in high-performance group of planes server technology field, particularly a kind of extensive Network of Workstation.
Technical background
A group of planes is the solution of a kind of super quality and competitive price in the current high-performance calculation, and along with the maturation of group of planes technology and the reduction of cost, its increase in size also is more and more faster.The scale that a group of planes is huge and a large amount of resources all need us can understand their state timely and effectively, can this normally move and form computing power for a computing environment and have great significance, so this just requires us can have the effective method for supervising of a cover to monitor so large-scale computational resource.
Group monitoring method in the past has some shortcomings, at first, computers group monitoring in the past generally adopts the double-layer structure of Client/Server (client/server) pattern, so too much to the scale restriction of a group of planes own, in case group of planes scale changes, particularly the increase in size several times time, supervisory system is difficult to adapt to, secondly, method for supervising in the past is to be described from the angle of the Physical View running status to certain (a bit) node of group of planes inside mostly, and can not a certain class state of resources of group of planes whole interior be described from the angle of logical view, the 3rd, group monitoring method is in the past often only monitored such as the central processing unit utilization factor, other status informations of operating system software level such as memory usage, and not to the temperature of cluster environment, voltage, status informations such as rotation speed of the fan are monitored.
Summary of the invention
In view of the deficiency of existing group monitoring method, the invention provides a kind of method for supervising of extensive Network of Workstation.This method provides a kind of execution scheme for the supervisory system of an extensive group of planes; also constructed a multi-level monitoring environment simultaneously, in network environment, realized monitored cluster server is carried out the function that status information capture, status information gather, status information is put in order, status information is stored, status information shows.
Specific implementation method of the present invention is as follows:
A. the framed structure of group monitoring method
This method is divided into 4 levels and 5 devices with supervisory system integral body from structure, is respectively node information acquisition layer (software information collector, hardware information collector), group information management layer (group information manager), group of planes information management layer (group of planes information manager), group monitoring layer (group monitoring terminal).See Linux group of planes superserver supervisory system structural representation for details.This multi-level system architecture makes this supervisory system be easy to adapt to the group of planes of various scales, and the scale from the scale of several nodes to thousands of nodes can be finished the monitoring to them.
B. the collection of Network of Workstation status information
The status information of whole extensive Network of Workstation is gathered by the status information of each node and forms; and the collecting work of the status information of each node is finished by the node information acquisition layer, and the node information acquisition layer is made of software information collector and 2 devices of hardware information collector.
The software information collector obtains system state data by timing read operation systematic parameter.The system state data of the required collection of software information collector module mainly comprises: central processing unit operating position, Installed System Memory capacity and operating position, system interaction partition size and operating position, disk operating position (busy extent of read-write operation), respectively overlap state (break-make), transmitting-receiving bag situation, the packet loss of network, the state of application program operation.
The hardware information collector is a hardware device, and it finishes the collection to the status information of Network of Workstation internal hardware devices by data monitoring card (capture card), temp probe, voltage measuring apparatus, fan speed measurement device.The data owner that the hardware information collector is gathered will comprise: the magnitude of voltage of each hardware device and working temperature, each rotation speed of the fan etc.
C. the Network of Workstation status information gathers
The architecture of this method for supervising is divided into 4 levels, after the Network of Workstation status information is finished by the node information acquisition layer collection of bottom, respectively through arrangement at all levels, gather and form.
The status information of each node that the node information acquisition layer is collected obtains gathering for the first time at group information manager place.The group information manager can periodic group membership's (node) to it be sent out request, ask for the status information of the software and hardware of each node, software information collector on each node can be by the communication mode based on socket, this node application state information is reported to the group information manager, and the hardware information collector on each node then is delivered to group information manager with the hardware state data of node by serial port by the I2C agreement.
Each group information manager reports own all group memberships' that managed status data to group of planes information manager, is that status data gathered in the second time of group of planes inside.Group of planes information manager can periodically be sent out request to each group information manager, asks for the summary information of each node of each group information manager reservation.After receiving request, each group information manager can send to group of planes information manager by the communication mode based on socket with the status information of all members in the group that oneself keeps, and the software of inner all nodes of a group of planes, the status information of hardware are gathered at group of planes information manager.
D. the preservation of Network of Workstation status information
Software information collector on the node can be after node os starting success, marks off the status data that memory field together is used for keeping this node in the node internal memory, and the status data that keeps in this internal memory can periodically be refreshed by the software information collector.
Similar with the software information collector, the group information manager is after the success of node os starting, mark off memory field together and be used for keeping this and organize the status data set that all nodes are reported up in the node internal memory, the status data that keeps in the internal memory can periodically be refreshed by the group information manager.
The status data of the whole group of planes of group of planes information management management, comprising the management of current status data and the management of historical data, the management of current status data and software information collector are similar, after the success of node os starting, group of planes information manager marks off memory field together and is used for keeping the status data set that inner all the group information managers of a group of planes are reported up in the node internal memory, the status data that keeps in the internal memory can periodically be refreshed by group of planes information manager.Simultaneously, group of planes information manager is also being managed group of planes historical state data, and this work is finished by utilizing the MySQL database.Group of planes information manager deposits each cycle the table of MySQL in from the status data that the group information manager collects, and table is to set up by the sky, sets up a new table every day, deposits all status datas of this group of planes on the same day.
E. the demonstration of Network of Workstation status information
The demonstration of Network of Workstation status information is finished by the group monitoring terminal, and the group monitoring terminal is positioned at the group monitoring layer.
The interface of group monitoring terminal is made up of one group of view, and it comprises static information view, real-time information view and historical data analysis view three classes.Mode by figure is come out the information representation of a monitored group of planes, the group monitoring terminal data from database server.The static information view is that unit shows the information relevant with cluster configuration such as its central processing unit information, memory size, hard-disk capacity according to a group of planes.The real-time information view dynamically shows each node central processing unit utilization factor, memory usage, interactive partition utilization factor, hard disk utilization factor in the group of planes with histogram or broken line graph form, and hardware fault situation, comprise voltage, electric current shakiness, fan stall, temperature anomaly etc.The historical data analysis view is that preface, group of planes integral body are analytic target with time, provide central processing unit operating position, hard disk duty, the memory usage of all nodes in the group of planes, the variation tendency of interactive partition utilization factor, whether the performance of analyzing a current group of planes can satisfy the demand of current application, simultaneously, with time is preface, add up soft, hardware fault point and failure-frequency, so that assist to carry out upgrading soft, hardware.View also is the form demonstration with histogram and broken line graph.
Description of drawings
Fig. 1 is the structural representation of extensive Network of Workstation method for supervising of the present invention;
Fig. 2 is the deployment synoptic diagram of computers group monitoring of this method of application of Fig. 1;
Fig. 3 is the process flow diagram of extensive Network of Workstation method for supervising of the present invention.
As shown in fig. 1; extensive Network of Workstation method for supervising is divided into 4 levels, 5 devices from structure, and they are respectively node information acquisition layer 1 (software information collector, hardware information collector), group information management layer 2 (group information manager), group of planes information management layer 3 (group of planes information manager), group monitoring layer 4 (group monitoring terminal).
The node information collection is divided into the software information collection and hardware information is gathered two parts.Hardware information collector on each node is delivered to the group information manager to the node hardware information of collecting by the I2C dedicated network, equally, software information collector on each node also passes to the group information manager to corresponding node system status information, the information that each group information manager can be managed 0~128 node, the information of several group information managers is aggregated in the group of planes information manager, group of planes information manager will be collected, processing also utilizes database to preserve the data that these constantly send over, for the keeper monitors inner each node state of a group of planes, the history run information of understanding node provides data.The group monitoring terminal is a set of diagrams shape interface management instrument, it is by obtaining the current and historic state information of the inner node of a group of planes from database, and give keeper's mode with patterned showing interface, make the keeper obtain the current of a monitored group of planes and historic state information intuitively, in time, accurately.
As shown in Figure 2, the supervisory system of having used this method is deployed in each module in the group of planes on the corresponding node, forms the complete supervisory system of a cover, and co-ordination.
Soft, hardware information collector is deployed in inner each of a group of planes and calculates on the node, is responsible for collecting soft, the hardware status information of this node; The group information manager is deployed on the inner group of the group of planes management node, is responsible for gathering the status information of each node in the group; Group of planes information manager is deployed on the node of cluster network outlet (having outer net and Intranet simultaneously), is responsible for gathering the status information of each group, deposits data in database simultaneously; The group monitoring terminal part is deployed in database to be had on the terminal that network is connected, and the various status informations in the database are shown.
The extensive Network of Workstation method for supervising of Fig. 3; its step is as follows: the periodic respectively running state information of collecting the software and hardware of this node of step S1 software information collector and hardware information collector; group information manager under the status information of each node periodically is summarized in; each group information manager of step S2 is collected; each node state property information cycle of arrangement management be summarized in group of planes information manager; step S3 group of planes information manager is collected; each group of preservation management is periodically put the group of planes status information of being managed in order and is deposited database in, and step S4 group monitoring terminal obtains information needed and shows from database.
Effect of the present invention is embodied in:
1. the machine of the obstructed scale of the easier adaptation of architecture of four levels proposing of this method for supervising Group, particularly Large Scale Cluster, the as compared with the past Client/Server of computers group monitoring employing The double-layer structure of pattern has better extensibility.
2. this group monitoring method utilizes database technology that a large amount of status datas is managed, and is fixed The phase backup, and for data analysis tool provides source data, greatly facilitate administrative staff to being supervised The analysis of control group of planes history run status data.
3. this group monitoring method has proposed the scalability of view logical level, for the keeper provides Different visual angles observe the state of the various resources of monitored lattice point, the keeper both can be with machine Interior all nodes of group are used as an integral body and are observed its certain class resource status, also can check lattice point Interior arbitrarily certain resource behaviour in service of node.
4. this group monitoring method is not only to operation systems such as central processing unit utilization rate, memory usages The status information of system software level is monitored, but also to temperature, voltage, the wind of cluster environment Hardware status information such as fan rotating speed etc. is monitored, and this is that in the past computers group monitoring does not have.

Claims (12)

1. the method for supervising of an extensive Network of Workstation; it is divided into four levels; five devices; comprise the node information acquisition layer; the group information management layer; group of planes information management layer; the group monitoring layer; it is characterized in that by by the software information collector; the hardware information collector is the acquisition system status information periodically; the group information manager is periodically from soft; the hardware information collector is collected; put the status information of each group membership (node) in order; group of planes information manager is periodically collected from each group information manager again; arrangement; preserve the status data of (utilizing the MySQL database) each group information management management; these status datas are read out from the MySQL database by the group monitoring terminal at last; and the status data of various types of monitored objects is shown to the keeper with the angle of the mode of figure and logical view; in the method; communication mode between group monitoring terminal and the MySQL database adopts based on the JDBC communication pattern of (Java DataBaseConnectivity-Java database is connected), and the employing of the intermodule communication of different levels is finished based on the communication pattern of socket (socket) in addition.
2. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: this method is divided into four levels, five devices with supervisory system, comprises node information acquisition layer (software information collector, hardware information collector), group information management layer (group information manager), group of planes information management layer (group of planes information manager), group monitoring layer (group monitoring terminal).
3. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the application state information of periodically being gathered monitored system by the software information collector.
4. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the hardware status information of periodically being gathered monitored system by the hardware information collector
5. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: by organizing information manager periodically from status information soft, that the hardware information collector is collected, put each group membership (node) in order.
6. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: the communication mode of employing based on socket of communicating by letter of group information manager and lower floor's software information manager.
7. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: group of planes information manager is periodically collected, arrangement, is preserved the status data of (utilizing the MySQL database) each group information management management from each group information manager.
8. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: group of planes information manager and communicating by letter of lower floor group information manager are adopted the communication mode based on socket.
9. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: these status datas are read out from the MySQL database by the group monitoring terminal, and the status data of various types of monitored objects is shown to the keeper with the mode of figure.
10. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1; it is characterized in that: these status datas are read out from the MySQL database by the group monitoring terminal, and give the keeper with these state of resources data presentation from the logic visual angle of group of planes resource.
11. the method for supervising of a kind of extensive Network of Workstation as claimed in claim 1 is characterized in that: communicate by letter between group monitoring terminal and the MySQL of lower floor database with adopting and finish based on the communication pattern of JDBC.
12; a kind of method for supervising of extensive Network of Workstation; its step is as follows: the periodic respectively running state information of collecting the software and hardware of this node of step S1 software information collector and hardware information collector; group information manager under the status information of each node periodically is summarized in; each group information manager of step S2 is collected; each node state property information cycle of arrangement management be summarized in group of planes information manager; step S3 group of planes information manager is collected; each group of preservation management is periodically put the group of planes status information of being managed in order and is deposited database in, and step S4 group monitoring terminal obtains information needed and shows from database.
CN 200310119410 2003-12-10 2003-12-10 Method for monitoring large-scale cluster system Expired - Lifetime CN1270240C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200310119410 CN1270240C (en) 2003-12-10 2003-12-10 Method for monitoring large-scale cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200310119410 CN1270240C (en) 2003-12-10 2003-12-10 Method for monitoring large-scale cluster system

Publications (2)

Publication Number Publication Date
CN1547121A true CN1547121A (en) 2004-11-17
CN1270240C CN1270240C (en) 2006-08-16

Family

ID=34338241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200310119410 Expired - Lifetime CN1270240C (en) 2003-12-10 2003-12-10 Method for monitoring large-scale cluster system

Country Status (1)

Country Link
CN (1) CN1270240C (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100440159C (en) * 2005-01-06 2008-12-03 富士通株式会社 Method and apparatus for providing monitoring-information, and computer product
CN100449326C (en) * 2005-03-16 2009-01-07 西门子(中国)有限公司 Recording method and system of monitoring journal
CN102141934A (en) * 2011-02-28 2011-08-03 浪潮(北京)电子信息产业有限公司 Method and device for controlling process on fat node
CN101572631B (en) * 2008-04-30 2012-03-28 新奥特(北京)视频技术有限公司 Data transmission state monitoring method based on Eclipse RCP
CN102662762A (en) * 2012-03-30 2012-09-12 浪潮电子信息产业股份有限公司 Method for effectively controlling use of memory resource of fat node
CN103345414A (en) * 2013-07-26 2013-10-09 广州广电运通金融电子股份有限公司 Method for controlling hardware equipment by self-service terminal, equipment manager and processor
CN103473164A (en) * 2013-09-25 2013-12-25 浪潮电子信息产业股份有限公司 Monitoring and early-warning method for linux server
CN103825753A (en) * 2012-11-19 2014-05-28 英业达科技有限公司 Server system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8642831B2 (en) 2008-02-29 2014-02-04 Ferrosan Medical Devices A/S Device for promotion of hemostasis and/or wound healing
EP2977066A3 (en) 2012-06-12 2016-07-27 Ferrosan Medical Devices A/S Dry haemostatic composition

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100440159C (en) * 2005-01-06 2008-12-03 富士通株式会社 Method and apparatus for providing monitoring-information, and computer product
CN100449326C (en) * 2005-03-16 2009-01-07 西门子(中国)有限公司 Recording method and system of monitoring journal
CN101572631B (en) * 2008-04-30 2012-03-28 新奥特(北京)视频技术有限公司 Data transmission state monitoring method based on Eclipse RCP
CN102141934A (en) * 2011-02-28 2011-08-03 浪潮(北京)电子信息产业有限公司 Method and device for controlling process on fat node
CN102662762A (en) * 2012-03-30 2012-09-12 浪潮电子信息产业股份有限公司 Method for effectively controlling use of memory resource of fat node
CN103825753A (en) * 2012-11-19 2014-05-28 英业达科技有限公司 Server system
CN103345414A (en) * 2013-07-26 2013-10-09 广州广电运通金融电子股份有限公司 Method for controlling hardware equipment by self-service terminal, equipment manager and processor
CN103345414B (en) * 2013-07-26 2016-08-24 广州广电运通金融电子股份有限公司 Self-aided terminal controls the method for hardware device, equipment manager and processor
CN103473164A (en) * 2013-09-25 2013-12-25 浪潮电子信息产业股份有限公司 Monitoring and early-warning method for linux server

Also Published As

Publication number Publication date
CN1270240C (en) 2006-08-16

Similar Documents

Publication Publication Date Title
Borghesi et al. Online anomaly detection in hpc systems
AU2010276368B2 (en) Techniques for power analysis
US20130218547A1 (en) Systems and methods for analyzing performance of virtual environments
US9323651B2 (en) Bottleneck detector for executing applications
US20110179160A1 (en) Activity Graph for Parallel Programs in Distributed System Environment
CN111077870A (en) Intelligent OPC data real-time acquisition and monitoring system and method based on stream calculation
EP1895416B1 (en) Data visualization for diagnosing computing systems
CN1270240C (en) Method for monitoring large-scale cluster system
Roth et al. On-line automated performance diagnosis on thousands of processes
CN106685703A (en) Data acquisition and visual monitoring intelligent system
US20110012902A1 (en) Method and system for visualizing the performance of applications
KR20090046543A (en) System for operating and collecting a monitering data in power system and method therefor
CN102638378B (en) Mass storage system monitoring method integrating heterogeneous storage devices
CN101010669A (en) Techniques for health monitoring and control of application servers
JP2011527480A (en) Automatic discovery of physical connectivity between power outlets and IT equipment
JP2012099048A (en) Monitoring system and monitoring method for computer
CN1695282A (en) System and method for managing object based clusters
CN112286755B (en) Out-of-band data acquisition method and device for cluster server and computer equipment
CN103092738A (en) Method and device of resource visualization
CN107341040A (en) A kind of collecting method and device for virtualizing cloud platform
WO2014116204A1 (en) Processing data streams
CN1547356A (en) A grid-faced monitor system and method
CN109474479A (en) A kind of network equipment monitoring method and system
CN110502424A (en) A kind of performance data processing method of application software, device, system and terminal
Moore et al. A sense of place: Toward a location-aware information plane for data centers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Assignee: Beijing Shidai Shanyuan Automation Control Technology Co.,Ltd.

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract fulfillment period: 2007.10.9 to 2012.10.9

Contract record no.: 2009990000366

Denomination of invention: Method for monitoring large-scale cluster system

Granted publication date: 20060816

License type: Exclusive license

Record date: 20090424

LIC Patent licence contract for exploitation submitted for record

Free format text: EXCLUSIVE LICENSE; TIME LIMIT OF IMPLEMENTING CONTACT: 2007.10.9 TO 2012.10.9; CHANGE OF CONTRACT

Name of requester: BEIJING SHIDAI SHANYUAN AUTOMATION CONTROL TECHNOL

Effective date: 20090424

ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES

Effective date: 20130605

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 518129 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20130605

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee after: HUAWEI TECHNOLOGIES Co.,Ltd.

Address before: 100080 No. 6 South Road, Zhongguancun Academy of Sciences, Beijing

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

CX01 Expiry of patent term

Granted publication date: 20060816

CX01 Expiry of patent term