Nothing Special   »   [go: up one dir, main page]

CN109165138A - A kind of method and apparatus of monitoring equipment fault - Google Patents

A kind of method and apparatus of monitoring equipment fault Download PDF

Info

Publication number
CN109165138A
CN109165138A CN201810866734.6A CN201810866734A CN109165138A CN 109165138 A CN109165138 A CN 109165138A CN 201810866734 A CN201810866734 A CN 201810866734A CN 109165138 A CN109165138 A CN 109165138A
Authority
CN
China
Prior art keywords
failure
target critical
monitoring
index
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810866734.6A
Other languages
Chinese (zh)
Other versions
CN109165138B (en
Inventor
陈涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wangsu Science and Technology Co Ltd
Original Assignee
Wangsu Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wangsu Science and Technology Co Ltd filed Critical Wangsu Science and Technology Co Ltd
Priority to CN201810866734.6A priority Critical patent/CN109165138B/en
Publication of CN109165138A publication Critical patent/CN109165138A/en
Application granted granted Critical
Publication of CN109165138B publication Critical patent/CN109165138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • G06F11/3423Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of method and apparatus of monitoring equipment fault, belong to field of computer technology.The described method includes:, every the monitoring sleep time of target critical index, fitted the master tool monitoring objective key index for originally including by tool set after running tool set script;If there is abnormal, to fit this detection currently with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index by tool set monitoring sleep time in target critical index;If based on the monitoring sleep time of the second preset duration adjustment target critical index, and being fitted by tool set there is currently failure and this determination and reporting the fault message of current failure;If there is currently no failure, the monitoring sleep time based on third preset duration adjustment target critical index.It using the present invention, can repeatedly be reported to avoid the frequent monitoring to key index and to the frequent of same failure, and can more discovering device failure in time.

Description

A kind of method and apparatus of monitoring equipment fault
Technical field
The present invention relates to field of computer technology, in particular to a kind of method and apparatus of monitoring equipment fault.
Background technique
In the process of running, often there is operation troubles because of the problem on hardware or software in equipment, so as to Phenomena such as leading to equipment disposal ability decline, execute logic error, or even will appear equipment delay machine, component damage.In order to the greatest extent The early operation troubles found and solve equipment in time, user can often be looked into passage capacity monitoring programme (can be described as monitoring tools) The performance indicator for seeing equipment understands the operating status of equipment.
Presently, there are a kind of integrated tool sets there are many monitoring tools to fit this, and being fitted by tool set originally can unify Automatically the operating status of equipment is monitored.Specifically, user can install in equipment and run above-mentioned tool set Script, so that equipment can fit periodically through tool set, multiple master tools that this is included refer to monitor multiple keys Mark.When some key index occurs abnormal, equipment can further be fitted partial data sampling instrument in this using tool set Equipment operating parameter is acquired, and judges whether equipment breaks down based on collected equipment operating parameter, and corresponding event Hinder type.In turn, the failure that equipment can report this to occur repairs equipment for failure with reminding technology personnel.
In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems:
After equipment is if a failure occurs, failure generally understands last longer, if the period of monitoring key index is shorter, Then during failure continues, equipment can be detected constantly and report same failure, it will consumption is largely used to performance monitoring Equipment process resource;And if the period of monitoring key index is longer, may cause can not find failure in time.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of method of monitoring equipment fault and dresses It sets.The technical solution is as follows:
In a first aspect, providing a kind of method of monitoring equipment fault, which comprises
Every the monitoring sleep time of target critical index, fitted described in the master tool monitoring for originally including by tool set Target critical index;
If exception occurs in the target critical index, this detection is fitted currently with the presence or absence of event by the tool set Barrier, otherwise adjusts the monitoring sleep time of the target critical index based on the first preset duration;
If there is currently failure, when adjusting the monitoring suspend mode of the target critical index based on the second preset duration It is long, and fitted by the tool set and this determination and report the fault message of current failure;
If there is currently no failure, when adjusting the monitoring suspend mode of the target critical index based on third preset duration It is long, wherein second preset duration is greater than first preset duration, and first preset duration is default greater than the third Duration.
Optionally, the monitoring sleep time that the target critical index is otherwise adjusted based on the first preset duration, packet It includes:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, if it is described there is currently no failure, the target critical index is adjusted based on third preset duration Monitoring sleep time, comprising:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, it is described fitted by the tool set this determination and report the fault message of current failure, comprising:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, described to be fitted the fault message of this determination current failure by the tool set, by the current failure Number of repetition in short-term add one, comprising:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
Second aspect, provides a kind of device of monitoring equipment fault, and described device includes:
Monitoring module, for the monitoring sleep time every target critical index, fitted the base for originally including by tool set Plinth tool monitors the target critical index;
Module is adjusted, if occurring for the target critical index abnormal, is fitted this detection by the tool set It currently whether there is failure, the monitoring sleep time of the target critical index otherwise adjusted based on the first preset duration, if There is currently failures, then the monitoring sleep time of the target critical index are adjusted based on the second preset duration, and by described Tool set, which fits, this determination and reports the fault message of current failure, if there is currently no failure, when being preset based on third The long monitoring sleep time for adjusting the target critical index;
Wherein, second preset duration is greater than first preset duration, and first preset duration is greater than described the Three preset durations.
Optionally, the adjustment module, is specifically used for:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, the adjustment module, is specifically used for:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, the adjustment module, is specifically used for:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, the adjustment module, is specifically used for:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
The third aspect provides a kind of equipment, and the equipment includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute State the side for the monitoring equipment fault that code set or instruction set are loaded by the processor and executed with realization as described in relation to the first aspect Method.
Fourth aspect provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or The method that instruction set is loaded by processor and executed to realize monitoring equipment fault as described in relation to the first aspect.
Technical solution provided in an embodiment of the present invention has the benefit that
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is a kind of method flow diagram of monitoring equipment fault provided in an embodiment of the present invention;
Fig. 2 is a kind of logical schematic for monitoring key index provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of chained list provided in an embodiment of the present invention;
Fig. 4 is a kind of apparatus structure schematic diagram of monitoring equipment fault provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of equipment provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.
The embodiment of the invention provides a kind of method of monitoring equipment fault, the executing subject of this method, which can be, has journey The arbitrary equipment of sort run function can be server either terminal.Equipment can be loaded and be run in the technology that has powerful connections and mention And tool set fit this, fitted by the tool set and originally can use different monitoring tools monitoring device from different perspectives Operating status, so as to the hardware or software failure of the generation in timely discovering device operational process.Equipment may include place Device, memory, transceiver are managed, processor can be used for carrying out the processing in following processes for monitoring equipment fault, memory Can be used for storing the data of the data and generation that need in treatment process, such as store tool set fit this, recording equipment fortune Row parameter etc., transceiver can be used for sending and receiving the related data in treatment process, such as receiving the finger of user's input It enables, the fault message etc. of reporting equipment failure.Equipment can support multiple processes while run, different degrees of when process is run It occupies the process resource of equipment CPU, using certain memory headroom, and generates magnetic disc i/o.
Below in conjunction with specific embodiment, process flow shown in FIG. 1 is described in detail, content can be as Under:
Step 101, every the monitoring sleep time of target critical index, fitted the master tool for originally including by tool set Monitoring objective key index.
In an implementation, after technical staff is mounted with that tool set fits originally in equipment, equipment can load and run this Tool set fits this, and later, the equipment master tool for originally including that can be fitted by tool set monitors multiple key indexes.This Place, key index can be it is preset, by multiple key indexes can it is relatively simple, in time on discovering device whether It breaks down, and is directed to each key index, too small amount of key index that is able to reflect can be led to the presence or absence of exception information Master tool monitored in real time, in this way, executing a small amount of master tool monitors key index, the equipment process resource of consumption It is less, equipment performance is had an impact smaller.And for each key index, the monitoring that can individually set the key index is stopped Dormancy duration, i.e., every monitoring sleep time, equipment can fit the master tool for originally including to corresponding key by tool set Index is once monitored.Further, the monitoring sleep time of different key indexes can be different, correspondingly, different crucial The monitoring moment of index can also be different.In this way, by taking target critical index as an example, after running tool set script, Equipment can be every the monitoring sleep time of target critical index, and fitted the master tool monitoring objective for originally including by tool set Key index.
Optionally, above-mentioned key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.It is understood that in other embodiments, before key index is not limited to State the these types enumerated.
In an implementation, CPU usage, memory usage, load value, the CPU of I/O waiting time and each process can be chosen This five indices of utilization rate are as key index.Pointedly, for CPU usage, the progress of " mpstat " tool can be used Detection;For memory usage, can be detected by checking " used " and " free " field of " free-m ";For load Value, can the Load field in 1 minute by checking "/proc/load avg " file detected;When being waited for I/O It is long, " mpstat " tool can be used and detected;For the CPU usage of each process, " top " tool can be used and examined It surveys.
Step 102, if exception occurs in target critical index, this detection is fitted currently with the presence or absence of event by tool set Barrier, the monitoring sleep time otherwise based on the first preset duration adjustment target critical index.
In an implementation, equipment is when fitting the master tool monitoring objective key index in originally by tool set, Ke Yitong The mode for crossing threshold determination is tested according to some empirical datas used in routine analysis, judges the target monitored Whether key index there is exception, and so as to judge whether it is necessary to triggering following processing, specific processing refers to Fig. 2 institute Show.And if it find that target critical index occurs abnormal, equipment can be currently further then by tool set this detection of fitting It is no there are failure, be preset with for different key indexes number when its exception in this respectively specifically, tool set fits According to sampling instrument, equipment can first sampling instrument can collect set relevant to target critical Indexes Abnormality based on these data Then standby operating parameter is further confirmed that currently by these equipment operating parameters with the presence or absence of failure.And if target critical Index does not occur exception, then the monitoring sleep time of target critical index can be adjusted based on the first preset duration.
Optionally, if certain key index continuously detects normally, the monitoring suspend mode of the key index can be appropriately extended Duration, correspondingly, the part processing of step 102 can be such that the continuous normal number for otherwise counting target critical index, and will The monitoring sleep time of target critical index is adjusted to the product of continuous normal number and the first preset duration.
In an implementation, after being monitored to target critical index, if it find that target critical index does not occur exception, Equipment can then count the continuous normal number of target critical index, then adjust the monitoring sleep time of target critical index For the product of above-mentioned continuous normal number and the first preset duration.As an example it is assumed that the first preset duration is 1min, if on Target critical index is abnormal in primary monitoring, and target critical index is normal when this monitoring, then continuous normal number is 1, the monitoring sleep time of target critical index is then adjusted to 1*1min;If target critical index is in the monitoring of preceding n times Normally, target critical index is also normal when and this is monitored, then continuous normal number is N+1, and the monitoring of target critical index is stopped Dormancy duration is then adjusted to (N+1) * 1min.Furthermore, it is possible to set target critical index monitoring sleep time maximum value, i.e., without Why it is worth by continuous normal number, the monitoring sleep time of target critical index does not exceed the maximum value, this way it is possible to avoid When continuous normal number value is larger, i.e., when target critical index is chronically at normal condition, the monitoring of target critical index is stopped Dormancy duration is excessive, and the case where can not be monitored in time after target critical Indexes Abnormality.
Step 103, if there is currently failure, the monitoring suspend mode based on the second preset duration adjustment target critical index Duration, and fitted by tool set and this determination and report the fault message of current failure.
In an implementation, if in a step 102 by the confirmation of equipment operating parameter there is currently failure, equipment if, can be with base In the monitoring sleep time of the second preset duration adjustment target critical index.Meanwhile equipment can also fit this by tool set Determine the fault message of simultaneously reporting equipment current failure.Herein, technical staff can the various failures that are likely to occur of pre- measurement equipment, And the parameter attribute of equipment operating parameter when each failure occurs for recording equipment, it later can be by parameter attribute and fault message pair Tool set should be written to fit in this source code, in this way, equipment can be according to above-mentioned interior after collecting equipment operating parameter Hold, determines the corresponding fault message of equipment operating parameter acquired.
Optionally, if repeated detection has arrived same failure in a short time, corresponding event can intermittently be reported Hinder information, therefore, the processing of the part of step 103 can be such that is fitted the fault message of this determination current failure by tool set, The number of repetition in short-term of current failure is added one;When number of repetition reports threshold value equal to the corresponding failure of current failure in short-term, The fault message of current failure is reported, and reports threshold value by the failure that preset rules increase current failure.
In an implementation, equipment can be fitted the fault message of this determination current failure by tool set, and by current failure Number of repetition in short-term add one, it is readily appreciated that, number of repetition reflects equipment in a short time and repeats to detect the failure in short-term Number.Later, the failure corresponding with current failure of number of repetition in short-term that equipment can compare after adding one reports the big of threshold value Small, if number of repetition is equal to the corresponding failure of current failure and reports threshold value in short-term, equipment if, can report the event of current failure Hinder information, while reporting threshold value according to the failure that preset rules increase current failure, otherwise reports place without fault message Reason.As an example it is assumed that it is the increase with 3 for index that failure, which reports the regular i.e. preset rules of the increase of threshold value, then in failure Report threshold value is then followed successively by 1,3,9,27 ..., represents in conjunction with number of repetition in short-term: when determining the fault message of the failure for the first time into Row reports, and second does not report when determining the fault message of the failure, and third time reports when determining, does not report for the 4th time ... until It reports again for 9th time, it is subsequent and so on.
Optionally, it above-mentioned determining fault message and updates the processing of number of repetition in short-term and specifically can be such that and pass through tool Set script selects the failure cause of current failure in the corresponding preset failure reason of target critical index, and determines that failure is former The fault signature of cause;If the fault signature for the failure cause for locally recording faulty reason, and locally recording and this determination Fault signature similarity be greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, otherwise The failure cause and fault signature of this determination are recorded, and sets one for the number of repetition in short-term of failure cause.
In an implementation, during determining the fault message of current failure, equipment can be fitted by tool set and originally be existed The failure cause of current failure is selected in the corresponding preset failure reason of target critical index, and determines that the failure of failure cause is special Sign.By key index be CPU usage, load value, the CPU usage of each process and for I/O waiting time, it is specific default Failure cause and the method for determination of fault signature can refer to following table 1.Later, equipment may determine that locally whether recorded phase Same failure cause can determine fault signature and this event determined of the failure cause of local record if record has Hinder the similarity of feature, for example, fault signature there are 4, wherein only a fault signature is consistent, then similarity is 1/4.It Afterwards, if similarity is greater than preset threshold, the number of repetition in short-term of the failure cause of local record can be added one by equipment if.And Failure cause is not recorded or above-mentioned similarity is less than preset threshold if local, and equipment if can recorde the event of this determination Hinder reason and fault signature, and sets one for the number of repetition in short-term of failure cause.It is noted that the event of local record Barrier reason has certain storage duration, and after storing duration, equipment will be automatically deleted corresponding failure cause and event Hinder feature.
Table 1
Optionally, above-mentioned failure cause is recorded in the form of chained list, wherein chained list includes multiple nodes, Mei Gejie The corresponding key index of point, each key index respectively correspond one or more child list, and every child list includes multiple use In the linked list head of record failure cause, each linked list head corresponds to multiple child nodes, and it is former that multiple child nodes are respectively used to storage failure The fault signature of cause, in short-term number of repetition and failure report threshold value.
In an implementation, it is contemplated that in the data structure of programming, chained list is convenient for data traversal, while chained list shape Formula is easy to extend (i.e. in chained list can unlimited nested child list), and chained list has stronger data type compatibility, It can store the data under arbitrary data types, so above-mentioned failure cause can be recorded in the form of chained list.Equally to close Key index is CPU usage, load value, the CPU usage of each process and for I/O waiting time, and chained list is as shown in figure 3, chain Table trunk portion is made of, respectively CPU, LOAD, PROCESS, IO four nodes, and each key index is corresponding with one extremely A plurality of child list, the child list of CPU branch include the preset failure reason linked list head equal in number with CPU usage exception; The child list of LOAD branch can be divided into using disk (SDA, SDB ...), process (PROCESS_A, PROCESS_B ...), CPU (CPU0, CPU1 ...) three child lists, wherein the child list of LOAD- disk includes that number of disks corresponding with equipment is equal Linked list head, the child list of LOAD- process include N number of linked list head, and the child list of LOAD-CPU includes logic CPU corresponding with equipment The equal linked list head of quantity;The child list of PROCESS branch includes N number of linked list head;The child list of IO branch include and equipment pair The equal linked list head of the number of disks answered.Above-mentioned each linked list head can correspond to multiple events for being respectively used to storage failure cause Barrier feature, in short-term number of repetition and failure report the child node of threshold value.
Step 104, if there is currently no failure, the monitoring based on third preset duration adjustment target critical index is stopped Dormancy duration.
Wherein, the second preset duration is greater than the first preset duration, and the first preset duration is greater than third preset duration.
It in an implementation, can be with if in a step 102 by the confirmation of equipment operating parameter there is currently no failure, if equipment Monitoring sleep time based on third preset duration adjustment target critical index.It should be noted that the second preset duration is greater than First preset duration, the first preset duration are greater than third preset duration.It is appreciated that first, due to before failover, equipment Failure can generally have certain time, and corresponding key index will also be continuously in exception, so, detecting that target critical refers to Mark is abnormal, and successfully, in order to avoid frequently repeatedly detecting same failure, can control after the fault message of determining current failure Interval longer period of time is again monitored target critical index, therefore selection is adjusted based on longer second preset duration The monitoring sleep time of target critical index;Second, the probability is relatively small for the device fails under in operating status, equipment Most of the time is at normal condition, so without frequently being monitored to key index, while in order to which equipment is going out Can be detected in time after existing failure, the supervision interval of key index is not answered yet it is too long, so if monitoring target critical Index is normal, then selects the first preset duration of moderate length to adjust the monitoring sleep time of target critical index;Third, right The monitoring of key index primarily serves fault pre-alarming function, and when finding target critical Indexes Abnormality, equipment has greatly may be There is failure, and if further detection fails to find that failure, very possible failure are in the initial stage, Yi Xieshe Standby operating parameter is also not affected by influence, it is also possible to be therefore the other reasons such as target critical index Temporal fluctuations are set this In the case that standby state can not determine, need in a short time to monitor target critical index again, that is, need selection compared with Short third preset duration adjusts the monitoring sleep time of target critical index.
Optionally, if certain key index continuously detects exception, and be not further discovered that failure every time, then it can be appropriate Extend the monitoring sleep time of the key index, correspondingly, the processing of step 104 can be such that if there is currently no failure, Then statistics continuously monitors the continuous fault-free number after target critical Indexes Abnormality, and by the monitoring suspend mode of target critical index Duration is adjusted to the product of continuous fault-free number and third preset duration.
In an implementation, if it find that exception occurs in target critical index, but failure is not found in further detection process, Equipment, which can then count, continuously monitors the continuous fault-free number after target critical Indexes Abnormality, then by target critical index Monitoring sleep time be adjusted to the product of above-mentioned continuous fault-free number and third preset duration.As an example it is assumed that third Preset duration is 10s, if target critical index is normal in last monitoring, or target critical in last monitoring Indexes Abnormality, and confirmed equipment fault in further detection process, and target critical Indexes Abnormality when this monitoring, but not It was found that failure, then continuous fault-free number is 1, and the monitoring sleep time of target critical index is then adjusted to 1*10s;If preceding N Target critical index is exception in secondary monitoring, and does not find failure in further detection, while mesh when this monitoring It is also abnormal to mark key index, does not further also find failure in detection, then continuous fault-free number is N+1, target critical index Monitoring sleep time be then adjusted to (N+1) * 10s.Furthermore, it is possible to set the maximum of the monitoring sleep time of target critical index Value, i.e., no matter why continuous fault-free number is worth, and the monitoring sleep time of target critical index does not exceed the maximum value.
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
Based on the same technical idea, the embodiment of the invention also provides a kind of devices of monitoring equipment fault, such as Fig. 4 institute Show, described device includes:
Monitoring module 401, for the monitoring sleep time every target critical index, being fitted by tool set originally includes Master tool monitors the target critical index;
Module 402 is adjusted, if occurring for the target critical index abnormal, is fitted this inspection by the tool set It surveys and currently whether there is failure, the monitoring sleep time of the target critical index is otherwise adjusted based on the first preset duration, such as There is currently failures for fruit, then the monitoring sleep time of the target critical index is adjusted based on the second preset duration, and pass through institute It states tool set and fits and this determination and report the fault message of current failure, if there is currently no failure, it is default based on third Duration adjusts the monitoring sleep time of the target critical index;
Wherein, second preset duration is greater than first preset duration, and first preset duration is greater than described the Three preset durations.
Optionally, the adjustment module 402, is specifically used for:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, the adjustment module 402, is specifically used for:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, the adjustment module 402, is specifically used for:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, the adjustment module 402, is specifically used for:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
It should be understood that the device of monitoring equipment fault provided by the above embodiment is in monitoring equipment fault, only with The division progress of above-mentioned each functional module can according to need and for example, in practical application by above-mentioned function distribution by not Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above Or partial function.In addition, the device of monitoring equipment fault provided by the above embodiment and the method for monitoring equipment fault are implemented Example belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 5 is the structural schematic diagram of equipment provided in an embodiment of the present invention.The equipment 500 can be due to configuration or performance be different Bigger difference is generated, may include one or more central processing units 522 (for example, one or more are handled Device) and memory 532, one or more storage application programs 552 or data 554 storage medium 530 (such as one or More than one mass memory unit).Wherein, memory 532 and storage medium 530 can be of short duration storage or persistent storage.It deposits Storage may include one or more modules (diagram does not mark) in the program of storage medium 530, and each module may include To the series of instructions operation in equipment.Further, central processing unit 522 can be set to communicate with storage medium 530, The series of instructions operation in storage medium 530 is executed in equipment 500.
Equipment 500 can also include one or more power supplys 525, one or more wired or wireless networks connect Mouthfuls 550, one or more input/output interfaces 558, one or more keyboards 555, and/or, one or one with Upper operating system 551, such as Windows Server, Mac OS X, UnixTM, Linux, FreeBSD etc..
Equipment 500 may include have memory and one perhaps one of them or one of more than one program with Upper program is stored in memory, and be configured to be executed by one or more than one processor it is one or one with Upper program includes the instruction for carrying out above-mentioned monitoring equipment fault.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (16)

1. a kind of method of monitoring equipment fault, which is characterized in that the described method includes:
Every the monitoring sleep time of target critical index, the master tool for originally including that fitted by tool set monitors the target Key index;
If exception occurs in the target critical index, being fitted by the tool set, this detection is current to whether there is failure, Otherwise the monitoring sleep time of the target critical index is adjusted based on the first preset duration;
If the monitoring sleep time of the target critical index is adjusted based on the second preset duration there is currently failure, and Fitted by the tool set and this determination and reports the fault message of current failure;
If adjusting the monitoring sleep time of the target critical index based on third preset duration there is currently no failure, Wherein, second preset duration is greater than first preset duration, when first preset duration is default greater than the third It is long.
2. the method according to claim 1, wherein otherwise described adjust the target based on the first preset duration The monitoring sleep time of key index, comprising:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring sleep time of the target critical index It is adjusted to the product of the continuous normal number and the first preset duration.
3. if pre- based on third the method according to claim 1, wherein described there is currently no failure If duration adjusts the monitoring sleep time of the target critical index, comprising:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality time there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
4. the method according to claim 1, wherein described worked as by fit this determination and reporting of the tool set The fault message of prior fault, comprising:
Fitted the fault message of this determination current failure by the tool set, the number of repetition in short-term of the current failure is added One;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current failure is reported Fault message, and threshold value is reported by the failure that preset rules increase the current failure.
5. according to the method described in claim 4, it is characterized in that, described fitted this determination current failure by the tool set Fault message, the number of repetition in short-term of the current failure is added one, comprising:
Being fitted by the tool set, this selects current failure in the corresponding preset failure reason of the target critical index Failure cause, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this failure determined The similarity of feature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, otherwise remembered The record failure cause that this is determined and fault signature, and one is set by the number of repetition in short-term of the failure cause.
6. according to the method described in claim 5, it is characterized in that, the failure cause is recorded in the form of chained list, In, the chained list includes multiple nodes, and each corresponding key index of the node, each key index respectively corresponds One or more child list, every child list include multiple for recording the linked list head of failure cause, each linked list head pair Answer multiple child nodes, the multiple child node be respectively used to store the fault signature of the failure cause, in short-term number of repetition and Failure reports threshold value.
7. method according to claim 1-6, which is characterized in that the key index is used including at least CPU Rate, memory usage, load value, I/O waiting time and each process CPU usage in it is one or more.
8. a kind of device of monitoring equipment fault, which is characterized in that described device includes:
Monitoring module, for the monitoring sleep time every target critical index, the basic work for originally including of being fitted by tool set Tool monitors the target critical index;
Module is adjusted, if there is exception for the target critical index, this detection is fitted currently by the tool set With the presence or absence of failure, the monitoring sleep time of the target critical index is otherwise adjusted based on the first preset duration, if currently There are failures, then the monitoring sleep time of the target critical index are adjusted based on the second preset duration, and pass through the tool Set script is determining and reports the fault message of current failure, if being based on third preset duration tune there is currently no failure The monitoring sleep time of the whole target critical index;
Wherein, second preset duration is greater than first preset duration, and it is pre- that first preset duration is greater than the third If duration.
9. device according to claim 8, which is characterized in that the adjustment module is specifically used for:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring sleep time of the target critical index It is adjusted to the product of the continuous normal number and the first preset duration.
10. device according to claim 8, which is characterized in that the adjustment module is specifically used for:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality time there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
11. device according to claim 8, which is characterized in that the adjustment module is specifically used for:
Fitted the fault message of this determination current failure by the tool set, the number of repetition in short-term of the current failure is added One;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current failure is reported Fault message, and threshold value is reported by the failure that preset rules increase the current failure.
12. device according to claim 11, which is characterized in that the adjustment module is specifically used for:
Being fitted by the tool set, this selects current failure in the corresponding preset failure reason of the target critical index Failure cause, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this failure determined The similarity of feature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, otherwise remembered The record failure cause that this is determined and fault signature, and one is set by the number of repetition in short-term of the failure cause.
13. device according to claim 12, which is characterized in that the failure cause is recorded in the form of chained list, Wherein, the chained list includes multiple nodes, each corresponding key index of the node, and each key index is right respectively One or more child list is answered, every child list includes multiple for recording the linked list head of failure cause, each linked list head Corresponding multiple child nodes, the multiple child node are respectively used to store the fault signature of the failure cause, in short-term number of repetition Threshold value is reported with failure.
14. according to the described in any item devices of claim 8-13, which is characterized in that the key index makes including at least CPU With one or more in the CPU usage of rate, memory usage, load value, I/O waiting time and each process.
15. a kind of equipment, which is characterized in that the equipment includes processor and memory, is stored at least in the memory One instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the generation Code collection or instruction set are loaded by the processor and are executed to realize the monitoring equipment fault as described in claim 1 to 7 is any Method.
16. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium Few one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or the instruction The method that collection is loaded by processor and executed to realize the monitoring equipment fault as described in claim 1 to 7 is any.
CN201810866734.6A 2018-08-01 2018-08-01 Method and device for monitoring equipment fault Active CN109165138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810866734.6A CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810866734.6A CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Publications (2)

Publication Number Publication Date
CN109165138A true CN109165138A (en) 2019-01-08
CN109165138B CN109165138B (en) 2022-06-17

Family

ID=64898638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810866734.6A Active CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Country Status (1)

Country Link
CN (1) CN109165138B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110048932A (en) * 2019-04-03 2019-07-23 北京奇安信科技有限公司 Validation checking method, apparatus, equipment and the storage medium of mail Monitoring function
CN110995519A (en) * 2020-02-28 2020-04-10 北京信安世纪科技股份有限公司 Load balancing method and device
CN111143134A (en) * 2019-12-30 2020-05-12 深圳Tcl新技术有限公司 Fault processing method, equipment and computer storage medium
CN111464372A (en) * 2019-01-18 2020-07-28 广东天创同工大数据应用有限公司 Method for improving communication refreshing speed
CN112446978A (en) * 2019-08-29 2021-03-05 长鑫存储技术有限公司 Monitoring method and device of semiconductor equipment, storage medium and computer equipment
CN113840122A (en) * 2021-11-25 2021-12-24 南方电网数字电网研究院有限公司 Monitoring shooting control method and device, electronic equipment and storage medium
CN115904917A (en) * 2023-02-22 2023-04-04 湖北泰跃卫星技术发展股份有限公司 Internet of things exception handling method and device, computer equipment and storage medium
CN116795196A (en) * 2023-08-25 2023-09-22 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290882A1 (en) * 2011-05-10 2012-11-15 Corkum David L Signal processing during fault conditions
CN105549508A (en) * 2015-12-25 2016-05-04 北京奇虎科技有限公司 Alarm method based on information combination and apparatus thereof
CN106357469A (en) * 2016-11-16 2017-01-25 郑州云海信息技术有限公司 Dynamic adjustment method and device of resource monitoring mode
CN106502868A (en) * 2016-11-18 2017-03-15 国云科技股份有限公司 A kind of dynamic adjustment monitoring frequency method suitable for cloud computing
CN106878111A (en) * 2017-03-15 2017-06-20 郑州云海信息技术有限公司 The cloud monitoring system and monitoring method of a kind of High Availabitity
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290882A1 (en) * 2011-05-10 2012-11-15 Corkum David L Signal processing during fault conditions
CN105549508A (en) * 2015-12-25 2016-05-04 北京奇虎科技有限公司 Alarm method based on information combination and apparatus thereof
CN106357469A (en) * 2016-11-16 2017-01-25 郑州云海信息技术有限公司 Dynamic adjustment method and device of resource monitoring mode
CN106502868A (en) * 2016-11-18 2017-03-15 国云科技股份有限公司 A kind of dynamic adjustment monitoring frequency method suitable for cloud computing
CN106878111A (en) * 2017-03-15 2017-06-20 郑州云海信息技术有限公司 The cloud monitoring system and monitoring method of a kind of High Availabitity
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111464372B (en) * 2019-01-18 2021-09-24 广东天创同工大数据应用有限公司 Method for improving communication refreshing speed
CN111464372A (en) * 2019-01-18 2020-07-28 广东天创同工大数据应用有限公司 Method for improving communication refreshing speed
CN110048932A (en) * 2019-04-03 2019-07-23 北京奇安信科技有限公司 Validation checking method, apparatus, equipment and the storage medium of mail Monitoring function
CN112446978A (en) * 2019-08-29 2021-03-05 长鑫存储技术有限公司 Monitoring method and device of semiconductor equipment, storage medium and computer equipment
CN111143134B (en) * 2019-12-30 2024-06-04 深圳Tcl新技术有限公司 Fault processing method, device and computer storage medium
CN111143134A (en) * 2019-12-30 2020-05-12 深圳Tcl新技术有限公司 Fault processing method, equipment and computer storage medium
CN110995519B (en) * 2020-02-28 2020-06-26 北京信安世纪科技股份有限公司 Load balancing method and device
CN110995519A (en) * 2020-02-28 2020-04-10 北京信安世纪科技股份有限公司 Load balancing method and device
CN113840122A (en) * 2021-11-25 2021-12-24 南方电网数字电网研究院有限公司 Monitoring shooting control method and device, electronic equipment and storage medium
CN113840122B (en) * 2021-11-25 2022-03-08 南方电网数字电网研究院有限公司 Monitoring shooting control method and device, electronic equipment and storage medium
CN115904917A (en) * 2023-02-22 2023-04-04 湖北泰跃卫星技术发展股份有限公司 Internet of things exception handling method and device, computer equipment and storage medium
CN116795196A (en) * 2023-08-25 2023-09-22 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer
CN116795196B (en) * 2023-08-25 2023-11-17 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer

Also Published As

Publication number Publication date
CN109165138B (en) 2022-06-17

Similar Documents

Publication Publication Date Title
CN109165138A (en) A kind of method and apparatus of monitoring equipment fault
CN109783262B (en) Fault data processing method, device, server and computer readable storage medium
Tan et al. Adaptive system anomaly prediction for large-scale hosting infrastructures
US7444263B2 (en) Performance metric collection and automated analysis
US8949671B2 (en) Fault detection, diagnosis, and prevention for complex computing systems
CN114500250B (en) System linkage comprehensive operation and maintenance system and method in cloud mode
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US20080195369A1 (en) Diagnostic system and method
US20110314138A1 (en) Method and apparatus for cause analysis configuration change
CN110287081A (en) A kind of service monitoring system and method
JPH04230538A (en) Method and apparatus for detecting faulty software component
CN105224888B (en) A kind of data of magnetic disk array protection system based on safe early warning technology
CN114154035A (en) Data processing system for dynamic loop monitoring
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN104574219A (en) System and method for monitoring and early warning of operation conditions of power grid service information system
CN116755992B (en) Log analysis method and system based on OpenStack cloud computing
CN116126621A (en) Task monitoring method of big data cluster and related equipment
US8601318B2 (en) Method, apparatus and computer program product for rule-based directed problem resolution for servers with scalable proactive monitoring
CN117194142A (en) Integrated application performance diagnosis system and method based on link tracking
CN113505044A (en) Database warning method, device, equipment and storage medium
CN112905431A (en) Method, device and equipment for automatically positioning system performance problem
CN114118991A (en) Third-party system monitoring system, method, device, equipment and storage medium
CN115408271A (en) One-stop closed loop test method, system, equipment and medium
CN113342596A (en) Distributed monitoring method, system and device for equipment indexes
CN112527594A (en) Hard disk inspection method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant