CN112214409B - Operation and maintenance method and device used in test environment - Google Patents
Operation and maintenance method and device used in test environment Download PDFInfo
- Publication number
- CN112214409B CN112214409B CN202011090055.8A CN202011090055A CN112214409B CN 112214409 B CN112214409 B CN 112214409B CN 202011090055 A CN202011090055 A CN 202011090055A CN 112214409 B CN112214409 B CN 112214409B
- Authority
- CN
- China
- Prior art keywords
- test environment
- index
- environment
- information
- repairing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 200
- 238000000034 method Methods 0.000 title claims abstract description 119
- 238000012423 maintenance Methods 0.000 title claims abstract description 80
- 230000008569 process Effects 0.000 claims description 40
- 238000012545 processing Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 17
- 230000008439 repair process Effects 0.000 claims description 16
- 230000004083 survival effect Effects 0.000 claims description 14
- 238000003860 storage Methods 0.000 claims description 11
- 230000007613 environmental effect Effects 0.000 abstract description 19
- 230000002159 abnormal effect Effects 0.000 abstract description 18
- 238000012544 monitoring process Methods 0.000 description 40
- 230000000875 corresponding effect Effects 0.000 description 21
- 238000001514 detection method Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 230000005856 abnormality Effects 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000007726 management method Methods 0.000 description 7
- 238000013515 script Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 235000008113 selfheal Nutrition 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Classifications
-
- G06F11/3664—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application provides an operation and maintenance method and device used in a test environment, wherein the operation and maintenance method used in the test environment comprises the following steps: acquiring node type information of a test environment in real time; judging whether the test environment is normal or not according to the node type information; and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model. Aiming at large-scale and multi-type test environments, the application can greatly improve the success rate of self-healing caused by problems of the test environments and reduce the steps of analyzing the environmental problems of operation and maintenance personnel.
Description
Technical Field
The application relates to the technical field of computers, in particular to the technical field of operation and maintenance monitoring and self-healing systems, and in particular relates to an operation and maintenance method and device used in a test environment.
Background
In the field of environment operation and maintenance in the prior art, a small amount of maintenance manpower cannot actively find abnormal conditions of the environment under a large amount of test and production environments, and service discovery service abnormality notification operation and maintenance personnel are often required to manually check, so that real-time processing after the environment is abnormal cannot be achieved. In addition, an operation and maintenance person can only process one abnormal problem at the same time, the abnormal processing efficiency is low, and the requirement of high availability of the environment cannot be met. Therefore, a suitable automated operation and maintenance method is needed to improve the environmental availability and reduce the cost of manual maintenance.
In the existing environment monitoring system, a monitoring module needs to be deployed on a server manually in the link of collecting monitoring indexes, and for different types of test environments, testers need to know the types of the test environments first, and then select the corresponding monitoring module to deploy. Such manual operations increase the workload of the operation and maintenance personnel to a certain extent, and the process is very time-consuming and labor-consuming in a large-scale test environment. And the exception handling strategy only supports one custom strategy, and cannot meet the exception handling under complex conditions.
Disclosure of Invention
According to the operation and maintenance method and device for the test environment, provided by the application, aiming at a large-scale and multi-type test environment, the success rate of self-healing caused by problems of the test environment can be greatly improved, and the steps of analyzing the environmental problems of operation and maintenance personnel are reduced.
To achieve the above object, there is provided an operation and maintenance method for use in a test environment, including:
acquiring node type information of a test environment in real time;
judging whether the test environment is normal or not according to the node type information;
and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model.
Preferably, the determining whether the test environment is normal according to the node type information includes:
acquiring index information according to the node type information;
judging whether the test environment is normal or not according to the index information and a first preset threshold value.
Preferably, the index information includes: the method comprises the steps of testing performance indexes, memory indexes, disk indexes, operating system indexes, network IO (input/output), file system IO indexes, server availability indexes, database performance indexes, port process survival indexes and mirror image survival indexes of a central processor of a server in the environment.
Preferably, the obtaining the index information according to the node type information includes:
acquiring index information according to the node type information in an index database; the index database is a time sequence database.
Preferably, the step of generating the hierarchical self-healing model comprises:
judging whether the environment type in the test environment is normal or not;
if not, repairing the test environment according to a preset environment type repairing method;
if so, searching whether a process with load exceeding a second preset threshold exists in the test environment;
if the test environment exists, repairing the test environment according to a preset load repairing method;
if the problem does not exist, judging whether the plurality of servers in the test environment have the same problem or not;
if the test environment exists, repairing the test environment according to a preset server repairing method;
if the test environment does not exist, reading error reporting information in the log information of the test environment;
and repairing the test environment according to the error reporting information.
Preferably, the acquiring node type information of the test environment in real time includes:
and acquiring node type information of the test environment in real time through the configuration management database.
In a second aspect, the present application provides an operation and maintenance device for use in a test environment, the device comprising:
the information acquisition unit is used for acquiring node type information of the test environment in real time;
the environment judging unit is used for judging whether the test environment is normal or not according to the node type information;
and the environment restoration unit is used for restoring the test environment according to the pre-generated hierarchical self-healing model.
Preferably, the environment judgment unit includes:
the index information acquisition module is used for acquiring index information according to the node type information;
the environment judging module is used for judging whether the test environment is normal or not according to the index information and a first preset threshold value;
the index information includes: the method comprises the steps of testing a central processing unit performance index, a memory index, a disk index, an operating system index, a network IO (input/output), a file system IO index, a server availability index, a database performance index, a port process survival index and a mirror image survival index of a server in the environment;
the index information acquisition module is specifically used for acquiring index information according to the node type information in an index database; the index database is a time sequence database
The operation and maintenance device used in the test environment further comprises: the model generation module is used for generating the hierarchical self-healing model, and the model generation module comprises:
the type judging module is used for judging whether the environment type in the test environment is normal or not;
the type restoration module is used for restoring the test environment according to a preset environment type restoration method;
the process searching module is used for searching whether a process with load exceeding a second preset threshold exists in the test environment;
the process repairing module is used for repairing the test environment according to a preset load repairing method;
the server judging module is used for judging whether the plurality of servers in the test environment have the same problem or not;
the server repair module is used for repairing the test environment according to a preset server repair method;
the error reporting reading module is used for reading error reporting information in the log information of the test environment;
the error reporting and repairing module is used for repairing the test environment according to the error reporting information;
the information acquisition unit is specifically used for acquiring node type information of the test environment in real time through the configuration management database.
In a third aspect, the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing steps for an operation and maintenance method in a test environment when the program is executed by the processor.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements steps for an operation and maintenance method in a test environment.
As can be seen from the above description, the operation and maintenance method and apparatus for a test environment provided by the embodiments of the present application first obtain node type information of the test environment in real time; then judging whether the test environment is normal or not according to the node type information; and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model. Aiming at large-scale and multi-type test environments, the application can greatly improve the success rate of self-healing caused by problems of the test environments and reduce the steps of analyzing the environmental problems of operation and maintenance personnel.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for operation and maintenance in a test environment according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps 200 of an operation and maintenance method for a test environment according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating steps 201 of an operation and maintenance method for a test environment according to an embodiment of the present application;
FIG. 4 is a second flow chart of the operation and maintenance method for testing environment according to the embodiment of the present application;
FIG. 5 is a flowchart illustrating steps 400 of an operation and maintenance method for a test environment according to an embodiment of the present application;
FIG. 6 is a flowchart of the operation and maintenance method step 100 for a test environment according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an operation and maintenance system for use in a test environment in an embodiment of the present application;
FIG. 8 is a flow chart of an operation and maintenance method for a test environment in an embodiment of the present application;
FIG. 9 is a flow chart illustrating the concept of an operation and maintenance method for a test environment in an embodiment of the present application;
FIG. 10 is a schematic flow chart of a hierarchical self-healing strategy in a specific application example of the present application;
FIG. 11 is a schematic diagram of a structure of an operation and maintenance device for use in a testing environment according to an embodiment of the present application;
FIG. 12 is a schematic diagram of an environment determining unit according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a second embodiment of an operation and maintenance device for use in a test environment;
FIG. 14 is a schematic diagram of a model generation module according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an electronic device in an embodiment of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
An embodiment of the present application provides a specific implementation manner of an operation and maintenance method used in a test environment, referring to fig. 1, the method specifically includes the following steps:
step 100: and acquiring node type information of the test environment in real time.
It will be appreciated that a set of test environments includes a plurality of nodes, with the environmental nodes being divided into different types including application servers, oracle databases, mysql databases, batch servers, etc. Each environment node contains type information, operating system information, server ip address, server user information, type element information and the like of the node.
Step 200: judging whether the test environment is normal or not according to the node type information.
Specifically, according to the node type information, the environment indexes in the index database are inquired, and whether the environment is normally available or not is judged according to the set threshold value.
Step 300: and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model.
Referring to the background art, the test environment problems are more various and more complex in reason in the prior art. The user-defined script cannot meet the requirements of specific problems, and step 300 adopts a hierarchical self-healing strategy as a supplement, and the deep self-healing of the test environment can be completed by using the strategy. Specifically, for the problems found in the monitoring, the deep self-healing of the test environment is completed by adopting a grading strategy. Policy support configuration, switch control, self-healing process recording, normal unavailability of filtering environment, repeated self-healing, verification after self-healing and the like. The method solves the problems of monitoring deployment and abnormality treatment in large-scale and multi-type environments.
As can be seen from the above description, the operation and maintenance method for a test environment provided by the embodiment of the present application firstly obtains node type information of the test environment in real time; then judging whether the test environment is normal or not according to the node type information; and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model. Aiming at large-scale and multi-type test environments, the application can greatly improve the success rate of self-healing caused by problems of the test environments and reduce the steps of analyzing the environmental problems of operation and maintenance personnel.
In one embodiment, referring to fig. 2, step 200 further comprises:
step 201: and acquiring index information according to the node type information.
The index information in step 201 includes a central processing unit performance index, a memory index, a disk index, an operating system index, a network IO, a file system IO index, a server availability index, a database performance index, a port process survival index, and a mirror image survival index of the server in the test environment.
Step 202: judging whether the test environment is normal or not according to the index information and a first preset threshold value.
Specifically, the relation between the index information and the corresponding preset threshold value is judged so as to judge whether the test environment is normal or not.
In one embodiment, referring to fig. 3, step 201 further comprises:
step 2011: and acquiring index information according to the node type information in an index database.
Preferably, the environmental index in the index database is queried according to the environmental information registered in the configuration management database. In addition, the index database in step 2011 is a time series database. The time series database is used to refer to processing time-stamped (time-series-changed in order of time) data, which is also referred to as time-series data. Based on the characteristics of time series data, the relational database cannot meet the requirement of efficient storage and processing of time series data, and thus a database system specially optimized for time series data, i.e., a time series database, is required. The storage and processing of time sequence big data are often processed by adopting a relational database, but the relational database cannot be used for efficient storage and data query due to the natural disadvantages of the relational database. The time sequence big data solution is an important technology for solving the problem of mass data processing by using a special storage mode so that the time sequence big data can be efficiently stored and rapidly processed. The technology adopts a special data storage mode, greatly improves the processing capacity of time-related data, reduces the storage space by half relative to a relational database, and greatly improves the query speed.
In one embodiment, referring to fig. 4, the operation and maintenance method used in the test environment further includes:
step 400: and generating the hierarchical self-healing model. Referring to fig. 5, further, step 400 includes:
step 401: judging whether the environment type in the test environment is normal or not;
step 402: if not, repairing the test environment according to a preset environment type repairing method;
step 403: if so, searching whether a process with load exceeding a second preset threshold exists in the test environment;
step 404: if the test environment exists, repairing the test environment according to a preset load repairing method;
step 405: if the problem does not exist, judging whether the plurality of servers in the test environment have the same problem or not;
step 406: if the test environment exists, repairing the test environment according to a preset server repairing method;
step 407: if the test environment does not exist, reading error reporting information in the log information of the test environment;
step 408: and repairing the test environment according to the error reporting information.
Steps 401 through 408 are described below in more specific examples.
Step 401 and step 402: primary policy (custom policy): and the corresponding repair script is set by the user according to the environment type in a self-defined mode. When the problem of the environment type is detected, the script is adopted to execute self-healing, and if the self-healing fails, a secondary self-healing strategy is executed.
Step 403 and step 404: secondary policy (server level policy): is set by the system. Firstly, detecting the load corresponding to the server, returning a process name and a process number which exceed a threshold value, if the process is in a white list preset by the system, ignoring the process, otherwise, monitoring the load every 15 seconds after the process is kill. And if the load is still higher, informing a corresponding environment maintainer of manual intervention. And detecting a server disk, and if the disk space exceeds a threshold value, starting an automatic cleaning strategy of the disk. And after the cleaning is finished, re-executing the first-level strategy. If the primary strategy still fails, the tertiary strategy is started.
Step 405 and step 406: three-level policy (associated environment policy): other servers associated with the server are searched in the CMDB system, and whether the related server has a problem or not is detected. Whether or not it can be repaired by the primary policy and the secondary policy. If there is no problem or there is still a problem in the current environment after repair, a fourth level policy is started.
Step 407 and step 408: four-level policy (log analysis policy): and reading a corresponding starting log on the environment, and analyzing corresponding log information.
In one embodiment, referring to fig. 6, step 100 further comprises:
step 101: and acquiring node type information of the test environment in real time through the configuration management database.
It will be appreciated that the configuration management database (Configuration Management Database, CMDB) is a logical database containing information about the full lifecycle of configuration items and relationships (including physical relationships, real-time communication relationships, non-real-time communication relationships, and dependencies) between configuration items. The CMDB stores and manages various configuration information of devices in the enterprise IT architecture, is closely connected with all service support and service delivery flows, supports the operation of the flows, plays the value of the configuration information, and ensures the accuracy of data depending on the related flows.
The operation and maintenance method for the test environment provided by the embodiment of the application adopts a hierarchical self-healing strategy mode, and considers various factors influencing the self-healing effect. The success rate of self-healing can be greatly improved, and the environmental problem analysis steps can be reduced. Meanwhile, the monitoring deployment is used as a self-healing strategy, the type of the deployment to be monitored is automatically selected according to the type of the environment, the self-healing function comprises deployment of environment detection, and data obtained by the environment detection are used as a self-healing data base. The method reduces manual handling of environmental anomalies and avoids manual deployment and maintenance of the monitoring module.
To further illustrate the present solution, the present application provides a specific application example of an operation and maintenance method in a test environment, where the specific application example specifically includes the following.
The specific application example also provides an operation and maintenance system used in a test environment, wherein the test environment consists of servers in the server cluster and comprises different operating systems and operating system versions. The environment type body server deploys different middleware according to different functions. As shown in fig. 7, the operation and maintenance system for a test environment includes: the monitoring module is deployed on the server and is used for storing a time sequence database of check data, an abnormality detection module, a self-healing module and a CMDB module for storing environment information.
The CMDB system has registered thereon an environment list. A set of environments includes multiple nodes, with the environmental nodes being categorized into different types including application servers, oracle databases, mysql databases, batch servers, etc. Each environment node contains type information, operating system information, server ip address, server user information, type element information and the like of the node.
The monitoring module is a python or go program compiled into a binary executable file in advance. The monitoring module is divided into a system index monitoring module and a service index monitoring module. The system index module is a universal module and can acquire information such as cpu, memory, disk, io and the like on the server. The service index monitoring module is used for checking different service indexes such as an application server availability index, a database performance index, a port process survival index, a mirror image survival index and the like according to different environment types. The monitoring module initiates deployment by a monitoring module deployment strategy in the self-healing module, deploys different monitoring modules according to different environment types, and sets a timing task initiation check. After the index data is obtained by initiating the check, the data and the server information are sent to an index database through an http request.
The index database is a time sequence database for storing environmental state index data. The abnormality detection module comprises detection sub-modules of different environment types, queries environment indexes in an index database according to environment information registered by the CMDB system, and judges whether the environment is normally available according to a set threshold value. If the environment is abnormal, an abnormal self-healing request is initiated. And meanwhile, a monitoring module deployment request is initiated when the index data cannot be acquired. The self-healing module stores the abnormal self-healing strategy set by the system and configured by the user. The user can configure a strategy which is negligible, can alarm and can self-heal according to the actual situation. Policy support configuration, switch control, self-healing process recording, normal unavailability of filtering environment, repeated self-healing, verification after self-healing and the like. Meanwhile, the self-healing module comprises a monitoring module deployment strategy for automatically deploying the monitoring module.
Based on the operation and maintenance system used in the test environment, the operation and maintenance method used in the test environment provided by the present embodiment includes the following matters, see fig. 8 and fig. 9:
s1: environmental node type information is entered at the CMDB system.
S2: and customizing abnormal recovery strategies of different types of nodes in the self-healing module.
S3: the abnormality detection module obtains the current environmental state in the environmental state index database according to the environmental type information in the CMDB system. And judging whether to initiate an automatic deployment request of the detection module and whether to execute a self-healing strategy request according to whether index information can be acquired and whether the index information is normal or not.
Further, if the index data cannot be acquired, the server is considered to be a newly registered server, and an automatic deployment task request of the monitoring module is initiated. If the index data is obtained and the index data is normal, the server is considered to be normal in operation and does not need to be processed, and the process is ended. If the index data is obtained abnormally, the server is considered to operate abnormally, and a self-healing request is initiated.
S4: the self-healing module executes corresponding actions according to the request initiated by the abnormality detection module.
Specifically, the self-healing module initiates an automatic deployment task request of the monitoring module, and if the deployment of the monitoring module is not initiated, the corresponding monitoring module is deployed on the inspected server. And restarting the anomaly detection after deployment. If the deployment of the monitoring module is initiated, the server is considered to fail to automatically deploy the monitoring module, and the operation and maintenance user is informed of manual processing.
The self-healing module initiates a self-healing task request. And if the self-healing strategy is not initiated, the self-healing module executes the self-healing strategy of the corresponding type. And after the execution is finished, detecting the abnormality again. If the self-healing strategy is initiated, the server is considered to run abnormally, the self-healing strategy is executed without solving the abnormal problem, and the operation and maintenance user is informed of manual processing.
Further, referring to fig. 10, the self-healing policy in step S4 specifically includes:
primary policy (custom policy): and the corresponding repair script is set by the user according to the environment type in a self-defined mode. When the problem of the environment type is detected, the script is adopted to execute self-healing, and if the self-healing fails, a secondary self-healing strategy is executed.
Secondary policy (server level policy): is set by the system. Firstly, detecting the load corresponding to the server, returning a process name and a process number which exceed a threshold value, if the process is in a white list preset by the system, ignoring the process, otherwise, monitoring the load every 15 seconds after the process is kill. And if the load is still higher, informing a corresponding environment maintainer of manual intervention. And detecting a server disk, and if the disk space exceeds a threshold value, starting an automatic cleaning strategy of the disk. And after the cleaning is finished, re-executing the first-level strategy. If the primary strategy still fails, the tertiary strategy is started.
Three-level policy (associated environment policy): other servers associated with the server are searched in the CMDB system, and whether the related server has a problem or not is detected. Whether or not it can be repaired by the primary policy and the secondary policy. If there is no problem or there is still a problem in the current environment after repair, a fourth level policy is started.
Four-level policy (log analysis policy): and reading a corresponding starting log on the environment, and analyzing corresponding log information. (1) If the program in the log is in error, notifying the error program name to find the corresponding program modifier and notifying the corresponding person of modification. (2) If the corresponding IP connection is not successful in the log, searching the node information corresponding to the IP, and finding the environment maintainer corresponding to the node, and notifying the maintainer. (3) And starting a log classification program, and obtaining a log model according to offline learning. If the matching log is similar to the log defined before, if the same type of log exists, the custom environment repair is started. (4) notifying the corresponding environmental maintainer.
From the above description, it can be seen that, in the operation and maintenance method for a test environment provided by the specific application example of the present application, in order to realize automatic discovery and automatic processing of an abnormality of the test environment, high availability of the test environment is satisfied, manual intervention and maintenance of maintenance personnel are reduced, the problem that the operation and maintenance personnel manually deploy different monitoring modules to different types of test environments is avoided, and the problem that different operating system platforms write different monitoring programs or scripts is solved. The application provides a method capable of automatically deploying a monitoring module and realizing anomaly detection and self-healing. And adding a monitoring module automatic deployment strategy to the self-healing module to deploy different detection modules according to different environment types. And for the problems found in the monitoring, the deep self-healing of the test environment is completed by adopting a grading strategy. Policy support configuration, switch control, self-healing process recording, normal unavailability of filtering environment, repeated self-healing, verification after self-healing and the like. The method solves the problems of monitoring deployment and exception handling in large-scale and multi-type environments. Specifically, the application has the following beneficial effects:
1. the method has the advantages that the abnormal problems are rapidly found through the monitoring of the availability of the test environment, the self-healing of the test environment is used for automatically repairing the abnormality or notifying operation and maintenance personnel to process the abnormality, the abnormality checking and processing efficiency is improved, and the environment availability is improved.
2. The operation and maintenance personnel do not need to intervene in the deployment and maintenance of the detection module, only need to maintain the environment type registered on the CMDB, the self-healing system can automatically deploy the required detection module according to the environment type when the self-healing system finds that the data cannot be acquired, and the deployment is idempotent, so that the self-maintenance of the monitoring system is realized.
3. The monitoring module is written by PYTHON/GO, and is compiled into binary files on different systems once, so that cross-platform support is realized, and the work of writing different scripts for different systems by operation and maintenance personnel is reduced; the monitoring module is initiated by a timing task, and exits after the data is pushed successfully, so that the monitoring module does not reside in the memory and has little influence on the system performance.
4. A hierarchical self-healing strategy is provided, and various factors influencing the self-healing effect are comprehensively considered. The success rate of self-healing can be greatly improved, and the environmental problem analysis steps can be reduced. And provides an extensible and configurable self-healing method.
Based on the same inventive concept, the embodiments of the present application also provide an operation and maintenance device used in a test environment, which can be used to implement the method described in the above embodiments, such as the following embodiments. Since the principle of solving the problem by the operation and maintenance device in the test environment is similar to that of the operation and maintenance method in the test environment, the operation and maintenance device in the test environment can be implemented by the operation and maintenance method in the test environment, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
An embodiment of the present application provides a specific implementation manner of an operation and maintenance device for a test environment, which can implement an operation and maintenance method for a test environment, referring to fig. 11, the operation and maintenance device for a test environment specifically includes the following contents:
an information acquisition unit 10 for acquiring node type information of the test environment in real time;
an environment judging unit 20 for judging whether the test environment is normal according to the node type information;
and an environment restoration unit 30, configured to restore the test environment according to a pre-generated hierarchical self-healing model.
Preferably, referring to fig. 12, the environment judging unit 20 includes:
an index information obtaining module 201, configured to obtain index information according to the node type information;
an environment judging module 202, configured to judge whether the test environment is normal according to the index information and a first preset threshold;
the index information includes: the method comprises the steps of testing a central processing unit performance index, a memory index, a disk index, an operating system index, a network IO (input/output), a file system IO index, a server availability index, a database performance index, a port process survival index and a mirror image survival index of a server in the environment;
the index information obtaining module 201 is specifically configured to obtain index information according to the node type information in an index database; the index database is a time sequence database
Referring to fig. 13, the operation and maintenance device for use in the test environment further includes: a model generation module 40 for generating the hierarchical self-healing model, see fig. 14, the model generation module 40 comprising:
a type judging module 401, configured to judge whether an environment type in the test environment is normal;
the type repairing module 402 is configured to repair the test environment according to a preset environment type repairing method;
a process searching module 403, configured to search whether a process with a load exceeding a second preset threshold exists in the test environment;
the process repairing module 404 is configured to repair the test environment according to a preset load repairing method;
a server judging module 405, configured to judge whether the plurality of servers in the test environment have the same problem;
the server repair module 406 is configured to repair the test environment according to a preset server repair method;
an error reading module 407, configured to read error information in the log information of the test environment;
an error reporting repair module 408, configured to repair the test environment according to the error reporting information;
the information obtaining unit 10 is specifically configured to obtain node type information of the test environment in real time through the configuration management database.
As can be seen from the above description, the operation and maintenance device for use in a test environment provided by the embodiment of the present application first obtains node type information of the test environment in real time; then judging whether the test environment is normal or not according to the node type information; and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model. Aiming at large-scale and multi-type test environments, the application can greatly improve the success rate of self-healing caused by problems of the test environments and reduce the steps of analyzing the environmental problems of operation and maintenance personnel.
The embodiment of the present application further provides a specific implementation manner of an electronic device capable of implementing all the steps in the operation and maintenance method for use in the test environment in the foregoing embodiment, and referring to fig. 15, the electronic device specifically includes the following contents:
a processor 1201, a memory 1202, a communication interface (Communications Interface) 1203, and a bus 1204;
wherein the processor 1201, the memory 1202 and the communication interface 1203 perform communication with each other through the bus 1204; the communication interface 1203 is configured to implement information transmission between related devices such as a server device, a power measurement device, and a user device.
The processor 1201 is configured to invoke a computer program in the memory 1202, and when the processor executes the computer program, the processor implements all the steps in the operation and maintenance method for use in the test environment in the above embodiment, for example, when the processor executes the computer program, the processor implements the following steps:
step 100: acquiring node type information of a test environment in real time;
step 200: judging whether the test environment is normal or not according to the node type information;
step 300: and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model.
The embodiment of the present application also provides a computer-readable storage medium capable of implementing all the steps in the operation and maintenance method for use in a test environment in the above embodiment, on which a computer program is stored, which when executed by a processor implements all the steps in the operation and maintenance method for use in a test environment in the above embodiment, for example, the processor implements the following steps when executing the computer program:
step 100: acquiring node type information of a test environment in real time;
step 200: judging whether the test environment is normal or not according to the node type information;
step 300: and if the test environment is abnormal, repairing the test environment according to the pre-generated hierarchical self-healing model.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a hardware+program class embodiment, the description is relatively simple, as it is substantially similar to the method embodiment, as relevant see the partial description of the method embodiment.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Although the application provides method operational steps as an example or a flowchart, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principles and embodiments of the present application have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
Claims (9)
1. An operation and maintenance method for a test environment, comprising:
acquiring node type information of a test environment in real time;
judging whether the test environment is normal or not according to the node type information;
if not, repairing the test environment according to a pre-generated hierarchical self-healing model;
the step of generating the hierarchical self-healing model comprises the steps of:
judging whether the environment type in the test environment is normal or not;
if not, repairing the test environment according to a preset environment type repairing method;
if so, searching whether a process with load exceeding a second preset threshold exists in the test environment;
if the test environment exists, repairing the test environment according to a preset load repairing method;
if the problem does not exist, judging whether the plurality of servers in the test environment have the same problem or not;
if the test environment exists, repairing the test environment according to a preset server repairing method;
if the test environment does not exist, reading error reporting information in the log information of the test environment;
and repairing the test environment according to the error reporting information.
2. The operation and maintenance method according to claim 1, wherein the determining whether the test environment is normal according to the node type information includes:
acquiring index information according to the node type information;
judging whether the test environment is normal or not according to the index information and a first preset threshold value.
3. The operation and maintenance method according to claim 2, wherein the index information includes: the method comprises the steps of testing performance indexes, memory indexes, disk indexes, operating system indexes, network IO (input/output), file system IO indexes, server availability indexes, database performance indexes, port process survival indexes and mirror image survival indexes of a central processor of a server in the environment.
4. The operation and maintenance method according to claim 2, wherein the obtaining the index information according to the node type information includes:
acquiring index information according to the node type information in an index database; the index database is a time sequence database.
5. The operation and maintenance method according to claim 1, wherein the acquiring node type information of the test environment in real time includes:
and acquiring node type information of the test environment in real time through the configuration management database.
6. An operation and maintenance device for use in a test environment, comprising:
the information acquisition unit is used for acquiring node type information of the test environment in real time;
the environment judging unit is used for judging whether the test environment is normal or not according to the node type information;
the environment restoration unit is used for restoring the test environment according to the pre-generated hierarchical self-healing model;
the model generation module is used for generating the hierarchical self-healing model, and the model generation module comprises:
the type judging module is used for judging whether the environment type in the test environment is normal or not;
the type restoration module is used for restoring the test environment according to a preset environment type restoration method;
the process searching module is used for searching whether a process with load exceeding a second preset threshold exists in the test environment;
the process repairing module is used for repairing the test environment according to a preset load repairing method;
the server judging module is used for judging whether the plurality of servers in the test environment have the same problem or not;
the server repair module is used for repairing the test environment according to a preset server repair method;
the error reporting reading module is used for reading error reporting information in the log information of the test environment;
and the error reporting and repairing module is used for repairing the test environment according to the error reporting information.
7. The operation and maintenance device according to claim 6, wherein the environment judgment unit includes:
the index information acquisition module is used for acquiring index information according to the node type information;
the environment judging module is used for judging whether the test environment is normal or not according to the index information and a first preset threshold value;
the index information includes: the method comprises the steps of testing a central processing unit performance index, a memory index, a disk index, an operating system index, a network IO (input/output), a file system IO index, a server availability index, a database performance index, a port process survival index and a mirror image survival index of a server in the environment;
the index information acquisition module is specifically used for acquiring index information according to the node type information in an index database; the index database is a time sequence database;
the information acquisition unit is specifically used for acquiring node type information of the test environment in real time through the configuration management database.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for operation and maintenance in a test environment according to any of claims 1 to 5 when the program is executed by the processor.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the operation and maintenance method for use in a test environment according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011090055.8A CN112214409B (en) | 2020-10-13 | 2020-10-13 | Operation and maintenance method and device used in test environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011090055.8A CN112214409B (en) | 2020-10-13 | 2020-10-13 | Operation and maintenance method and device used in test environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112214409A CN112214409A (en) | 2021-01-12 |
CN112214409B true CN112214409B (en) | 2023-11-24 |
Family
ID=74053765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011090055.8A Active CN112214409B (en) | 2020-10-13 | 2020-10-13 | Operation and maintenance method and device used in test environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112214409B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110430071A (en) * | 2019-07-19 | 2019-11-08 | 云南电网有限责任公司信息中心 | Service node fault self-recovery method, apparatus, computer equipment and storage medium |
WO2019223062A1 (en) * | 2018-05-22 | 2019-11-28 | 平安科技(深圳)有限公司 | Method and system for processing system exceptions |
CN111176879A (en) * | 2019-12-31 | 2020-05-19 | 中国建设银行股份有限公司 | Fault repairing method and device for equipment |
-
2020
- 2020-10-13 CN CN202011090055.8A patent/CN112214409B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019223062A1 (en) * | 2018-05-22 | 2019-11-28 | 平安科技(深圳)有限公司 | Method and system for processing system exceptions |
CN110430071A (en) * | 2019-07-19 | 2019-11-08 | 云南电网有限责任公司信息中心 | Service node fault self-recovery method, apparatus, computer equipment and storage medium |
CN111176879A (en) * | 2019-12-31 | 2020-05-19 | 中国建设银行股份有限公司 | Fault repairing method and device for equipment |
Non-Patent Citations (1)
Title |
---|
电网安全稳定控制系统远程试验方法及工程应用;郭琦,等;电力系统自动化;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112214409A (en) | 2021-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9064056B2 (en) | Completing functional testing | |
CN107807877B (en) | Code performance testing method and device | |
CN102968374B (en) | A kind of data warehouse method of testing | |
CN109165170B (en) | Method and system for automatic request test | |
CN113238924B (en) | Chaotic engineering realization method and system in distributed graph database system | |
CN114461534A (en) | Software performance testing method and system, electronic equipment and readable storage medium | |
CN110990289B (en) | Method and device for automatically submitting bug, electronic equipment and storage medium | |
CN116866242A (en) | Switch regression testing method, device and medium | |
CN115114064B (en) | Micro-service fault analysis method, system, equipment and storage medium | |
CN112214409B (en) | Operation and maintenance method and device used in test environment | |
CN112685275B (en) | Algorithm policy search method and device, electronic equipment and storage medium | |
CN114996127A (en) | Intelligent test method and system for solid state disk firmware module | |
CN107102938B (en) | Test script updating method and device | |
CN111124809A (en) | Test method and device for server sensor system | |
CN113342675B (en) | Mapping method and device of tested module, and testing method and system | |
CN116431522A (en) | Automatic test method and system for low-code object storage gateway | |
CN112596750B (en) | Application testing method and device, electronic equipment and computer readable storage medium | |
CN115733741A (en) | Abnormal scene testing method and device for system to be tested | |
CN115981901A (en) | Fault positioning method, equipment and medium for automatic test of switch | |
CN112597030B (en) | Task release method and device, execution method and device and system | |
CN115129610B (en) | Method and system for testing aircraft engineering algorithm | |
US20230216727A1 (en) | Identification of root causes in data processing errors | |
CN111159006A (en) | Automatic testing method and device | |
CN117453567A (en) | Chaos testing method, device and equipment | |
CN116225823A (en) | Method and device for collecting server logs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |