CN108683569B - Service monitoring method and system for cloud service infrastructure - Google Patents
Service monitoring method and system for cloud service infrastructure Download PDFInfo
- Publication number
- CN108683569B CN108683569B CN201810585690.XA CN201810585690A CN108683569B CN 108683569 B CN108683569 B CN 108683569B CN 201810585690 A CN201810585690 A CN 201810585690A CN 108683569 B CN108683569 B CN 108683569B
- Authority
- CN
- China
- Prior art keywords
- log
- dial testing
- data
- dial
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/02—Standardisation; Integration
- H04L41/0246—Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols
- H04L41/0273—Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols using web services for network management, e.g. simple object access protocol [SOAP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1001—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention provides a service monitoring method and system for cloud service infrastructure, and belongs to the field of cloud service infrastructure. The system comprises a control center server and dial testing servers arranged in various regions. The system comprises a control center server, a dial testing task issuing module, a data acquisition module, a dial testing data analysis module, a dial testing alarm module and a database, wherein the dial testing servers are respectively provided with a cloud service dial testing module; the dial-up test servers are arranged in different regions, the monitoring tasks are configured on the WEB interface of the control center server, the dial-up test task issuing module issues the configuration files of the monitoring tasks to each dial-up test server to verify the correctness of the target IP of the tasks, a large-range asynchronous dial-up test monitoring method is adopted, and the monitoring method is aimed at routers and DNS recursive servers providing services, so that the monitoring of cloud service infrastructure is realized, the loss of data packets is reduced, the load balance is realized, and the robustness and the stability are higher.
Description
Technical Field
The invention belongs to the field of cloud service infrastructure, and relates to a method and a system for testing and verifying network connectivity of cloud service infrastructure.
Background
With the rapid development of cloud computing and mobile internet, more and more enterprise businesses are developed at the cloud end, cloud services are indispensable in our work, infrastructure in the cloud services is more important, and the robustness of the system and the accuracy of data can be guaranteed only through the improvement of the foundation.
Cloud services are a business circulation process resulting from the development of computers into the internet era. The cloud service mainly represents services of a plurality of layers such as storage, reading and downloading of computers and networks and information resources, information security monitoring and analysis and the like. Due to the characteristics of safety, stability and mass storage, cloud services are becoming popular with current social enterprises and individuals. Accordingly, how to ensure the stability of the cloud service also becomes a focus of attention, and how to solve the connectivity of the cloud service in a large environment of the cloud service, and verifying the security and stability of the cloud service and the integrity of data is an important item of the cloud service.
Cloud service manufacturers have a large-scale network at a plurality of local points and a plurality of servers in different regions, and the unstable cloud service is caused by poor network conditions and poor lines, so that data loss is caused, and the loss is very large for users and enterprises.
Most of the current monitoring methods are black box tests, and problems can be known only according to final results, but the problems cannot be determined. Different methods are available in different application scenes, and when data is in a problem, accurate positioning of the problem is the key for efficiently solving the fault of the cloud service. Therefore, the network connectivity of the infrastructure of the cloud service manufacturer can be tested by dialing and testing the router, the DNS (domain name system) server and the service module.
The prior art that monitors the operating states of different service systems by actively detecting network messages of an organization specific TCP/IP protocol (transmission control protocol/internet protocol) mainly includes:
1) national applied network research laboratory (NLANER), research on distributed, high performance connections of interest [ reference 1: mcgregor T, Braun H W, Brown J.the NLANR Network analysis information structure [ J ] communication Magazine IEEE,2000,38(5): 122-. In AMP, monitors send icmp (internet Control Message protocol) messages to each other every minute, routing to other monitors every ten minutes. Throughput testing may be measured by bulk TCP data transmission, bulk UDP (user datagram protocol) data transmission, ping-F and treno.
2) Ethernet OAM information probing [ reference 2: the information of Chinese society of science and society of society and development academic discussion is divided into meeting places 2008, Continuity Check messages (Continuity checks) are used as heartbeat signals to detect the communication condition between terminals; using Link Trace messages (Link Trace messages) for recording hop paths between end-to-end, similar to the Traceroute tool at the IP layer; loopback messages (loopbacks messages) are similar to the Ping function of ICMP and are used to probe connectivity between terminals.
3) Cisco Service Assurance Agent (Cisco Service Assistant Agent)/Cisco IOS IP Service Level protocols (IOS IP Service Level Agents) built in Cisco IOS equipment, allows active probing and active monitoring, and can be configured with a large number of options, such as UDP/TCP port number, ToS field, VRF instance, source IP, destination IP, web URL, etc. The tool can measure the following performance parameters: one-way delay, round-trip delay, delay variation, packet loss rate, packet order, sound quality score, network resource availability, application performance, server response time.
In the above technologies, the requirements of the cloud service infrastructure business test cannot be met. Aiming at the characteristics of large scale, wide range and multiple local points of a cloud service network, a cloud service manufacturer needs to perform large-scale dial test at multiple local points, and perform accurate asynchronous test on different protocols (UDP, TCP and HTTP), different local points and different operators, so that the dial test system and the data analysis system are separated, the reasons for the problems can be accurately found out according to the abnormity, and a worker is helped to position the problems of the cloud service system.
Disclosure of Invention
In order to meet the requirements, the invention provides a cloud service infrastructure-oriented service monitoring method and system in order to realize cloud service infrastructure-oriented service monitoring in a cloud service environment, so as to monitor data of a router and a DNS recursive server, and realize asynchronous large-range time-sharing dial-up test aiming at the characteristics of wide cloud service range and many local points.
The invention provides a service monitoring system facing cloud service infrastructure, which comprises a control center server and dial testing servers arranged in various regions. The cloud service dial testing module is arranged on each dial testing server, and the dial testing task issuing module, the data acquisition module, the dial testing data analysis module, the dial testing alarm module and the database are arranged on the control center server.
A user configures a monitoring task through a dial testing task issuing module and issues a configuration file to a corresponding dial testing server; the configuration file records a plurality of source IPs, a plurality of destination IPs and a plurality of protocol monitoring tasks configured by a user. The dial testing task issuing module verifies the target IP in the configuration file and does not issue the task which is not verified to pass.
And after receiving the configuration file, the cloud service dial testing module traverses the tasks in the configuration file, verifies the correctness of the target IP of the tasks, configures data packets meeting the rules for the verified tasks, and carries out data dial testing in an asynchronous time-sharing dial testing mode. The cloud service dial testing module comprises two types of dial testing data: one is to send a data packet of a domain name with a specified format to a DNS recursive server providing services in the cloud service; one is to set the number of data packets to be dialed and measured according to the router list through which the obtained cloud service traffic passes and the sampling ratio of the router, and send the data packets to the destination IP. And the cloud service dial testing module records a packet sending Log S-Log and sends the packet sending Log S-Log to a dial testing data analysis module of the control center server after sending a data packet to a target IP and a DNS recursive server providing services.
The data acquisition module acquires data fingerprint information according to the package sending Log S-Log, traverses the local point database, verifies the connection of the local point database before data query, records the problem into the problem Log E-Log if the connection fails or the query is overtime, queries the warehousing data of the local point database if the connection succeeds, and generates a data acquisition file and a data acquisition file Log R-Log after the local point database is traversed. The data fingerprint information is represented as a six-element group information (source IP, destination IP, source port, destination port, protocol number, rule ID).
The dial-up test data analysis module acquires a Log R-Log, a problem Log E-Log and a package sending Log S-Log of a data acquisition file corresponding to a certain task, firstly traverses the problem Log E-Log, marks a problematic local point database in the database, and records corresponding local point database problems; traversing a packet sending Log S-Log, comparing the packet sending Log S-Log with a Log R-Log, if the dial-test data is streaming data sent to a target IP, calculating the average sampling ratio of a router to be passed through according to data fingerprint information, and calculating the warehousing rate of tasks; and if the dial testing data is data sent to the DNS recursive server, comparing according to the data fingerprint information, and calculating the entry rate of the task.
The dial testing alarm module presets threshold values aiming at the flow monitoring of the router and the monitoring of the DNS recursive server providing services, compares the warehousing rate of the tasks calculated by the dial testing data analysis module with the threshold values, and carries out alarm prompt on the tasks with the warehousing rate lower than the threshold values.
The invention provides a service monitoring method facing to cloud service infrastructure, which comprises the following steps:
step 1: setting dial testing servers in different areas, wherein the dial testing servers adopt an asynchronous time-sharing dial testing mode; the asynchronous time-sharing dial testing mode is that dial testing data is set to be sent asynchronously, and a set number of data packets are sent at set time intervals;
step 2: a user configures a monitoring task on a WEB interface of a control center server, a dial-up test task issuing module verifies the correctness of a target IP of the task, and if the target IP of the task is correct, a configuration file of the monitoring task is generated; verifying the correctness of the target IP by inquiring whether the information (country, province, city and operator) of the IP is correct in an IP library;
and step 3: the dial testing task issuing module issues the configuration file of the monitoring task to each dial testing server, the dial testing servers verify the correctness of the target IP in the configuration file, if the target IP is wrong, dial testing is not carried out on the target IP, and wrong information is fed back to the control center server;
and 4, step 4: acquiring a router list through which cloud service flow passes and a sampling ratio of a router, and setting a dial testing data packet of a dial testing server to a target IP according to the sampling ratio of the router and a set probability condition that a flow log is sampled; the dial testing server generates a packet sending Log S-Log after sending the dial testing data packet each time and sends the packet sending Log S-Log to the control center server.
If the sampling ratio of the router is 1/X, the probability that the flow log is sampled is required to be greater than G%, and the number of the dial testing data packets is Y, the relationship exists: when the number of the data packets is Y, the probability that the flow log is sampled is
And 5: the dial testing server obtains a DNS recursive server list for providing services in the cloud service, sends a data packet of a domain name with a specified format to the recursive DNS server, and performs packet capturing to generate a PCAP file;
step 6: the dial testing server sends the PCAP file to the intermediate human machine, and the intermediate human machine performs data packet verification to ensure safety; the dial testing server sends out the data packet through the intermediary robot, generates a packet sending Log S-Log and sends the packet sending Log S-Log to the control center server;
and 7: the control center server verifies the connection state of the local point database, and if the connection fails or the query is overtime, the problem is recorded into a problem Log E-Log;
and 8: the control center server acquires fingerprint information of a data packet according to the received packet sending log, performs data query from each local point database and generates a data acquisition file; the fingerprint information is (source IP, destination IP, source port, destination port, protocol number, rule ID).
And step 9: aiming at a certain task, the control center server acquires a Log R-Log and a problem Log E-Log of a data acquisition file, and searches a corresponding package sending Log S-Log; marking a problem local point database in a database of a control center by traversing a problem Log E-Log, and marking existing problems;
step 10: the control center server traverses the packet sending Log S-Log, compares the packet sending Log S-Log with the Log R-Log for a normal local point database, calculates the average sampling ratio of a router to be passed by according to data fingerprint information if the dial-test data is streaming data sent to a target IP, calculates the warehousing rate of the task, and compares the data fingerprint information if the dial-test data is data sent to a DNS recursive server to calculate the warehousing rate of the task;
step 11: and comparing the calculated warehousing rate with a preset corresponding warehousing rate threshold, and prompting on a WEB interface if the calculated warehousing rate is smaller than the threshold.
Compared with the traditional service monitoring technology, the method and the system of the invention have the following advantages and positive effects:
(1) the method and the system of the invention provide a data monitoring scheme aiming at the router and the DNS recursive server, can utilize the characteristics of router sampling and the function of domain name resolution provided by the recursive server to carry out identification, comparison and analysis aiming at specific data, and adopt a black box test method to carry out fault test on the service system on the premise of not influencing the prior service system. According to the method and the system, the problem of the cloud service link can be determined by calculating the warehousing rate aiming at the flow monitoring of the router, and the DNS data forwarded by the intermediate robot is monitored and analyzed to perform identity verification, so that safe DNS monitoring is provided, and the analysis function of a DNS server is verified.
(2) The method and the system adopt asynchronous large-range time-sharing monitoring, carry out asynchronous large-range time-sharing dial testing aiming at the characteristics of wide cloud service range and multiple local points, separate the dial testing from a monitoring analysis platform, reduce the pressure of a server, reduce the loss of data packets, greatly improve the flow data acquisition rate, realize load balancing, have small influence on a tested system, have higher robustness and stability, and improve the robustness and effectiveness of a service monitoring system.
(3) The system is deployed in a distributed mode, the cloud service dial testing module is separated from the dial testing data analysis module, and accurate asynchronous testing can be performed on different office points and service office points of different operators. The invention introduces the load balancing technology into the method, realizes the parallel processing of the data, realizes the asynchronous monitoring of the separation of the dial testing module and the statistical module, ensures that the statistical data is not influenced by the dial testing module, enhances the usability of the system and improves the performance and the expansibility of the system.
Drawings
Fig. 1 is an overall structural view of a cloud service infrastructure-oriented business monitoring system of the present invention;
FIG. 2 is a flow chart of a functional implementation of a cloud service dial testing module in the system of the present invention;
FIG. 3 is a flow chart of the functional implementation of the data acquisition module in the system of the present invention;
FIG. 4 is a flow chart of the functional implementation of the dial-up test data analysis module in the system of the present invention.
Detailed Description
The technical solution of the present invention is described below with reference to the accompanying drawings and examples.
The service monitoring method and the system for the cloud service infrastructure provided by the invention adopt a large-range asynchronous dial-up test monitoring method, realize the monitoring of the cloud service infrastructure aiming at the router and the DNS recursive server providing the service, have the functions of implementing monitoring and prompting in real time, are accurate to the infrastructure, and provide effective support for the cloud service troubleshooting.
As shown in fig. 1, the invention discloses a service monitoring system facing cloud service infrastructure, which comprises dial testing servers and a control center server, wherein the dial testing servers are arranged in various regions, a cloud service dial testing module is arranged on each dial testing server, and a dial testing task issuing module, a data acquisition module, a dial testing data analysis module, a dial testing alarm module and a database are arranged on the control center server.
The control center server is a service cluster, and the modules arranged on the control center server can be realized by using a single server. Or limited to resources, in which several modules are integrated on one server.
And the user configures the detection task through the dial testing task issuing module and issues the configuration file to the corresponding dial testing server. In the dial testing task issuing module, a user configures an issuing task through a WEB interface, configures a plurality of source IPs (different provinces), multi-local point destination IPs and dial testing tasks of a plurality of protocols (TCP, UDP and HTTP), and issues different tasks to corresponding dial testing servers so as to be used for dial testing by the dial testing servers. The dial testing task issuing module provides an IP verification function, verifies a target IP and does not issue a non-conforming IP task.
The cloud service dial testing module receives the configuration file sent by the dial testing task sending module, verifies the target IP information, carries out dial testing in a time-sharing and interval mode to ensure the stability of the dial testing, sends a large number of data packets to the corresponding target IP and a recursion server providing services, records a packet sending log, compresses and sends the packet sending log to the dial testing data analysis server. As shown in fig. 2, the cloud service dial-up test module reads the configuration file, traverses the tasks in the configuration file, verifies the correctness of the destination IP address of each task, configures the corresponding rules of the tasks when the verification is passed, sends the data packet to the destination IP, adds the packet sending record to the packet sending Log S-Log, and returns the packet sending Log S-Log to the control center server after all the tasks in the configuration file are finished.
The correctness verification of the dial testing task issuing module and the cloud service dial testing module on the target IP address is that whether the information (country, province, city and operator) of the target IP is correct is inquired through an IP library, if the information is correct, the verification is passed, and otherwise, the verification is not passed.
The cloud service dial testing module adopts an asynchronous time-sharing dial testing mode, and sets the number of data packets to be dial tested according to the obtained router list through which the cloud service flow passes and the sampling ratio of the router. Sending a packet of a formatted domain name to a DNS recursive server providing a service within a cloud service. The data packets sent by the cloud service dial testing module at time-sharing and interval can be flexibly configured according to different network environments of various regions, so that the influence on the normal service of the target server and the consumption on the bandwidth of the local network are avoided.
The data acquisition module acquires data fingerprint information according to the package sending Log S-Log, traverses the local point database, verifies the connection of the local point database before data query, records the problem into the problem Log E-Log if the connection fails or the query is overtime, queries the data to be put into storage of the local point database if the connection is successful, generates a data acquisition file when the local point database is traversed and completes the query task, and generates a Log R-Log of the data acquisition file, wherein the suffix name of the file is ok. The data fingerprint information is represented as a six-element group information (source IP, destination IP, source port, destination port, protocol number, rule ID). The data fingerprint information is also called dyeing information, a dial-up data message can be uniquely identified, and the dyeing method and the marking are formed by the component information of the test system, so that the method has great flexibility and operability.
The dial-up test data analysis module obtains logs R-Log, problem logs E-Log and package sending logs S-Log of data acquisition files corresponding to a certain task, as shown in FIG. 4, firstly, a problematic local point database is marked according to the problem logs E-Log, then, for each task, the package sending logs S-Log are traversed, and the logs R-Log and S-Log are compared. When a problem local point database is encountered, marking is carried out in the database of the control center server and the corresponding local point database problem is recorded. For a normal local point database, if the dial-up data is stream data sent to a target IP, calculating the average sampling ratio of the router according to the data fingerprint information, and calculating the warehousing rate of each task; and if the dial testing data is data sent to the DNS recursive server, comparing according to the data fingerprint information, and calculating the entry rate of each task. The dial testing data analysis module also statistically analyzes the warehousing rate change in each time period and provides trend change. In fig. 4, after traversing the package sending Log of a certain task is finished, the Log R-Log, the Log E-Log and the Log S-Log of the data collection file participating in the analysis are moved from the current analysis directory to the backup directory.
The dial testing alarm module is preset with threshold values aiming at the flow monitoring of the router and the monitoring of a DNS recursive server providing services, compares the warehousing rate of the tasks calculated by the dial testing data analysis module with the corresponding threshold values, and carries out alarm prompt aiming at the tasks with the warehousing rate lower than the threshold values.
The database is a local database of the control center server, and stores the collected flow log information, wherein the flow log information contains the content in a part of data packets besides the fingerprint information (source IP, destination IP, source port, destination port, protocol number and rule ID).
The service monitoring method for the cloud service infrastructure comprises the following steps 1-11.
Step 1: the asynchronous large-range time-sharing dial testing technology is adopted to realize an asynchronous testing mode, dial testing and monitoring analysis are separated, time-sharing dial testing is carried out, the pressure of a target IP server is reduced, and the loss of data packets is reduced.
In the step, a plurality of dial testing servers are provided, and for P different provinces, each province has the dial testing server, so that the load balancing function is realized, and the robustness and the convergence of the algorithm are ensured. P is an integer greater than 2.
Each dial testing server adopts an asynchronous time-sharing dial testing mode, namely dial testing data is sent asynchronously, and a dial testing interval t is set for every K data packets, so that the warehousing rate can be improved to a great extent. K is a positive integer. And time-sharing dial testing is carried out, so that the pressure of a target IP server is reduced, and the loss of data packets is reduced. In the embodiment of the invention, the dialing test interval is set to be 1s for every 1000 data packets, so that the warehousing rate can be improved to a great extent.
The asynchronous mode in the method is also embodied in that the cloud service dial testing and the dial testing data analysis are separated because the dial testing data storage has delay (10min-30min), namely the dial testing data analysis and the dial testing server are not positioned on the same platform and do not influence each other.
Step 2: and (3) a user issues a monitoring task through a WEB interface of the control center server, a dial-up testing task issuing module verifies the correctness of the target IP of the monitoring task, an IP library is used for inquiring whether the information (country, province, city and operator) of the IP is correct or not, incorrect prompting is carried out, and if the information is correct, a monitoring task configuration file is generated, and the step is entered into step 3.
And step 3: and the dial testing task issuing module issues the configuration file of the monitoring task to a dial testing server, the dial testing server also verifies the correctness of the target IP in the configuration file, inquires whether the information (country, province, city and operator) of the IP is correct or not through an IP library, if the information is wrong, does not carry out dial testing on the target IP, and feeds back an error prompt to the control center server.
And 4, step 4: the number of data packets sent to a target IP by each dial testing server is set, and flow records formed by the data packets can be sampled and captured by a router which is accessed to a target IP address. And obtaining a router list through which the cloud service traffic passes, and obtaining a sampling ratio of a router of the cloud service traffic. If the router has a sampling ratio of 1/X, the router log is also the sampling data. For the flow test of the router, because the flow log generated by the router is sampled, it needs to be ensured that the flow log generated by the message can be captured under sampling.
And setting the sampling ratio of the router as 1/X, requiring the probability of the flow log being sampled to be greater than G%, and setting the number of the dial testing data packets as Y, namely setting the number of the sent flows as Y. X, Y are all positive integers and G is a positive number less than 100.
When Y is 1, the probability of the flow log being collected isAt this time, if X is 1000, the capture probability is 1/1000.
Then, at this time, under the sampling ratio of 1:1000, the probability of the flow log being sampled is greater than 99.99%, Y needs to take the value of 10000, and the probability is 99.995483%;
under the sampling ratio of 1:2000, Y needs to take the value of 20000, and the probability of the flow log being sampled is 99.995471%;
under the sampling ratio of 1:5000, Y needs to take the value of 50000, and the probability of sampling the stream log is 99.995465%.
In addition, considering that the router outputs the stream log, the sending interval t needs to be considered, and the stability of the packet receiving is ensured. And large-range dial testing is carried out on the cloud service, and 50000 data packets are dial tested for each target IP. The dial testing server generates a packet sending Log S-Log after sending the dial testing data packet each time and sends the packet sending Log S-Log to the control center server.
And 5: and the dial testing server obtains a DNS recursive server list for providing services in the cloud service, sends a data packet of a domain name with a specified format to the recursive DNS server, and performs packet capturing to obtain the PCAP file.
The PCAP file format is a common packet storage format, and mainstream packet capturing software including wireshark can generate packets in this format.
Step 6: the dial testing server sends the PCAP file to the intermediate human machine, and the intermediate human machine carries out data packet verification to ensure safety. And the intermediate human machine verifies whether the source IP of the data packet is the IP in a white list of the control center, whether the domain name is a regular domain name, if the source IP is not the IP in the white list or the domain name is not the regular domain name, analyzes the PCAP file, recombines the data packet, forges the source IP as the IP in the white list, forges the domain name with a specific specification and the like. And the intermediate human machine sends out the data packet passing the verification and the recombined data packet. The dial testing server generates a packet Log S-Log after sending a dial testing data packet through the middle robot each time and sends the packet Log S-Log to the control center server.
And 7: and verifying the connection state of the local point database. Database validation functionality is added because the data is stored in a multi-place corresponding local point database. And the control center server verifies the connection state of the database, and records the problems into a problem Log E-Log if the connection fails or the query is overtime.
And 8: the control center acquires fingerprint information of data according to a packet sending log sent by the dial testing server, acquires fingerprint hexahydric group information (source IP, destination IP, source port, destination port, protocol number and rule ID), and then performs data query of multiple local points according to the fingerprint information to generate a data acquisition ok file;
and step 9: the control center server obtains a Log R-Log of a data acquisition file of a task, searches a corresponding package sending Log S-Log, traverses a problem Log E-Log, determines a problematic local point database and existing problems, and marks the problematic local point database and the existing problems in the database of the control center;
step 10: traversing the package sending Log S-Log, and comparing the fingerprint information with the data acquisition Log R-Log for the normal local point database. If the dial testing data is streaming data, calculating an average sampling ratio through a router through which the streaming passes, finally analyzing the number of the warehouse-in and warehouse-out, and calculating the warehouse-in rate of the task; if the data is DNS data, the data is directly compared according to the fingerprint, and the entry rate of the task is calculated;
step 11: and comparing the calculated warehousing rate with a threshold according to a preset threshold, and prompting on a WEB interface if the warehousing rate is smaller than the threshold and giving a corresponding problem prompt. In the embodiment of the present invention, the warehousing rate threshold is set to 65% in this step.
Compared with the prior art, the method disclosed by the invention has the advantages that the problem of a cloud service link is determined by calculating the warehousing rate aiming at the flow monitoring of the router; the DNS data monitoring and analyzing method forwarded by the man-in-the-middle carries out identity verification, provides safe DNS monitoring and verifies the analyzing function of a DNS server. The asynchronous large-range time-sharing monitoring method provided by the invention has the advantages that time-sharing dial-up measurement is realized, the loss of data packets is reduced, load balance is realized, the influence on a system to be measured is small, and the method has higher robustness and stability.
Claims (5)
1. A service monitoring system facing cloud service infrastructure is characterized by comprising a control center server and dial testing servers arranged in various regions; the system comprises a control center server, a dial testing task issuing module, a data acquisition module, a dial testing data analysis module, a dial testing alarm module and a database, wherein the dial testing servers are respectively provided with a cloud service dial testing module;
a user configures a monitoring task through a dial testing task issuing module and issues a configuration file to a corresponding dial testing server; recording a plurality of source IPs, a plurality of destination IPs and a plurality of protocol monitoring tasks configured by a user in a configuration file; the dial testing task issuing module verifies a target IP in the configuration file and does not issue tasks which fail to be verified;
after the cloud service dial testing module receives the configuration file, traversing tasks in the configuration file, verifying the correctness of a target IP of the tasks, configuring data packets meeting the rules for the verified tasks, and carrying out data dial testing in an asynchronous time-sharing dial testing mode; the cloud service dial testing module comprises two types of dial testing data: one is to send a data packet of a domain name with a specified format to a DNS recursive server providing services in the cloud service; one is that the number of data packets to be dialed and measured is set according to the router list through which the obtained cloud service flow passes and the sampling ratio of the router, and the data packets are sent to a target IP; the cloud service dial testing module records a packet sending Log S-Log and sends the packet sending Log S-Log to a dial testing data analysis module of the control center server after sending a data packet to a target IP and a DNS recursive server providing services;
the data acquisition module acquires data fingerprint information according to the package sending Log S-Log, traverses the local point database, verifies the connection of the local point database before data query, records the problem into the problem Log E-Log if the connection fails or the query is overtime, queries the warehousing data of the local point database if the connection succeeds, and generates a data acquisition file and a data acquisition file Log R-Log after the traversal of the local point database is completed;
the dial-up test data analysis module acquires a Log R-Log, a problem Log E-Log and a package sending Log S-Log of a data acquisition file corresponding to a certain task, firstly traverses the problem Log E-Log, marks a problematic local point database in the database, and records corresponding local point database problems; traversing a packet sending Log S-Log, comparing the packet sending Log S-Log with a Log R-Log, if the dial-test data is streaming data sent to a target IP, calculating the average sampling ratio of a router to be passed through according to data fingerprint information, and calculating the warehousing rate of tasks; if the dial testing data are data sent to the DNS recursive server, comparing according to the data fingerprint information, and calculating the entry rate of the task;
the dial testing alarm module presets threshold values aiming at the flow monitoring of the router and the monitoring of the DNS recursive server providing services, compares the warehousing rate of the tasks calculated by the dial testing data analysis module with the threshold values, and carries out alarm prompt on the tasks with the warehousing rate lower than the threshold values.
2. The cloud service infrastructure-oriented traffic monitoring system according to claim 1, wherein the control center server is a service cluster, and the modules disposed in the control center server are implemented by separate servers.
3. The cloud service infrastructure-oriented business monitoring system of claim 1, wherein the dial-up test task issuing module queries whether the information of the IP is correct for a target IP of the monitoring task through an IP library, and if the information is correct, the verification is passed.
4. The cloud service infrastructure-oriented traffic monitoring system of claim 1, wherein the data fingerprint information is represented as a six-element group information: source IP, destination IP, source port, destination port, protocol number, rule ID.
5. A service monitoring method facing to cloud service infrastructure is characterized by comprising the following steps:
step 1: setting dial testing servers in different areas, wherein the dial testing servers adopt an asynchronous time-sharing dial testing mode; the asynchronous time-sharing dial testing mode is that dial testing data is set to be sent asynchronously, and a set number of data packets are sent at set time intervals;
step 2: a user configures a monitoring task on a WEB interface of a control center server, a dial-up test task issuing module verifies the correctness of a target IP of the task, and if the target IP of the task is correct, a configuration file of the monitoring task is generated;
and step 3: the dial testing task issuing module issues the configuration file of the monitoring task to each dial testing server, the dial testing server verifies the correctness of the target IP in the configuration file, if the target IP is wrong, the dial testing is not performed on the target IP, and the target IP is fed back to the control center server;
and 4, step 4: acquiring a router list through which cloud service flow passes and a sampling ratio of a router, and setting a dial testing data packet of a dial testing server to a target IP according to the sampling ratio of the router and a set probability condition that a flow log is sampled; the dial testing server generates a packet sending Log S-Log after sending a dial testing data packet each time and sends the packet sending Log S-Log to the control center server;
if the sampling ratio of the router is 1/X, the probability that the flow log is sampled is required to be greater than G%, and the number of the dial testing data packets is Y, the relationship exists: when the number of the data packets is Y, the probability that the flow log is sampled is
And 5: the dial testing server obtains a DNS recursive server list for providing services in the cloud service, sends a data packet of a domain name with a specified format to the recursive DNS server, and performs packet capturing to generate a PCAP file;
step 6: the dial testing server sends the PCAP file to the intermediate human machine, and the intermediate human machine performs data packet verification to ensure safety; the dial testing server sends out the data packet through the intermediary robot, generates a packet sending Log S-Log and sends the packet sending Log S-Log to the control center server;
and 7: the control center server verifies the connection state of the local point database, and if the connection fails or the query is overtime, the problem is recorded into a problem Log E-Log;
and 8: the control center server acquires fingerprint information from a packet sending log sent by the dial testing server, and performs data query of multiple local points according to the fingerprint information to generate a data acquisition file; the fingerprint information is as follows: source IP, destination IP, source port, destination port, protocol number, and rule ID;
and step 9: for a certain task, the control center server acquires a Log R-Log and a problem Log E-Log of a data acquisition file, and searches a corresponding package sending Log S-Log; marking a problem local point database in a database of a control center by traversing a problem Log E-Log, and marking existing problems;
step 10: the control center server traverses the packet sending Log S-Log, compares the packet sending Log S-Log with the Log R-Log for a normal local point database, calculates the average sampling ratio of a router to be passed by according to data fingerprint information if the dial-test data is streaming data sent to a target IP, calculates the warehousing rate of the task, and compares the data fingerprint information if the dial-test data is data sent to a DNS recursive server to calculate the warehousing rate of the task;
step 11: and comparing the calculated warehousing rate with a preset corresponding warehousing rate threshold, and prompting on a WEB interface if the calculated warehousing rate is smaller than the threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810585690.XA CN108683569B (en) | 2018-06-06 | 2018-06-06 | Service monitoring method and system for cloud service infrastructure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810585690.XA CN108683569B (en) | 2018-06-06 | 2018-06-06 | Service monitoring method and system for cloud service infrastructure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108683569A CN108683569A (en) | 2018-10-19 |
CN108683569B true CN108683569B (en) | 2020-06-09 |
Family
ID=63810284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810585690.XA Active CN108683569B (en) | 2018-06-06 | 2018-06-06 | Service monitoring method and system for cloud service infrastructure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108683569B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109921925B (en) * | 2019-02-15 | 2022-04-22 | 北京奇艺世纪科技有限公司 | Dial testing method and device |
CN112463572B (en) * | 2019-09-06 | 2023-09-15 | 福建天泉教育科技有限公司 | Cross-border multi-service dial testing software testing system and method thereof |
CN110519303B (en) * | 2019-09-30 | 2022-02-18 | 北京市天元网络技术股份有限公司 | Communication method and system across isolation devices |
CN114257518A (en) * | 2020-09-11 | 2022-03-29 | 中兴通讯股份有限公司 | Communication network testing method and device |
CN112100133A (en) * | 2020-11-04 | 2020-12-18 | 广州市玄武无线科技股份有限公司 | Distributed log processing system |
CN112866053A (en) * | 2020-12-31 | 2021-05-28 | 天翼物联科技有限公司 | Internet of things testing method, system and device and storage medium |
CN113572644B (en) * | 2021-07-26 | 2024-01-23 | 武汉众邦银行股份有限公司 | Internet cloud dial testing automatic monitoring method and device |
CN118590425A (en) * | 2023-02-24 | 2024-09-03 | 华为云计算技术有限公司 | Dial test method, device, system and computing equipment cluster |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727389B (en) * | 2009-11-23 | 2012-11-14 | 中兴通讯股份有限公司 | Automatic test system and method of distributed integrated service |
CN201601833U (en) * | 2009-12-28 | 2010-10-06 | 福建邮科通信技术有限公司 | Automatic detecting system for wireless network |
CN102546269B (en) * | 2010-12-07 | 2015-08-19 | 中国移动通信集团广东有限公司 | A kind of method and system of Fast Monitoring IP network |
KR101847199B1 (en) * | 2012-09-25 | 2018-05-28 | 에스케이텔레콤 주식회사 | Apparatus and method for providing quality analysis of data service |
CN104753735B (en) * | 2013-12-31 | 2018-09-07 | 中国移动通信集团上海有限公司 | A kind of call-testing system and method |
-
2018
- 2018-06-06 CN CN201810585690.XA patent/CN108683569B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108683569A (en) | 2018-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108683569B (en) | Service monitoring method and system for cloud service infrastructure | |
Sherwood et al. | Discarte: a disjunctive internet cartographer | |
US7076547B1 (en) | System and method for network performance and server application performance monitoring and for deriving exhaustive performance metrics | |
US8443074B2 (en) | Constructing an inference graph for a network | |
US9210050B2 (en) | System and method for a testing vector and associated performance map | |
US7804787B2 (en) | Methods and apparatus for analyzing and management of application traffic on networks | |
US8135828B2 (en) | Cooperative diagnosis of web transaction failures | |
CN109617743B (en) | Network performance monitoring and service testing system and testing method | |
EP2081321A2 (en) | Sampling apparatus distinguishing a failure in a network even by using a single sampling and a method therefor | |
US20030005145A1 (en) | Network service assurance with comparison of flow activity captured outside of a service network with flow activity captured in or at an interface of a service network | |
US20070171827A1 (en) | Network flow analysis method and system | |
Azzouni et al. | Fingerprinting OpenFlow controllers: The first step to attack an SDN control plane | |
CN109995582B (en) | Asset equipment management system and method based on real-time state | |
CN109067938B (en) | Method and device for testing DNS (Domain name Server) | |
US20140280904A1 (en) | Session initiation protocol testing control | |
CN111934936B (en) | Network state detection method and device, electronic equipment and storage medium | |
CN114389792B (en) | WEB log NAT (network Address translation) front-back association method and system | |
CN114157554A (en) | Troubleshooting method and device, storage medium and computer equipment | |
WO2012002849A1 (en) | Apparatus and method for monitoring of connectivity services | |
CN112532614A (en) | Safety monitoring method and system for power grid terminal | |
Mahmood et al. | Network traffic analysis and SCADA security | |
Aceto et al. | Open source platforms for Internet Monitoring and Measurement | |
Polverini et al. | Investigating on black holes in segment routing networks: Identification and detection | |
Viipuri | Traffic analysis and modeling of IP core networks | |
Marchetta et al. | Measuring networks using IP options |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |