CN110445662B

CN110445662B - Method and device for adaptively switching OpenStack control node into computing node

Info

Publication number: CN110445662B
Application number: CN201910810282.4A
Authority: CN
Inventors: 刘梦可; 刘超
Original assignee: Inesa R&d Center
Current assignee: Inesa R&d Center
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2022-07-12
Anticipated expiration: 2039-08-29
Also published as: CN110445662A

Abstract

The invention relates to a method and a device for adaptively switching OpenStack control nodes into computing nodes, wherein the OpenStack comprises a plurality of control node groups and computing node groups, and the method comprises the following steps: s1: dividing a plurality of groups of control node groups into switchable control node groups and non-switchable control node groups, and selecting control nodes to be switched from the switchable control node groups through a selection algorithm; s2: periodically triggering monitoring, if a node fault or overhigh total load occurs in a computing node group, triggering a self-adaptive upgrading process, and otherwise, ending the process; the self-adaptive upgrading process specifically comprises the following steps: and switching the control nodes to be switched into the computing nodes by an automatic management tool in combination with the container technology, and adding the computing nodes to the computing node group in the step S2. Compared with the prior art, the invention has the advantages of high efficiency and the like.

Description

Method and device for adaptively switching OpenStack control node into computing node

Technical Field

The invention relates to the technical field of Openstack cloud platforms, in particular to a method and a device for adaptively switching Openstack computing nodes into control nodes.

Background

OpenStack is a management platform of open-source cloud computing, can realize the management of a large amount of distributed computing resources, storage resources and network resources, and provides a unified management panel. The purpose is as follows: the cloud service system helps to organize the cloud which runs as virtual computing or storage service, and provides extensible and flexible cloud computing for public cloud and private cloud, and the OpenStack has developed very mature through years of production practice.

In a medium-small scale cloud platform, a general deployment architecture is a model of multiple control nodes and multiple computing nodes, the control nodes can be multiplexed with network nodes, and distributed storage services can be deployed on the control nodes, the computing nodes or other independent nodes. With the increasing service life of the server on the cloud platform, the failure rate of the server is also increased, the emergency situation of server failure is often encountered in an actual production environment, and when a computing node has a functional failure, particularly a reusable storage service, the virtual machine on the node cannot provide service to the outside in a short time; in addition, when the load of the cloud platform computing node group is too high, the performance of the virtual machine is also affected, and the user experience is further affected. In order to improve the operation continuity of a virtualization cluster system, when a virtual machine running a service fails, the service can be timely recovered to run, so that the service interruption time is minimized. Each server in the high-availability technology virtualization cluster is provided with an agent, the agents continuously detect the states of other servers, the agents send heartbeat signals to other servers at regular time, if one server cannot respond to the heartbeat signals continuously for three times, the agents think that the server has a fault and report the corresponding fault, and the agents restart all virtual machines on the fault server on other servers in the virtualization cluster to recover the virtual machine service, so that the continuity of the service is ensured, or adding a new server or replacing the original failed server, and directly deploying each computing service and network agent service of the cloud platform on an operating system of the newly added server in an RPM (revolution speed) packet mode, however, the method has low timeliness and slow recovery due to the complex process of putting the server on and off the shelf and complex deployment, and the requirement on high timeliness in a production environment is difficult to meet.

In order to solve the above problems, the prior art also provides an effective solution, and a chinese patent CN108089911A provides a method and a device for controlling a computing node in an OpenStack environment, where the method includes: periodically monitoring, in the control node, status data of the compute nodes, determining whether the respective compute node is available, and if not, evacuating virtual machines running on the respective compute node to compute nodes determined to be available in the environment based on the monitored status data, the method can maximally shorten the RTO and the RPO by determining whether the computing node is available and evacuating the virtual machines running on the unavailable computing node in time, the virtual machine traffic is resumed in the shortest amount of time, maintaining high availability, but the increased load on the receiving virtual machine's compute node also causes the service on that node to run slower, meanwhile, the virtual machines of the unavailable computing nodes are dispersed to other available computing nodes, the operation burden of other computing nodes is increased, the method can not effectively solve the problem of overhigh load of the computing node and influences the service quality of the tenant virtual machine.

Disclosure of Invention

The present invention is directed to provide a method and an apparatus for adaptively switching an OpenStack computing node to a control node, so as to overcome the above-mentioned drawbacks of the prior art.

The purpose of the invention can be realized by the following technical scheme:

a method for adaptively switching OpenStack control nodes into computing nodes is provided, wherein the OpenStack is a topological structure formed by a control node group and a computing node group, and the method comprises the following steps:

s1: dividing the control node group into a switchable control node group and a non-switchable control node group, and selecting a control node to be switched from the switchable control node group through a selection algorithm;

s2: periodically triggering monitoring, if a node fault or over-high total load occurs in a computing node group, triggering a self-adaptive upgrading process, and otherwise, ending the process;

the self-adaptive upgrading process specifically comprises the following steps: and switching the control nodes to be switched into the computing nodes by an automatic management tool in combination with the container technology, and adding the computing nodes to the computing node group in the step S2.

Furthermore, the control node group provides highly available cloud platform management control services, the switchable control node group is used as a supplement of the control node group, management services and virtual network services can be provided, the management performance of the cloud platform is improved, the service bandwidth of the centralized virtual network is increased, and due to the fact that each service of the whole control node group is highly available, service interruption of the cloud platform cannot be caused due to switching of a single control node.

Furthermore, the computing node provides services such as computing resource virtualization service, computing resource management service, virtual two-layer network proxy and the like, and provides computing resources such as a CPU (central processing unit), a memory and the like for the virtual machine.

Furthermore, the OpenStack cloud platform adopts a containerization deployment mode, all services are packaged into corresponding Docker images, the services are started in a container starting mode, the problem of dependence conflict among different services is avoided, meanwhile, the upgrading rollback of each service is facilitated, and the problems of difficult deployment and difficult upgrading of the cloud platform are effectively solved;

the container mirror image of every service all keeps in local Docker private warehouse, combines the layering characteristic of container mirror image, and all customization mirror images of cloud platform adopt the mode of mirror image layering to realize, realize the mirror image layering through four layers, by from the top down in proper order: the operating system base mirror image, the cloud platform base mirror image, the function module base mirror image and the service mirror image in the module can avoid the repeated installation of the dependence package through mirror image layering, reduce the total storage size of the mirror image and improve the deployment efficiency.

Further, control node groups are grouped by marking custom tags, the number of the non-switchable control node groups is at least 3, and the requirement of the cloud platform on the minimum number of high availability is met.

Furthermore, when the control nodes in the switchable control node group are deployed, the container mirror images required by the computing nodes are pre-installed, the container mirror images of the control nodes are kept updated synchronously when the cloud platform service is upgraded, the cloud platform computing service can be started through the container mirror images, and meanwhile performance degradation caused by network transmission of large files is avoided;

further, the election algorithm selects a node with the minimum reference index value, the reference index is a CPU load or a network traffic or a Cost value, the Cost value is obtained through a weighted summation algorithm, and a calculation formula is as follows:

wherein W_iIs a weight value, X_iThe method is a combination of any parameters of input parameters including CPU usage, memory usage and network flow, and N is the number of the input parameters.

Further, the combining the container technology and the automatic management tool to quickly switch the control node to be switched into the computing node specifically includes:

and cleaning all containers on the control node to be switched by utilizing an automatic deployment tool, wherein the automatic deployment tool comprises an Angle, an operating system layer and a Docker service layer which are consistent with the computing node, and rapidly starting the services of the computing node obtained by switching, and the services comprise nova-libvirtd, nova-computer and neutron-ope nvswitch-agent.

Further, the method for judging the node failure of the computing node group specifically comprises the following steps:

the monitoring system sends heartbeat packets to each computing node in the computing node group, and if any computing node cannot receive the heartbeat packets, the computing node group has node faults.

Further, the method for judging that the load of the computing node group is too high comprises total load calculation and total load prediction.

Further, the total load calculation method specifically includes:

collecting the load of the current computing node through a monitoring agent on each computing node in the computing node group, wherein the load comprises a CPU (Central processing Unit), a memory and network flow, and when the total load of the computing node group exceeds a preset threshold value, the total load of the computing node group is too high;

further, the total load prediction method specifically includes:

based on historical monitoring data of the computing nodes, predicting through a multi-input single-output neural network linear regression model, wherein the neural network linear regression model is as follows:

Z＝WX+B

where Z is the calculated node load prediction value, X ═ X₁,x₂,…,x_NIs an input sample that includes time, number of virtual machines, and number of tenants, W ═ W₁,w₂,…,w_NIs the weight matrix, B ═ B₁And the total load of the computing node group is over-high if the total load exceeds a preset threshold value.

An apparatus for adaptive switching of an OpenStack control node to a computing node comprises a memory having a computer program stored therein and a processor that invokes the computer program to perform the steps of the method.

Compared with the prior art, the invention has the following beneficial effects:

(1) the state of a computing node group is periodically monitored, whether node failure occurs or not and the total load is overhigh, the process of controlling the node to be switched into the computing node is automatically triggered, the self-healing or capacity expansion of the computing node group is achieved, and the high availability of a cloud platform is maintained, wherein the computing mode of the total load of the computing node group comprises real-time computing and prediction computing, the prediction computing is carried out through a neural network linear regression model based on historical data, the node switching is carried out before the total load of the computing node group reaches a set threshold, and the influence on the cloud platform is avoided;

(2) according to the invention, a containerization deployment mode is adopted for each service of the cloud platform, and when a node switching process is carried out, only the original container service on a control node is needed, so that the environment can be quickly cleaned;

(3) the service of the cloud platform is stored in the Docker private warehouse in a mirror layering mode, wherein the mirror layering sequentially comprises an operating system basic mirror, a cloud platform basic mirror, each function module basic mirror and the mirror of each service in the module from top to bottom, the total storage size of the mirror is reduced, the control node to be switched after environment cleaning only needs to download all the mirrors related to the service of the computing node in the Docker private warehouse in advance instead of starting a corresponding container, the corresponding container service can be started quickly, and the deployment efficiency is high;

(4) according to the method, the control node groups are divided into the switchable control node groups and the non-switchable control node groups, wherein the non-switchable control node groups ensure high availability of the cloud platform, so that the continuity of the cloud platform service cannot be influenced by switching of a single node.

Drawings

FIG. 1 is a flow chart of an adaptive switching node;

FIG. 2 is a block diagram of an adaptive switching node;

FIG. 3 is a flow diagram of a switching node according to an embodiment;

FIG. 4 is a Docker container deployment diagram of three classes of nodes.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example one

A method for adaptively switching OpenStack control nodes into computing nodes, wherein OpenStack is a topology structure formed by a control node group and a computing node group, as shown in FIG. 1, the method comprises the following steps:

s1: dividing the control node group into a switchable control node group and a non-switchable control node group, and selecting and generating a control node to be switched from the switchable control node group through a selection algorithm;

the self-adaptive upgrading process specifically comprises the following steps: and switching the control node to be switched into the computing node by an automatic management tool in combination with the container technology, and adding the computing node to the computing node group in the step S2.

The method comprises the following steps that a node with the minimum reference index value is selected by an election algorithm, the node with the minimum reference index value is selected by the election algorithm, the reference index is a CPU load or network flow or a Cost value, the Cost value is obtained through a weighted summation algorithm, and the calculation formula is as follows:

The method for rapidly switching the control node to be switched into the computing node by combining the container technology through an automatic management tool specifically comprises the following steps:

The method for judging the node fault of the computing node group specifically comprises the following steps:

The method for judging the overhigh load of the computing node group comprises the following steps:

specifically, in this embodiment, as shown in fig. 3, step S2 includes:

101) the timer regularly triggers the monitoring system to collect load information of each computing node of the cloud platform every five minutes, and sends heartbeat packets to the computing nodes;

102) judging whether the computing node group has node faults or the total load is too high, if so, executing a step 103), otherwise, ending the process;

103) if the monitoring system is configured to be in a silent mode, directly executing the step 104), otherwise, notifying an administrator through a mail or a short message, if the administrator agrees, executing the step 104), otherwise, ending the process;

104) selecting control nodes to be switched from the switchable control node group through an election algorithm;

105) automatically cleaning the container on the control node to be switched through the infrastructure, and reserving an operating system layer;

106) and automatically starting the container service related to the control node to be switched and the computing node through the infrastructure, switching the container service into the computing node, adding the computing node into the computing node group, and ending the process.

Example two

In the implementation, the total load of the computing node group is obtained by computing through a prediction algorithm, and the others are the same as those in the first embodiment, and the prediction algorithm specifically comprises the following steps:

Z＝WX+B

wherein Z is a predicted value of the load of the computing node, and X is { X ═ X₁,x₂,…,x_NIs an input sample that includes time, number of virtual machines, and number of tenants, W ═ W₁,w₂,…,w_NIs the weight matrix, B ═ B₁And the total load of the computing node group is obtained according to the load predicted value Z of each computing node in the group, and if the total load exceeds a preset threshold value, the total load of the computing node group is too high.

EXAMPLE III

An apparatus for OpenStack control node adaptive switching to a compute node corresponding to an embodiment, the OpenStack comprising a set of control nodes and a set of compute nodes, the apparatus comprising:

the fault monitoring module is used for detecting each computing node in the computing node group by sending the heartbeat packet and judging whether the computing node group has node faults or not;

the load detection module is used for collecting load information elements of each computing node in the computing node group, computing the total load of the computing node group according to the collected load information, or predicting the total load of the computing node group according to historical load information, and judging whether the computing node group is overloaded or not according to a set load threshold value;

the node processing module is used for dividing the control node groups into switchable control node groups and non-switchable control node groups, selecting and generating control nodes to be switched from the switchable control node groups through a selection algorithm, switching the control nodes to be switched into calculation nodes through an automatic management tool by combining a container technology, and adding the calculation nodes into the calculation node groups with node faults or overhigh total load;

and the timing trigger module is used for setting a monitoring period and triggering the fault monitoring module and the load detection module to operate at regular time according to the period.

The device of the embodiment is used as a peripheral device of a cloud platform, monitors load information and fault states of computing nodes, wherein the load information comprises a CPU (central processing unit), a memory and key network traffic, and manages node switching processes.

The infrastructure of the cloud platform adopts a topology of M + N nodes and comprises a model of M control nodes and N computing nodes, wherein the control nodes can reuse network nodes, the M control nodes are divided into a plurality of control node groups, the control node groups are divided into switchable control node groups and non-switchable control node groups through marking custom labels, the number of the non-switchable control node groups is at least 3, and the requirement of the cloud platform on the minimum number of high availability is met.

The control node group provides a highly available cloud platform management control service: providing Application Program Interface (API) services comprising a computing module, a cloud hard disk management module and a mirror image management module and internal working components comprising a controller component and a scheduling component, wherein the services are stateless and realize high availability of load balancing through hash + keepalive; the method comprises the steps that a shared database and a message queue service are provided at the same time, the two services are stateful, the database service realizes a multi-master high-availability cluster through MySQL Gelera, and the RabbitMQ cluster realizes the high availability of a message queue through a mirror image mode;

meanwhile, the control node group can provide high-availability cloud platform management control services including services of a gateway of a tenant network, external network access, a floating IP (Internet protocol), a virtual firewall and the like through the L3 Agent, the control node group can be switched to serve as a supplement of the control node group, the management services and the virtual network services can be provided, the management performance of the cloud platform is improved, the service bandwidth of a centralized virtual network is increased, and the high-availability of active and standby virtual routing is realized through keepalive.

The computing node group provides computing resource virtualization service, computing resource management service, virtual two-layer network proxy and other services, and computing resources such as a CPU (central processing unit), a memory and the like are provided for the virtual machine.

The OpenStack cloud platform adopts a containerization deployment mode, all services are packaged into corresponding Docker images, the services are started in a container starting mode, all nodes keep the consistency of operating system versions and Docker service versions, the problem of dependence and conflict among different services is avoided, meanwhile, the upgrading and rollback of each service is facilitated, and the problems of difficult cloud platform deployment and difficult upgrading are effectively solved.

As shown in fig. 2, the container mirror image of each service is stored in the local Docker private warehouse, and in combination with the hierarchical property of the container mirror image, all customized mirror images of the cloud platform are implemented in a mirror image hierarchical manner, and the mirror image hierarchical is implemented by four layers, which are sequentially from top to bottom: the operating system basic mirror image, the cloud platform basic mirror image, each function module basic mirror image and the mirror image of each service in the module can avoid the repeated installation of the dependence package through the mirror image layering, reduce the total storage size of the mirror images and improve the deployment efficiency.

All services of the cloud platform are started in a Docker container mode, all nodes keep consistency of operating system versions and Docker service versions, and smoothness and stability of node switching are guaranteed.

As shown in fig. 4, the Docker container of the control node includes API services and internal components of each module of the cloud platform, and all of the shared database, the message queue, and the load balancing service implement a high availability scheme.

The Docker container of the computing node comprises nova-computer computing service, neutron-openvswitch-ag ent virtual two-layer network service and the like, wherein all images of the computing node service are pre-downloaded by the nodes in the switchable control node group instead of starting the corresponding container, so that the corresponding container service can be quickly started when the node is upgraded to the computing node.

When the control nodes in the switchable control node group are deployed, the container mirror images required by the computing nodes are pre-installed, the container mirror images of the control nodes are kept updated synchronously when the cloud platform service is upgraded, the cloud platform computing service can be started through the container mirror images, and meanwhile performance degradation caused by network transmission of large files is avoided.

The first embodiment, the second embodiment and the third embodiment trigger a process of controlling the node switching to the computing node based on the current state, including node failure or overload of the computing node group, or prediction based on historical data through a multi-input single-output neural network linear regression model, by adopting containerization deployment platform service, only old container service is simply deleted when the node is switched, and then the service is quickly started through a pre-downloaded Docker mirror image, so that the timeliness and high availability of cloud platform switching are guaranteed through the quick process.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A method for OpenStack control node to adaptively switch to a computing node, the OpenStack comprising a plurality of groups of control nodes and a plurality of groups of computing nodes, the method comprising:

s1: dividing a plurality of groups of control node groups into switchable control node groups and non-switchable control node groups, and selecting and generating control nodes to be switched from the switchable control node groups through a selection algorithm;

s2: triggering monitoring periodically, if finding that a computing node group has node faults or the total load is overhigh, triggering a self-adaptive upgrading process, and if not, ending the process;

the self-adaptive upgrading process specifically comprises the following steps: switching the control node to be switched into a computing node by combining a container technology through an automatic management tool, and adding the computing node into the computing node group in the step S2;

the combination container technology for rapidly switching the control nodes to be switched into the computing nodes through an automatic management tool specifically comprises the following steps:

and cleaning all containers on the control node to be switched by utilizing an automatic deployment tool, wherein the automatic deployment tool comprises an anchor, an operating system layer and a Docker service layer which are consistent with the computing node, and rapidly starts the services of the computing node obtained by switching, the services comprise nova-libvirtd, nova-computer and neutron-openvswitch-agent, the container services related to the control node to be switched and the computing node are automatically started through the anchor, the computing node is switched into the computing node, and the computing node is added into the computing node group, wherein all mirror images of the computing node services are pre-downloaded by the nodes in the switchable control node group, but the corresponding containers are not started.

2. The method of claim 1, wherein the grouping of control node groups is performed by tagging custom tags, and the number of the non-switchable control node groups is at least 3.

3. The method for OpenStack control node to adaptively switch to a compute node according to claim 1, wherein the election algorithm selects a node with a minimum reference index value, and the reference index is a CPU load or a network traffic or a Cost value;

the Cost value is obtained through a weighted summation algorithm, and the calculation formula is as follows:

wherein

Is a weight value of the weight value,

the number of the input parameters is one or more of the input parameters including CPU usage, memory usage and network traffic.

4. The method for adaptively switching an OpenStack control node to a computing node according to claim 1, wherein the method for determining a node failure in a computing node group specifically comprises:

5. The method of claim 1, wherein the method for determining that the load of the computing node group is too high comprises total load calculation and total load prediction.

6. The method for OpenStack control node adaptive switching to a compute node according to claim 5, wherein the method for computing the total load of a compute node group specifically comprises:

the method comprises the steps that loads of current computing nodes are collected through monitoring agents on the computing nodes in a computing node group, the loads comprise a CPU, a memory and network flow, and when the total loads of the computing node group exceed a preset threshold value, the total loads of the computing node group are too high.

7. The method for OpenStack control node adaptive switching to a compute node according to claim 5, wherein the total load prediction method of a compute node group specifically is: based on historical monitoring data of the computing nodes, predicting through a multi-input single-output neural network linear regression model;

wherein the neural network linear regression model is:

wherein

In order to calculate the node load prediction value,

to input a sample, the sample includes time, number of virtual machines, and number of tenants,

in order to be a weight matrix, the weight matrix,

is an offset matrix; according to the obtained load predicted value of each computing node in the group

And solving the total load of the computing node group, wherein if the total load exceeds a preset threshold value, the total load of the computing node group is too high.

8. An apparatus for adaptive switching of an OpenStack control node to a computing node, comprising a memory and a processor, the memory storing a computer program, wherein the processor invokes the computer program to perform the steps of the method according to any of claims 1-7.