CN116248484B - Management method and device of cloud primary integrated machine, electronic equipment and storage medium - Google Patents
Management method and device of cloud primary integrated machine, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116248484B CN116248484B CN202310221964.8A CN202310221964A CN116248484B CN 116248484 B CN116248484 B CN 116248484B CN 202310221964 A CN202310221964 A CN 202310221964A CN 116248484 B CN116248484 B CN 116248484B
- Authority
- CN
- China
- Prior art keywords
- node
- network connection
- state
- connection state
- storage service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000007726 management method Methods 0.000 title claims abstract description 129
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000006855 networking Effects 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 5
- 238000011084 recovery Methods 0.000 description 5
- 210000004556 brain Anatomy 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Hardware Redundancy (AREA)
Abstract
The application discloses a management method of a cloud primary all-in-one machine, wherein the cloud primary all-in-one machine comprises a first node and a second node, the first node and the second node comprise a networking cluster mode and an off-network single machine mode, and the management method comprises the following steps of; when the first node is in the cluster mode, the first node acquires a heartbeat network connection state from the first node to the second node; when the heartbeat network connection state from the first node to the second node is normal, acquiring the system working state and the storage service working state of the second node; the first node manages its own operation mode based on the system operation state and the storage service operation state of the second node. According to the method, when the node fails, the working mode of the node can be automatically switched, storage failure switching is achieved, and high availability of storage service is guaranteed.
Description
Technical Field
The application belongs to the technical field of computers, and particularly relates to a management method and device of a cloud primary integrated machine, electronic equipment and a storage medium.
Background
The two-Node cloud primary container all-in-one machine consists of two computers, each of which is called a Node in a cluster. When a network fails, for example, a switch in the network fails, a cluster may Split into two groups (Node groups), a phenomenon called Brain Split. When the cluster is split, each of the two split node groups cannot detect existence of the other node groups through heartbeat information or lease information, nodes of other node groups are considered to be faulty, and thus, in the same time period, the nodes in the node groups may initiate access to a certain shared storage resource, such as access to a storage disk, and thus, data access errors are caused.
Disclosure of Invention
The invention aims to provide a management method and device of a cloud primary integrated machine, electronic equipment and a storage medium, and aims to solve the problem that after a cloud primary integrated machine cluster in the prior art is subjected to brain fracture, nodes in a node group possibly initiate access to shared storage resources, so that data access errors occur.
In order to achieve the above purpose, a technical scheme adopted in the application is as follows:
The management method of the cloud native all-in-one machine comprises a first node and a second node, wherein the first node and the second node comprise a networking cluster mode and an off-network single machine mode;
when the first node is in the cluster mode, the first node acquires a heartbeat network connection state from the first node to the second node;
when the heartbeat network connection state from the first node to the second node is normal, acquiring the system working state and the storage service working state of the second node;
the first node manages its own operation mode based on the system operation state and the storage service operation state of the second node.
In one or more embodiments, synchronizing with the step of obtaining the system operational state and the storage service operational state of the second node further comprises:
when the heartbeat network connection state from the first node to the second node is a fault, acquiring management network connection states from the first node to the second node and the third-party gateway respectively;
when the connection state of the first node to the management network of the second node is normal, judging whether the first node is a default node or not; if so, the first and second data are not identical,
The first node switches itself to stand-alone mode.
In one or more embodiments, synchronizing with the step of obtaining the management network connection status itself to the second node and the third party gateway, respectively, further comprises:
and when the management network connection state from the first node to the second node is a fault and the management network connection state from the first node to the third gateway is normal, the first node switches the first node to a single machine mode.
In one or more embodiments, the step of the first node managing its own operation mode based on the system operation state and the storage service operation state of the second node includes:
when the connection state of the storage network of the first node is a fault, judging whether the first node is a default node or not; if so, the first and second data are not identical,
the first node switches itself to stand-alone mode.
In one or more embodiments, the step of managing the own operation mode of the first node based on the own storage network connection state, the system operation state of the second node, and the storage service operation state includes:
and the first node switches the first node to a single machine mode when the system working state of the second node is a fault and/or the storage service working state of the second node is a fault.
In one or more embodiments, further comprising:
when the first node is in the single-machine mode, judging whether the first node can be networked with the second node or not; if so, the first and second data are not identical,
the first node joins the second node to form a cluster.
In one or more embodiments, the step of determining, when the first node is in the stand-alone mode, whether the first node is capable of networking with the second node includes:
the first node acquires the working mode of the second node, the storage service state of the first node and the management network connection state of the first node;
and when the second node is in a stand-alone mode and the storage service state and the management network connection state of the first node are normal, the first node determines that the first node can be networked with the second node.
In one or more embodiments, the step of determining, when the first node is in the stand-alone mode, whether the first node is capable of networking with the second node further includes:
when the second node is not in the stand-alone mode, and/or the storage service state of the second node is a fault, and/or the management network connection state of the second node is a fault, the first node waits for a preset time and then reacquires the working mode of the second node, the storage service state of the second node and the management network connection state of the second node.
In one or more embodiments, the step of adding the first node to the second node to form a cluster, where the first node and the second node switch themselves to the cluster mode respectively further includes:
the first node judges whether the cluster is successfully formed with the second node; if not, the method comprises the steps of,
and the first node judges whether the first node can be networked with the second node again after waiting for the preset time.
In one or more embodiments, further comprising:
the first node reads and writes own system disk;
if the read-write fails, the first node triggers the system kdump service to restore the running memory and restarts the system.
In one or more embodiments, further comprising:
the first node acquires the working state of a self management platform;
when the self management platform of the first node is in a starting state, acquiring the self service network connection state, the system working state, the storage service working state, the management network connection state and the service network connection state of the second node;
and when one or more of the service network connection state of the first node, the system working state of the second node, the storage service working state, the management network connection state and the service network connection state are faults, the first node performs fault evacuation on the second node.
In one or more embodiments, the step of the first node performing fault evacuation for the second node includes:
the first node acquires a management network connection state, a business network connection state and a storage service working state of the second node;
when the storage service working state of the second node is a fault, the first node shuts down and evacuates the second node and sends alarm information;
when the storage service working state of the second node is normal and the management network connection state and the service network connection state of the second node are both faults, shutting down and evacuating the second node and sending alarm information;
when the storage service working state and the management network connection state of the second node are normal and the service network connection state of the second node is a fault, the first node closes and cold-transitions the service POD of the second node to the first node and sends alarm information;
and when the storage service working state and the service network connection state of the second node are normal and the management network connection state of the second node is a fault, the first node does not process the second node.
In order to achieve the above purpose, another technical scheme adopted in the application is as follows:
the management device of the cloud native all-in-one machine comprises a first node and a second node, wherein the first node and the second node comprise a networking cluster mode and an off-network single machine mode, the management device is applied to the first node, and the management device comprises;
the first acquisition module is used for acquiring the heartbeat network connection state from the first node to the second node when the first node is in the cluster mode;
the second acquisition module is used for acquiring the system working state and the storage service working state of the second node when the heartbeat network connection state from the second node to the second node is normal;
and the management module is used for managing the working mode of the management module based on the system working state and the storage service working state of the second node.
In order to achieve the above object, another technical solution adopted in the present application is:
there is provided an electronic apparatus characterized by comprising:
at least one processor; the method comprises the steps of,
and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of managing a cloud native all-in-one machine as described in any of the above embodiments.
In order to achieve the above object, another technical solution adopted in the present application is:
there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of managing a cloud native all-in-one machine as described in any of the above embodiments.
The beneficial effect of this application is, in contrast to prior art:
according to the management method, when the node fails, the working mode of the node can be automatically switched, storage failure switching is realized, and high availability of storage service is ensured;
according to the management method, after the fault recovery, the nodes can be controlled to be networked again to construct the cluster, so that the storage fault recovery is realized, and the high availability is ensured;
the management method can carry out fault evacuation when the node fails, and ensures that the system is high in availability.
Drawings
FIG. 1 is a block diagram of one embodiment of a cloud native all-in-one machine of the present application;
FIG. 2 is a schematic flow chart of an embodiment of storage failover in a method of managing a cloud primary all-in-one machine according to the present application;
FIG. 3 is a schematic flow chart of an embodiment of storage failure recovery in the management method of the cloud primary all-in-one machine of the present application;
FIG. 4 is a flowchart of an embodiment corresponding to the step S100b in FIG. 3;
FIG. 5 is a schematic flow chart diagram of an embodiment of fault evacuation in the management method of the cloud primary all-in-one machine of the present application;
FIG. 6 is a flowchart of an embodiment corresponding to the step S300c in FIG. 5;
FIG. 7 is a block diagram of an embodiment of a management device of the cloud native all-in-one machine of the present application;
fig. 8 is a schematic structural diagram of an embodiment of the electronic device of the present application.
Detailed Description
The present application will be described in detail with reference to the embodiments shown in the drawings. The embodiments are not intended to be limiting and structural, methodological, or functional changes made by those of ordinary skill in the art in light of the embodiments are intended to be included within the scope of the present application.
The cloud primary container integrated machine is a software and hardware integrated solution integrating a container platform and a storage function, flexibly meets the elastic configuration requirements of different businesses on calculation, storage and I/O, and provides a safe, controllable, economical and effective data center infrastructure. The cloud primary container integrated machine integrated container technology can easily realize the bearing of applications, has good expandability and service agility, and helps customers to cope with complex scenes of multi-cloud deployment and hybrid cloud management.
The two-Node cloud primary container integrated machine consists of two computers, each computer is called a Node in a cluster, a storage service is operated in each Node, and the other computers form distributed storage.
When a network fails, for example, a switch in the network fails, a cluster may Split into two groups (Node groups), a phenomenon called Brain Split. When the cluster is split, each node group of the two split node groups cannot detect existence of the other node groups through heartbeat information or lease information, nodes of other node groups are considered to be faulty, and thus the nodes can initiate access to a certain shared storage resource, such as access to a storage disk, in the same time period, and thus data access errors can be caused.
In order to ensure normal operation of distributed storage service of the cloud primary container all-in-one machine and high availability of the storage service, the application provides a management method of the cloud primary container all-in-one machine, wherein the management method can switch node states when storage faults occur and high availability of the storage service is ensured.
Referring to fig. 1, fig. 1 is a structural block diagram of an embodiment of a cloud native all-in-one machine of the present application, where the cloud native all-in-one machine includes a first node and a second node, each node is disposed with a storage service and a service POD, the first node and the second node are connected by a heartbeat line, and meanwhile, the heartbeat line is also multiplexed to provide connection for the node and a storage network, and the first node and the second node are also connected to a third party gateway by a management network.
The third party gateway is used for issuing instructions to the first node and the second node through the management network so as to manage services provided by the first node and the second node.
The first node and the second node comprise a cluster mode of networking and a stand-alone mode of off-network.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of storage failover in a management method of a cloud native all-in-one machine according to the present application.
Storage failover includes:
and S100a, when the first node is in a cluster mode, the first node acquires the heartbeat network connection state from the first node to the second node.
When the first node and the second node are in a networking cluster mode, the first node can acquire the connection state of the first node and the second node through the heartbeat network, and the connection state of the first node and the storage network is synchronously acquired because the heartbeat network is multiplexed into the storage network.
And S200a, when the heartbeat network connection state from the first node to the second node is normal, acquiring the system working state and the storage service working state of the second node.
When the heartbeat network connection state from the first node to the second node is normal, the first node can be judged to be connected with the storage network normally, namely the first node can provide storage service normally.
In order to avoid that the second node fails to influence the storage service of the cluster, the first node can further acquire the system working state and the storage service working state of the second node.
Specifically, in one embodiment, the first node may obtain, from the second node, the system operating state and the storage service operating state of the second node through the SSH security protocol.
S300a, the first node manages the working mode of the first node based on the system working state and the storage service working state of the second node.
Based on the state information obtained in step S200a, the first node can determine whether the storage service of the cluster is running normally at this time, that is, at least one node normally provides the storage service, and based on the state of the storage service of the current cluster, the working mode of the first node can be managed to ensure high availability of the storage service of the cluster.
Specifically, in one embodiment, the step of the first node managing its own operation mode may include:
and the first node switches the first node to a single machine mode when the system working state of the second node is a fault and/or the storage service working state of the second node is a fault.
When the system of the second node has a fault or the storage service of the second node cannot work normally, the first node can directly switch the first node to a single-machine mode, so that the storage service is provided in a single-machine mode, and the high availability of the storage service is ensured.
The step of synchronizing with step S200a further comprises:
and S200a', when the heartbeat network connection state from the first node to the second node is a fault, acquiring the management network connection states from the first node to the second node and the third-party gateway respectively.
When the heartbeat network connection state from the first node to the second node is a fault, the first node cannot be connected with the storage network, the storage services of the first node and the second node cannot be connected, the storage services of the first node and the second node cannot detect the existence of the other party, and at the moment, the storage services of the two nodes can cause data access errors.
Therefore, it is necessary to determine the state of the cluster at this time further based on the management network connection states of the first node and the second node and the management network connection states of the first node and the third party gateway.
And S300a', when the connection state of the first node to the management network of the second node is normal, judging whether the first node is a default node or not.
If the first node is the default node, the first node switches itself to the stand-alone mode.
When the heartbeat network connection state of the first node and the second node is a fault and the management network connection state of the first node to the second node is normal, the first node and the second node can be judged to be in a normal working state, only the heartbeat network is in fault, and the storage service of the first node and the second node can not detect the occurrence of brain fracture phenomenon of the other party, so that data access errors are caused.
Therefore, it is necessary to switch the default node to the stand-alone mode, preventing the two nodes from colliding with each other. The default node, i.e. the master node identified by the system, may be defined by a configuration file of the node.
The first node may determine whether itself is a default node by reading the configuration file and its node information. When the first node is a default node preset by the system configuration file, the first node can directly switch the first node to a single-machine mode by taking the first node as a main node, and at the moment, the storage service provided by the first node in the single-machine mode can normally run, so that the high availability of the storage service is ensured.
It can be appreciated that when the first node is not the default node, the first node may not perform related actions, wait for the default node to switch itself to the stand-alone mode, and also can implement normal operation of the storage service.
The step of synchronizing with step S300a' further comprises:
s300a', when the management network connection state from the first node to the second node is a fault and the management network connection state from the first node to the third gateway is normal, the first node switches the first node to a stand-alone mode.
When the heartbeat network connection state and the management network connection state from the first node to the second node are both faults, and the management network connection state from the first node to the third party gateway is normal, the first node can be judged to be in a normal working state, the second node is in an abnormal working state, and the heartbeat network is faulty, so that the first node can directly switch the first node to a single machine mode in order to ensure high availability of storage service.
It can be understood that when the management network connection state from the first node to the second node, the heartbeat network connection state, and the management network connection state from the first node to the third party gateway are all faults, it can be determined that the first node is faulty, the state of the second node is unknown, and the network is completely disconnected.
By adopting the scheme, when the storage service of one node or the system has faults, the other node can be automatically switched to a single mode to provide the storage service; meanwhile, when the storage network fails, the default node can be automatically switched to a single mode to provide storage service, storage failure switching is realized, and high availability of the storage service is ensured.
After the fault is repaired, in order to reconstruct the cluster, the management method further comprises the step of storing the fault recovery. Specifically, referring to fig. 3, fig. 3 is a flow chart illustrating an embodiment of storage failure recovery in the management method of the cloud native all-in-one machine of the present application.
And S100b, when the first node is in the single-machine mode, judging whether the first node can be networked with the second node.
And S200b, if the first node determines that the first node can networking with the second node, the first node joins the second node to form a cluster.
When the first node is in the stand-alone mode, the first node can judge whether the first node can be connected with the second node. When the first node determines that it is able to network with the second node, the two nodes may reconnect to form a cluster while the two nodes switch to cluster mode.
Specifically, referring to fig. 4, fig. 4 is a flow chart of an embodiment corresponding to step S100b in fig. 3.
The step of the first node judging whether the first node can networking with the second node comprises the following steps:
s101b, the first node acquires the working mode of the second node, the storage service state of the first node and the management network connection state of the first node.
S102b, when the second node is in a stand-alone mode and the storage service state and the management network connection state of the first node are normal, the first node determines that the first node can be networked with the second node.
First, the first node can determine whether the second node is in a stand-alone mode, and the first node can be connected with the second node to form a cluster only when the second node is also in the stand-alone mode.
When the second node is also in a single machine mode, the first node can acquire whether the storage service of the first node is normal or not and whether the connection between the first node and a management network is normal or not; if both are normal, the first node may determine that networking with the second node is enabled.
In one embodiment, synchronizing with step S102b further comprises:
s102b', when the second node is not in a stand-alone mode, and/or the storage service state of the second node is a fault, and/or the management network connection state of the second node is a fault, the first node waits for a preset time and then reacquires the working mode of the second node, the storage service state of the second node and the management network connection state of the second node.
It can be understood that the first node can be reconnected with the second node to form a cluster only when the condition that the second node is in a single machine mode, the storage service of the first node is normal, and the management network of the cluster is normal is satisfied at the same time; when any one of the conditions is not met, the first node can wait for a preset time and then judge whether the first node can be networked with the second node again, so that periodic inspection is realized, and the first time can be ensured to construct a cluster when the networking conditions are met.
The preset time may be selected based on an actual working condition, and may be set through a configuration file of the node, for example, may be 5s.
To ensure that the cluster construction is successful, the method further includes, after step S200 b:
s300b, the first node judges whether the cluster is successfully formed with the second node.
And S400b, if the cluster is not successfully formed, the first node waits for a preset time and then judges whether the first node can be networked with the second node.
After the first node joins the second node to form the cluster, the first node can determine whether to successfully form the cluster with the second node, if not, the first node can wait for a preset time and then re-determine whether to be able to network with the second node.
The preset time may be selected based on an actual working condition, and may be set through a configuration file of the node, for example, may be 5s.
In order to prevent the system disk from being damaged and affecting the operation of the storage service, the management method of the cloud primary integrated machine may further include periodic checking of the system disk, and specifically, the management method may include:
the first node reads and writes own system disk; if the read-write fails, the first node triggers the system kdump service to restore the running memory and restarts the system.
The first node can periodically read and write own system disk, judge whether the system disk works, when the system disk can not read and write, the first node can immediately trigger kdump service of the system, wherein kdump is a tool and service for dumping memory operation parameters when the system crashes, deadlocks or crashes, the operation memory can be restored and the system can be restarted by triggering the kdump service, the unavailability of storage caused by the false death of the system is prevented, and the storage of the operation parameters and the high availability of the storage service are ensured.
In one embodiment, as shown in fig. 1, a management platform service is further deployed within the first node and the second node, the management platform service may be deployed within the container, and the management platform service may include a failure evacuation service.
The management method of the cloud native all-in-one machine may further include fault evacuation, specifically referring to fig. 5, fig. 5 is a schematic flow chart of an embodiment of fault evacuation in the management method of the cloud native all-in-one machine.
The fault evacuation includes:
s100c, the first node acquires the working state of a management platform of the first node;
and S200c, when the self management platform of the first node is in a starting state, acquiring the self service network connection state, the system working state of the second node, the storage service working state, the management network connection state and the service network connection state.
And S300c, when one or more of the service network connection state of the first node, the system working state of the second node, the storage service working state, the management network connection state and the service network connection state are faults, the first node performs fault evacuation on the second node.
Specifically, the first node first acquires its own management platform state, that is, determines whether the failure evacuation service in the first node is in an enabled state.
When the fault evacuation service of the first node is in an enabling state, the first node can further acquire the connection state of the service PODs of the two nodes and the service network, namely the service network connection state, wherein the service network is used for a user to give a service instruction to the nodes, and the service PODs execute corresponding instructions after receiving the service instruction.
The first node can further acquire the system working state, the storage service working state and the connection state of the second node and the management network; based on the above state, it can be determined whether or not the second node needs to be subjected to failure evacuation, and the high availability of the storage service is ensured by performing failure evacuation on the second node.
Referring to fig. 6, fig. 6 is a flowchart of an embodiment corresponding to step S300c in fig. 5.
The method for the first node to carry out fault evacuation on the second node comprises the following steps:
s301c, the first node acquires a management network connection state, a business network connection state and a storage service working state of the second node.
After determining that the second node needs to be subjected to fault evacuation, the first node obtains the connection state of the second node and the management network and the service network and the working state of the storage service of the second node.
And S302c, when the storage service working state of the second node is a fault, the first node shuts down and evacuates the second node and sends alarm information.
If the storage service of the second node has a fault, that is, the second node cannot provide the storage service, in order to ensure high availability of the storage service, the first node can directly shut down and evacuate the second node, so that the second node is prevented from influencing the storage service operation of the first node in the cluster mode, and meanwhile, the first node can send alarm information to remind an intranet administrator that the second node is shut down.
And S303c, when the storage service working state of the second node is normal and the management network connection state and the business network connection state of the second node are both faults, the first node shuts down and evacuates the second node and sends alarm information.
When the storage service of the second node works normally, but the connection between the second node and the management network and the service network are failed, at this time, the administrator and the user cannot access the second node normally, at this time, although the storage service of the second node works normally, the first node can shut down and evacuate the second node and send alarm information to remind the intranet administrator that the second node has been shut down because the management instruction of the intranet and the user service instruction of the extranet cannot be transmitted to the second node.
And S304c, when the storage service working state and the management network connection state of the second node are normal and the service network connection state of the second node is a fault, the first node closes and cold-transitions the service POD of the second node to the first node and sends alarm information.
When the storage service of the second node works normally, the connection between the second node and the management network is normal, but when the connection state of the second node and the service network is failure, the user service instruction of the external network cannot be transmitted to the second node, and at the moment, the first node can close and cold-migrate the service POD of the second node to the first node, so that the first node can execute the service instruction of the second node and send alarm information to remind an intranet administrator that the service POD of the second node has been migrated to the first node.
S305c, when the storage service working state and the business network connection state of the second node are normal, and the management network connection state of the second node is a fault, the first node does not process the second node.
When the storage service of the second node works normally, the second node is connected with the service network normally, when the connection state of the second node and the management network is a fault, a user service instruction of the external network can be transmitted to the second node, the management instruction of the internal network cannot be transmitted to the second node, at the moment, the second node can still execute the service instruction, and the first node can not migrate and shut down the second node.
The application also provides a management device of the cloud primary all-in-one machine, wherein the cloud primary all-in-one machine comprises a first node and a second node, and the first node and the second node comprise a networking cluster mode and an off-network single machine mode. Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a management device of a cloud native all-in-one machine according to the present application.
The management device is applied to a first node and comprises a first acquisition module 21, a second acquisition module 22 and a management module 23.
The first obtaining module 21 is configured to obtain, when the first node is in the trunking mode, a heartbeat network connection state from the first node to the second node; the second obtaining module 22 is configured to obtain a system working state and a storage service working state of the second node when a heartbeat network connection state from the second obtaining module to the second node is normal; the management module 23 is configured to manage its own operation mode based on the system operation state and the storage service operation state of the second node.
In one embodiment, the management device further includes a judging module 24 and a clustering module 25.
Wherein, the judging module 24 is configured to judge whether the first node can networking with the second node when the first node is in the stand-alone mode;
the cluster module 25 is configured to join the second node to form a cluster when it is determined that the second node can be networked with itself.
In one embodiment, the management device further includes a read/write module 26 and a restart module 27.
The read-write module 26 is used for reading and writing a system disk of the read-write module; the restarting module 27 triggers the system kdump service to restore the running memory and restart the system when the read-write fails.
In an embodiment, the management device further comprises a third acquiring module 28, a fourth acquiring module 29 and a failure evacuation module 30.
The third obtaining module 28 is configured to obtain a working state of the management platform of the third obtaining module; the fourth obtaining module 29 is configured to obtain, when the self management platform is in a start state, a self service network connection state, and a system working state, a storage service working state, a management network connection state, and a service network connection state of the second node; the fault evacuation module 30 is configured to evacuate the second node when one or more of the service network connection state of the second node, the system operation state, the storage service operation state, the management network connection state, and the service network connection state are faulty.
As described above with reference to fig. 1 to 6, a management method of the cloud native all-in-one machine according to an embodiment of the present specification is described. The details mentioned in the above description of the method embodiment are equally applicable to the management device of the cloud-primary all-in-one machine of the present specification embodiment. The management device of the cloud primary integrated machine can be realized by adopting hardware, or can be realized by adopting software or a combination of hardware and software.
Fig. 8 is a schematic structural diagram of an embodiment of the electronic device of the present application. As shown in fig. 8, the electronic device 40 may include at least one processor 41, a memory 42 (e.g., a non-volatile memory), a memory 43, and a communication interface 44, and the at least one processor 41, the memory 42, the memory 43, and the communication interface 44 are connected together via a bus 45. The at least one processor 41 executes at least one computer-readable instruction stored or encoded in a memory 42.
It should be appreciated that the computer-executable instructions stored in the memory 42, when executed, cause the at least one processor 41 to perform the various operations and functions described above in connection with fig. 1-7 in various embodiments of the present description.
In embodiments of the present description, electronic device 40 may include, but is not limited to: personal computers, server computers, workstations, desktop computers, laptop computers, notebook computers, mobile electronic devices, smart phones, tablet computers, cellular phones, personal Digital Assistants (PDAs), handsets, messaging devices, wearable electronic devices, consumer electronic devices, and the like.
According to one embodiment, a program product, such as a machine-readable medium, is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-7 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present specification.
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of protection of this specification should be limited by the attached claims.
It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical client, or some units may be implemented by multiple physical clients, or may be implemented jointly by some components in multiple independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (12)
1. The management method of the cloud primary all-in-one machine is characterized by comprising a first node and a second node, wherein the first node and the second node comprise a networking cluster mode and an off-network single machine mode;
when the first node is in the cluster mode, the first node acquires a heartbeat network connection state from the first node to the second node;
when the heartbeat network connection state from the first node to the second node is normal, acquiring the system working state and the storage service working state of the second node;
the first node manages the working mode of the first node based on the system working state and the storage service working state of the second node;
the first node acquires the working state of a self management platform;
when the self management platform of the first node is in a starting state, acquiring the self service network connection state, the system working state, the storage service working state, the management network connection state and the service network connection state of the second node;
when one or more of the service network connection state of the first node, the system working state of the second node, the storage service working state, the management network connection state and the service network connection state are faults, the first node performs fault evacuation on the second node;
Wherein the step of performing fault evacuation by the first node on the second node includes:
the first node acquires a management network connection state, a business network connection state and a storage service working state of the second node;
when the storage service working state of the second node is a fault, the first node shuts down and evacuates the second node and sends alarm information;
when the storage service working state of the second node is normal and the management network connection state and the service network connection state of the second node are both faults, shutting down and evacuating the second node and sending alarm information;
when the storage service working state and the management network connection state of the second node are normal and the service network connection state of the second node is a fault, the first node closes and cold-transitions the service POD of the second node to the first node and sends alarm information;
and when the storage service working state and the service network connection state of the second node are normal and the management network connection state of the second node is a fault, the first node does not process the second node.
2. The method of managing as set forth in claim 1, wherein synchronizing with the step of acquiring the system operation state and the storage service operation state of the second node further includes:
when the heartbeat network connection state from the first node to the second node is a fault, acquiring management network connection states from the first node to the second node and the third-party gateway respectively;
when the connection state of the first node to the management network of the second node is normal, judging whether the first node is a default node or not; if so, the first and second data are not identical,
the first node switches itself to stand-alone mode.
3. The method of managing according to claim 2, wherein synchronizing with the step of acquiring the management network connection status itself to the second node and the third party gateway, respectively, further comprises:
and when the management network connection state from the first node to the second node is a fault and the management network connection state from the first node to the third gateway is normal, the first node switches the first node to a single machine mode.
4. The method according to claim 1, wherein the step of the first node managing its own operation mode based on the system operation state and the storage service operation state of the second node includes:
And the first node switches the first node to a single machine mode when the system working state of the second node is a fault and/or the storage service working state of the second node is a fault.
5. The method of managing as set forth in claim 1, further comprising:
when the first node is in the single-machine mode, judging whether the first node can be networked with the second node or not; if so, the first and second data are not identical,
the first node joins the second node to form a cluster.
6. The method of managing of claim 5, wherein the step of the first node determining whether itself can be networked with the second node while in the stand-alone mode comprises:
the first node acquires the working mode of the second node, the storage service state of the first node and the management network connection state of the first node;
and when the second node is in a stand-alone mode and the storage service state and the management network connection state of the first node are normal, the first node determines that the first node can be networked with the second node.
7. The method of managing of claim 6, wherein the step of the first node determining whether itself can network with the second node when in the stand-alone mode further comprises:
When the second node is not in the stand-alone mode, and/or the storage service state of the second node is a fault, and/or the management network connection state of the second node is a fault, the first node waits for a preset time and then reacquires the working mode of the second node, the storage service state of the second node and the management network connection state of the second node.
8. The method according to claim 5, wherein the first node joins the second node to form a cluster, and the step of switching itself to the cluster mode by the first node and the second node, respectively, further comprises:
the first node judges whether the cluster is successfully formed with the second node; if not, the method comprises the steps of,
and the first node judges whether the first node can be networked with the second node again after waiting for the preset time.
9. The method of managing as set forth in claim 1, further comprising:
the first node reads and writes own system disk;
if the read-write fails, the first node triggers the system kdump service to restore the running memory and restarts the system.
10. The management device of the cloud primary all-in-one machine is characterized by comprising a first node and a second node, wherein the first node and the second node comprise a networking cluster mode and an off-network single machine mode, the management device is applied to the first node, and the management device comprises;
The first acquisition module is used for acquiring the heartbeat network connection state from the first node to the second node when the first node is in the cluster mode;
the second acquisition module is used for acquiring the system working state and the storage service working state of the second node when the heartbeat network connection state from the second node to the second node is normal;
the management module is used for managing the working mode of the management module based on the system working state and the storage service working state of the second node;
the third acquisition module is used for acquiring the working state of the self management platform;
a fourth obtaining module, configured to obtain, when the self management platform is in a start state, a self service network connection state, and a system working state, a storage service working state, a management network connection state, and a service network connection state of the second node;
the fault evacuation module is used for evacuating the faults of the second node when one or more of the service network connection state of the fault evacuation module, the system working state of the second node, the storage service working state, the management network connection state and the service network connection state are faults;
Wherein said performing a failure evacuation of the second node comprises:
acquiring a management network connection state, a business network connection state and a storage service working state of the second node;
when the storage service working state of the second node is a fault, shutting down and evacuating the second node, and sending alarm information;
when the storage service working state of the second node is normal and the management network connection state and the service network connection state of the second node are both faults, shutting down and evacuating the second node and sending alarm information;
when the storage service working state and the management network connection state of the second node are normal and the service network connection state of the second node is a fault, closing and cold-migrating the service POD of the second node to the service POD and sending alarm information;
and when the storage service working state and the service network connection state of the second node are normal and the management network connection state of the second node is a fault, not processing the second node.
11. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of managing a cloud native all-in-one machine of any of claims 1-9.
12. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of managing a cloud native as claimed in any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310221964.8A CN116248484B (en) | 2023-03-09 | 2023-03-09 | Management method and device of cloud primary integrated machine, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310221964.8A CN116248484B (en) | 2023-03-09 | 2023-03-09 | Management method and device of cloud primary integrated machine, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116248484A CN116248484A (en) | 2023-06-09 |
CN116248484B true CN116248484B (en) | 2024-03-22 |
Family
ID=86631075
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310221964.8A Active CN116248484B (en) | 2023-03-09 | 2023-03-09 | Management method and device of cloud primary integrated machine, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116248484B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050078931A (en) * | 2004-02-03 | 2005-08-08 | 엘지엔시스(주) | Method for dealing with system troubles through joint-owning of state information and control commands |
CN107147528A (en) * | 2017-05-23 | 2017-09-08 | 郑州云海信息技术有限公司 | One kind stores gateway intelligently anti-fissure system and method |
CN111274135A (en) * | 2020-01-18 | 2020-06-12 | 苏州浪潮智能科技有限公司 | High availability test method for computing nodes of openstack |
CN113377702A (en) * | 2021-07-06 | 2021-09-10 | 安超云软件有限公司 | Method and device for starting two-node cluster, electronic equipment and storage medium |
CN115269248A (en) * | 2022-07-28 | 2022-11-01 | 江苏安超云软件有限公司 | Method and device for preventing split brain under dual-node cluster, electronic equipment and storage medium |
-
2023
- 2023-03-09 CN CN202310221964.8A patent/CN116248484B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050078931A (en) * | 2004-02-03 | 2005-08-08 | 엘지엔시스(주) | Method for dealing with system troubles through joint-owning of state information and control commands |
CN107147528A (en) * | 2017-05-23 | 2017-09-08 | 郑州云海信息技术有限公司 | One kind stores gateway intelligently anti-fissure system and method |
CN111274135A (en) * | 2020-01-18 | 2020-06-12 | 苏州浪潮智能科技有限公司 | High availability test method for computing nodes of openstack |
CN113377702A (en) * | 2021-07-06 | 2021-09-10 | 安超云软件有限公司 | Method and device for starting two-node cluster, electronic equipment and storage medium |
CN115269248A (en) * | 2022-07-28 | 2022-11-01 | 江苏安超云软件有限公司 | Method and device for preventing split brain under dual-node cluster, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116248484A (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10078563B2 (en) | Preventing split-brain scenario in a high-availability cluster | |
US9052935B1 (en) | Systems and methods for managing affinity rules in virtual-machine environments | |
CN107526659B (en) | Method and apparatus for failover | |
CN102355369B (en) | Virtual clustered system as well as processing method and processing device thereof | |
CN109151045B (en) | Distributed cloud system and monitoring method | |
EP3210367B1 (en) | System and method for disaster recovery of cloud applications | |
CN102394914A (en) | Cluster brain-split processing method and device | |
EP3550436A1 (en) | Method and apparatus for detecting and recovering fault of virtual machine | |
CN107480014A (en) | A kind of High Availabitity equipment switching method and device | |
CN104158707A (en) | Method and device of detecting and processing brain split in cluster | |
US11223515B2 (en) | Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium storing program | |
CN111865632B (en) | Switching method of distributed data storage cluster and switching instruction sending method and device | |
CN114138732A (en) | Data processing method and device | |
CN114840495A (en) | Database cluster split-brain prevention method, storage medium and device | |
CN105389231A (en) | Database dual-computer backup method and system | |
CN111342986B (en) | Distributed node management method and device, distributed system and storage medium | |
CN102073523A (en) | Method and device for implementing software version synchronization | |
CN103902401A (en) | Virtual machine fault tolerance method and device based on monitoring | |
CN116248484B (en) | Management method and device of cloud primary integrated machine, electronic equipment and storage medium | |
CN109617716B (en) | Data center exception handling method and device | |
CN111488247B (en) | High availability method and equipment for managing and controlling multiple fault tolerance of nodes | |
CN107181608B (en) | Method for recovering service and improving performance and operation and maintenance management system | |
US20180107502A1 (en) | Application continuous high availability solution | |
CN114301763B (en) | Distributed cluster fault processing method and system, electronic equipment and storage medium | |
CN116192885A (en) | High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |