CN118747130A - Data transmission repair function verification method, device, electronic device and storage medium - Google Patents
Data transmission repair function verification method, device, electronic device and storage medium Download PDFInfo
- Publication number
- CN118747130A CN118747130A CN202410813363.0A CN202410813363A CN118747130A CN 118747130 A CN118747130 A CN 118747130A CN 202410813363 A CN202410813363 A CN 202410813363A CN 118747130 A CN118747130 A CN 118747130A
- Authority
- CN
- China
- Prior art keywords
- server
- verified
- data transmission
- memory
- repair function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008439 repair process Effects 0.000 title claims abstract description 214
- 230000005540 biological transmission Effects 0.000 title claims abstract description 169
- 238000000034 method Methods 0.000 title claims abstract description 75
- 238000012795 verification Methods 0.000 title claims abstract description 53
- 230000006870 function Effects 0.000 claims abstract description 147
- 230000007246 mechanism Effects 0.000 claims description 19
- 230000001960 triggered effect Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 10
- 238000009825 accumulation Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 239000002245 particle Substances 0.000 description 3
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100498818 Arabidopsis thaliana DDR4 gene Proteins 0.000 description 1
- 206010033799 Paralysis Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明提供一种数据传输修复功能验证方法、装置、电子设备及存储介质,所述方法包括:获取待验证服务器的服务器参数信息;根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。实现了能够有效验证待验证服务器是否具备数据传输修复功能,从而为调用待验证服务器的数据传输修复功能,以使待验证服务器能够在不中断运行的情况下进行数据修复打下基础。
The present invention provides a data transmission repair function verification method, device, electronic device and storage medium, the method comprising: obtaining server parameter information of a server to be verified; determining whether the server to be verified has a data transmission repair environment according to the server parameter information; injecting correctable errors into the memory of the running server to be verified, and obtaining the number of the correctable errors injected by the server to be verified; and verifying whether the server to be verified has a data transmission repair function in operation based on the number of correctable errors. The method can effectively verify whether the server to be verified has a data transmission repair function, thereby laying a foundation for calling the data transmission repair function of the server to be verified so that the server to be verified can perform data repair without interrupting operation.
Description
技术领域Technical Field
本发明涉及数据传输修复技术领域,尤其涉及一种数据传输修复功能验证方法、装置、电子设备及存储介质。The present invention relates to the technical field of data transmission repair, and in particular to a data transmission repair function verification method, device, electronic device and storage medium.
背景技术Background Art
相关技术可知,服务器设备在运行中需要进行数据传输。一旦服务器设备发生故障,需要采取断电更换故障设备或进行热重启等措施进行服务器设备故障修复。这将无法保证数据传输的连续性,影响用户的使用体验。因此,服务器设备在运行过程中具备数据传输修复功能,将能在不影响系统运行的情况下修复故障设备,从而确保数据读写的连续性、传递性。It is known from the related art that server equipment needs to transmit data during operation. Once a server equipment fails, it is necessary to take measures such as powering off to replace the faulty equipment or performing a hot restart to repair the server equipment failure. This will not guarantee the continuity of data transmission and affect the user experience. Therefore, the server equipment has a data transmission repair function during operation, which will be able to repair the faulty equipment without affecting the operation of the system, thereby ensuring the continuity and delivery of data reading and writing.
因此,验证服务器设备是否具备数据传输修复功能成为当前的重要研究课题。Therefore, verifying whether the server equipment has the data transmission repair function has become an important research topic at present.
发明内容Summary of the invention
本发明提供一种数据传输修复功能验证方法、装置、电子设备及存储介质,实现了能够有效验证服务器设备是否具备数据传输修复功能,从而为调用服务器设备的数据传输修复功能,以使服务器设备能够在不中断运行的情况下进行数据修复打下基础。The present invention provides a data transmission repair function verification method, device, electronic device and storage medium, which can effectively verify whether a server device has a data transmission repair function, thereby laying a foundation for calling the data transmission repair function of the server device so that the server device can perform data repair without interrupting operation.
本发明提供一种数据传输修复功能验证方法,所述方法包括:获取待验证服务器的服务器参数信息;根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。The present invention provides a method for verifying a data transmission repair function, the method comprising: obtaining server parameter information of a server to be verified; determining whether the server to be verified has a data transmission repair environment according to the server parameter information; when it is determined that the server to be verified has the data transmission repair environment, injecting correctable errors into the memory of the running server to be verified, and obtaining the number of the correctable errors injected by the server to be verified; and verifying whether the server to be verified has a data transmission repair function during operation based on the number of correctable errors.
根据本发明提供的一种数据传输修复功能验证方法,在所述基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能之前,所述方法还包括:获取所述待验证服务器的可容纳错误数量阈值;所述基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能,具体包括:在所述可纠正错误的错误数量大于所述可容纳错误数量阈值的情况下,确定所述待验证服务器已触发错误机制;在所述待验证服务器已触发错误机制的情况下,判断所述待验证服务器是否正常运行;在所述待验证服务器能够正常运行的情况下,确定所述待验证服务器在运行中具备数据传输修复功能;在所述待验证服务器不能够正常运行的情况下,确定所述待验证服务器在运行中不具备数据传输修复功能。According to a data transmission repair function verification method provided by the present invention, before verifying whether the server to be verified has the data transmission repair function during operation based on the number of correctable errors, the method also includes: obtaining a threshold value of the number of errors that can be accommodated for the server to be verified; verifying whether the server to be verified has the data transmission repair function during operation based on the number of errors that can be corrected, specifically including: when the number of correctable errors is greater than the threshold value of the number of errors that can be accommodated, determining that the server to be verified has triggered an error mechanism; when the server to be verified has triggered the error mechanism, judging whether the server to be verified is operating normally; when the server to be verified can operate normally, determining that the server to be verified has the data transmission repair function during operation; when the server to be verified cannot operate normally, determining that the server to be verified does not have the data transmission repair function during operation.
根据本发明提供的一种数据传输修复功能验证方法,所述待验证服务器的可容纳错误数量阈值包括所述待验证服务器的错误积累阈值、所述待验证服务器的错误风暴阈值,以及最小阈值公约数中的任意一种或几种,其中,所述最小阈值公约数为所述待验证服务器的错误积累阈值和所述待验证服务器的错误风暴阈值的最小公约数。According to a data transmission repair function verification method provided by the present invention, the threshold value of the number of errors that can be accommodated by the server to be verified includes any one or more of the error accumulation threshold of the server to be verified, the error storm threshold of the server to be verified, and a minimum threshold common divisor, wherein the minimum threshold common divisor is the minimum common divisor of the error accumulation threshold of the server to be verified and the error storm threshold of the server to be verified.
根据本发明提供的一种数据传输修复功能验证方法,在所述为运行中的所述待验证服务器的内存注入可纠正错误之前,所述方法还包括:在确定出所述待验证服务器的中央处理器处于加密状态的情况下,为所述中央处理器进行解密处理,得到中央处理器处于解密状态下的待验证服务器;所述为运行中的所述待验证服务器的内存注入可纠正错误,具体包括:为运行中的所述中央处理器处于解密状态下的待验证服务器的内存注入可纠正错误。According to a data transmission repair function verification method provided by the present invention, before injecting correctable errors into the memory of the running server to be verified, the method also includes: when it is determined that the central processor of the server to be verified is in an encrypted state, decrypting the central processor to obtain the server to be verified with the central processor in a decrypted state; injecting correctable errors into the memory of the running server to be verified specifically includes: injecting correctable errors into the memory of the running server to be verified with the central processor in a decrypted state.
根据本发明提供的一种数据传输修复功能验证方法,所述方法还包括:在确定出所述待验证服务器的中央处理器处于非加密状态的情况下,为运行中的所述中央处理器处于非加密状态下的待验证服务器的内存注入可纠正错误。According to a data transmission repair function verification method provided by the present invention, the method also includes: when it is determined that the central processor of the server to be verified is in an unencrypted state, injecting a correctable error into the memory of the running server to be verified whose central processor is in an unencrypted state.
根据本发明提供的一种数据传输修复功能验证方法,所述服务器参数信息至少包括目标中央处理器类型信息、目标内存类型信息和目标内存运行修复信息;所述根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境,采用以下方式实现:在所述待验证服务器满足第一条件、第二条件和第三条件的情况下,确定所述待验证服务器具备数据传输修复环境,其中,在根据预先维护的第一映射表确定所述目标中央处理器类型信息和所述目标内存类型信息具有匹配关系的情况下,确定所述待验证服务器满足第一条件,其中,所述第一映射表中包括由目标服务器的中央处理器类型信息和内存类型信息构成的匹配关系,所述目标服务器为具备数据传输功能的服务器;在检查出所述待验证服务器的串口日志存在所述内存运行修复信息的情况下,确定所述待验证服务器具备捕获内存运行修复信息功能,并将所述待验证服务器具备捕获内存运行修复信息功能,确定为所述待验证服务器满足第二条件;在检测出所述待验证服务器能够捕获故障内存信息的情况下,确定所述待验证服务器满足第三条件。According to a data transmission repair function verification method provided by the present invention, the server parameter information at least includes target central processing unit type information, target memory type information and target memory operation repair information; the determination of whether the server to be verified has a data transmission repair environment based on the server parameter information is implemented in the following manner: when the server to be verified satisfies the first condition, the second condition and the third condition, the server to be verified is determined to have a data transmission repair environment, wherein, when it is determined according to a pre-maintained first mapping table that the target central processing unit type information and the target memory type information have a matching relationship, the server to be verified is determined to meet the first condition, wherein the first mapping table includes a matching relationship formed by the central processing unit type information and the memory type information of the target server, and the target server is a server with a data transmission function; when it is checked that the serial port log of the server to be verified contains the memory operation repair information, it is determined that the server to be verified has a function of capturing memory operation repair information, and the server to be verified has the function of capturing memory operation repair information, and it is determined that the server to be verified meets the second condition; when it is detected that the server to be verified can capture faulty memory information, it is determined that the server to be verified meets the third condition.
根据本发明提供的一种数据传输修复功能验证方法,在验证出所述待验证服务器在运行中具备数据传输修复功能的情况下,所述方法还包括:在所述待验证服务器的数据传输修复功能启用的情况下,基于所述待验证服务器进行数据传输,以使所述待验证服务器在不中断运行的情况下能够对数据进行修复。According to a data transmission repair function verification method provided by the present invention, when it is verified that the server to be verified has the data transmission repair function during operation, the method further includes: when the data transmission repair function of the server to be verified is enabled, data transmission is performed based on the server to be verified, so that the server to be verified can repair the data without interrupting operation.
本发明还提供一种数据传输修复功能验证装置,所述装置包括:获取模块,用于获取待验证服务器的服务器参数信息;确定模块,用于根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;处理模块,用于在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;验证模块,用于基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。The present invention also provides a data transmission repair function verification device, which includes: an acquisition module, used to obtain server parameter information of a server to be verified; a determination module, used to determine whether the server to be verified has a data transmission repair environment based on the server parameter information; a processing module, used to inject correctable errors into the memory of the running server to be verified when it is determined that the server to be verified has the data transmission repair environment, and obtain the number of correctable errors injected by the server to be verified; a verification module, used to verify whether the server to be verified has a data transmission repair function during operation based on the number of correctable errors.
本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述数据传输修复功能验证方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the method for verifying the data transmission repair function as described above is implemented.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述数据传输修复功能验证方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the data transmission repair function verification method as described in any one of the above is implemented.
本发明还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述数据传输修复功能验证方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements any of the above-mentioned data transmission repair function verification methods.
本发明提供的数据传输修复功能验证方法、装置、电子设备及存储介质,获取待验证服务器的服务器参数信息;根据服务器参数信息,确定待验证服务器是否具备数据传输修复环境;在待验证服务器具备数据传输修复环境的情况下,为运行中的待验证服务器的内存注入可纠正错误,并获取待验证服务器注入的可纠正错误的错误数量;基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能。实现了能够有效验证待验证服务器是否具备数据传输修复功能,从而为调用待验证服务器的数据传输修复功能,以使待验证服务器能够在不中断运行的情况下进行数据修复打下基础。The data transmission repair function verification method, device, electronic device and storage medium provided by the present invention obtain server parameter information of a server to be verified; determine whether the server to be verified has a data transmission repair environment according to the server parameter information; if the server to be verified has a data transmission repair environment, inject correctable errors into the memory of the running server to be verified, and obtain the number of correctable errors injected by the server to be verified; based on the number of correctable errors, verify whether the server to be verified has a data transmission repair function during operation. The method can effectively verify whether the server to be verified has a data transmission repair function, thereby laying a foundation for calling the data transmission repair function of the server to be verified so that the server to be verified can perform data repair without interrupting operation.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明提供的数据传输修复功能验证方法的流程示意图之一。FIG. 1 is a flow chart of a method for verifying a data transmission repair function provided by the present invention.
图2是本发明提供的基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能的流程示意图。FIG2 is a flow chart of verifying whether a server to be verified has a data transmission repair function during operation based on the number of correctable errors provided by the present invention.
图3是本发明提供的为运行中的所述待验证服务器的内存注入可纠正错误的流程示意图。FIG3 is a schematic diagram of a flow chart of injecting correctable errors into the memory of the running server to be verified provided by the present invention.
图4是本发明提供的数据传输修复功能验证方法的流程示意图之二。FIG. 4 is a second flow chart of the data transmission repair function verification method provided by the present invention.
图5是本发明提供的数据传输修复功能验证方法的应用场景示意图。FIG5 is a schematic diagram of an application scenario of the data transmission repair function verification method provided by the present invention.
图6是本发明提供的数据传输修复功能验证装置的结构示意图。FIG. 6 is a schematic diagram of the structure of a data transmission repair function verification device provided by the present invention.
图7是本发明提供的电子设备的结构示意图。FIG. 7 is a schematic diagram of the structure of an electronic device provided by the present invention.
具体实施方式DETAILED DESCRIPTION
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
本发明提供的数据传输修复功能验证方法,可以对待验证服务器收购具备数据传输修复功能(North Bridge Interface Repair,又称nBIF Repair)进行验证,从而可以为调用待验证服务器的数据传输修复功能,以使待验证服务器能够在不中断运行的情况下进行数据修复打下基础,对客户的业务不会产生影响。旨在确保系统的持续可靠性和数据的完整性,从而有效应对设备故障带来的挑战。The data transmission repair function verification method provided by the present invention can verify whether the server to be verified has the data transmission repair function (North Bridge Interface Repair, also known as nBIF Repair), thereby laying a foundation for calling the data transmission repair function of the server to be verified so that the server to be verified can perform data repair without interrupting operation, and will not affect the customer's business. It is intended to ensure the continuous reliability of the system and the integrity of the data, so as to effectively cope with the challenges brought by equipment failures.
图1是本发明提供的数据传输修复功能验证方法的流程示意图之一。FIG. 1 is a flow chart of a method for verifying a data transmission repair function provided by the present invention.
下面将结合图1对本发明提供的数据传输修复功能验证方法的过程进行说明。The process of the data transmission repair function verification method provided by the present invention will be described below in conjunction with FIG. 1 .
在本发明一示例性实施例中,结合图1可知,数据传输修复功能验证方法可以包括步骤110至步骤140,下面将分别介绍各步骤。In an exemplary embodiment of the present invention, as can be seen from FIG. 1 , the data transmission repair function verification method may include steps 110 to 140 , and each step will be described below.
在步骤110中,获取待验证服务器的服务器参数信息。In step 110, server parameter information of the server to be verified is obtained.
在步骤120中,根据服务器参数信息,确定待验证服务器是否具备数据传输修复环境。In step 120, it is determined whether the server to be verified has a data transmission repair environment according to the server parameter information.
在一种实施例中,可以获取待验证服务器的服务器参数信息,并可以根据服务器参数信息,判断待验证服务器是否具备数据传输修复环境。可以理解的是,待验证服务器具备数据传输修复环境是待验证服务器具备数据传输修复功能的前提条件。其中,服务器参数信息可以是用于表征与待验证服务器的参数的相关信息。In one embodiment, server parameter information of the server to be verified can be obtained, and whether the server to be verified has a data transmission repair environment can be determined based on the server parameter information. It can be understood that the server to be verified having a data transmission repair environment is a prerequisite for the server to be verified to have a data transmission repair function. The server parameter information can be related information used to characterize parameters of the server to be verified.
在步骤130中,在确定待验证服务器具备数据传输修复环境的情况下,为运行中的待验证服务器的内存注入可纠正错误,并获取待验证服务器注入的可纠正错误的错误数量。In step 130, when it is determined that the server to be verified has a data transmission repair environment, correctable errors are injected into the memory of the running server to be verified, and the number of correctable errors injected into the server to be verified is obtained.
在步骤140中,基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能。In step 140, based on the number of correctable errors, it is verified whether the server to be verified has a data transmission repair function during operation.
在又一种实施例中,在判断出待验证服务器具备数据传输修复环境的情况下,还需要进一步判断待验证服务器是否具备数据传输修复功能。为了验证这一功能,可以为运行中的待验证服务器的内存注入可纠正错误,并获取待验证服务器注入的可纠正错误的错误数量。其中,可纠正错误可以是模拟得到的。进一步的,再基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能。In another embodiment, when it is determined that the server to be verified has a data transmission repair environment, it is necessary to further determine whether the server to be verified has a data transmission repair function. In order to verify this function, correctable errors can be injected into the memory of the running server to be verified, and the number of correctable errors injected into the server to be verified is obtained. The correctable errors can be obtained by simulation. Further, based on the number of correctable errors, it is verified whether the server to be verified has a data transmission repair function during operation.
本发明提供的数据传输修复功能验证方法,获取待验证服务器的服务器参数信息;根据服务器参数信息,确定待验证服务器是否具备数据传输修复环境;在待验证服务器具备数据传输修复环境的情况下,为运行中的待验证服务器的内存注入可纠正错误,并获取待验证服务器注入的可纠正错误的错误数量;基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能。实现了能够有效验证待验证服务器是否具备数据传输修复功能,从而为调用待验证服务器的数据传输修复功能,以使待验证服务器能够在不中断运行的情况下进行数据修复打下基础。The data transmission repair function verification method provided by the present invention obtains server parameter information of a server to be verified; determines whether the server to be verified has a data transmission repair environment according to the server parameter information; injects correctable errors into the memory of the running server to be verified when the server to be verified has the data transmission repair environment, and obtains the number of correctable errors injected by the server to be verified; and verifies whether the server to be verified has a data transmission repair function during operation based on the number of correctable errors. The method can effectively verify whether the server to be verified has a data transmission repair function, thereby laying a foundation for calling the data transmission repair function of the server to be verified so that the server to be verified can perform data repair without interrupting operation.
图2是本发明提供的基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能的流程示意图。FIG2 is a flow chart of verifying whether a server to be verified has a data transmission repair function during operation based on the number of correctable errors provided by the present invention.
下面将结合图2对基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能的过程进行说明。The following will describe the process of verifying whether the server to be verified has the data transmission repair function during operation based on the number of correctable errors in conjunction with FIG. 2 .
在本发明一示例性实施例中,结合图2可知,基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能可以包括步骤210至步骤250,下面将分别介绍各步骤。In an exemplary embodiment of the present invention, in conjunction with FIG. 2 , based on the number of correctable errors, verifying whether the server to be verified has a data transmission repair function during operation may include steps 210 to 250 , and each step will be described below.
在步骤210中,获取待验证服务器的可容纳错误数量阈值。In step 210, a threshold value of the number of errors that can be tolerated by the server to be verified is obtained.
在步骤220中,在可纠正错误的错误数量大于可容纳错误数量阈值的情况下,确定待验证服务器已触发错误机制。In step 220, when the number of correctable errors is greater than the threshold of the number of tolerable errors, it is determined that the error mechanism has been triggered by the server to be verified.
在一种实施例中,可以获取待验证服务器的可容纳错误数量阈值,其中,可容纳错误数量阈值可以根据实际情况进行调整,在本实施例中不作具体限定。In one embodiment, a threshold value of the number of errors that can be tolerated by the server to be verified may be obtained, wherein the threshold value of the number of errors that can be tolerated may be adjusted according to actual conditions and is not specifically limited in this embodiment.
在本发明又一示例性实施例中,可容纳错误数量阈值可以包括待验证服务器的错误积累阈值(又称CE Accumulation Threshold)、待验证服务器的错误风暴阈值(又称CEStorm Threshold),以及最小阈值公约数中的任意一种或几种,其中,最小阈值公约数为待验证服务器的错误积累阈值和待验证服务器的错误风暴阈值的最小公约数。In another exemplary embodiment of the present invention, the threshold for the number of errors that can be accommodated may include an error accumulation threshold (also known as CE Accumulation Threshold) of the server to be verified, an error storm threshold (also known as CEStorm Threshold) of the server to be verified, and any one or more of the minimum threshold common divisors, wherein the minimum threshold common divisor is the minimum common divisor of the error accumulation threshold of the server to be verified and the error storm threshold of the server to be verified.
在又一实施例中,可以在可纠正错误的错误数量大于可容纳错误数量阈值的情况下,确定待验证服务器已触发错误机制。在应用过程中,可以检查是否有内存可纠正错误的日志产生。若有日志产生,且若可纠正错误的错误数量大于等于可容纳错误数量阈值的情况下,说明待验证服务器的基本输入输出系统(Basic Input Output System,又称BIOS)触发了一次系统管理中断(System Management Interrupt,SMI),从而可以将内存可纠正错误上报给了待验证服务器的基板关联控制器(Baseboard Management Controller,又称BMC)。当满足这一步骤的预期结果时,说明设备已成功发生故障,此时持续触发错误机制,也即说明待验证服务器已触发错误机制。In another embodiment, when the number of correctable errors is greater than the threshold value of the number of errors that can be accommodated, it can be determined that the server to be verified has triggered the error mechanism. During the application process, it can be checked whether a log of memory correctable errors is generated. If a log is generated, and if the number of correctable errors is greater than or equal to the threshold value of the number of errors that can be accommodated, it means that the Basic Input Output System (Basic Input Output System, also known as BIOS) of the server to be verified has triggered a system management interrupt (System Management Interrupt, SMI), so that the memory correctable error can be reported to the baseboard associated controller (Baseboard Management Controller, also known as BMC) of the server to be verified. When the expected result of this step is met, it means that the device has successfully failed, and the error mechanism is continuously triggered at this time, which means that the server to be verified has triggered the error mechanism.
在步骤230中,在待验证服务器已触发错误机制的情况下,判断待验证服务器是否正常运行。In step 230, when the error mechanism has been triggered by the server to be verified, it is determined whether the server to be verified is operating normally.
在步骤240中,在待验证服务器能够正常运行的情况下,确定待验证服务器在运行中具备数据传输修复功能。In step 240, when the server to be verified can operate normally, it is determined that the server to be verified has a data transmission repair function during operation.
在步骤250中,在待验证服务器不能够正常运行的情况下,确定待验证服务器在运行中不具备数据传输修复功能。In step 250, when the server to be verified cannot operate normally, it is determined that the server to be verified does not have a data transmission repair function during operation.
在又一种实施例中,在待验证服务器已触发错误机制的情况下,可以判断待验证服务器是否正常运行。在一示例中,若待验证服务器能够正常运行,说明待验证服务器在触发错误机制的情况下,依然能够正常运行,则说明待验证服务器在运行中具备数据传输修复功能。在又一示例中,若待验证服务器不能够正常运行,说明待验证服务器在触发错误机制的情况下,不能够正常运行,则说明待验证服务器在运行中不具备数据传输修复功能。In another embodiment, when the server to be verified has triggered the error mechanism, it can be determined whether the server to be verified is operating normally. In one example, if the server to be verified can operate normally, it means that the server to be verified can still operate normally when the error mechanism is triggered, which means that the server to be verified has a data transmission repair function during operation. In another example, if the server to be verified cannot operate normally, it means that the server to be verified cannot operate normally when the error mechanism is triggered, which means that the server to be verified does not have a data transmission repair function during operation.
在又一实施例中,确定待验证服务器是否正常运行可以通过终端是否输出RASRuntime nBIF Repair字样信息,且系统信息模块上报了“correct/deferred error”日志来判断。若终端输出中看到RAS Runtime nBIF Repair字样信息,且系统信息模块上报了“correct/deferred error”日志,则说明修复设备成功,验证待验证服务器具备nBIFRepair功能。若终端没有输出RAS Runtime nBIF Repair信息,则说明修复设备失败,验证待验证服务器不具备nBIF Repair功能。至此,nBIF Repair功能的验证过程完成。In another embodiment, whether the server to be verified is operating normally can be determined by whether the terminal outputs the information of RASRuntime nBIF Repair and the system information module reports the "correct/deferred error" log. If the information of RAS Runtime nBIF Repair is seen in the terminal output and the system information module reports the "correct/deferred error" log, it means that the repair device is successful, and it is verified that the server to be verified has the nBIFRepair function. If the terminal does not output the RAS Runtime nBIF Repair information, it means that the repair device has failed, and it is verified that the server to be verified does not have the nBIF Repair function. At this point, the verification process of the nBIF Repair function is completed.
图3是本发明提供的为运行中的所述待验证服务器的内存注入可纠正错误的流程示意图。FIG3 is a schematic diagram of a flow chart of injecting correctable errors into the memory of the running server to be verified provided by the present invention.
下面将结合图3对为运行中的所述待验证服务器的内存注入可纠正错误的过程进行说明。The process of injecting correctable errors into the memory of the running server to be authenticated will be described below in conjunction with FIG. 3 .
在本发明一示例性实施例中,结合图3可知,为运行中的所述待验证服务器的内存注入可纠正错误可以包括步骤310至步骤330,下面将分别介绍各步骤。In an exemplary embodiment of the present invention, as shown in FIG. 3 , injecting correctable errors into the memory of the running server to be verified may include steps 310 to 330 , and each step will be described below.
在步骤310中,在确定出待验证服务器的中央处理器处于加密状态的情况下,为中央处理器进行解密处理,得到中央处理器处于解密状态下的待验证服务器。In step 310, when it is determined that the central processor of the server to be verified is in an encrypted state, a decryption process is performed on the central processor to obtain the server to be verified whose central processor is in a decrypted state.
在步骤320中,为运行中的中央处理器处于解密状态下的待验证服务器的内存注入可纠正错误。In step 320, a correctable error is injected into the memory of the server to be authenticated whose CPU is in a decrypted state.
在步骤330中,在确定出待验证服务器的中央处理器处于非加密状态的情况下,为运行中的中央处理器处于非加密状态下的待验证服务器的内存注入可纠正错误。In step 330, when it is determined that the CPU of the server to be authenticated is in a non-encrypted state, a correctable error is injected into the memory of the running server to be authenticated whose CPU is in a non-encrypted state.
在一种实施例中,还可以判断待验证服务器的中央处理器是否处于加密状态。若确定出待验证服务器的中央处理器处于加密状态的情况下,此时无法注入可纠正错误,在该场景下,需要为中央处理器进行解密处理,得到中央处理器处于解密状态下的待验证服务器。进一步的,再为运行中的中央处理器处于解密状态下的待验证服务器的内存注入可纠正错误,从而实现触发待验证服务器的错误机制,为进一步验证待验证服务器是否具备数据传输修复功能打下基础。In one embodiment, it is also possible to determine whether the central processor of the server to be verified is in an encrypted state. If it is determined that the central processor of the server to be verified is in an encrypted state, it is impossible to inject a correctable error at this time. In this scenario, it is necessary to decrypt the central processor to obtain the server to be verified with the central processor in a decrypted state. Further, a correctable error is injected into the memory of the server to be verified with the central processor in a decrypted state in operation, thereby triggering the error mechanism of the server to be verified, laying the foundation for further verifying whether the server to be verified has a data transmission repair function.
在又一实施例中,若确定出待验证服务器的中央处理器处于非加密状态的情况下,此时可以直接注入可纠正错误,在该场景下,可以直接为运行中的中央处理器处于非加密状态下的待验证服务器的内存注入可纠正错误,从而实现触发待验证服务器的错误机制,为进一步验证待验证服务器是否具备数据传输修复功能打下基础。In another embodiment, if it is determined that the central processor of the server to be verified is in an unencrypted state, a correctable error can be directly injected. In this scenario, a correctable error can be directly injected into the memory of the server to be verified whose central processor is in an unencrypted state, thereby triggering the error mechanism of the server to be verified, laying the foundation for further verifying whether the server to be verified has the data transmission repair function.
在本发明又一示例性实施例中,继续以前文所述的实施例为例进行说明,服务器参数信息至少可以包括目标中央处理器类型信息、目标内存类型信息和目标内存运行修复信息。其中,根据服务器参数信息,确定待验证服务器是否具备数据传输修复环境,可以采用以下方式实现:In another exemplary embodiment of the present invention, the above-mentioned embodiment is continued as an example for explanation, the server parameter information may at least include target CPU type information, target memory type information and target memory operation repair information. According to the server parameter information, determining whether the server to be verified has a data transmission repair environment may be implemented in the following manner:
在待验证服务器满足第一条件、第二条件和第三条件的情况下,确定待验证服务器具备数据传输修复环境,其中,When the server to be verified meets the first condition, the second condition and the third condition, it is determined that the server to be verified has a data transmission repair environment, wherein:
在根据预先维护的第一映射表确定目标中央处理器类型信息和目标内存类型信息具有匹配关系的情况下,确定待验证服务器满足第一条件,其中,第一映射表中包括由目标服务器的中央处理器类型信息和内存类型信息构成的匹配关系,目标服务器为具备数据传输功能的服务器;In the case where it is determined according to a pre-maintained first mapping table that the target central processing unit type information and the target memory type information have a matching relationship, determining that the server to be verified meets the first condition, wherein the first mapping table includes a matching relationship formed by the central processing unit type information and the memory type information of the target server, and the target server is a server with a data transmission function;
在检查出待验证服务器的串口日志存在内存运行修复信息的情况下,确定待验证服务器具备捕获内存运行修复信息功能,并将待验证服务器具备捕获内存运行修复信息功能,确定为待验证服务器满足第二条件;When it is found that the serial port log of the server to be verified contains memory operation repair information, it is determined that the server to be verified has the function of capturing the memory operation repair information, and the server to be verified has the function of capturing the memory operation repair information, and it is determined that the server to be verified meets the second condition;
在检测出待验证服务器能够捕获故障内存信息的情况下,确定待验证服务器满足第三条件。In the case where it is detected that the server to be verified is able to capture the fault memory information, it is determined that the server to be verified meets the third condition.
在一种实施例中,可以在待验证服务器满足第一条件、第二条件和第三条件的情况下,确定待验证服务器具备数据传输修复环境。在一示例中,可以根据预先维护的第一映射表确定目标中央处理器类型信息和目标内存类型信息具有匹配关系的情况下,确定待验证服务器满足第一条件,其中,第一映射表中包括由具备数据传输功能的服务器的中央处理器类型信息和内存类型信息构成的匹配关系。例如,对于具备数据传输功能的服务器,其前提条件是,中央处理器类型信息和内存类型信息具有对应的匹配关系。在一示例中,若CPU属于Genoa平台,需通过命令验证内存是否为X4类型,若内存满足为X4,说明待验证服务器满足第一条件。在又一示例中,若内存不满足为X4,可以更换内存至X4,以使待验证服务器满足第一条件。In one embodiment, it can be determined that the server to be verified has a data transmission repair environment when the server to be verified satisfies the first condition, the second condition, and the third condition. In one example, it can be determined that the server to be verified meets the first condition when the target central processing unit type information and the target memory type information have a matching relationship based on a pre-maintained first mapping table, wherein the first mapping table includes a matching relationship formed by the central processing unit type information and the memory type information of the server with a data transmission function. For example, for a server with a data transmission function, the prerequisite is that the central processing unit type information and the memory type information have a corresponding matching relationship. In one example, if the CPU belongs to the Genoa platform, it is necessary to verify whether the memory is of type X4 through a command. If the memory satisfies X4, it means that the server to be verified meets the first condition. In another example, if the memory does not meet X4, the memory can be replaced to X4 so that the server to be verified meets the first condition.
在又一实施例中,还可以启动待验证服务器并进入操作系统。其中,可以通过执行命令验证是否启动待验证服务器的内存颗粒中的寄存器。可以理解的是,启动待验证服务器的内存颗粒中的寄存器可以表征启动待验证服务器并进入操作系统。在一示例中,可以通过检查EccChipKillCap寄存器的值是否为1确定是否启动待验证服务器的内存颗粒中的寄存器。In another embodiment, the server to be verified can also be started and the operating system can be entered. Among them, the register in the memory particle of the server to be verified can be verified by executing a command. It can be understood that starting the register in the memory particle of the server to be verified can represent starting the server to be verified and entering the operating system. In an example, it can be determined whether the register in the memory particle of the server to be verified is started by checking whether the value of the EccChipKillCap register is 1.
在又一实施例中,在启动待验证服务器的情况下,可以通过SOL(Serial OverLAN)收集串口信息。当服务器进入操作系统时,停止收集,并打开串口日志检查是否存在内存运行修复信息(又称DPPRCL信息)。若检查出待验证服务器的串口日志存在内存运行修复信息,说明ABL功能已启用,可以正常捕获DRAM相关信息,也即可以确定待验证服务器具备捕获内存运行修复信息功能,并将待验证服务器具备捕获内存运行修复信息功能,确定为待验证服务器满足第二条件。In another embodiment, when the server to be verified is started, serial port information can be collected through SOL (Serial Over LAN). When the server enters the operating system, the collection is stopped, and the serial port log is opened to check whether there is memory operation repair information (also known as DPPRCL information). If it is checked that the serial port log of the server to be verified has memory operation repair information, it means that the ABL function has been enabled and DRAM related information can be captured normally, that is, it can be determined that the server to be verified has the function of capturing memory operation repair information, and the server to be verified has the function of capturing memory operation repair information, and it is determined that the server to be verified meets the second condition.
可以理解的是,若未检查出待验证服务器的串口日志存在内存运行修复信息,可以另行启用ABL相关功能选项,以确保Pointer Repair功能的固件支持,从而使待验证服务器满足第二条件。It is understandable that if no memory operation repair information is found in the serial port log of the server to be verified, the ABL related function options can be enabled separately to ensure the firmware support of the Pointer Repair function, so that the server to be verified meets the second condition.
在又一实施例中,还可以验证待验证服务器操作系统是否包含BERT功能模块,也即检测待验证服务器是否能够捕获故障内存信息。在检测出待验证服务器能够捕获故障内存信息的情况下,说明待验证服务器操作系统包含BERT功能模块,从而可以确定待验证服务器满足第三条件。In another embodiment, it is also possible to verify whether the operating system of the server to be verified includes a BERT functional module, that is, to detect whether the server to be verified can capture fault memory information. If it is detected that the server to be verified can capture fault memory information, it means that the operating system of the server to be verified includes a BERT functional module, so that it can be determined that the server to be verified meets the third condition.
在又一实施例中,在待验证服务器操作系统不包含BERT功能模块的情况下,还可以更换为包含该模块的操作系统或安装相应插件,以确保能够正常捕获故障内存信息,从而可以确保待验证服务器满足第三条件。In another embodiment, when the operating system of the server to be verified does not include the BERT functional module, it can also be replaced with an operating system including the module or a corresponding plug-in can be installed to ensure that the faulty memory information can be captured normally, thereby ensuring that the server to be verified meets the third condition.
可以理解的是,在待验证服务器满足第一条件、第二条件和第三条件的情况下,可以确定待验证服务器具备数据传输修复环境。It can be understood that, when the server to be verified meets the first condition, the second condition and the third condition, it can be determined that the server to be verified has a data transmission repair environment.
还可以检查待验证服务器的与数据传输修复功能对应的寄存器是否为1,若不为1,可以将待验证服务器的数据传输修复功能设置为不可使用状态。It may also be checked whether the register corresponding to the data transmission repair function of the server to be verified is 1. If not, the data transmission repair function of the server to be verified may be set to an unusable state.
图4是本发明提供的数据传输修复功能验证方法的流程示意图之二。FIG. 4 is a second flow chart of the data transmission repair function verification method provided by the present invention.
下面将结合图4对另一种数据传输修复功能验证方法的过程进行说明。The process of another data transmission repair function verification method will be described below in conjunction with FIG. 4 .
在一种实施例中,数据传输修复功能验证方法可以包括步骤410至步骤450,其中,步骤410至步骤440分别与前文所述的步骤110至步骤140相同或相似,其具体实施方式和有益效果请参照前文描述,在本实施例中不作具体限定,下面将介绍步骤450。In one embodiment, the data transmission repair function verification method may include steps 410 to 450, wherein steps 410 to 440 are respectively the same or similar to steps 110 to 140 described above. Please refer to the above description for its specific implementation and beneficial effects. No specific limitation is made in this embodiment. Step 450 will be introduced below.
在步骤450中,在待验证服务器的数据传输修复功能启用的情况下,基于待验证服务器进行数据传输,以使待验证服务器在不中断运行的情况下能够对数据进行修复。In step 450, when the data transmission repair function of the server to be verified is enabled, data transmission is performed based on the server to be verified, so that the server to be verified can repair data without interrupting operation.
在一种实施例中,可以在待验证服务器的数据传输修复功能启用的情况下,基于待验证服务器进行数据传输,从而可以在设备发生故障的情况下,能够使待验证服务器在不中断运行的情况下对数据进行修复。进而不会对客户的业务产生影响。旨在确保系统的持续可靠性和数据的完整性,有效应对设备故障带来的挑战。In one embodiment, when the data transmission repair function of the server to be verified is enabled, data transmission can be performed based on the server to be verified, so that when a device fails, the server to be verified can repair data without interrupting operation. This will not affect the customer's business. This is intended to ensure the continuous reliability of the system and the integrity of the data, and effectively respond to the challenges brought by device failures.
图5是本发明提供的数据传输修复功能验证方法的应用场景示意图。FIG5 is a schematic diagram of an application scenario of the data transmission repair function verification method provided by the present invention.
为了进一步介绍本发明提供的数据传输修复功能验证方法,下面将结合图5进行说明。In order to further introduce the data transmission repair function verification method provided by the present invention, it will be described below in conjunction with FIG. 5 .
在一种实施例中,结合图5可知,数据传输修复功能验证方法可以采用以下方式实现:确定待验证服务的CPU类型。In one embodiment, as can be seen from FIG. 5 , the data transmission repair function verification method can be implemented in the following manner: determining the CPU type of the service to be verified.
CPU类型为Turin的情况下,需要确定内存为X4、X8和X16中任意一种。When the CPU type is Turin, you need to make sure that the memory is any one of X4, X8 and X16.
CPU类型为Genoa的情况下,需要判断待验证服务器的内存类型是否为X4。If the CPU type is Genoa, you need to determine whether the memory type of the server to be verified is X4.
若内存类型不为X4,需要更换X4的内存。If the memory type is not X4, you need to replace the memory with X4.
在应用过程中,随着CPU的不断升级,BIOS代码进行了不断地适配,不同平台对于CPU和内存的需求确实存在差异,因此选择适合平台需求的CPU和内存是非常重要的。针对Genoa平台需选用rank4的内存(对应X4的内存),同时需具有ChipKill特性;而针对Turin平台,内存的选择更加广泛一些,如果CPU性能较强,那么可以选择更大的内存容量,如16GB或32GB,以确保系统运行的流畅性。对于需要快速数据传输的平台,可以选择DDR4或更高版本的内存,它们具有更高的数据传输速率和更低的延迟。During the application process, as the CPU is constantly upgraded, the BIOS code is constantly adapted. Different platforms do have different requirements for CPU and memory, so it is very important to choose CPU and memory that suit the platform requirements. For the Genoa platform, rank 4 memory (corresponding to X4 memory) must be selected, and it must have the ChipKill feature; for the Turin platform, the memory selection is more extensive. If the CPU performance is strong, you can choose a larger memory capacity, such as 16GB or 32GB, to ensure the smoothness of the system operation. For platforms that require fast data transmission, you can choose DDR4 or higher versions of memory, which have higher data transmission rates and lower latency.
在CPU和内存满足前述要求下,可以重启机器获取串口信息。If the CPU and memory meet the above requirements, you can restart the machine to obtain serial port information.
判断串口信息是否包含DPPRCL(对应内存运行修复信息)。Determine whether the serial port information contains DPPRCL (corresponding to memory operation repair information).
若串口信息不包含DPPRCL,需要启动ABL功能,从而能够捕获DPPRCL,确保串口信息包含DPPRCL。If the serial port information does not contain DPPRCL, you need to enable the ABL function to capture DPPRCL and ensure that the serial port information contains DPPRCL.
在串口信息包含DPPRCL下,判断待验证服务器是否具备BERT模块。以使待验证服务器能够捕获故障内存信息。When the serial port information contains DPPRCL, determine whether the server to be verified has a BERT module, so that the server to be verified can capture fault memory information.
在待验证服务器不具备BERT模块,可以更换待验证服务器为其他OS系统,以使更换系统后待验证服务器具备BERT模块。If the server to be verified does not have a BERT module, the server to be verified may be replaced with another OS system so that the server to be verified has a BERT module after the system is replaced.
在待验证服务器满足以上条件下,确定待验证服务器具备数据传输修复环境。If the server to be verified meets the above conditions, it is determined that the server to be verified has a data transmission repair environment.
在一实施例中,固件版本BIOS需具有基本的串口输出debug模式,以便确认内存DPPRCL内容可正常填充信息,针对OS系统需具有BERT模块,对于低版本系统可直接打印驱动,对于高版本系统自带其模块,可节约环境搭建时间,当各条件已满足,这标志着BIOS已经构建了一个适合验证nBIF Repair功能的环境。nBIF主要负责CPU与内存、显卡等高速设备之间的数据传输,故以其中一种错误为类型,如:利用CMDAMD工具以及内存的PhysicalAddress(物理地址),可以获取到内存的normalized address(规范化地址)以及其他相关的nBIF参数,如Socket、nbiohub、nbio等。In one embodiment, the firmware version BIOS needs to have a basic serial port output debug mode to confirm that the memory DPPRCL content can be filled with information normally. For the OS system, it needs to have a BERT module. For low-version systems, the driver can be directly printed. For high-version systems, the module comes with itself, which can save time for environment construction. When all conditions are met, it means that the BIOS has built an environment suitable for verifying the nBIF Repair function. nBIF is mainly responsible for data transmission between the CPU and high-speed devices such as memory and graphics cards, so one of the errors is used as the type, such as: using the CMDAMD tool and the PhysicalAddress of the memory, the normalized address of the memory and other related nBIF parameters, such as Socket, nbiohub, nbio, etc. can be obtained.
向待验证服务器注入可纠正错误。Inject correctable errors into the server being authenticated.
判断内存是否有上报错误。Determine whether there are any memory errors reported.
若内存上报错误,确定待验证服务器已触发错误机制。If the memory reports an error, it is determined that the server to be verified has triggered an error mechanism.
若内存未上报错误,确定待验证服务器的nBIF repair功能测试不通过。If the memory does not report an error, it is determined that the nBIF repair function test of the server to be verified has failed.
在确定待验证服务器已触发错误机制的情况下,判断待验证服务器是否正常运行。When it is determined that the server to be verified has triggered the error mechanism, it is determined whether the server to be verified is operating normally.
若正常运行,确定待验证服务器的nBIF repair功能测试通过。If it runs normally, confirm that the nBIF repair function test of the server to be verified has passed.
若未正常运行,确定待验证服务器的nBIF repair功能测试不通过。If it does not run normally, make sure that the nBIF repair function test of the server to be verified has failed.
在一种实施例中,在内存上报错误的情况下,确定待验证服务器已触发错误机制。可以观察到系统依然能够正常运行,这有效地避免了BMC和系统的瘫痪。同时,在终端会输出“RAS Runtime nBIF Repair”的字样信息,这验证了待验证服务器的nBIF repair功能测试通过,为保障设备数据读写功能的正常性提供了有力的支持。否则,说明待验证服务器的nBIF repair功能测试未通过。In one embodiment, in the case of a memory error, it is determined that the server to be verified has triggered an error mechanism. It can be observed that the system can still operate normally, which effectively avoids the paralysis of the BMC and the system. At the same time, the terminal will output the message "RAS Runtime nBIF Repair", which verifies that the nBIF repair function test of the server to be verified has passed, providing strong support for ensuring the normality of the device data reading and writing functions. Otherwise, it means that the nBIF repair function test of the server to be verified has not passed.
需要说明的是,为了保证待验证服务器具备nBIF repair功能,还需要定制待验证服务器的BIOS模块。这一模块可按照功能需求针对性的开发,也可在现有固件基础上进行优化后获得。此模块除具备常规服务器管理功能外,为满足本案功能需求,在不变动原有代码的基础上,需BIOS支持NBIO功能指令集disabled/enabled功能用于不同应用场景优化。通过指令或可视化界面操作,用户可以选择是在服务运行过程中单次修复,在服务运行过程中重复执行修复序列修复,在服务重启过程中单次修复还是在服务重启过程中多次修复。It should be noted that in order to ensure that the server to be verified has the nBIF repair function, the BIOS module of the server to be verified needs to be customized. This module can be developed specifically according to functional requirements, or it can be obtained after optimization based on the existing firmware. In addition to the conventional server management functions, in order to meet the functional requirements of this case, without changing the original code, the BIOS needs to support the NBIO function instruction set disabled/enabled functions for optimization of different application scenarios. Through instructions or visual interface operations, users can choose whether to perform a single repair during service operation, repeatedly execute a repair sequence repair during service operation, perform a single repair during service restart, or perform multiple repairs during service restart.
在又一实施例中,继续以前文所述的实施例为例进行说明,在确定出待验证服务器的nBIF repair功能测试不通过的情况下,还可以反推不通过的原因。也即确定待验证服务器不满足数据传输修复环境条件的第一原因和\或待验证服务器在满足数据传输修复环境的条件下不能正常运行的第二原因,确定出待验证服务器的nBIF repair功能测试不通过的目标原因。并基于目标原因进行对应整改,以使整改后待验证服务器能够通过nBIFrepair功能测试,从而使整改后待验证服务器具有数据传输修复功能。In another embodiment, the above-mentioned embodiment is used as an example for explanation. When it is determined that the nBIF repair function test of the server to be verified fails, the reason for the failure can be reversed. That is, the first reason that the server to be verified does not meet the data transmission repair environment conditions and/or the second reason that the server to be verified cannot operate normally under the conditions of the data transmission repair environment are determined, and the target reason for the failure of the nBIF repair function test of the server to be verified is determined. Based on the target reason, corresponding rectification is performed so that the server to be verified can pass the nBIFrepair function test after rectification, so that the server to be verified has the data transmission repair function after rectification.
根据前文描述可知,本发明提供的数据传输修复功能验证方法,获取待验证服务器的服务器参数信息;根据服务器参数信息,确定待验证服务器是否具备数据传输修复环境;在待验证服务器具备数据传输修复环境的情况下,为运行中的待验证服务器的内存注入可纠正错误,并获取待验证服务器注入的可纠正错误的错误数量;基于可纠正错误的错误数量,验证待验证服务器在运行中是否具备数据传输修复功能。实现了能够有效验证待验证服务器是否具备数据传输修复功能,从而为调用待验证服务器的数据传输修复功能,以使待验证服务器能够在不中断运行的情况下进行数据修复打下基础。According to the foregoing description, the data transmission repair function verification method provided by the present invention obtains the server parameter information of the server to be verified; determines whether the server to be verified has a data transmission repair environment based on the server parameter information; if the server to be verified has a data transmission repair environment, injects correctable errors into the memory of the running server to be verified, and obtains the number of correctable errors injected by the server to be verified; based on the number of correctable errors, verifies whether the server to be verified has a data transmission repair function during operation. This method can effectively verify whether the server to be verified has a data transmission repair function, thereby laying a foundation for calling the data transmission repair function of the server to be verified so that the server to be verified can perform data repair without interrupting operation.
下面对本发明提供的数据传输修复功能验证装置进行描述,下文描述的数据传输修复功能验证装置与上文描述的数据传输修复功能验证方法可相互对应参照。The data transmission repair function verification device provided by the present invention is described below. The data transmission repair function verification device described below and the data transmission repair function verification method described above can be referenced to each other.
图6是本发明提供的数据传输修复功能验证装置的结构示意图。FIG. 6 is a schematic diagram of the structure of a data transmission repair function verification device provided by the present invention.
下面将结合图6对本发明提供的数据传输修复功能验证装置的结构进行说明。The structure of the data transmission repair function verification device provided by the present invention will be described below in conjunction with FIG. 6 .
在本发明一示例性实施例中,结合图6可知,数据传输修复功能验证装置可以包括获取模块610、确定模块620、处理模块630和验证模块640,下面将分别介绍各模块。In an exemplary embodiment of the present invention, as shown in conjunction with FIG. 6 , the data transmission repair function verification device may include an acquisition module 610 , a determination module 620 , a processing module 630 and a verification module 640 , and each module will be described below.
获取模块610,可以被配置为用于获取待验证服务器的服务器参数信息;The acquisition module 610 may be configured to acquire server parameter information of the server to be verified;
确定模块620,可以被配置为用于根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;The determination module 620 may be configured to determine whether the server to be verified has a data transmission repair environment according to the server parameter information;
处理模块630,可以被配置为用于在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;The processing module 630 may be configured to, when it is determined that the server to be verified has a data transmission repair environment, inject a correctable error into the memory of the running server to be verified, and obtain the number of the correctable errors injected into the server to be verified;
验证模块640,可以被配置为用于基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。The verification module 640 may be configured to verify whether the server to be verified has a data transmission repair function during operation based on the number of correctable errors.
在本发明一示例性实施例中,验证模块640还可以被配置为用于:In an exemplary embodiment of the present invention, the verification module 640 may also be configured to:
获取所述待验证服务器的可容纳错误数量阈值;Obtaining a threshold value of the number of errors that can be tolerated by the server to be verified;
验证模块640可以采用以下方式实现基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能:The verification module 640 may verify whether the server to be verified has the data transmission repair function in operation based on the number of correctable errors in the following manner:
在所述可纠正错误的错误数量大于所述可容纳错误数量阈值的情况下,确定所述待验证服务器已触发错误机制;In a case where the number of correctable errors is greater than the threshold of the number of tolerant errors, determining that the server to be verified has triggered an error mechanism;
在所述待验证服务器已触发错误机制的情况下,判断所述待验证服务器是否正常运行;In the case where the server to be verified has triggered an error mechanism, determining whether the server to be verified is operating normally;
在所述待验证服务器能够正常运行的情况下,确定所述待验证服务器在运行中具备数据传输修复功能;In the case where the server to be verified can operate normally, determining that the server to be verified has a data transmission repair function during operation;
在所述待验证服务器不能够正常运行的情况下,确定所述待验证服务器在运行中不具备数据传输修复功能。In the case that the server to be verified cannot operate normally, it is determined that the server to be verified does not have a data transmission repair function during operation.
在本发明一示例性实施例中,所述待验证服务器的可容纳错误数量阈值包括所述待验证服务器的错误积累阈值、所述待验证服务器的错误风暴阈值,以及最小阈值公约数中的任意一种或几种,其中,所述最小阈值公约数为所述待验证服务器的错误积累阈值和所述待验证服务器的错误风暴阈值的最小公约数。In an exemplary embodiment of the present invention, the threshold value of the number of errors that can be tolerated by the server to be verified includes any one or more of the error accumulation threshold of the server to be verified, the error storm threshold of the server to be verified, and a minimum threshold common divisor, wherein the minimum threshold common divisor is the minimum common divisor of the error accumulation threshold of the server to be verified and the error storm threshold of the server to be verified.
在本发明一示例性实施例中,处理模块630还可以被配置为用于:In an exemplary embodiment of the present invention, the processing module 630 may also be configured to:
在确定出所述待验证服务器的中央处理器处于加密状态的情况下,为所述中央处理器进行解密处理,得到中央处理器处于解密状态下的待验证服务器;When it is determined that the central processor of the server to be verified is in an encrypted state, decryption processing is performed on the central processor to obtain the server to be verified whose central processor is in a decrypted state;
处理模块630可以采用以下方式实现为运行中的所述待验证服务器的内存注入可纠正错误:为运行中的所述中央处理器处于解密状态下的待验证服务器的内存注入可纠正错误。The processing module 630 may inject the correctable error into the memory of the running server to be authenticated in the following manner: inject the correctable error into the memory of the running server to be authenticated whose central processor is in a decrypted state.
在本发明一示例性实施例中,处理模块630还可以被配置为用于:在确定出所述待验证服务器的中央处理器处于非加密状态的情况下,为运行中的所述中央处理器处于非加密状态下的待验证服务器的内存注入可纠正错误。In an exemplary embodiment of the present invention, the processing module 630 may also be configured to: when it is determined that the central processor of the server to be verified is in an unencrypted state, inject a correctable error into the memory of the running server to be verified whose central processor is in an unencrypted state.
在本发明一示例性实施例中,所述服务器参数信息至少包括目标中央处理器类型信息、目标内存类型信息和目标内存运行修复信息;In an exemplary embodiment of the present invention, the server parameter information includes at least target CPU type information, target memory type information and target memory operation repair information;
确定模块620可以被配置为用于采用以下方式实现根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境:The determination module 620 may be configured to determine whether the server to be verified has a data transmission repair environment according to the server parameter information in the following manner:
在所述待验证服务器满足第一条件、第二条件和第三条件的情况下,确定所述待验证服务器具备数据传输修复环境,其中,When the server to be verified meets the first condition, the second condition and the third condition, it is determined that the server to be verified has a data transmission repair environment, wherein:
在根据预先维护的第一映射表确定所述目标中央处理器类型信息和所述目标内存类型信息具有匹配关系的情况下,确定所述待验证服务器满足第一条件,其中,所述第一映射表中包括由目标服务器的中央处理器类型信息和内存类型信息构成的匹配关系,所述目标服务器为具备数据传输功能的服务器;In the case where it is determined according to a pre-maintained first mapping table that the target CPU type information and the target memory type information have a matching relationship, determining that the server to be verified meets a first condition, wherein the first mapping table includes a matching relationship formed by the CPU type information and the memory type information of the target server, and the target server is a server with a data transmission function;
在检查出所述待验证服务器的串口日志存在所述内存运行修复信息的情况下,确定所述待验证服务器具备捕获内存运行修复信息功能,并将所述待验证服务器具备捕获内存运行修复信息功能,确定为所述待验证服务器满足第二条件;In the case where it is checked that the serial port log of the server to be verified contains the memory operation repair information, it is determined that the server to be verified has a function of capturing the memory operation repair information, and the server to be verified has the function of capturing the memory operation repair information, and it is determined that the server to be verified meets the second condition;
在检测出所述待验证服务器能够捕获故障内存信息的情况下,确定所述待验证服务器满足第三条件。In the case where it is detected that the server to be verified is capable of capturing fault memory information, it is determined that the server to be verified satisfies a third condition.
在本发明一示例性实施例中,验证模块640还可以被配置为用于:In an exemplary embodiment of the present invention, the verification module 640 may also be configured to:
在所述待验证服务器的数据传输修复功能启用的情况下,基于所述待验证服务器进行数据传输,以使所述待验证服务器在不中断运行的情况下能够对数据进行修复。When the data transmission and repair function of the server to be verified is enabled, data transmission is performed based on the server to be verified, so that the server to be verified can repair data without interrupting operation.
图7示例了一种电子设备的实体结构示意图,如图7所示,该电子设备可以包括:处理器(processor)710、通信接口(Communications Interface)720、存储器(memory)730和通信总线740,其中,处理器710,通信接口720,存储器730通过通信总线740完成相互间的通信。处理器710可以调用存储器730中的逻辑指令,以执行数据传输修复功能验证方法,该方法包括:获取待验证服务器的服务器参数信息;根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。FIG7 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG7 , the electronic device may include: a processor 710, a communication interface 720, a memory 730 and a communication bus 740, wherein the processor 710, the communication interface 720 and the memory 730 communicate with each other through the communication bus 740. The processor 710 may call the logic instructions in the memory 730 to execute a data transmission repair function verification method, the method comprising: obtaining server parameter information of a server to be verified; determining whether the server to be verified has a data transmission repair environment according to the server parameter information; in the case of determining that the server to be verified has a data transmission repair environment, injecting a correctable error into the memory of the running server to be verified, and obtaining the number of the correctable errors injected by the server to be verified; based on the number of the correctable errors, verifying whether the server to be verified has a data transmission repair function during operation.
此外,上述的存储器730中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 730 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.
另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的数据传输修复功能验证方法,该方法包括:获取待验证服务器的服务器参数信息;根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。On the other hand, the present invention also provides a computer program product, which includes a computer program, and the computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the data transmission repair function verification method provided by the above-mentioned methods, and the method includes: obtaining server parameter information of the server to be verified; determining whether the server to be verified has a data transmission repair environment based on the server parameter information; when it is determined that the server to be verified has a data transmission repair environment, injecting correctable errors into the memory of the running server to be verified, and obtaining the number of correctable errors injected by the server to be verified; based on the number of correctable errors, verifying whether the server to be verified has a data transmission repair function during operation.
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的数据传输修复功能验证方法,该方法包括:获取待验证服务器的服务器参数信息;根据所述服务器参数信息,确定所述待验证服务器是否具备数据传输修复环境;在确定所述待验证服务器具备数据传输修复环境的情况下,为运行中的所述待验证服务器的内存注入可纠正错误,并获取所述待验证服务器注入的所述可纠正错误的错误数量;基于所述可纠正错误的错误数量,验证所述待验证服务器在运行中是否具备数据传输修复功能。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the data transmission repair function verification method provided by the above-mentioned methods, the method comprising: obtaining server parameter information of the server to be verified; determining whether the server to be verified has a data transmission repair environment based on the server parameter information; when it is determined that the server to be verified has a data transmission repair environment, injecting correctable errors into the memory of the running server to be verified, and obtaining the number of correctable errors injected by the server to be verified; based on the number of correctable errors, verifying whether the server to be verified has a data transmission repair function during operation.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410813363.0A CN118747130A (en) | 2024-06-21 | 2024-06-21 | Data transmission repair function verification method, device, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410813363.0A CN118747130A (en) | 2024-06-21 | 2024-06-21 | Data transmission repair function verification method, device, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118747130A true CN118747130A (en) | 2024-10-08 |
Family
ID=92919096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410813363.0A Pending CN118747130A (en) | 2024-06-21 | 2024-06-21 | Data transmission repair function verification method, device, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118747130A (en) |
-
2024
- 2024-06-21 CN CN202410813363.0A patent/CN118747130A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7020803B2 (en) | System and methods for fault path testing through automated error injection | |
CN100451967C (en) | Method for switching basic input output system file and controller capable of supporting switching | |
CN103164523A (en) | Inspection method, device and system of data consistency inspection | |
CN111124780B (en) | UPI Link speed reduction test method, system, terminal and storage medium | |
WO2020192343A1 (en) | Hardware-based end-to-end data protection method and apparatus, and computer device | |
WO2024119762A1 (en) | Raid card construction method and system, and related apparatus | |
CN111625199B (en) | Method, device, computer equipment and storage medium for improving reliability of solid state disk data path | |
WO2022028057A1 (en) | Tpm-based apparatus and method for multi-layer protection of server asset information | |
CN111782446A (en) | Method and device for testing normal power failure of SSD, computer equipment and storage medium | |
US10514972B2 (en) | Embedding forensic and triage data in memory dumps | |
CN101303716B (en) | Embedded System Restoration Method Based on Trusted Platform Module | |
CN107562565A (en) | A kind of method for verifying internal memory Patrol Scurb functions | |
CN118747130A (en) | Data transmission repair function verification method, device, electronic device and storage medium | |
CN114579163A (en) | Disk firmware upgrading method, computing device and system | |
KR101300443B1 (en) | Flash memory device capable of verifying reliability using bypass path, and system and method of verifying reliability using that device | |
CN116795388A (en) | Burning rapid detection method and computing equipment | |
CN114510751A (en) | Hardware replacement prevention device and method based on processor security kernel | |
CN108874579B (en) | Method for policing and initializing ports | |
CN114661511B (en) | Equipment error processing method, device, equipment and storage medium | |
CN114385379B (en) | Method, system, terminal and storage medium for detecting on-board information refreshing | |
TWI757606B (en) | Server device and communication method between baseboard management controller and programmable logic unit thereof | |
CN117931493A (en) | Hardware error processing method and computing device | |
CN118535200A (en) | A method, device, medium and device for updating embedded software of automobile diagnostic equipment | |
CN117891519A (en) | USB controller initialization method and device, electronic equipment and storage medium | |
CN115114097A (en) | Hard disk injection medium error test method, system, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |