CN116069569A

CN116069569A - Sensitivity analysis method of Spark program on configuration parameters

Info

Publication number: CN116069569A
Application number: CN202111283681.3A
Authority: CN
Inventors: 苏子浩; 喻之斌; 陈超; 曾思棋; 杨永魁
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2023-05-05

Abstract

The invention discloses a sensitivity analysis method of Spark program about configuration parameters. The method comprises the following steps: aiming at a configuration parameter set of a Spark program, a plurality of different configuration parameters are randomly generated to form a sensitivity test group; operating Spark program by using each configuration parameter in the sensitivity test group to obtain real operation time corresponding to each configuration parameter, and calculating standard deviation of the operation time; and judging the sensitivity of the current Spark program to each configuration parameter based on the comparison of the standard deviation of the running time and the set threshold value, and further determining the distributed optimization time by utilizing the sensitivity. The invention can effectively characterize the sensitivity of the Spark program to be optimized on configuration parameters, and solves the problems of high time cost and low optimization benefit.

Description

Sensitivity analysis method of Spark program on configuration parameters

Technical Field

The invention relates to the technical field of big data processing, in particular to a sensitivity analysis method of Spark program about configuration parameters.

Background

In recent years, with rapid development of internet technology, application scenes of big data are receiving more and more attention. Taking the general parallel framework of the Spark big data system as an example, the Spark covers various loads, such as batch processing programs, user interactive programs, iterative algorithms and the like. Spark expansion perfects a MapReduce model, and through a memory pipelining type calculation mode, the read/write operation of data on a disk is reduced, so that the data processing speed is greatly improved.

To meet the usage requirements in different scenarios, spark frameworks expose a large number of configuration parameters, even up to 240, to the end user. Because of the different characteristics of the application programs, if default parameters are used in the running process of the program, the performance of the system is limited in many cases, and the system resources cannot be fully utilized. Therefore, some automatic configuration parameter tuning methods are proposed by people through machine learning means or parameter searching methods. The parameter tuning method is characterized in that firstly, running time under different parameters is collected, and then, the relation between configuration parameters and the running time is utilized for training, so that a performance model is obtained. When new data is processed, proper configuration is obtained by searching the performance model, so that the aim of optimizing configuration parameters is fulfilled. Such methods require collection of a large amount of sample data when constructing the training set, and are computationally inefficient and time-consuming.

Experimental observations have found that some Spark programs exhibit significant performance fluctuations for changes in configuration parameters. For example, the same Spark job is run on the same Spark cluster, but different configuration parameters are used, and the run time can vary by up to 15 times. However, the existing Spark automatic configuration parameter tuning method does not distinguish whether the Spark program is sensitive to the configuration parameters or not, or can be understood as "the default Spark program is sensitive to the configuration parameters". Whether or not the current Spark program is sensitive to configuration parameters, it is necessary to collect a large number of runtime under different configuration parameters (typically hundreds to thousands of sets of data to be run) and then use these raw data for training. Since the time cost of running the Spark program once is very high in the case of large data size, the time cost is higher when the program is repeatedly run to collect the training set sample size required for machine learning training. Furthermore, when collecting training samples, the sensitivity of the application to the configuration parameters is not considered, and although the time cost is high, the corresponding optimization effect is not obtained.

Disclosure of Invention

The object of the present invention is to overcome the above-mentioned drawbacks of the prior art and to provide a method for analyzing the sensitivity of Spark program with respect to configuration parameters. The method comprises the following steps:

step S1: aiming at a configuration parameter set of a Spark program, a plurality of different configuration parameters are randomly generated to form a sensitivity test group;

step S2: operating Spark program by using each configuration parameter in the sensitivity test group to obtain real operation time corresponding to each configuration parameter, and calculating standard deviation of the operation time;

step S3: and judging the sensitivity of the current Spark program to each configuration parameter based on the comparison of the standard deviation of the running time and the set threshold value, and further determining an optimization space by using the sensitivity.

Compared with the prior art, the method has the advantages that aiming at the problem of high time cost and low optimization benefit in the prior art, the sensitivity check of the Spark about configuration parameters is carried out before the Spark program is optimized, and then the optimizable space of the Spark program is determined according to the sensitivity check result. The invention can effectively avoid the ultrahigh optimization time cost under the condition of smaller Spark program optimization space, and obviously shortens the running time collection process, thereby more effectively finding out the better configuration parameters when the program runs.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a method of sensitivity analysis of Spark program with respect to configuration parameters according to one embodiment of the invention;

FIG. 2 is a process diagram of a method for sensitivity analysis of Spark program with respect to configuration parameters according to one embodiment of the invention;

FIG. 3 is a schematic diagram of a verification result according to one embodiment of the invention;

FIG. 4 is a comparison schematic of the optimization results according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

The sensitivity analysis method of the Spark program about the configuration parameters provided by the invention comprises the following steps: a plurality of Spark configuration parameters are generated each time by adopting random numbers, and the configuration parameters form a sensitivity test group. The current Spark program then uses each configuration parameter in the sensitivity test set to run virtually and records the real run time. Next, the standard deviation of the set of real run times is calculated. And judging whether the current Spark program is sensitive to the configuration parameters according to the standard deviation obtained by calculation and the set standard deviation threshold value to determine an optimization space.

In particular, as shown in connection with fig. 1 and 2, the provided method comprises the following steps.

Step S110, generating basic random configuration parameters for the configuration parameter set of the Spark program.

First, the range of values for each configuration parameter is determined, and then the parameters are randomly generated within the range. The policies for generating random configuration parameters are different for different types of configuration parameters.

For example, a boolean type parameter is used to turn on or off a function, and True and False are used to indicate on/off states, such as whether overflow data is compressed, and True or False may be randomly generated for that type parameter.

The type parameter is used to select one of several choices, such as from 4 compressed codec types (lz 4, lzf, snappy and zstd), for which the type parameter can be randomly selected and the corresponding parameter generated.

The continuous parameters are used for parameter configuration within a certain continuous range, and can be specifically classified into integer type (such as the number of required CPU cores) and floating point type (such as the percentage of the memory used for program running in the heap). In one embodiment, a range of such type parameters (e.g., the number of CPU cores available in a cluster) may be determined based on the actual physical resources of the system, and then the desired integer or floating point type parameters may be randomly generated within the range.

And step S120, performing multiple disturbance on the basic configuration parameters to construct a sensitivity test group.

In this step, a number of parameters are randomly selected from the above-generated basic random configuration parameters, and "perturbation" is performed. For example, the rule of the additional disturbance is:

if the currently selected parameter is a boolean parameter, then the current value is inverted, i.e., true becomes False, false becomes True;

if the currently selected parameter is a category type parameter, then the current value is randomly selected from among other selectable categories. The codec types for 4 kinds of compression are lz4, lzf, snappy and zstd, respectively. If the currently selected compressed codec type bit lz4, then randomly selecting one from lzf, snappy and zstd as the new parameter;

if the currently selected parameter is a continuous type parameter, the current value will append a viable perturbation value and ensure that the new value is still a legally viable parameter value. For example, if the current value of the required number of CPU cores is 12 and the number of CPU cores available in the cluster is 20, then a perturbation is added to the current number of CPU cores and it is ensured that the new parameter value is still smaller than the number of CPU cores available in the cluster.

The process of adding a perturbation to several of the configuration parameters is described in specific examples below. Assuming that the current basic configuration parameters consist of 38 parameters, denoted Base, expressed as:

Base＝(C ₁ C ₂ C ₃ C ₄ C ₅ … C ₃₈ )

at the time of adding disturbance, 5 parameters are randomly selected from 38 parameters to add disturbance, and the current selection { C is assumed ₁ C ₂ C ₄ C ₅ C ₃₈ And according to the additional disturbance rule, operating the basic configuration parameters to obtain the following new configuration parameters, namely Current, which are expressed as:

Current＝(C ₁ +ΔC ₁ C ₂ +ΔC ₂ C ₃ C ₄ +ΔC ₄ C ₅ +ΔC ₅ … C ₃₈ +ΔC ₃₈ )

through the above processing, a new configuration parameter is obtained. And repeating the steps for n times to obtain n feasible random configuration parameters. The n possible random configuration parameters constitute a sensitivity test set. The parameters of the sensitivity test set are the configuration parameters required by the Spark program to be detected.

Through multiple random disturbance, the obtained sensitivity test group can be ensured to traverse all optional configuration parameter combinations and parameter value ranges as much as possible, and certain difference exists under the condition of similar overall, so that the accuracy of the follow-up optimized configuration parameters is improved. It should be appreciated that the number of parameters for the additional disturbance may be determined based on the number of base configuration parameters, computational efficiency requirements, and the like.

Step S130, repeatedly running Spark program by using the optional configuration parameters for the sensitivity test group to obtain corresponding running time.

In the step, feasible configuration parameters are sequentially taken out from the sensitivity test group, and a Spark computing cluster is used for truly running a Spark program to be detected under the current configuration parameters.

And when the Spark computing cluster is under the current configuration parameters, recording the real running time after the Spark program to be detected is run. And repeating the step (n-1) for a plurality of times to traverse the current sensitivity test group, so that the real running time of the Spark program to be detected in the current sensitivity test group can be obtained.

Step S140, the sensitivity of the execution performance to the configuration parameters is measured by using the obtained running time, and then the optimizable space of the Spark program is determined.

Assume that in the current sensitivity test set, the n Spark programs run at times { t }, respectively ₀ t ₁ … t _n-1 }. Then calculate the mean of these n runtimes

Variance var and standard deviation σ are expressed as:

if the standard deviation is smaller than the set value, the running time fluctuation of the Spark program to be detected for n times is not large under the condition of using different configuration parameters. The current Spark program may be considered insensitive to configuration parameters. Meaning that the optimizable space of the current Spark program is small in terms of configuration parameter tuning. Then the depth is accurately adjusted without spending too high optimization time cost, and the 'shallow' optimization is performed. The method plays an important role in indicating time allocation in the next Spark program tuning strategy, so that the tuning efficiency of the Spark program is improved, and the balance between 'optimized time cost' and 'optimized effect' is maintained.

To further verify the effect of the present invention, experiments were performed. Sensitivity analysis was performed on the Spark procedure of the TPC-DS big data reference suite with respect to the configuration parameters. The invention is tested by selecting the Query04 and Query08 programs in TPC-DS. A sensitivity test group of group size 5 was generated for Query04 and Query08, respectively. A computing cluster was built using 3 servers with architecture X86. Each machine has 64GB of memory, 56 cores, and the CPU model is Intel (R) Xeon (R) CPU E5-2683 v3@2.00GHz. The data used was generated from TPC-DS, which has a data size of 50GB. The Spark version used is 2.4.5.

FIG. 3 shows the resulting real run time after repeated runs using the configuration parameters in the corresponding sensitivity test set for Query04 and Query08, where the ordinate indicates run time in milliseconds. As can be seen from fig. 3, the run-time fluctuations of Query04 are greater. Then the optimization space of Query04 is considered to be more, and then more optimization time is allocated to Query04 when searching for the optimal configuration parameters is used, so as to obtain better configuration parameters. Conversely, if the running time of Query08 fluctuates less, the optimization space of Query08 is considered to be limited, and the optimization time allocated to Query08 is shortened when the search is performed using the optimal configuration parameters later.

FIG. 4 shows whether Query04 and Query08 use the present invention before Spark configuration parameters tuning, resulting in differences in optimal configuration. Wherein the default configuration parameter represents the resulting runtime of running the Spark program using the default configuration of the Spark framework. Sensitivity analysis means that analysis of the sensitivity of Spark program with respect to configuration parameters is performed using the present invention. It can be seen that the tuning time of the configuration parameters of Query08 is reduced by half, since Query08 is insensitive to Spark configuration parameters. The method is particularly characterized in that the iteration times of configuration parameter tuning are reduced, so that the effect of shortening the time is achieved. The un-sensitivity analysis indicates that the present invention is not used and the default number of iterations of configuration parameter tuning is performed regardless of whether Query08 is sensitive to configuration parameters. As can be seen from FIG. 4, the operation speed of the Spark program can be significantly improved by not using the sensitivity optimal configuration and the optimal configuration generated by the invention, and meanwhile, the optimization effects of the sensitivity optimal configuration and the optimal configuration are relatively close, which indicates that the performance of the optimal configuration is not affected by the reduction of the iteration times of the optimization of the configuration parameters.

The experimental result shows that by analyzing the sensitivity of the Spark program with respect to the configuration parameters before the configuration parameters are adjusted, the time required for adjusting the configuration parameters can be greatly reduced while the configuration adjustment effect of the Spark program is achieved. Compared with the prior art, the invention can provide the Spark program with the optimal configuration parameters with similar performance, and shortens the optimization time of the configuration parameters.

In summary, compared with the prior art, the invention has at least the following technical effects:

1) The existing Spark configuration parameter automatic tuning method needs to repeatedly operate the Spark program in the tuning process, the tuning time cost is high, and the situation of 'poor output' can be caused. The invention can avoid the conditions of high time cost and low optimization benefit by carrying out sensitivity analysis on the configuration parameters before optimizing the Spark program.

2) The existing method has no real runtime information when judging the sensitivity of Spark program with respect to configuration parameters. Often, some developer with development experience will manually review the business code, relying on experience and established rules. The invention provides a method for constructing a sensitivity test group. The method ensures that certain difference exists between the configuration parameters in the sensitivity test group under the condition of overall similarity, thereby effectively characterizing the sensitivity of the Spark program to be optimized on the configuration parameters.

3) The method for detecting the sensitivity of the Spark program to be optimized with respect to the configuration parameters is very simple, and only one sensitivity test group is needed to be constructed, so that a foundation is laid for the follow-up more efficient searching of the preferred configuration.

It should be noted that the present invention can be used not only in Spark frames, but also in other different similar frames, such as Hadoop and Flink, etc., and can also maintain the balance between "tuning time cost" and "tuning effect".

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++, python, and the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method of sensitivity analysis of Spark program with respect to configuration parameters, comprising the steps of:

2. The method according to claim 1, wherein the sensitivity test set is obtained according to the steps of:

generating a basic random configuration parameter set aiming at the configuration parameter set of the Spark program;

and using n perturbation operations on the basic random configuration parameter set to obtain n Spark configuration parameters, and forming a sensitivity test group by the n configuration parameters.

3. The method of claim 2, wherein the generating the basic set of random configuration parameters for the Spark program comprises:

for the Boolean configuration parameters, randomly generating True or False values;

for category type configuration parameters, randomly selecting and generating corresponding parameters from selectable categories;

for the continuous parameters: according to the actual physical resources, the value range and the integer type or floating point type parameters randomly generated in the range are determined.

4. The method of claim 2, wherein the rules of perturbation operations are set to:

inverting the current value according to the Boolean configuration parameters;

for the category type configuration parameters, randomly selecting from other selectable categories except the current value;

and adding a feasible disturbance value to the current value aiming at the continuous configuration parameter to form a new value which is in a legal range.

5. The method of claim 2, wherein using n perturbation operations on the basic set of random configuration parameters comprises:

base= (C) for the basic configuration parameter set ₁ C ₂ C ₃ C ₄ C ₅ …C _m ) Randomly selecting a set number of parameters from the parameters to add disturbance to obtain the following new configuration parameter set;

repeating the steps for n times to obtain n feasible random configuration parameter sets, and forming a sensitivity test group by using the n feasible random configuration parameter sets.

6. The method of claim 5, wherein the standard deviation of the run time is determined according to the following equation:

wherein { t } ₀ t ₁ …t _n-1 The n times the Spark program runs,

mean of n run times, var represents variance and σ represents standard deviation.

7. The method of claim 6, wherein the current Spark procedure is deemed insensitive to relevant configuration parameters in the sensitivity test set if the resulting standard deviation σ is less than a set threshold.

8. The method of claim 1, wherein determining an optimization space using sensitivity comprises: and reducing the iteration times of tuning for the configuration parameter combination which is judged to be insensitive.

9. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor realizes the steps of the method according to any of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 8 when the program is executed.