CN108647771A - The layout method of research-on-research flow data under a kind of mixing cloud environment - Google Patents
The layout method of research-on-research flow data under a kind of mixing cloud environment Download PDFInfo
- Publication number
- CN108647771A CN108647771A CN201810427966.1A CN201810427966A CN108647771A CN 108647771 A CN108647771 A CN 108647771A CN 201810427966 A CN201810427966 A CN 201810427966A CN 108647771 A CN108647771 A CN 108647771A
- Authority
- CN
- China
- Prior art keywords
- data
- particle
- scientific workflow
- cloud environment
- data center
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000011160 research Methods 0.000 title abstract description 5
- 239000002245 particle Substances 0.000 claims abstract description 180
- 230000005540 biological transmission Effects 0.000 claims abstract description 17
- 230000002068 genetic effect Effects 0.000 claims abstract description 16
- 230000001149 cognitive effect Effects 0.000 claims description 21
- 230000019771 cognition Effects 0.000 claims description 16
- 238000012546 transfer Methods 0.000 claims description 9
- 230000035772 mutation Effects 0.000 claims description 7
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 abstract description 9
- 230000002028 premature Effects 0.000 abstract description 5
- 230000003044 adaptive effect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 4
- 239000008186 active pharmaceutical agent Substances 0.000 description 3
- 239000000306 component Substances 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 239000008358 core component Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention discloses a kind of layout method mixing research-on-research flow data under cloud environment, is encoded to scientific workflow data layout problem using discrete particle cluster coding mode;The variation and crossover operation for introducing genetic algorithm are updated operation to particle, effectively overcome the problems, such as the existing Premature Convergence of discrete particle cluster optimization;A kind of difference degree based between global history optimal particle and current particle is built come the adaptive update method for adjusting the inertia weight factor, effectively meets the property complicated and changeable of data problem;It is defined by the still remaining feasible interparticle fitness function of the infeasible particle and capacity of storing excess to data center's capacity, the data reduced between data center move number, compressed data transmission quantity, so as to effectively promote the execution efficiency of scientific workflow.
Description
Technical Field
The invention relates to a scientific workflow data layout method in the field of parallel and distributed high-performance computing, in particular to a scientific workflow data layout method in a mixed cloud environment.
Background
In the scientific research fields of astronomy and bioinformatics, scientists analyze data from existing data sources or collect data from physical devices by performing thousands of tasks and generate large amounts of new data, such as intermediate data or final data results, often of the TB and even PB magnitude. In the past, scientists have generally used simple methods (e.g., Perl scripting language) to orchestrate tasks and manage data, but this approach is not only time consuming but is also prone to errors. Computing tasks and scientific applications are therefore often modeled as workflows, with data analysis automated through data or control dependencies. Scientific workflows that contain a large number of computing tasks require not only high performance computing resources, but also a large amount of storage resources. Currently, some large scientific workflows are deployed on complex distributed computer systems, such as supercomputers, because of their high performance and mass storage. However, these systems are very expensive to build and time consuming to apply for access. The cloud computing virtualizes resources in different geographic positions into a resource pool through a virtualization technology, faces users in a pay-as-you-go mode, and has the characteristics of high efficiency, flexibility and customization, so that a new idea is provided for solving the problems encountered in the operation process of a scientific workflow.
As bandwidth between data centers is typically limited and expensive. Improper data placement strategies may result in excessive amounts of data transfer between data centers, resulting in excessive implementation costs to the user. In recent years, many workflow scheduling algorithms have been developed for efficient workflow execution. However, most of them focus on scheduling computationally intensive workflows. As more and more scientific workflows become data intensive, novel approaches are needed to handle these scientific workflows. At present, most data layout optimization algorithms adopt a clustering matrix method, the operation process of a workflow is optimized by dividing the operation process into two stages of construction time and operation time, and when the size difference of a data set is overlarge, the clustering matrix has certain limitation. At present, an intelligent method is also used for research on data layout, however, most of the work assumes that all data sets are the same in size, and data sets of different workflows are very different in practical situations, and these factors may have a certain influence on data layout.
Regarding the scientific workflow data layout problem considering the private data placement in the mixed cloud environment, few studies are currently conducted at home and abroad. The most relevant research work considers the data dependency relationship, screens out non-private data sets with high dependency on private data sets, and places the data sets in corresponding private cloud data centers, so that the data transmission time is effectively shortened, but the work does not consider the influence of the size difference among the data sets on the data layout.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a scientific workflow data layout method based on scientific workflow execution characteristics and self structure, data set size difference, data center capacity and other relevant factors under a mixed cloud environment.
In order to achieve the purpose of the invention, the technical scheme of the invention is as follows:
a scientific workflow data layout method in a mixed cloud environment comprises the following steps:
step 1: constructing a scientific workflow structure in a hybrid cloud environment, and calculating data transmission quantity generated when each task is executed according to the execution sequence of the tasks;
step 2: initializing a population size, maximum iteration times, inertia weight factors and cognitive factors based on a scientific workflow structure in the mixed cloud environment, and randomly generating an initial population; initializing self history optimal particles of the first generation particles and initial population global optimal particles;
and step 3: the velocity and position of the particles are updated,
and 4, step 4: calculating a fitness value of each particle;
and 5: updating the self history optimal particle of the particle;
if the fitness value of the current particle is smaller than the fitness of the historical optimal particle, updating the new particle into the historical optimal particle;
step 6: updating global optimal particles of the population;
if the fitness value of the current particle is smaller than that of the population global optimal particle, updating the particle into the population global optimal particle;
and 7: checking whether an algorithm termination condition is met, and ending when the algorithm termination condition is met; otherwise, go to step 3.
Further, the step 1 of constructing the scientific workflow structure in the hybrid cloud environment specifically includes the following steps:
step 1-1; generating a scientific workflow Data Set (DS);
where dsi represents the data set numbered i, and ds _ size represents the size of the data set dsi; init represents the initial position of the data set; depep represents the final layout position of the data set; ti represents a task set requiring a data set dsi; pri _ flag represents a private data set storage flag, if the dsi is a private data set, the pri _ flag is the number of a data center for storing the dsi, otherwise, the pri _ flag is 0; the ini _ flag represents an initial data set flag, if the dsi is an initial data set, the ini _ flag is 1, otherwise, the ini _ flag is 0;
1-2; constructing a data center set DC in a mixed cloud environment based on a scientific workflow data set DS;
wherein dci represents a data center with number i, and DSi represents a data set stored in the data center i; size represents the storage capacity of the data center; availsize represents the available storage capacity of the current data center; the type represents the type of the data center, and when the type is 0, the data center is a public cloud data center; when the type is 1, the data center is a private cloud data center;
1-3; generating a task set T of the scientific workflow;
where ti represents the task numbered i; inDSi represents the set of input data sets for task ti; outDSi is represented as a set of output data sets for task ti; runDC represents the data center location where task ti is executing;
1-4; constructing a scientific workflow of the triples;
G=<T,E,DS>(1);
wherein E is a control flow set representing the execution sequence of tasks, and G is a directed acyclic graph; t is expressed as a task set of the scientific workflow; the DS represents the set of datasets for the scientific workflow.
Further, in the step 1, the method for allocating the tasks and the generated data sets in the scientific workflow structure in the mixed cloud environment is as follows:
when the task contains the private data set, placing the task into a data center which correspondingly stores the private data for operation; otherwise, the data center with more input data sets is placed for operation.
Further, the pri _ flag and ini _ flag in step 1-1 take the following values:
DSpri represents a set of private data sets that must be placed in a particular private cloud data center, pridc (dsi) represents the location of the private cloud data center where data sets dsi are stored, DSpub represents all public data set sets, DSini represents a set of all initial data sets that already exist before the scientific workflow is run; DSgen represents a collection of data sets generated by a scientific workflow runtime.
Further, the step 2 of initializing the population comprises the following steps:
step 2-1: particle encoding: the particle coding is a two-dimensional particle coding strategy, each particle represents a data layout scheme of a scientific workflow in a mixed cloud environment, and each particle comprises a data center of a first-dimensional data set and a second-dimensional data set placement position; the number of the data sets is M, and the number of the data centers is Q;
position of particle i at time tExpressed as:
wherein,a data center representing the placement of the kth data set at time t;
step 2-2: initializing the population size, the maximum iteration times, the inertia weight factor, the individual cognition factor, the global cognition factor and the particle initial speed according to a standard PSO algorithm.
Further, the initialization of the corresponding parameters in step 2-2 are respectively as follows: the population size is set to 50, the maximum number of iterations is set to 500, the inertial weight factor is set to 0.4, the individual cognition factor is set to 0.4, and the global cognition factor is set to 0.6.
Further, the specific method for initializing the self-history optimal particles and the initial population global optimal particles of the first generation of particles in step 2 is as follows:
step 2-3: the fitness value of each particle of the first generation is calculated,
step 2-4: and selecting the particle with the minimum fitness value as a population global optimal particle, and setting each particle in the first generation as a self history optimal particle.
Further, the calculation formula of the fitness of the particle is as follows:
wherein a particle q1 represents one coding scheme corresponding to the question and a particle q2 represents another coding scheme corresponding to the question; QTuf is a feasible solution set, and QTul is an infeasible solution set; evaluate (q1) represents the data transmission amount of the placement strategy corresponding to the feasible particle q1, and evaluate (q2) represents the data transmission amount of the placement strategy corresponding to the feasible particle q 2;
further, the update formula for updating the particle i in step 3 is as follows:
wherein,is an inertial part, and is a part,the cognitive component of the individual is a cognitive component,is a global cognitive component;
the standard PSO algorithm is combined with the variation operation of the genetic algorithm to obtain the inertia part of the particle i at the time tThe formula of (1) is as follows:
w is an inertia weight factor, w is used for adjusting the searching capacity of particles on a solution space, Mu () represents a mutation operation, r1 is a random number between 0 and 1, Mu () randomly selects a quantile, the value of the quantile is randomly changed, and the value meets the corresponding value range;
the formula for respectively obtaining the individual cognitive part and the global cognitive part of the particle i at the time t by combining the standard PSO algorithm and the cross operation of the genetic algorithm is as follows:
wherein c is1Is an individual cognitive factor, c2The method comprises the following steps that (1) a global cognition factor is adopted, pBest and gBest respectively represent the individual optimal position of a particle after multiple iterations and the global optimal position of a population; cp () and Cg () denote a crossover operation, r2 and r3 are random numbers between 0 and 1, Cp () and Cg () randomly select two fractional bits of a particle, crossover the same fractional bit of pBest or gBest; r is1And r2Is a random variable with a value range of [0, 1 ]],r1And r2For enhancing randomness in the iterative search process.
The technical scheme is adopted, and a discrete particle swarm encoding mode is adopted to encode the scientific workflow data layout problem; the variation and cross operation of the genetic algorithm are introduced to update the particles, so that the problem of premature convergence in the optimization of the discrete particle swarm is effectively solved; an updating method for adaptively adjusting the inertia weight factor based on the difference degree between the global historical optimal particle and the current particle is constructed, and the complex and variable properties of the data problem are effectively met; by means of the fitness function definition of the data center capacity storage excess infeasible particles and the feasible particles with surplus capacity, the data moving times among the data centers are reduced, data transmission quantity is compressed, and therefore execution efficiency of scientific workflow can be effectively improved.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is an example diagram of a scientific workflow of a method for laying out scientific workflow data in a hybrid cloud environment according to the present invention;
FIG. 2 is a first scientific workflow data layout scheme of a scientific workflow data layout method in a hybrid cloud environment according to the present invention;
FIG. 3 is a scientific workflow data layout scheme II of the scientific workflow data layout method in a hybrid cloud environment according to the present invention;
FIG. 4 is a schematic particle encoding diagram of a scientific workflow data layout method in a hybrid cloud environment according to the present invention;
FIG. 5 is a mutation operator operation diagram of an inertial part of the scientific workflow data layout method in a hybrid cloud environment according to the present invention;
fig. 6 is a schematic diagram of the operation of the crossover operator of the personal or social cognitive part of the method for laying out the scientific workflow data in the hybrid cloud environment.
Fig. 7 is an algorithm flowchart of a scientific workflow data layout method in a hybrid cloud environment according to the present invention.
Detailed Description
For a more detailed description of the present invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
1, problem analysis:
as shown in fig. 1, a scientific workflow example of the present invention is given, and fig. 2 and 3 show two data layout schemes, respectively, to describe the problem in detail. The scientific workflow consists of 5 tasks t1,t2,t3,t4,t55 input data sets d1,d2,d3,d4,d5And 1 intermediate data set d6And (9) composition. Wherein the data set ds4Is a private data set that must be stored in a data center dc2In (1), task t4Is { ds }3,ds4,ds6Is then t4Must also be in the data center dc2And (4) running. If only the number of data transmissions is considered, the number of transmissions in fig. 2 is 2, and the number of transmissions in fig. 3 is 5, but the size of the data set cannot be ignored, the present invention assumes ds1=1GB,ds2=1GB,ds6The data transfer amount in fig. 2 is 60GB and the data transfer amount in fig. 3 is 34GB, 30 GB.
Consider 1) the large size of the data set; 2) there are dependencies between data sets; 3) the computing and storage resources of a single data center are insufficient to process and store the entire application data; 4) and part of the data set can only be stored in a designated private cloud data center in consideration of security and privacy. Different data sets need to be uploaded to different data centers respectively, and data are transmitted from other data centers when tasks are executed, so that the scientific workflow can run normally. Therefore, the invention provides a self-adaptive particle swarm optimization algorithm based on genetic algorithm operation, which introduces random two-point cross operation and random single-point variation operation of the genetic algorithm to improve the diversity in the population evolution process, effectively improves the execution efficiency of scientific workflow, and reduces the data movement times and data transmission quantity among data centers.
2, modeling a system:
scientific workflows are generally described as a Directed Acyclic Graph (DAG), in which nodes represent tasks and edges represent control dependencies. But this model is not suitable for data intensive scientific workflows because it does not have an input data set. Thus, the present invention extends the DAG model by considering the input data set.
Definition 1: a scientific workflow is defined as a triple:
G=<T,E,DS>(1)
the control flow set G is a Directed Acyclic Graph (DAG), and E represents the order of execution of tasks. T represents the set of all tasks of the scientific workflow. The DS represents the set of all data sets of the scientific workflow.
Definition 2: the data center set in the hybrid cloud environment is defined as:
wherein dciRepresenting the data center numbered i. DS (direct sequence)iRepresenting a collection of data sets stored in data center i; size represents the storage capacity of the data center; availsize represents the available storage capacity of the current data center; when the type is 0, the data center is a public cloud data center; when the type is 1, the data center is a private cloud data center.
Definition 3: the set of datasets for a scientific workflow is defined as:
wherein dsiRepresenting the data set numbered i. ds _ size represents the size of the data set ds; init represents the initial position of the data set; depep represents the final layout position of the data set;TiRepresenting a desired data set dsiA task set of (2); pri _ flag and ini _ flag take the following values:
private data set DSpriMust be placed in a specific private cloud data center, data set dsiThe stored private cloud data center is at the position of PrIDC (ds)i),DSpubRepresented as a set of all public datasets. Based on the generation time of the data set, the present invention refers to the data set that existed before the scientific workflow was run as the initial data set, DSiniRepresented as a collection of all initial datasets; generating a data set refers to a data set, DS, generated at the runtime of a scientific workflowgenRepresented as a collection of all the generated datasets.
Definition 4: the task set of the scientific workflow is defined as:
wherein t isiIndicating the task numbered i. The invention focuses on the data placement problem of scientific workflows in a hybrid cloud environment, and therefore only input and output data sets corresponding to tasks are considered. InDS (inner data System)iDenoted as task tiThe input dataset set of (a); outDSiDenoted as task tiThe set of output datasets of; runDC is denoted as task tiThe location of the data center where the execution is performed.
3 data Placement strategy of the invention
3.1 traditional particle swarm optimization Algorithm
The PSO algorithm is a group stochastic optimization algorithm based on global psychology principles and was proposed by Eberhart and Kennedy in 1995. The particles in the PSO algorithm represent candidate solutions of the optimization problem, can move at a certain speed in the whole problem space range, and continuously and iteratively update and adjust the position and the speed of the particles according to own experience and surrounding particles. Wherein the update formula of the speed and the position is expressed as:
where t is represented as the current number of iterations.And Vi tRespectively representing the position and the velocity of the ith particle after t iterations, and defining the maximum velocity of all the particles as V to ensure that the problem solution is always in the solution spacemax。And gBesttRespectively showing the self optimal position of the ith particle after t iterations and the historical optimal position of the population. The inertia weight w is used for adjusting the searching capability of the particles on the solution space and is important for the convergence of the algorithm. c. C1And c2The cognition factors represent the cognition learning ability of the current particles to the self historical optimal value and the population global historical optimal value. r is1And r2Is a random variable with a range of [0, 1 ]]And the method is used for enhancing the randomness in the iterative search process. In addition, in order to judge the superiority and inferiority of the solution, a fitness function is introduced to evaluate the solution quality of each particle.
3.2 adaptive particle swarm optimization algorithm based on genetic algorithm operation
The genetic algorithm is easy to deviate from the local optimal solution in the mutation process, but a better optimal interval is easy to leave, so that the search time is prolonged. As shown in equation (7), the PSO includes three core components: an inertial portion, an individual cognitive portion, and a global cognitive portion. In order to overcome the defect of premature convergence of the particle swarm algorithm, the invention introduces the variation and cross operation of the genetic algorithm and carries out corresponding updating operation on the particle swarm algorithm. The present invention will describe the algorithm from the following six sections.
Particle encoding:
the encoding mode of the particles affects the search efficiency and performance of the algorithm. The invention provides a two-dimensional particle coding strategy, and an integer coding strategy in a genetic algorithm is adopted to avoid invalid solutions. Each particle represents a data layout scheme of a scientific workflow in a hybrid cloud environment. The invention sets M-dimensional search space for each particle, namely the number of data sets is M, each dimension has a possibility set Q of an integer, namely the number of data centers is Q, and the position of the particle i at the time tCan be expressed as:
wherein,indicating the placement of the kth data set at time t. Fig. 4 illustrates a particle encoding strategy comprising 6 data sets, the first dimension of the particle represents the data set, the second dimension represents the placement position of the data set, the invention assumes that there are 3 data centers in the hybrid cloud environment, i.e. Q ranges from 0 to 2. As can be seen from FIG. 4, data set ds3Is dc1。
Setting parameters:
w in equation (7) affects the search power and convergence of the particle swarm algorithm: when w is larger, the algorithm has stronger global search capability, and when w is smaller, the algorithm has stronger local search capability. Due to the diversity and complexity of the initial problem space, the particles need to have strong global search capability, and as the number of iterations increases, the local search capability of the particles needs to be remarkably increased. Therefore, the inertial weight factor w should be linearly decreasing with evolution. The adjustment formula of w is set as follows:
wherein, wmaxAnd wminRespectively representing the maximum and minimum values at w initialization, and t representing the current number of iterations. This strategy takes into account the difference between the current particle and the globally optimal particle,is defined as:
whereinThe difference digit of the current particle and the global optimal particle is represented, and the DSN represents the number of data sets in the scientific workflow. When in useWhen the value is smaller, the difference between the current particle and the global optimal particle is smaller, so that w needs to be reduced to ensure that the particle finds an optimal solution in a small range; otherwise, the size of w should be increased to make the search space for the particles larger in order to find the optimal solution faster.
Initializing a strategy:
in order to ensure the integrity and diversity of a search space, the invention provides a simple initialization strategy, parameter values such as the population size, the maximum iteration times, the inertia weight factor, the cognitive factor and the like in the initialization placement strategy are initialized according to the above mentioned rules, and an initial population is randomly generated. When randomly assigning data centers to each public data set, it is possible to have data sets in all data centers. For private data sets, the present invention always distributes them to specific data centers.
Fitness function:
the main objective of the data placement strategy proposed by the present invention is to reduce the total data traffic volume between data centers during scientific workflow execution. The method utilizes the fitness function of the particles to evaluate the superiority and inferiority comparison of the two particles. In general, the particles corresponding to smaller fitness function values are better, and the method takes the global transmission quantity of the data placement scheme as an evaluation function. Since the previous particle encoding strategy does not satisfy the soundness principle, that is, the residual size of the data center may not be enough to store the generated data set generated during the operation of the workflow, the present invention needs to distinguish and define the fitness function between the feasible solution and the infeasible solution from the following three aspects. In the present invention, a particle q1 indicates one coding scheme corresponding to a problem, and a particle q2 indicates another coding scheme corresponding to a problem.
Case 1: particle q1Belong to the feasible solution set QTufParticles q2Belonging to the infeasible solver set QTul. A feasible solution is chosen without any controversy, whose fitness function is defined as shown in equation (12), where evaluate (q)1) Represents feasible particles q1The amount of data transfer for the corresponding placement policy.
fitness=evaluate(q1),q1∈QTuf(12)
Case 2: particle q1And q is2All belong to a feasible solution set QTufThe particle with the least total data transmission amount is selected. The fitness function is defined as shown in formula (13):
fitness=min(evaluate(q1),evaluate(q2)),q1,q2∈QTuf(13)
case 3: particle q1And q is2All belong to the infeasible solution set QTulThe particle with the least total data transmission amount is selected. As the particle is more likely to become a viable solution after evolution. The fitness function is defined as shown in formula (14):
fitness=min(evaluate(q1),evaluate(q2)),q1,q2∈QTul(14)
furthermore, each task in a scientific workflow must meet at least four requirements: (1) tasks are scheduled to be performed on a particular data center; (2) the input data set of the task comprises a privacy data set and must be placed on a private cloud data center; (3) the initial position of the output data set of the task is on a data center for executing the task; (4) the data sets required by the tasks are in the same data node. Thus, if the input data sets of a task are stored on different data centers, data transfer between the data nodes is required before execution. The present invention assumes that a task is placed on a data center that contains a larger number of input data sets for that task in order to reduce the amount of data transfer between data centers due to task scheduling.
And calculating the fitness value of the particle, firstly, distributing the task and the generated data set, if the task contains the private data set, placing the task to the corresponding data center for operation, and otherwise, placing the task to the data center with more input data sets for operation. Then, the data transfer amount generated when each task is executed is calculated according to the execution order of the tasks. And finally, returning the fitness value of the particle, namely the total data transmission amount generated by the deployment scheme corresponding to the particle.
Update policy
As shown in equation (7), the PSO includes three core components: an inertial portion, an individual cognitive portion, and a global cognitive portion. The invention introduces the crossover and mutation operation of the genetic algorithm to overcome the defect of premature convergence of the particle swarm algorithm. The update formula of the particle i at the time t is as follows:
for the inertia part, the invention combines the variant operation idea of the genetic algorithm to update the corresponding part in the formula (7), and the formula of the inertia part of the particle i at the time t is as follows:
wherein M isu() Denotes a mutation operation, r1Is a random number between 0 and 1. Mu() Randomly selecting a fractional bit, and randomly changing the numerical value of the fractional bit, wherein the numerical value meets the corresponding value range. FIG. 5 shows a mutation operation on the encoded particle of FIG. 4, a data set ds of the particle2The value of the corresponding placement position is updated from 0 to 1, and the variance meets the scheduling criteria.
For the individual cognition part and the global cognition part, the invention combines the cross operation thought of the genetic algorithm to update the corresponding part in the formula (7), and the formulas of the individual cognition part and the global cognition part of the particle i at the time t are as follows:
wherein C isp() And Cg() Denotes a crossover operation, r2And r3Is a random number between 0 and 1. Cp() And Cg() Randomly selecting two fractions of a particle, andthe values between the same quantiles of pBest (gBest) are interleaved.
FIG. 6 shows the interleaving operation on the encoded particles of FIG. 4, the particles being in a data set interval ds2To ds4The value of the corresponding placement position is interleaved with the value of the corresponding position of pbest (gbest), and the operation meets the scheduling criteria.
Algorithm flow
Fig. 7 is an algorithm flowchart of the scientific workflow data placement method of the present invention, which includes the following steps:
step 1: constructing a scientific workflow structure in a hybrid cloud environment, and calculating data transmission quantity generated when each task is executed according to the execution sequence of the tasks;
step 2: initializing a population size, maximum iteration times, inertia weight factors and cognitive factors based on a scientific workflow structure in the mixed cloud environment, and randomly generating an initial population; initializing self history optimal particles of the first generation particles and initial population global optimal particles;
and step 3: the velocity and position of the particles are updated,
and 4, step 4: calculating a fitness value of each particle;
and 5: updating the self history optimal particle of the particle;
if the fitness value of the current particle is smaller than the fitness of the historical optimal particle, updating the new particle into the historical optimal particle;
step 6: updating global optimal particles of the population;
if the fitness value of the current particle is smaller than that of the population global optimal particle, updating the particle into the population global optimal particle;
and 7: checking whether an algorithm termination condition is met, and ending when the algorithm termination condition is met; otherwise, go to step 3.
The technical scheme is adopted, and a discrete particle swarm encoding mode is adopted to encode the scientific workflow data layout problem; the variation and cross operation of the genetic algorithm are introduced to update the particles, so that the problem of premature convergence in the optimization of the discrete particle swarm is effectively solved; an updating method for adaptively adjusting the inertia weight factor based on the difference degree between the global historical optimal particle and the current particle is constructed, and the complex and variable properties of the data problem are effectively met; by means of the fitness function definition of the data center capacity storage excess infeasible particles and the feasible particles with surplus capacity, the data moving times among the data centers are reduced, data transmission quantity is compressed, and therefore execution efficiency of scientific workflow can be effectively improved.
Claims (9)
1. A scientific workflow data layout method in a mixed cloud environment is characterized by comprising the following steps: which comprises the following steps:
step 1: constructing a scientific workflow structure in a hybrid cloud environment, and calculating data transmission quantity generated when each task is executed according to the execution sequence of the tasks;
step 2: initializing a population size, maximum iteration times, inertia weight factors and cognitive factors based on a scientific workflow structure in the mixed cloud environment, and randomly generating an initial population; initializing self history optimal particles of the first generation particles and initial population global optimal particles;
and step 3: the velocity and position of the particles are updated,
and 4, step 4: calculating a fitness value of each particle;
and 5: updating the self history optimal particle of the particle;
if the fitness value of the current particle is smaller than the fitness of the historical optimal particle, updating the new particle into the historical optimal particle;
step 6: updating global optimal particles of the population;
if the fitness value of the current particle is smaller than that of the population global optimal particle, updating the particle into the population global optimal particle;
and 7: checking whether an algorithm termination condition is met, and ending when the algorithm termination condition is met; otherwise, go to step 3.
2. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 1, wherein the method comprises the following steps: the method for constructing the scientific workflow structure in the mixed cloud environment in the step 1 specifically comprises the following steps:
step 1-1; generating a scientific workflow Data Set (DS);
where dsi represents the data set numbered i, and ds _ size represents the size of the data set dsi; init represents the initial position of the data set; depep represents the final layout position of the data set; ti represents a task set requiring a data set dsi; pri _ flag represents a private data set storage flag, if the dsi is a private data set, the pri _ flag is the number of a data center for storing the dsi, otherwise, the pri _ flag is 0; the ini _ flag represents an initial data set flag, if the dsi is an initial data set, the ini _ flag is 1, otherwise, the ini _ flag is 0;
1-2; constructing a data center set DC in a mixed cloud environment based on a scientific workflow data set DS;
wherein dci represents a data center with number i, and DSi represents a data set stored in the data center i; size represents the storage capacity of the data center; availsize represents the available storage capacity of the current data center; the type represents the type of the data center, and when the type is 0, the data center is a public cloud data center; when the type is 1, the data center is a private cloud data center;
1-3; generating a task set T of the scientific workflow;
where ti represents the task numbered i; inDSi represents the set of input data sets for task ti; outDSi is represented as a set of output data sets for task ti; runDC represents the data center location where task ti is executing;
1-4; constructing a scientific workflow of the triples;
G=<T,E,DS>(1);
wherein E is a control flow set representing the execution sequence of tasks, and G is a directed acyclic graph; t is expressed as a task set of the scientific workflow; the DS represents the set of datasets for the scientific workflow.
3. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 1, wherein the method comprises the following steps: the method for distributing the tasks and the generated data sets in the scientific workflow structure in the mixed cloud environment in the step 1 comprises the following steps:
when the task contains the private data set, placing the task into a data center which correspondingly stores the private data for operation; otherwise, the data center with more input data sets is placed for operation.
4. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 2, wherein the method comprises the following steps: the values of pri _ flag and ini _ flag in step 1-1 are as follows:
DSpri represents a set of private data sets that must be placed in a particular private cloud data center, pridc (dsi) represents the location of the private cloud data center where data sets dsi are stored, DSpub represents all public data set sets, DSini represents a set of all initial data sets that already exist before the scientific workflow is run; DSgen represents a collection of data sets generated by a scientific workflow runtime.
5. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 1, wherein the method comprises the following steps: step 2 initializing the population comprises the following steps:
step 2-1: particle encoding: the particle coding is a two-dimensional particle coding strategy, each particle represents a data layout scheme of a scientific workflow in a mixed cloud environment, and each particle comprises a data center of a first-dimensional data set and a second-dimensional data set placement position; the number of the data sets is M, and the number of the data centers is Q;
position of particle i at time tExpressed as:
wherein,a data center representing the placement of the kth data set at time t;
step 2-2: initializing the population size, the maximum iteration times, the inertia weight factor, the individual cognition factor, the global cognition factor and the particle initial speed according to a standard PSO algorithm.
6. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 5, wherein the method comprises the following steps: the initialization of the corresponding parameters in the step 2-2 is respectively as follows: the population size is set to 50, the maximum number of iterations is set to 500, the inertial weight factor is set to 0.4, the individual cognition factor is set to 0.4, and the global cognition factor is set to 0.6.
7. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 1, wherein the method comprises the following steps: the specific method for initializing the self-history optimal particles and the initial population global optimal particles of the first generation particles in the step 2 is as follows:
step 2-3: the fitness value of each particle of the first generation is calculated,
step 2-4: and selecting the particle with the minimum fitness value as a population global optimal particle, and setting each particle in the first generation as a self history optimal particle.
8. The method for laying out scientific workflow data in the hybrid cloud environment according to claim 1 or 7, wherein the method comprises the following steps: the calculation formula of the fitness of the particles is as follows:
wherein a particle q1 represents one coding scheme corresponding to the question and a particle q2 represents another coding scheme corresponding to the question; QTuf is a feasible solution set, and QTul is an infeasible solution set; evaluate (q1) represents the data transfer amount of the placement strategy corresponding to the feasible particle q1, and evaluate (q2) represents the data transfer amount of the placement strategy corresponding to the feasible particle q 2.
9. The method for laying out scientific workflow data in a hybrid cloud environment according to claim 1, wherein the method comprises the following steps: the update formula for updating the particle i in step 3 is as follows:
wherein,is an inertial part, and is a part,the cognitive component of the individual is a cognitive component,is a global cognitive component;
the standard PSO algorithm is combined with the variation operation of the genetic algorithm to obtain the inertia part of the particle i at the time tThe formula of (1) is as follows:
w is an inertia weight factor, w is used for adjusting the searching capacity of particles on a solution space, Mu () represents a mutation operation, r1 is a random number between 0 and 1, Mu () randomly selects a quantile, the value of the quantile is randomly changed, and the value meets the corresponding value range;
the formula for respectively obtaining the individual cognitive part and the global cognitive part of the particle i at the time t by combining the standard PSO algorithm and the cross operation of the genetic algorithm is as follows:
wherein c is1Is an individual cognitive factor, c2The method comprises the following steps that (1) a global cognition factor is adopted, pBest and gBest respectively represent the individual optimal position of a particle after multiple iterations and the global optimal position of a population; cp () and Cg () denote a crossover operation, r2 and r3 are random numbers between 0 and 1, Cp () and Cg () randomly select two fractional bits of a particle, crossover the same fractional bit of pBest or gBest; r is1And r2Is a random variable with a value range of [0, 1 ]],r1And r2For enhancing randomness in the iterative search process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810427966.1A CN108647771A (en) | 2018-05-07 | 2018-05-07 | The layout method of research-on-research flow data under a kind of mixing cloud environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810427966.1A CN108647771A (en) | 2018-05-07 | 2018-05-07 | The layout method of research-on-research flow data under a kind of mixing cloud environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108647771A true CN108647771A (en) | 2018-10-12 |
Family
ID=63749278
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810427966.1A Pending CN108647771A (en) | 2018-05-07 | 2018-05-07 | The layout method of research-on-research flow data under a kind of mixing cloud environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647771A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033076A (en) * | 2019-04-19 | 2019-07-19 | 福州大学 | Mix the Work stream data layout method below cloud environment to cost optimization |
CN110058812A (en) * | 2019-03-08 | 2019-07-26 | 中国农业科学院农业信息研究所 | Scientific workflow data placement method under a kind of cloud environment |
CN111882234A (en) * | 2020-08-03 | 2020-11-03 | 浪潮云信息技术股份公司 | Scientific workflow task management method and device |
CN112579987A (en) * | 2020-12-04 | 2021-03-30 | 河南大学 | Migration deployment method and operation identity verification method of remote sensing program in hybrid cloud |
CN112632615A (en) * | 2020-12-30 | 2021-04-09 | 福州大学 | Scientific workflow data layout method based on mixed cloud environment |
CN113806039A (en) * | 2021-08-09 | 2021-12-17 | 杭州电子科技大学 | Particle swarm optimization workflow scheduling method based on directional search |
CN114049083A (en) * | 2021-11-10 | 2022-02-15 | 福建师范大学 | Scientific workflow data layout method and terminal based on privacy level classification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201712010D0 (en) * | 2015-01-27 | 2017-09-06 | Beijing Didi Infinity Tech And Dev Co Ltd | Information providing method and system for on-demand service |
EP3261054A1 (en) * | 2015-02-10 | 2017-12-27 | Beijing Didi Infinity Technology and Development Co., Ltd. | Order pushing method and system |
CN107656799A (en) * | 2017-11-06 | 2018-02-02 | 福建师范大学 | The workflow schedule method of communication and calculation cost is considered under a kind of more cloud environments |
-
2018
- 2018-05-07 CN CN201810427966.1A patent/CN108647771A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201712010D0 (en) * | 2015-01-27 | 2017-09-06 | Beijing Didi Infinity Tech And Dev Co Ltd | Information providing method and system for on-demand service |
US20180017405A1 (en) * | 2015-01-27 | 2018-01-18 | Beijing Didi Infinity Technology And Development Co., Ltd. | Methods and systems for providing information for an on-demand service |
EP3261054A1 (en) * | 2015-02-10 | 2017-12-27 | Beijing Didi Infinity Technology and Development Co., Ltd. | Order pushing method and system |
CN107656799A (en) * | 2017-11-06 | 2018-02-02 | 福建师范大学 | The workflow schedule method of communication and calculation cost is considered under a kind of more cloud environments |
Non-Patent Citations (3)
Title |
---|
吴家豪等: "基于多Agent系统的粒子群遗传优化云工作流调度算法", 《南京大学学报(自然科学)》 * |
孙平安: "基于粒子群算法的云计算资源配置研究", 《西南师范大学学报(自然科学版)》 * |
李学俊 等: ""混合云中面向数据中心的工作流数据布局方法"", 《软件学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110058812A (en) * | 2019-03-08 | 2019-07-26 | 中国农业科学院农业信息研究所 | Scientific workflow data placement method under a kind of cloud environment |
CN110058812B (en) * | 2019-03-08 | 2022-11-22 | 中国农业科学院农业信息研究所 | Scientific workflow data placement method in cloud environment |
CN110033076A (en) * | 2019-04-19 | 2019-07-19 | 福州大学 | Mix the Work stream data layout method below cloud environment to cost optimization |
CN110033076B (en) * | 2019-04-19 | 2022-08-05 | 福州大学 | Workflow data layout method for cost optimization in mixed cloud environment |
CN111882234A (en) * | 2020-08-03 | 2020-11-03 | 浪潮云信息技术股份公司 | Scientific workflow task management method and device |
CN112579987A (en) * | 2020-12-04 | 2021-03-30 | 河南大学 | Migration deployment method and operation identity verification method of remote sensing program in hybrid cloud |
CN112579987B (en) * | 2020-12-04 | 2022-09-13 | 河南大学 | Migration deployment method and operation identity verification method of remote sensing program in hybrid cloud |
CN112632615A (en) * | 2020-12-30 | 2021-04-09 | 福州大学 | Scientific workflow data layout method based on mixed cloud environment |
CN112632615B (en) * | 2020-12-30 | 2023-10-31 | 福州大学 | Scientific workflow data layout method based on hybrid cloud environment |
CN113806039A (en) * | 2021-08-09 | 2021-12-17 | 杭州电子科技大学 | Particle swarm optimization workflow scheduling method based on directional search |
CN114049083A (en) * | 2021-11-10 | 2022-02-15 | 福建师范大学 | Scientific workflow data layout method and terminal based on privacy level classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647771A (en) | The layout method of research-on-research flow data under a kind of mixing cloud environment | |
CN112016812B (en) | Multi-unmanned aerial vehicle task scheduling method, system and storage medium | |
Kaur et al. | Deep‐Q learning‐based heterogeneous earliest finish time scheduling algorithm for scientific workflows in cloud | |
CN108989098B (en) | Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment | |
Jat et al. | A memetic algorithm for the university course timetabling problem | |
CN106227599B (en) | The method and system of scheduling of resource in a kind of cloud computing system | |
Hardiansyah et al. | Solving economic load dispatch problem using particle swarm optimization technique | |
CN110033076B (en) | Workflow data layout method for cost optimization in mixed cloud environment | |
CN102662743A (en) | Heuristic type coarse grain parallel grid task scheduling method | |
US20150170052A1 (en) | Method of reducing resource fluctuations in resource leveling | |
Deng et al. | A data and task co-scheduling algorithm for scientific cloud workflows | |
CN104937544A (en) | Computing regression models | |
Liu et al. | A data placement strategy for scientific workflow in hybrid cloud | |
Chen et al. | Scheduling independent tasks in cloud environment based on modified differential evolution | |
Mirzayi et al. | A hybrid heuristic workflow scheduling algorithm for cloud computing environments | |
Xhafa | A hybrid evolutionary heuristic for job scheduling on computational grids | |
Pooranian et al. | Hybrid metaheuristic algorithm for job scheduling on computational grids | |
CN112632615B (en) | Scientific workflow data layout method based on hybrid cloud environment | |
Kune et al. | Genetic algorithm based data-aware group scheduling for big data clouds | |
CN114461368A (en) | Multi-target cloud workflow scheduling method based on cooperative fruit fly algorithm | |
CN113987936A (en) | Equipment test resource overall allocation method based on chaotic genetic algorithm | |
CN107155215B (en) | Distribution method and device of application home service cluster | |
Entezari-Maleki et al. | A genetic algorithm to increase the throughput of the computational grids | |
Yu | [Retracted] Research on Optimization Strategy of Task Scheduling Software Based on Genetic Algorithm in Cloud Computing Environment | |
Li et al. | An evolutionary framework for multi-agent organizations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181012 |
|
RJ01 | Rejection of invention patent application after publication |