Nothing Special   »   [go: up one dir, main page]

CN112287603A - Prediction model construction method and device based on machine learning and electronic equipment - Google Patents

Prediction model construction method and device based on machine learning and electronic equipment Download PDF

Info

Publication number
CN112287603A
CN112287603A CN202011177483.4A CN202011177483A CN112287603A CN 112287603 A CN112287603 A CN 112287603A CN 202011177483 A CN202011177483 A CN 202011177483A CN 112287603 A CN112287603 A CN 112287603A
Authority
CN
China
Prior art keywords
prediction model
data
time prediction
data set
job
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011177483.4A
Other languages
Chinese (zh)
Inventor
吴恩慈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiyue Information Technology Co Ltd
Original Assignee
Shanghai Qiyue Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiyue Information Technology Co Ltd filed Critical Shanghai Qiyue Information Technology Co Ltd
Priority to CN202011177483.4A priority Critical patent/CN112287603A/en
Publication of CN112287603A publication Critical patent/CN112287603A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Geometry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of computers, in particular to a prediction model construction method and device based on machine learning and an electronic device, wherein the prediction model construction method comprises the following steps: extracting a job data set in a target job, the job data set comprising: feature vector data and measurement index data; selecting the executable feature vector data and the executable measurement index data from the operation data set to construct a time prediction model; performing parameter configuration on the time prediction model based on a cross validation method; and performing overfitting optimization on the time prediction model to obtain a final time prediction model. The method can be applied to computational analysis and optimization of data mining, machine learning and deep learning based on a distributed computing platform through the prediction result of the time prediction model, effectively improves the use efficiency of computing resources, and simultaneously realizes parameter configuration adjustment and optimized overfitting of the model.

Description

Prediction model construction method and device based on machine learning and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a prediction model construction method and device based on machine learning and electronic equipment.
Background
The distributed computing platform can extend the cluster size to thousands of nodes with the help of a core engine, and the Catalyst Optimizer provides a rule and cost-based Optimizer to push the computing power of the data warehouse to a new height. However, on a very large-scale data set, the problems of usability and expandability exist, a structured production language or a Dataset program is parsed into a logic plan before being executed, an executable physical plan is generated, and different execution plans have great influence on performance.
In the prior art, a multiple linear regression method is provided, cluster performance is predicted according to Spark performance index, and model index R2If the average value is smaller than the average value, the cluster index is static, and once the hardware resources of the cluster are expanded, the prediction model needs to be retrained; in the prior art, the influence of changing Spark cluster configuration parameters on performance exists, key performance indexes are captured in a model training stage, a user is allowed to inquire the influence of the configuration parameters on the performance, the user needs to deeply know and understand the parameters and indexes of a computing platform, and the technical threshold of the method application is high; meanwhile, in the prior art, by analyzing data collected in the execution process of the Map Reduce task, a two-stage regression method is provided to predict the completion time of the task, the implementation process is complex, the data consistency is compromised, and the real-time requirement of a computing platform is difficult to meet.
Disclosure of Invention
The invention provides a prediction model construction method and device based on machine learning and electronic equipment, which effectively improve the use efficiency of computing resources and simultaneously realize parameter configuration adjustment and optimized overfitting of a model.
The embodiment of the specification provides a prediction model construction method based on machine learning, which comprises the following steps:
extracting a job data set in a target job, the job data set comprising: feature vector data, metric index data;
selecting the executable feature vector data and the executable measurement index data from the operation data set to construct a time prediction model;
performing parameter configuration on the time prediction model based on a cross validation method;
and performing overfitting optimization on the time prediction model to obtain a final time prediction model.
Preferably, the method further comprises the following steps: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing a training set and a test set, and carrying out parameter configuration on the time prediction model.
Preferably, the extracting the job data set in the target job includes:
acquiring the operation data set in any mode of an operation scheduling page, an REST interface and an external monitoring tool;
extracting the feature vector data by a listener bus mechanism and the metric data by an indicator system.
Preferably, the overfitting optimization of the temporal prediction model includes:
iteratively training the temporal prediction model based on a combinatorial algorithm, the combinatorial algorithm comprising: a random forest algorithm and a gradient lifting tree algorithm;
verifying the result of the iterative training by a training method of a verification set;
and stopping the iterative training when the verification result is lower than the tolerance set by the strategy, and obtaining the time prediction model after overfitting optimization.
Preferably, the feature vector data is selected according to a chi-square selection method.
Preferably, the metric is selected according to the job data sets of different sizes, the Shuffle and interface operations of the job data sets of different types, and the time overhead of network traffic.
An embodiment of the present specification further provides a prediction model building apparatus based on machine learning, including: a data extraction module that extracts a job data set in a target job, the job data set comprising: feature vector data and measurement index data;
the data selection module is used for selecting the executable feature vector data and the measurement index data in the operation data set and constructing a time prediction model;
the parameter configuration module is used for carrying out parameter configuration on the time prediction model based on a cross verification method;
and the goodness-of-fit adjustment module is used for carrying out overfitting optimization on the time prediction model to obtain a final time prediction model.
Preferably, the method further comprises the following steps: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing a training set and a test set, and carrying out parameter configuration on the time prediction model.
Preferably, the extracting the job data set in the target job includes:
acquiring the operation data set in any mode of an operation scheduling page, an REST interface and an external monitoring tool;
extracting the feature vector data by a listener bus mechanism and the metric data by an indicator system.
Preferably, the overfitting optimization of the temporal prediction model includes:
iteratively training the temporal prediction model based on a combinatorial algorithm, the combinatorial algorithm comprising: a random forest algorithm and a gradient lifting tree algorithm;
verifying the result of the iterative training by a training method of a verification set;
and stopping the iterative training when the verification result is lower than the tolerance set by the strategy, and obtaining the time prediction model after overfitting optimization.
Preferably, the feature vector data is selected according to a chi-square selection method.
Preferably, the metric is selected according to the job data sets of different sizes, the Shuffle and interface operations of the job data sets of different types, and the time overhead of network traffic.
An electronic device, wherein the electronic device comprises:
a processor and a memory storing computer executable instructions that, when executed, cause the processor to perform the method of any of the above.
A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of the above.
The beneficial effects are that:
the method can be applied to computational analysis and optimization of data mining, machine learning and deep learning based on a distributed computing platform through the prediction result of the time prediction model, effectively improves the use efficiency of computing resources, and simultaneously realizes parameter configuration adjustment and optimized overfitting of the model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic diagram of a principle of a prediction model construction method based on machine learning according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a prediction model building apparatus based on machine learning according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a computer-readable medium provided in an embodiment of the present specification.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept to those skilled in the art. The same reference numerals denote the same or similar elements, components, or parts in the drawings, and thus their repetitive description will be omitted.
Features, structures, characteristics or other details described in a particular embodiment do not preclude the fact that the features, structures, characteristics or other details may be combined in a suitable manner in one or more other embodiments in accordance with the technical idea of the invention.
In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.
The diagrams depicted in the figures are exemplary only, and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams depicted in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The term "and/or" and/or "includes all combinations of any one or more of the associated listed items.
Referring to fig. 1, a schematic diagram of a method for building a prediction model based on machine learning according to an embodiment of the present disclosure includes:
s101: extracting a job data set in a target job, the job data set comprising: feature vector data, metric index data;
in the preferred embodiment of the present invention, the job data set is extracted from the target job, wherein the job data set includes feature vector data and metric index data, and in this embodiment, the job data set is obtained through the REST interface, the feature vector data is extracted through the listener bus mechanism, the metric index data is extracted through the index system, and the job data set extraction event is submitted to the corresponding event listener by using the asynchronous thread.
S102: selecting the executable feature vector data and the measurement index data from the operation data set to construct a time prediction model;
in a preferred embodiment of the present invention, executable feature vector data and the metric index data are selected from the job data set, specifically, the feature selection method mainly includes two modes, namely supervision and unsupervised, and the feature vector data is selected by using a chi-square selection feature selection method, and the relevance is determined by performing chi-square test between features and real tags, as shown in table 1, the source and value of a Task feature vector X are exemplified as follows:
Figure BDA0002749115320000061
TABLE 1
The numbers 1-3 represent network flow characteristics, the numbers 4-11 represent Shuffle and interface characteristics in the operation execution process, and the numbers 12-15 represent data scale characteristics; the acquisition of the job data set can be achieved through any one of a job scheduling page, a REST interface and an external monitoring tool, the feature vector data is extracted through a listener bus mechanism, the measurement index data is extracted through an index system, the extraction of the target job feature data is completed, and an event is submitted to a corresponding event listener through an asynchronous thread.
The selection of the measurement index data needs to reflect data sets with different scales, Shuffle and interface operations of operators with different types, and time overhead in the aspect of network flow. The dynamic property of the operation execution time index is fully considered, the available resources of each operation execution plan are different, resource competition exists when the operation runs in parallel, the recovery time of a garbage collector of the operation, the data serialization and deserialization time, and certain randomness and relevance exist in network transmission, for example, the interface operation of an operator in a Shuffle process is time-consuming; the sources and values of the Task metric index Y are shown in table 2:
Figure BDA0002749115320000062
Figure BDA0002749115320000071
TABLE 2
S103: performing parameter configuration on the time prediction model based on a cross validation method;
in the preferred embodiment of the invention, pruning is achieved by minimizing the loss function of the decision tree, which to a certain extent can avoid overfitting, and the decision tree stops growing when the node Depth equals the Max Depth parameter. The Min Info Gain parameter is the minimum value of splitting information Gain which needs to be improved, the decision tree stops growing when the information Gain is larger than the minimum value due to the fact that candidate items are not divided, machine learning can find the best hyper-parameter of a specific problem through a data set, the best hyper-parameter can be completed in an independent Estimator, or the best hyper-parameter can be completed in a workflow containing various algorithms and feature selection, model training and testing steps are repeatedly performed, K similar mutually exclusive subsets are selected by adopting a random sampling method, consistency of data distribution is kept as much as possible for each subset, models are trained and tested respectively, the problem of over-fitting is avoided by taking the mean value of the K models, the process is called cross validation, and stability and fidelity of an evaluation result depend on the value of K to a great extent.
When the cross validation method evaluates the model parameters, the average evaluation index of different data pairs of the model fitted by the estimator is calculated, the estimator of the whole data set is re-fitted by using the parameters, the optimal configuration parameters are found, and the optimal time prediction model with strong generalization capability and relatively small error is trained on the whole training set.
S104: and performing overfitting optimization on the time prediction model to obtain a final time prediction model.
In a preferred embodiment of the invention, iterative training is carried out on the time prediction model based on a random forest algorithm and a gradient lifting tree algorithm, a group of decision tree sets are trained in parallel by adopting a guiding aggregation algorithm idea in a random forest, the model training process is random, secondary sampling is carried out on an original data set during each iteration, different random feature subsets are divided at each tree node, the node is selected randomly at first to obtain a subset containing K attributes, then the optimal attribute is selected, and the introduction degree of randomness is controlled by parameters. The random forest does not construct an integrated model based on model residual errors, so that a lower variance can be obtained, and prediction results aggregate prediction of a decision tree set. The prediction of each tree is counted as one type of vote, the category with the most votes is obtained as a classification prediction result, the average value is used as a regression prediction result, nonlinear features can be captured, the discrimination precision is improved, and overfitting is avoided to a certain extent.
The gradient boosting tree algorithm is an iterative training decision tree, and requires longer training time than random forests. The current data set is used for predicting the label of each training example in each iteration, the prediction result is compared with the real label, then the data set is marked again, the decision tree corrects the previous deviation in the next iteration training, the deviation of the training data in each iteration can be further reduced, and the mechanism for marking the example again is determined by a loss function; the gradient lifting tree algorithm adjusts the operation execution time prediction result according to the operation execution time observation value, is easily influenced by noise points, is verified by using a training method based on a verification set during training, stops training when the improvement of verification errors does not exceed the tolerance set by a strategy, and can effectively prevent overfitting.
Decision trees can fit data using complex nonlinear models for regression analysis by varying the metric of purity. Variance is a measure used to measure the label uniformity at the nodes of the regression model.
As shown in equation (1), the root mean square error RMSE is the square root of the mean square error MSE, the accuracy is further amplified, and a closer to zero indicates a more accurate prediction, wTx(i) Is a predicted value of each job execution time, and y (i) is an actual value of all job execution times:
Figure BDA0002749115320000081
as shown in formula (2), the average absolute error MAE is an average value of absolute values of differences between the predicted value and the actual value, and the MAE avoids mutual cancellation of positive and negative errors and better reflects the actual situation of the predicted value error.
Figure BDA0002749115320000082
As shown in equation (3), the goodness-of-fit (R-squared Coefficient) is used to evaluate the goodness of fit data of the model and the degree of variation of the measurement target variable, indicating that the variant part of the dependent variable can be explained from the variation of the independent variable. When R is2Approaching 1 indicates that the higher the degree of interpretation of the dependent variable by the independent variable.
Figure RE-GDA0002810038350000083
Wherein R is2Is the goodness of fit.
Further, still include: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing the training set and the test set, and carrying out parameter configuration on the time prediction model.
In a preferred embodiment of the present invention, when the training cost of the cross-validation method is high, the method of randomly dividing the training set and the test set can be used to perform the super-parameter adjustment, create a single training and test data set pair, divide the data set into two parts using the training scale parameters, generate a set of training and test data set pairs using 75% of the training scale parameters, and fit the estimator using the optimal parameter configuration and the complete data set. In the cross validation method, each parameter is evaluated for K times, each parameter combination is evaluated only once in the method of randomly dividing the training set and the test set, and the reliability of the result is low when the training data set is not large enough.
Further, the extracting the job data set in the target job includes:
acquiring the operation data set in any mode of an operation scheduling page, an REST interface and an external monitoring tool;
extracting the feature vector data by a listener bus mechanism and the metric data by an indicator system.
In a preferred embodiment of the present invention, the job data set may be obtained through any one of a job scheduling page, a REST interface, and an external monitoring tool, the feature vector data is extracted through a listener bus mechanism, the metric index data is extracted through an index system, the extraction of the feature data of the target job is completed, and the established time prediction model is trained through the extracted feature data.
Further, the overfitting optimization of the temporal prediction model includes:
iteratively training the temporal prediction model based on a combinatorial algorithm, the combinatorial algorithm comprising: a random forest algorithm and a gradient lifting tree algorithm;
verifying the result of the iterative training by a training method of a verification set;
and stopping the iterative training when the verification result is lower than the tolerance set by the strategy, and obtaining the time prediction model after overfitting optimization.
In a preferred embodiment of the invention, the time prediction model is iteratively trained based on a combined algorithm, a training method based on a verification set is used for verification during training, the training is stopped when the improvement of verification errors does not exceed the tolerance set by a strategy, the time prediction model after overfitting optimization is obtained, and overfitting can be effectively prevented.
Further, the feature vector data is selected according to a chi-square selection method.
In the preferred embodiment of the present invention, chi-square selection is a feature selection method that is used statistically, and a chi-square test is performed between features and real tags to determine the degree of association, thereby selecting a feature vector with a larger association.
Further, the metric index is selected according to the job data sets of different sizes, the Shuffle and the interface operation of the job data sets of different types, and the time overhead of the network traffic.
In the preferred embodiment of the present invention, the selection of the metric data requires the implementation of data sets of different scales, Shuffle and interface operations of operators of different types, and time overhead in terms of network traffic. The dynamic performance of the operation execution time index is fully considered, the available resources of each operation execution plan are different, resource competition exists when the operation runs in parallel, the recovery time of the garbage collector of the operation, the data serialization and deserialization time, and certain randomness and relevance exist in network transmission to carry out measurement index selection.
In the embodiment of the family, the scheduling program identifies the dependency relationship between the elastic distributed data set and the directed acyclic graph, the application program compiles the target operation into the operation execution plan, and the main basis for dividing the scheduling stage is whether the current input of the calculation factor is determined.
In the embodiment of the family, the directed acyclic graph scheduler sequences the target jobs, the job scheduler schedules the job sets by the directed acyclic graph scheduler and creates a job set manager to be added to the scheduling pool, sequences all the job set managers in the scheduling pool, allocates resources according to a data locality principle, and runs an execution plan on each allocated node. The intermediate result and the final result of the execution plan calculation are stored in a storage system, a job monitor monitors the success or failure of execution of each task in the job, the execution condition of the task is reported to a directed acyclic graph scheduler through a monitoring event, and a retry and fault tolerance mechanism exists for the failed task. The data locality principle is as follows: if the operation execution plan is in the scheduling stage of operation start, the data locality of the preferred operation position of the corresponding elastic distributed data set partition is Node Local; if the task is in the scheduling stage of the non-job beginning, obtaining a preferred position according to the running position of the father scheduling stage; if the Executor is active, the data locality is Process Local.
In the preferred embodiment of the present invention, there are Memory application and allocation problems in the task scheduling process, Tungsten is a Memory allocation and release implementation, and the Memory Block data structure similar to the page cache of the operating system is implemented by directly operating the system Memory. The method and the device accurately apply and release the off-heap memory, accurately calculate the space occupied by the serialized data, and reduce the difficulty and the error of management. The data in the memory block is located in the virtual machine heap memory or the off-heap memory, and mainly comprises two attributes, obj and offset. The obj attribute of the memory block holds the address of the object in the virtual machine heap, the offset attribute holds the offset of the start address of the page cache relative to the address of the object in the virtual machine heap, and the length attribute holds the size of the page cache. When Tungsten is in an in-heap memory mode, data is stored in a virtual machine heap as an object, and the specific position of the object using offset positioning data is found from the heap; in the off-heap memory mode, data is located from the off-heap memory by the offset attribute, and a fixed-length continuous memory block is obtained from the start positions of obj and offset. If the requested Memory block is larger than or equal to 1MB and a Memory block with a specified size exists in the Memory Buffer Pools, obtaining the Memory block from the Memory cache pool, otherwise, independently establishing the Memory block for distribution.
The method can be applied to computational analysis and optimization of data mining, machine learning and deep learning based on a distributed computing platform through the prediction result of the time prediction model, effectively improves the use efficiency of computing resources, and simultaneously realizes parameter configuration adjustment and optimized overfitting of the model.
Fig. 2 is a schematic structural diagram of a prediction model building apparatus based on machine learning according to an embodiment of the present specification, including:
the data extraction module 201 extracts a job data set from a target job, where the job data set includes: feature vector data and measurement index data;
in a preferred embodiment of the present invention, the data extraction module 201 extracts a job data set from a target job, where the job data set includes feature vector data and metric index data, and in this embodiment, the job data set is obtained through a REST interface, the feature vector data is extracted through a snooper bus mechanism, the metric index data is extracted through an index system, and an asynchronous thread is adopted to submit the job data set extraction event to a corresponding event snooper.
A data selecting module 202, configured to select the executable feature vector data and the metric index data from the job data set, and construct a time prediction model;
in a preferred embodiment of the present invention, the data selecting module 202 selects executable feature vector data and the metric index data from the job dataset, specifically, the feature selection method mainly includes two modes, namely, supervised mode and unsupervised mode, and the feature vector data is selected by using a feature selection method selected by chi-square, and the relevance is determined by performing chi-square test between the features and the real tags, as shown in table 1, the Task feature vector X source and value example are shown.
The numbers 1-3 represent network flow characteristics, the numbers 4-11 represent Shuffle and interface characteristics in the operation execution process, and the numbers 12-15 represent data scale characteristics; the acquisition of the job data set can be achieved through any one of a job scheduling page, a REST interface and an external monitoring tool, the feature vector data is extracted through a listener bus mechanism, the measurement index data is extracted through an index system, the extraction of the target job feature data is completed, and an event is submitted to a corresponding event listener through an asynchronous thread.
The selection of the measurement index data needs to reflect data sets with different scales, Shuffle and interface operations of operators with different types, and time overhead in the aspect of network flow. The dynamic property of the operation execution time index is fully considered, the available resources of each operation execution plan are different, resource competition exists when the operation runs in parallel, the recovery time of a garbage collector of the operation, the data serialization and deserialization time, and certain randomness and relevance exist in network transmission, for example, the interface operation of an operator in a Shuffle process is time-consuming; the sources and values of the Task metric Y are shown in table 2.
The parameter configuration module 203 is used for performing parameter configuration on the time prediction model based on a cross validation method;
in the preferred embodiment of the invention, pruning is achieved by minimizing the loss function of the decision tree, which to a certain extent can avoid overfitting, and the decision tree stops growing when the node Depth equals the Max Depth parameter. The Min Info Gain parameter is the minimum value of splitting information Gain which needs to be improved, the decision tree stops growing when the information Gain is larger than the minimum value due to the fact that candidate items are not divided, machine learning can find the best hyper-parameter of a specific problem through a data set, the best hyper-parameter can be completed in an independent Estimator, or the best hyper-parameter can be completed in a workflow containing various algorithms and feature selection, model training and testing steps are repeatedly performed, K similar mutually exclusive subsets are selected by adopting a random sampling method, consistency of data distribution is kept as much as possible for each subset, models are trained and tested respectively, the problem of over-fitting is avoided by taking the mean value of the K models, the process is called cross validation, and stability and fidelity of an evaluation result depend on the value of K to a great extent.
When the cross validation method evaluates the model parameters, the average evaluation index of different data pairs of the model fitted by the estimator is calculated, the estimator of the whole data set is re-fitted by using the parameters, the optimal configuration parameters are found, and the optimal time prediction model with strong generalization capability and relatively small error is trained on the whole training set.
And the goodness-of-fit adjustment module 204 is used for performing overfitting optimization on the time prediction model to obtain a final time prediction model.
In a preferred embodiment of the present invention, the goodness-of-fit adjustment module 204 performs iterative training on the time prediction model based on a random forest algorithm and a gradient lifting tree algorithm, the random forest adopts a guided aggregation algorithm idea, a group of decision tree sets are trained in parallel, the model training process is random, the original data set is sampled again during each iteration, different random feature subsets are divided at each tree node, a subset including K attributes is selected at random for the node, then the optimal attribute is selected, and the introduction degree of randomness is controlled by parameters. The random forest does not construct an integrated model based on model residual errors, so that a lower variance can be obtained, and prediction results aggregate prediction of a decision tree set. The predictor of each tree is counted as a class of vote, the class with the most votes is obtained as a classification prediction result, the average value is used as a regression prediction result, nonlinear features can be captured, the discrimination precision is improved, and overfitting is avoided to a certain extent.
The gradient boosting tree algorithm is an iterative training decision tree, and requires longer training time than random forests. The current data set is used for predicting the label of each training example in each iteration, the prediction result is compared with the real label, then the data set is marked again, the decision tree corrects the previous deviation in the next iteration training, the deviation of the training data in each iteration can be further reduced, and the mechanism for marking the example again is determined by a loss function; the gradient lifting tree algorithm adjusts the operation execution time prediction result according to the operation execution time observation value, is easily influenced by noise points, is verified by using a training method based on a verification set during training, stops training when the improvement of verification errors does not exceed the tolerance set by a strategy, and can effectively prevent overfitting.
Further, still include: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing the training set and the test set, and carrying out parameter configuration on the time prediction model.
Further, the extracting the job data set in the target job includes:
acquiring the operation data set in any mode of an operation scheduling page, an REST interface and an external monitoring tool;
extracting the feature vector data by a listener bus mechanism and the metric data by an indicator system.
Further, the overfitting optimization of the temporal prediction model includes:
iteratively training the temporal prediction model based on a combinatorial algorithm, the combinatorial algorithm comprising: a random forest algorithm and a gradient lifting tree algorithm;
verifying the result of the iterative training by a training method of a verification set;
and stopping the iterative training when the verification result is lower than the tolerance set by the strategy, and obtaining the time prediction model after overfitting optimization.
Further, the feature vector data is selected according to a chi-square selection method.
Further, the metric index is selected according to the job data sets of different sizes, the Shuffle and the interface operation of the job data sets of different types, and the time overhead of the network traffic.
The method can be applied to computational analysis and optimization of data mining, machine learning and deep learning based on a distributed computing platform through the prediction result of the time prediction model, effectively improves the use efficiency of computing resources, and simultaneously realizes parameter configuration adjustment and optimized overfitting of the model.
Based on the same inventive concept, the embodiment of the specification further provides the electronic equipment.
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as specific physical implementations for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details not disclosed in the embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification. An electronic device 300 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 3, electronic device 300 is embodied in the form of a general purpose computing device. The components of electronic device 300 may include, but are not limited to: at least one processing unit 310, at least one memory unit 320, a bus 330 connecting different device components (including the memory unit 320 and the processing unit 310), a display unit 340, and the like.
Wherein the storage unit stores program code executable by the processing unit 310 to cause the processing unit 310 to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned processing method section of the present specification. For example, the processing unit 310 may perform the steps as shown in fig. 1.
The storage unit 320 may include readable media in the form of volatile storage units, such as a random access storage unit (RAM)3201 and/or a cache storage unit 3202, and may further include a read-only storage unit (ROM) 3203.
The memory unit 320 may also include a program/utility 3204 having a set (at least one) of program modules 3205, such program modules 3205 including, but not limited to: an operating device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 330 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 300 may also communicate with one or more external devices 400 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 300, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 300 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 350. Also, the electronic device 300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 360. Network adapter 360 may communicate with other modules of electronic device 300 via bus 330. It should be appreciated that although not shown in FIG. 3, other hardware and/or software modules may be used in conjunction with electronic device 300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID devices, tape drives, and data backup storage devices, to name a few.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention. When executed by a data processing device, the computer program enables the computer-readable medium to implement the above-described method of the invention, namely: such as the method shown in fig. 1.
Fig. 4 is a schematic diagram of a computer-readable medium provided in an embodiment of the present disclosure.
A computer program implementing the method shown in fig. 1 may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in accordance with embodiments of the present invention may be implemented in practice using general purpose data processing equipment such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The embodiments in the present specification are described in a progressive manner, and portions that are similar to each other in the embodiments are referred to each other, and each embodiment focuses on differences from other embodiments.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A prediction model construction method based on machine learning is characterized by comprising the following steps:
extracting a job data set in a target job, the job data set comprising: feature vector data and measurement index data;
selecting the executable feature vector data and the executable measurement index data from the operation data set to construct a time prediction model;
performing parameter configuration on the time prediction model based on a cross validation method;
and performing overfitting optimization on the time prediction model to obtain a final time prediction model.
2. The method of machine learning-based predictive model construction according to claim 1, further comprising: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing the training set and the test set, and carrying out parameter configuration on the time prediction model.
3. The method for constructing prediction model based on machine learning according to claim 1 or 2, wherein the extracting of job data set in target job comprises:
acquiring the operation data set in any mode of an operation scheduling page, an REST interface and an external monitoring tool;
extracting the feature vector data by a listener bus mechanism and the metric data by an indicator system.
4. A method of machine learning based prediction model construction according to any of claims 1-3, wherein said overfitting optimization of the temporal prediction model comprises:
iteratively training the temporal prediction model based on a combinatorial algorithm, the combinatorial algorithm comprising: a random forest algorithm and a gradient lifting tree algorithm;
verifying the result of the iterative training by a training method of a verification set;
and stopping the iterative training when the verification result is lower than the tolerance set by the strategy, and obtaining the time prediction model after overfitting optimization.
5. The method of any one of claims 1-4, wherein the feature vector data is selected according to a chi-squared selection method.
6. The machine-learning-based predictive model building method of any one of claims 1-5, wherein the metric is selected according to the job data sets of different sizes, the Shuffle and interface operations of the different types of job data sets, and the time overhead of network traffic.
7. A prediction model construction device based on machine learning is characterized by comprising:
a data extraction module that extracts a job data set in a target job, the job data set comprising: feature vector data and measurement index data;
the data selection module is used for selecting the executable feature vector data and the measurement index data in the operation data set and constructing a time prediction model;
the parameter configuration module is used for carrying out parameter configuration on the time prediction model based on a cross verification method;
and the goodness-of-fit adjustment module is used for carrying out overfitting optimization on the time prediction model to obtain a final time prediction model.
8. The machine-learning-based predictive model building apparatus of claim 7, further comprising: and when the operation data set is larger than a preset training data set, carrying out hyper-parameter adjustment on the time prediction model by adopting a method of randomly dividing the training set and the test set, and carrying out parameter configuration on the time prediction model.
9. An electronic device, wherein the electronic device comprises:
a processor and a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202011177483.4A 2020-10-29 2020-10-29 Prediction model construction method and device based on machine learning and electronic equipment Pending CN112287603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011177483.4A CN112287603A (en) 2020-10-29 2020-10-29 Prediction model construction method and device based on machine learning and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011177483.4A CN112287603A (en) 2020-10-29 2020-10-29 Prediction model construction method and device based on machine learning and electronic equipment

Publications (1)

Publication Number Publication Date
CN112287603A true CN112287603A (en) 2021-01-29

Family

ID=74373743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011177483.4A Pending CN112287603A (en) 2020-10-29 2020-10-29 Prediction model construction method and device based on machine learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN112287603A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377830A (en) * 2021-05-21 2021-09-10 北京沃东天骏信息技术有限公司 Method for determining hyper-parameters, method for training federal learning model and electronic equipment
CN113672169A (en) * 2021-07-19 2021-11-19 浙江大华技术股份有限公司 Data reading and writing method of stream processing system and stream processing system
CN113704599A (en) * 2021-07-14 2021-11-26 大箴(杭州)科技有限公司 Marketing conversion user prediction method and device and computer equipment
CN113988488A (en) * 2021-12-27 2022-01-28 上海一嗨成山汽车租赁南京有限公司 Method for predicting ETC passing probability of vehicle by multiple factors
CN114358445A (en) * 2022-03-21 2022-04-15 山东建筑大学 Business process residual time prediction model recommendation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647137A (en) * 2018-05-10 2018-10-12 华东师范大学 A kind of transaction capabilities prediction technique, device, medium, equipment and system
US20190007410A1 (en) * 2017-06-30 2019-01-03 Futurewei Technologies, Inc. Quasi-agentless cloud resource management
CN111126668A (en) * 2019-11-28 2020-05-08 中国人民解放军国防科技大学 Spark operation time prediction method and device based on graph convolution network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190007410A1 (en) * 2017-06-30 2019-01-03 Futurewei Technologies, Inc. Quasi-agentless cloud resource management
CN108647137A (en) * 2018-05-10 2018-10-12 华东师范大学 A kind of transaction capabilities prediction technique, device, medium, equipment and system
CN111126668A (en) * 2019-11-28 2020-05-08 中国人民解放军国防科技大学 Spark operation time prediction method and device based on graph convolution network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张鹏: "异构Spark集群Straggler策略研究", 《中国优秀硕士学位论文全文数据库-信息科技辑》, no. 01, pages 1 - 2 *
李克果: "面向云平台优化的数据支撑工具研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)-信息科技辑》, no. 01, 15 January 2019 (2019-01-15), pages 2 - 4 *
李克果: "面向云平台优化的数据支撑工具研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》, no. 01, pages 2 - 4 *
雷祖尔•卡里姆: "《Scala机器学习:构建现实世界机器学习和深度学习项目》", pages: 42 - 43 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377830A (en) * 2021-05-21 2021-09-10 北京沃东天骏信息技术有限公司 Method for determining hyper-parameters, method for training federal learning model and electronic equipment
CN113704599A (en) * 2021-07-14 2021-11-26 大箴(杭州)科技有限公司 Marketing conversion user prediction method and device and computer equipment
CN113672169A (en) * 2021-07-19 2021-11-19 浙江大华技术股份有限公司 Data reading and writing method of stream processing system and stream processing system
CN113988488A (en) * 2021-12-27 2022-01-28 上海一嗨成山汽车租赁南京有限公司 Method for predicting ETC passing probability of vehicle by multiple factors
CN114358445A (en) * 2022-03-21 2022-04-15 山东建筑大学 Business process residual time prediction model recommendation method and system

Similar Documents

Publication Publication Date Title
CN112287603A (en) Prediction model construction method and device based on machine learning and electronic equipment
KR102485179B1 (en) Method, device, electronic device and computer storage medium for determining description information
CN111143226B (en) Automatic test method and device, computer readable storage medium and electronic equipment
CN109891438B (en) Numerical quantum experiment method and system
JP5791149B2 (en) Computer-implemented method, computer program, and data processing system for database query optimization
CN110995459A (en) Abnormal object identification method, device, medium and electronic equipment
CN110727437A (en) Code optimization item acquisition method and device, storage medium and electronic equipment
Li et al. A scenario-based approach to predicting software defects using compressed C4. 5 model
CN113609008B (en) Test result analysis method and device and electronic equipment
JP2023036773A (en) Data processing method, data processing apparatus, electronic apparatus, storage medium and computer program
CN111913931A (en) Method and device for constructing vehicle fault database, storage medium and electronic equipment
CN111582488A (en) Event deduction method and device
US11886779B2 (en) Accelerated simulation setup process using prior knowledge extraction for problem matching
CN112783508B (en) File compiling method, device, equipment and storage medium
US11593700B1 (en) Network-accessible service for exploration of machine learning models and results
CN116560984A (en) Test case clustering grouping method based on call dependency graph
CN112115234A (en) Question bank analysis method and device
CN108959454B (en) Prompting clause specifying method, device, equipment and storage medium
US11868436B1 (en) Artificial intelligence system for efficient interactive training of machine learning models
CN111400414A (en) Decision-making method and system based on standardized enterprise data and electronic equipment
Asaduzzaman Visualization and analysis of software clones
CN113656292B (en) Multi-dimensional cross-space-time basic software performance bottleneck detection method
CN112286990A (en) Method and device for predicting platform operation execution time and electronic equipment
CN114610648A (en) Test method, device and equipment
Banu et al. Study of software reusability in software components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210129