CN117076098B - Dynamic tensor compiling optimization method and device, electronic equipment and medium - Google Patents
Dynamic tensor compiling optimization method and device, electronic equipment and medium Download PDFInfo
- Publication number
- CN117076098B CN117076098B CN202310538630.3A CN202310538630A CN117076098B CN 117076098 B CN117076098 B CN 117076098B CN 202310538630 A CN202310538630 A CN 202310538630A CN 117076098 B CN117076098 B CN 117076098B
- Authority
- CN
- China
- Prior art keywords
- microkernel
- target
- candidate
- dynamic
- tensor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000005457 optimization Methods 0.000 title claims abstract description 49
- 238000013210 evaluation model Methods 0.000 claims abstract description 133
- 238000012216 screening Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 30
- 238000009826 distribution Methods 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 19
- 238000013508 migration Methods 0.000 description 26
- 230000005012 migration Effects 0.000 description 26
- 238000013135 deep learning Methods 0.000 description 19
- 238000004422 calculation algorithm Methods 0.000 description 17
- 238000005259 measurement Methods 0.000 description 11
- 230000002829 reductive effect Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000011156 evaluation Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 238000010845 search algorithm Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 150000004820 halides Chemical class 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 208000036829 Device dislocation Diseases 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Neurology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The application provides a dynamic tensor compiling optimization method, a device, electronic equipment and a medium, wherein the method acquires various hardware resource parameters of a target hardware platform before a target model is deployed on the target hardware platform; screening a plurality of candidate microkernels meeting the preset microkernel conditions according to the hardware resource parameters of the target hardware platform; predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; screening out candidate microkernels with highest performance scores as target microkernels of dynamic tensors of a target model to obtain dynamic tensors using the target microkernels in the target model; this improves the tuning effect and efficiency of the dynamic tensor.
Description
Technical Field
The application relates to the field of deep learning algorithms, in particular to a dynamic tensor compiling optimization method, a device, electronic equipment and a medium.
Background
In recent years, the demands for hardware computing power for deep learning applications have grown exponentially. GPU, TPU and special domain accelerators as hardware platforms are continually improving the performance of deep learning applications in academia and industry. However, to achieve the performance inherent to hardware, only the deep learning algorithm can be fine-tuned on this hardware. For example, hardware vendors face deep learning application hand-developed operator libraries, such as cublas, cudnn, cutlass. Because the algorithm and hardware are changed faster, the development and maintenance of the operator library are high in cost and long in development time.
A deep learning compiler (tvm, tc, halide, mlir) may implement a high-performance deep learning algorithm for the accelerator. Search-based deep learning compilers, such as Ansor, may automatically optimize some tensor programs to obtain the best performance. However, the performance of tensor programs is shape sensitive, and if tensor shapes are not available at compile time, but only at runtime, it is not guaranteed that tensors for all possible shapes perform effectively. For example, in a computer vision algorithm (ViT), input features may have different resolutions; in the speech recognition algorithm (deepspeech 2), the length of the input speech is unpredictable; in the natural language processing algorithm (BERT), the input sentence may vary from 1 word to hundreds of words. This is the so-called dynamic shape tensor. For such dynamic shape inputs, current compilers can only select the smallest tensor in the existing operator library that is larger than the current shape and fill in the data for the part that exceeds the input shape, which can lead to serious performance degradation. While the autotuner may automatically optimize the tensor procedure, the economic cost of performing the optimization for all possible shapes is prohibitive.
Disclosure of Invention
In view of the above, the present application aims to provide a dynamic tensor compiling optimization method, a device, an electronic device and a medium, which are based on hardware resource information to improve the self-adaptive capability of hardware, improve the efficiency of dynamic tensor compiling optimization in a deep learning model, and save optimization time and economic cost.
The embodiment of the application provides a dynamic tensor compiling optimization method, which comprises the following steps:
Before the target model is deployed on a target hardware platform, acquiring various hardware resource parameters of the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
Screening a plurality of candidate microkernels meeting the preset microkernel conditions from a microkernel library constructed in advance according to hardware resource parameters of a target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
and screening out candidate microkernels with highest performance scores as target microkernels of the dynamic tensor of the target model to obtain the dynamic tensor using the target microkernels in the target model.
In some embodiments, there is also provided a dynamic tensor compilation optimization device provided by an embodiment of the present application, where the device includes:
The acquisition module is used for acquiring various hardware resource parameters of the target hardware platform before the target model is deployed on the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
The first screening module is used for screening a plurality of candidate microkernels which meet the preset microkernel conditions from a microkernel library constructed in advance according to the hardware resource parameters of the target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
The prediction module is used for inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
And the second screening module is used for screening candidate microkernels with highest performance scores as target microkernels of the dynamic tensor of the target model so as to obtain the dynamic tensor using the target microkernels in the target model.
In some embodiments, the present application further provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the machine-readable instructions are executed by the processor to execute the steps of the dynamic tensor compiling optimization method.
In some embodiments, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the dynamic tensor compilation optimization method.
The application provides a dynamic tensor compiling optimization method, a device, electronic equipment and a medium, wherein the method acquires various hardware resource parameters of a target hardware platform before a target model is deployed on the target hardware platform; screening a plurality of candidate microkernels meeting the preset microkernel conditions from a microkernel library constructed in advance according to hardware resource parameters of a target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter; inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training; screening out candidate microkernels with highest performance scores as target microkernels of dynamic tensors of a target model to obtain dynamic tensors using the target microkernels in the target model; thus, a hardware-friendly microkernel candidate set is selected as a search space according to various hardware resource information, so that the search space is reasonably compressed, and tuning time is effectively reduced; and then, the hardware resource sensitive cost model is used for evaluating the resource utilization rate level and the memory access efficiency of hardware of different microkernels, microkernels matched with the hardware resources are accurately selected from the microkernel candidate set, and the tuning effect of the dynamic tensor is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for dynamic tensor compilation optimization according to an embodiment of the present application;
FIG. 2 is a flow chart showing steps performed by the tuning system HAOTuner according to an embodiment of the present application;
FIG. 3 is a schematic diagram showing the effect of microkernel size on dynamic tensor program performance according to the present embodiment;
FIG. 4 is a schematic diagram showing the effect of cache utilization on microkernel parameter settings according to the present embodiment;
FIG. 5 shows a method flow diagram of a microkernel search algorithm in accordance with an embodiment of the present application;
FIG. 6 illustrates the ratio of memory access time to total program execution time for different microkernels measured at NVIDIATESLA T, in accordance with an embodiment of the present application;
FIG. 7 illustrates a process for implementing HAOTuner a migration of a microkernel evaluation model from a source hardware platform to a target hardware platform in accordance with an embodiment of the present application;
FIG. 8 illustrates a delay comparison of dynamic sequence lengths across a BERT-base model for various compilers according to an embodiment of the present application;
FIG. 9 illustrates the runtime consumption of the dense operator to generate tensors at different sequence lengths according to an embodiment of the present application;
FIG. 10 illustrates the runtime consumption of the BatchMatmul operator to generate tensors at different sequence lengths according to an embodiment of the present application;
FIG. 11 shows the results of tuning time comparisons HAOTuner and DietCode on the BERT model according to an embodiment of the application;
FIG. 12 shows the results of tuning time comparisons HAOTuner and DietCode on the Dense operator according to embodiments of the present application;
FIG. 13 is a block diagram of a dynamic tensor compilation optimization device according to an embodiment of the present application;
fig. 14 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for the purpose of illustration and description only and are not intended to limit the scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure.
In addition, the described embodiments are only some, but not all, embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that the term "comprising" will be used in embodiments of the application to indicate the presence of the features stated hereafter, but not to exclude the addition of other features.
In recent years, the demands for hardware computing power for deep learning applications have grown exponentially. GPU, TPU and special domain accelerators as hardware platforms are continually improving the performance of deep learning applications in academia and industry. However, to achieve the performance inherent to hardware, only the deep learning algorithm can be fine-tuned on this hardware. For example, hardware vendors face deep learning application hand-developed operator libraries, such as cublas, cudnn, cutlass. Because the algorithm and hardware are changed faster, the development and maintenance of the operator library are high in cost and long in development time.
Deep learning algorithms are typically developed using high-level programming languages such as Python, C, or c++, and deep learning compilers represent the algorithms by intermediate expressions in the form of computational graphs in which vertices are called deep learning operators, e.g., conv2D, matmul, etc. For each operator in these computational graphs, the compiler needs to find the corresponding optimizers from the operator library provided by the hardware vendor, ensuring that the computational graph is executed efficiently in hardware.
A deep learning compiler (tvm, tc, halide, mlir) may implement a high-performance deep learning algorithm for the accelerator. Search-based deep learning compilers, such as Ansor, may automatically optimize some tensor programs to obtain the best performance. However, the performance of tensor programs is shape sensitive, and if tensor shapes are not available at compile time, but only at runtime, it is not guaranteed that tensors for all possible shapes perform effectively. For example, in a computer vision algorithm (ViT), input features may have different resolutions; in the speech recognition algorithm (deepspeech 2), the length of the input speech is unpredictable; in the natural language processing algorithm (BERT), the input sentence may vary from 1 word to hundreds of words. This is the so-called dynamic shape tensor. The dynamic shape tensor may be referred to as a dynamic tensor program, a dynamic tensor, or the like.
For such dynamic shape inputs, the current compiler can only select the smallest tensor in the existing operator library that is larger than the current shape, and the compiler will fill the tensor program with additional bytes to match the smallest shape that is closest to being larger than the shape, which can result in a serious performance degradation. Although autotuners can automatically optimize tensor programs, because in theory they can explore the whole space, find the best tensor program template for all possible shapes. However, the time and computational costs consumed by the compiler based on automatic optimization are not negligible, and the economic cost of performing the optimization for all possible shapes is high.
For current research on autotune, such as DietCode, dynamic shape tensors are supported by designing an intermediate unit (microkernel) that represents hardware resources. DietCode map computing tasks onto microkernels, which are then distributed to the actual hardware for execution. That is, dietCode will establish a mapping relationship between the tensor program and the hardware resources. The dynamic tensor program may be divided into a number of microkernels that are to be distributed into CUDAcore of the GPU, and microkernels may constitute dynamic tensor programs of different shapes, although microkernels work well on specific hardware, once the hardware platform is replaced, the otherwise assumed performance cannot be achieved.
Microkernels are the best solution to handle dynamic tensors, but DietCodt does not consider their performance on different hardware. Illustratively, the tuning time of the dense operator in the BERT model on different devices was evaluated using DietCode, which showed that the tuning time increased significantly with decreasing hardware resources. The reasons for the performance degradation are mainly the following two aspects: 1. the cost model used in DietCode does not take into account the computational resource differences between the different hardware; 2. DietCode only the shape of the tensor is considered when selecting microkernels for the dynamic tensor program, and hardware resources are ignored.
Therefore, it makes sense to add hardware resource information to the autotuner of the dynamic tensor program to improve the adaptive capacity of the hardware. The application provides a dynamic tensor compiling optimization method, a device, electronic equipment and a medium, which consider the shape of tensors and hardware resources when microkernels are selected for a dynamic tensor program, and consider the difference of computing resources among different hardware, so that the tuning efficiency of the dynamic tensor is improved, and the tuning time is reduced.
Specifically, the application provides a dynamic tensor compiling optimization method, a device, electronic equipment and a medium, wherein the method acquires various hardware resource parameters of a target hardware platform before a target model is deployed on the target hardware platform; screening a plurality of candidate microkernels meeting the preset microkernel conditions from a microkernel library constructed in advance according to hardware resource parameters of a target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter; inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training; screening out candidate microkernels with highest performance scores as target microkernels of dynamic tensors of a target model to obtain dynamic tensors using the target microkernels in the target model; thus, a hardware-friendly microkernel candidate set is selected as a search space according to various hardware resource information, so that the search space is reasonably compressed, and tuning time is effectively reduced; and then, the hardware resource sensitive cost model is used for evaluating the resource utilization rate level and the memory access efficiency of hardware of different microkernels, microkernels matched with the hardware resources are accurately selected from the microkernel candidate set, and the tuning effect and the efficiency of the dynamic tensor are improved.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for optimizing dynamic tensor compilation according to an embodiment of the present application; specifically, the method comprises the following steps S101-S104:
S101, before a target model is deployed on a target hardware platform, acquiring various hardware resource parameters of the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
S102, screening a plurality of candidate microkernels meeting the preset microkernel conditions from a microkernel library constructed in advance according to hardware resource parameters of a target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
S103, inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using a dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
S104, screening out candidate microkernels with highest performance scores as target microkernels of dynamic tensors of a target model, so as to obtain the dynamic tensors of the target microkernels in the target model.
Therefore, a plurality of candidate microkernels with friendly hardware are selected as search spaces according to various hardware resource information, so that the search spaces are reasonably compressed, and tuning time is effectively reduced; and then, the hardware resource sensitive cost model is used for evaluating the resource utilization rate level and the memory access efficiency of hardware of different microkernels, and microkernels matched with the hardware resource are accurately selected from a plurality of candidate microkernels, so that the tuning effect of the dynamic tensor is improved.
It should be noted that, according to the dynamic tensor compiling optimization method described in the embodiment of the present application, HAOTuner, HAOTuner is provided as an automatic optimizer for a dynamic tensor program. The dynamic tensor compiling and optimizing method is the dynamic tensor compiling and optimizing method in the implementation step of the optimizing system in HAOTuner.
Specifically, referring to fig. 2, the haotuner tuning system performs the following steps: 1. after the level optimization, the sub-graph is calculated by dividing as input to HAOTuner; 2. generating a schedule for the calculation subgraph according to the calculation mode and the hardware characteristics of the program; HAOTuner selecting the size of the microkernel through a hardware resource friendly search algorithm, generating a schedule for the microkernel, and then selecting a proper microkernel for the dynamic tensor program through a hardware-related cost model. Then, through sampling, different schedules are actually executed on hardware to obtain program throughput, and a cost model is trained and refined; 3. at runtime stage HAOTuner matches the best microkernels for different dynamic tensor shapes through the allocator; 4. after tuning the subgraph on a specific hardware, HAOTuner provides a migration scheme of a cost model, and migrates the tuning device to a different hardware platform.
In the step S101, before the target model is deployed on the target hardware platform, acquiring a plurality of hardware resource parameters of the target hardware platform; the hardware resource parameters comprise the maximum thread number in the thread block, the thread number in the thread cluster and the cache capacity.
The embodiment of the application refers to the concept of microkernels, and matches the microkernels for the dynamic shape tensor under specific hardware resources. Referring to fig. 3, fig. 3 is a schematic diagram showing the effect of microkernel size on dynamic tensor performance; by changing the configuration parameters of the microkernel between different hardware, it was found that different hardware achieved optimal performance under different microkernel settings, i.e., different hardware had different optimal microkernel sizes. This not only illustrates the great impact of microkernels on performance, but also that microkernels should be set according to hardware resources.
In the embodiment of the application, the hardware resources are analyzed from the two aspects of cache and CUDA thread. Referring to fig. 4, fig. 4 illustrates the effect of cache utilization on microkernel parameter settings. In a simple matrix multiplication operation, a suitable microkernel may achieve a higher data reuse rate, as shown in the right half of fig. 4. The choice of microkernel should be adaptive in order to maximize the use of cache resources due to the different cache sizes of the different hardware.
In the step S102, a plurality of candidate microkernels meeting the preset microkernel conditions are screened from a microkernel library constructed in advance according to the hardware resource parameters of the target hardware platform; the preset microkernel condition is set according to at least one hardware resource parameter.
In the embodiment of the application, the preset microkernel conditions comprise: the number of threads of the candidate micro-cores is an integer multiple of the number of threads in the thread cluster, the number of threads of the candidate micro-cores is smaller than or equal to the maximum number of threads in the thread block, and the total data access amount of the candidate micro-cores is an integer factor of the cache capacity;
screening a plurality of candidate microkernels meeting the preset microkernel conditions; comprising the following steps:
the number of the screening threads is an integer multiple of the number of threads in the thread cluster, and the number of threads is less than or equal to the maximum number of threads in the thread block, and the total amount of data access is an integer factor of the cache capacity.
Theoretical analysis is performed on the effect of the CUDA thread on the microkernel. To ensure that programs execute with maximum parallelism, we align microkernel sizes with thread execution units. Assuming that the number of threads in each stream processor (SM) in the GPU is at most t, the number of threads in the microkernel should be set to t or a factor of t, so the number of threads of the candidate microkernel to be screened out is less than or equal to the maximum number of threads in the thread block.
For a GPU, the number of threads of the microkernel is an integer multiple of the number of threads in the thread cluster (i.e., the warp size), or a value that can be divided by the maximum number of threads of the SM can be used as a candidate parameter for the microkernel. Similarly, the microkernel data load size is also set to an integer multiple of the register cache size to achieve maximum data reuse rates. Therefore, the number of threads of the candidate micro-cores to be screened out is an integer multiple of the number of threads in the thread cluster, and the total data access amount of the candidate micro-cores is an integer factor of the cache capacity.
In the embodiment of the application, the screened candidate microkernels simultaneously meet three conditions that the thread number is an integer multiple of the thread number in the thread cluster, the thread number is smaller than or equal to the maximum thread number in the thread block, and the total data access amount is an integer factor of the cache capacity, and the screened candidate microkernels form a microkernel candidate set, so that all microkernels are not required to be searched, the search space is reasonably compressed, and the optimization efficiency is improved.
Referring to fig. 5, fig. 5 is a flowchart illustrating a method of a microkernel search algorithm according to an embodiment of the present application; the microkernel search algorithm is the method for searching candidate microkernels in step S102. Taking HAOTuner as an example, HAOTuner may execute the method according to the embodiment of the present application by obtaining the computing resource parameter, and automatically generate a hardware resource-friendly microkernel candidate set, where the number of threads of the microkernel in the microkernel candidate set is an integer multiple of the warp size, or a value that can be divided by the maximum number of threads of the SM may be used as the candidate parameter of the microkernel. Similarly, the microkernel's data load size is also set to an integer multiple of the register cache size to achieve maximum data reuse rates.
In the step S103, inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training.
In the embodiment of the application, the performance score of the candidate micro-kernel is specifically determined based on the memory access efficiency of the candidate micro-kernel by using the dynamic tensor.
Specifically, predicting, by the target microkernel evaluation model, the performance of the target model when each candidate microkernel is used by a dynamic tensor in the target model, to obtain a performance score of each candidate microkernel, including:
Predicting the throughput of the candidate microkernel used by the dynamic tensor in the target model through the target microkernel evaluation model;
predicting a resource utilization coefficient when the candidate microkernel is used by the dynamic tensor;
Determining a memory access efficiency coefficient of the candidate microkernel used by the dynamic tensor based on various memory access times and required complete execution time of the candidate microkernel to the target access platform used by the dynamic tensor; the memory access efficiency scores of the microkernels with different sizes represent the difference of the microkernels with different sizes in memory access time;
and calculating the performance score of the candidate micro-kernel based on the throughput, the resource utilization coefficient and the memory access efficiency coefficient when the candidate micro-kernel is used by the dynamic tensor.
In the embodiment of the application, when the dynamic tensor uses the resource utilization coefficient of the candidate microkernel, the dynamic tensor is determined based on the change rate of the resource utilization coefficient of the target hardware platform, and the change rate of the resource utilization coefficient of the target hardware platform is determined, so that the target microkernel evaluation model automatically learns in the test stage of the target hardware platform.
In the embodiment of the application, the performance scores of the candidate microkernels are calculated from four comprehensive consideration.
First: the performance f MK (FeatureExtractor (M)) of the microkernel is used to evaluate the throughput of the dynamic tensor P when using the microkernel M. Meanwhile, the dynamic tensor P represents the number of multiply-add operations in a complete dynamic tensor program as P; microkernel M represents the number of multiply-add operations M in one microkernel, i.e., microkernel size M.
The dynamic tensor is predicted by the target microkernel evaluation model by using the throughput f MK (FeatureExtractor (M)) of the candidate microkernel.
Second,: byte-stuffing penalty term f pad (P, M), which defines the percentage of the complete program that the byte-stuffing part occupies in the complete dynamic tensor program P that uses microkernel size M. The byte filling penalty term is used for evaluating the matching degree of the microkernel size and the tensor program shape, and belongs to a hardware irrelevant term.
Third,: the dynamic tensor uses the resource utilization factor f OCCA for the candidate microkernel. In the embodiment of the application, the resource utilization coefficient of the candidate microkernel is calculated by the following formula (1);
Wherein f OCCA (P/M) characterizes the resource utilization coefficient of the dynamic tensor P using microkernel M; P/M characterizes the total number of microkernels M needed to execute a single dynamic tensor P, divmod ((P/M), num Cores) characterizes the modulo function of P/M and the total number of computational elements NumCores, ceil_by (divmod (P/M), num Cores)) characterizes the round-up function of P/M and the total number of computational elements NumCores, and k 1 characterizes the rate of change of dynamic tensor performance with microkernel size; b 1 characterizes the dynamic parameters, where k 1+b1 =1.
The resource utilization coefficient in the prior art is denoted as f OCC, the number NumCores of stream processors (SM, STREAMING MULTIPROCESSORS) in the GPU is a fixed value related to hardware, and the number of stream processors SM can directly affect the calculation result of the coefficient f OCC. Since the value of the resource utilization coefficient should be within the (0, 1) range, it is found by observation that the value of the resource utilization coefficient f OCC cannot be guaranteed to be always smaller than 1 in DietCode. If the number of stream processors occupied by all microkernels P/M used for executing the program P is greater than the total number of stream processors NumCores in hardware, the value of the value resource utilization coefficient will be much greater than 1, and thus f OCC cannot represent the resource utilization of the hardware any more.
Fourth,: the memory access efficiency coefficient of the candidate micro-kernel used by the dynamic tensor, specifically, based on various memory access times and required complete execution time of the candidate micro-kernel used by the dynamic tensor on the target access platform, determines the memory access efficiency coefficient of the candidate micro-kernel used by the dynamic tensor by the candidate micro-kernel by the following formula (2):
Wherein t represents the number of threads in the candidate microkernel; LS R t represents the access time of the candidate microkernel to the target hardware platform register; LS S t represents the access time of the candidate microkernel to the shared memory of the target hardware platform; LS G t represents the global access time of the candidate microkernel to the target hardware platform; executionTime M (P) represents the complete execution time required to use the left-hand rotation state tensor P of microkernel M; the parameter k 2、b2 is the change rate of the actual memory access time fitted in the training process of the target microkernel evaluation model; f LST (P, M) characterizes the memory access efficiency on the target hardware platform using the dynamic tensor P of microkernel M.
Based on this, the performance score Cost M (P) of the dynamic tensor P when the microkernel M is used is calculated by the following formula (3):
In GPU, a thread is the smallest unit of access to memory, and theoretically the memory access time of program P depends on the number of threads. Since the number of threads in microkernels is variable, there is an inherent difference in memory access time for microkernels of different sizes. As shown in fig. 6, the ratio of memory access time to total program execution time for different microkernels is measured at NVIDIA TESLA T, and during the measurement, three different types of memory access and computation time consumption for four microkernels are measured with clock cycles as measurement units. The calculation time of different microkernels is unchanged, and the three access times show obvious differences due to the different microkernel sizes. Therefore, in order to accurately predict the memory access efficiency of the microkernel, the embodiment of the present application adds the memory access efficiency f LST about the memory access time as a penalty coefficient in the microkernel evaluation model.
In the embodiment of the application, the target microkernel evaluation model is obtained based on training by the following method:
Acquiring a source microkernel evaluation model which is trained on a source hardware platform, and training the source microkernel evaluation model on a target hardware platform through a training data set comprising a dynamic tensor instance to obtain a weight distribution difference coefficient of the source microkernel evaluation model between the source hardware platform and the target hardware platform;
Judging whether the weight distribution difference coefficient is smaller than a preset difference threshold value or not;
if yes, the weight of the source microkernel evaluation model is reserved as a target kernel evaluation model; if not, modifying the weight of the source microkernel evaluation model to obtain a target kernel evaluation model;
training the target kernel evaluation model according to the loss function of the target kernel evaluation model until the target kernel evaluation model converges to obtain a target microkernel evaluation model.
Specifically, in the embodiment of the present application, the source microkernel evaluation model is a lifting tree structure;
Preserving the weight of the source microkernel evaluation model as a target kernel evaluation model, comprising:
The lifting tree structure of the source microkernel evaluation model is reserved as a target kernel evaluation model;
modifying the weight of the source microkernel evaluation model to obtain a target kernel evaluation model, comprising:
and constructing a new tree for the source microkernel evaluation model to obtain a target kernel evaluation model.
In order to solve the problem that the current compiler needs to train a microkernel evaluation model from the beginning when tuning towards diversified heterogeneous hardware, tuning time is greatly increased. The embodiment of the application provides a novel cross-equipment microkernel evaluation model migration method, which improves the training convergence rate of a microkernel evaluation model by realizing the following three points:
First point: the target hardware platform (or target hardware, target equipment and target domain) adopts a lifting tree structure trained by the source equipment (or source hardware platform, source hardware and source domain) to execute tuning. The process does not need to provide an offline data set or train the lifting tree from the beginning, and the lifting training is enabled to be converged through fewer rounds of actual hardware measurement;
Second point: for hardware, the target hardware platform and the source hardware platform may be different in terms of actual performance due to architecture differences. For example, these differences often manifest themselves in the number of CUDA computational cores, the size of shared memory, and the like, referred to as hardware differences. The targets set for the transfer learning in the embodiment of the application are as follows: fully utilizing source domain knowledge, reserving domain independent weights, and minimizing domain differences caused by hardware differences;
Third point: the difference between the source domain and the target domain is measured by constructing a target loss function and setting a weight distribution difference coefficient, when the weight distribution difference coefficient is smaller than the target loss, the source domain weight is reserved, and when the difference parameter is larger than the target loss, a new tree is constructed and the weight is updated.
In the migration training process of the source microkernel evaluation model, domain (hardware) independent weights are reserved by utilizing source domain knowledge to minimize domain differences caused by hardware differences. The microkernel evaluation model is a lifting tree structure, and the lifting tree is a model taking Q incremental regression trees as output. The microkernel evaluation model is continuously trained iteratively, and in each iteration, the lifting tree is optimized by utilizing the target loss function. In the embodiment of the application, the target loss function of the microkernel evaluation modelWhere h (x) represents the loss function of the microkernel evaluation model, Q represents the total number of trees in the microkernel evaluation model, T k represents the lifting tree for the kth iteration, Θ k represents the parameters of tensor program instance x for the kth iteration. The weight of each tree can be accurately obtained according to the output of each tree in the microkernel evaluation model and the tree structure function on the leaf node of each tree.
Source training data of a source microkernel evaluation model of a source hardware platform come from a source hardware platform actual measurement stage; the target training data of the target microkernel evaluation model of the target hardware platform is from the actual measurement stage of the target hardware platform. Wherein the source hardware platform and the target hardware platform contain different numbers of dynamic tensor instances, and each dynamic tensor instance has a plurality of features, and thus the features of the source training data and the target training data are different. The method and the device aim to determine the target loss function of the target hardware platform after migration through training based on the target loss function of the source hardware platform before migration; in addition, the same tree structure as the source hardware platform is ensured on the target hardware platform as much as possible, so that the training difficulty of a source microkernel evaluation model matched with the target hardware platform is reduced; it should be noted that different node weights may be used in the same tree structure before and after migration. This unique design preserves the robustness and interpretability of the lifting tree.
In the stage of the tensor program for actual measurement of the target hardware, if the dynamic tensor examples acquired in the target hardware are enough, the microkernel evaluation model can be trained directly through a large number of dynamic tensor examples to enable the microkernel evaluation model to be converged. Training microkernel evaluation models with a large number of dynamic tensor instances can result in inefficient training. If only a small amount of hardware actual measurement is performed on the target hardware, the number of acquired dynamic tensor instances is small, so that the training convergence of the microkernel evaluation model on the target device is difficult, and for the high-dimensional dynamic tensor instances, the direct optimization of the target loss function often leads to poor prediction effect. Therefore, the embodiment of the application improves the model prediction effect on the target domain by using the prior knowledge on the source domain through transfer learning, and realizes that the converged microkernel evaluation model matched with the target hardware can be obtained through training by a small amount of hardware actual measurement. In addition, by training the objective loss function during migration, such migration optimization can adjust the distribution difference between domains with relatively low linear complexity, and improve the accuracy of prediction by matching joint distribution between domains instead of edge distribution.
However, in a real scene, there is often a great difference between the source domain and the target domain. It is not advisable to simply put the cost model on the source device directly onto the target device. To solve this problem, in the embodiment of the present application, weights are reassigned to nodes corresponding to dynamic tensor instances from the source domain according to the objective loss function; and, the new weight for each node is derived from the joint distribution ratio between the target hardware platform and the source hardware platform. Specifically, the joint distribution ratio is divided into two parts, namely, an edge distribution ratio and a conditional distribution ratio.
After the joint distribution ratio is divided, an overall loss objective function needs to be constructed based on objective loss functions before migration and after migration, an optimizing task in the migration process is defined as two parts, namely the weight of a source domain part is updated, and the weight of a target domain is assigned. The method aims to quickly learn the feature weights of the target hardware platform through training. The overall loss objective function is L (h t)=λLS(ht,β)+LT(ht)+Ω(ht). Wherein, beta represents the weight matched by the joint distribution ratio, and lambda is the super-parameter of the target loss function L T after balance migration and the target loss function L S before migration; h t is the classification result of the dynamic tensor instance after migration; Ω is a regularization to h t.
The total loss objective function considers the objective loss function before and after migration and the joint distribution ratio, and the joint distribution ratio characterizes the difference between the source domain and the target domain, so that the total loss objective function can more effectively help the microkernel evaluation model to quickly migrate between the source domain and the target domain with the difference.
Referring to fig. 7, fig. 7 illustrates that a microkernel evaluation model trained on a source device is migrated to a target device, where q (X) represents a lifting tree structure, and an input tensor program instance X t is used as a training data set for actually measuring and collecting the target device, and a measurement result is Y t. In the embodiment of the application, the joint distribution ratio gamma is used for representing the distribution difference between the source domain and the target domain, and then the comparison beta and the preset difference threshold epsilon are smaller than the value, which means that the weight of the source domain can be reserved, and larger than the value, which means that a new lifting tree needs to be added for the microkernel evaluation model. And finally, training the microkernel evaluation model after migration according to the overall loss objective function until convergence.
FIG. 7 illustrates a migration adoption computing process for HAOTuner implementing the migration of a microkernel evaluation model from a source hardware platform to a target hardware platform. The decision of which parameters in the source domain can continue to remain in the target domain is a key issue to be addressed by the migration algorithm. The target domain uses the same tree structure as the source domain, but the weights of the allowed nodes may be different. Furthermore, the same tree structure can ensure that the edge probability distributions of the two match. For a compiler, in the tuning stage, different hardware usually adopts the same sub-graph segmentation mode for the same model. That is, the target device uses the same tree structure as the source device, and the same edge probability distribution of different hardware can be ensured. On the other hand, the use of different weights provides flexibility for learning different conditional probability distributions. Since β can be used to measure the difference in the distribution of the source domain from the target domain, we set a threshold ε. That is, when β is less than this determined threshold ε (e.g. 0.5), then the weights under this instance of the source domain are preserved. When β is greater than ε, then a new tree is built for this instance. The kth iteration trains the newly added tree of the target domain to optimize the overall loss objective function. After the kth iteration increases the lifting tree HAOTuner still requires redefining the weight update of the objective function.
In the step S104, candidate microkernels with the highest performance scores are screened out as target microkernels of the dynamic tensor of the target model, so as to obtain the dynamic tensor using the target microkernels in the target model.
Specifically, the target microkernel evaluation model directly outputs candidate microkernels with performance scores ordered from high to low, so that the target microkernel with the highest performance score is rapidly determined.
The performance of HAOTuner constructed based on the dynamic tensor compilation optimization method in the embodiment of the present application is tested as follows.
BERT is the representative of the Transformer in the NLP field, and has received the most widespread attention in the industry due to its high performance, high parallelism and portability. SOTA records of 11 NLP tasks are refreshed once pushed, and the paper is cited more than 4 ten thousand. The BERT model, however, has a large increase in both parameter amounts and complexity relative to previous models, which also presents difficulties for industrial deployment. Specifically, the general TVM is used for accelerating the Batch to be 1 and the BERT model reasoning with the sequence length of 256 on the CPU, so that the promotion can be 2.8 times. Accelerating BERT model deployment using deep learning compilation may prove to be important to the industry. Therefore, the embodiment of the application adopts the length of the dynamic sequence in the BERT-base model to evaluate HAOTuner tuning effect in the aspect of end-to-end model reasoning. Thereafter, embodiments of the present application demonstrate HAOTuner performance over dynamic sequence length at the dense layer and the packet matrix multiplication layer, respectively. Since it is impractical to use an existing auto-tuner to complete the entire tuning process, it takes 42 hours for a single bertlayer with a sequence length in the range of [1,128] to complete tuning on the CPU. Therefore HAOTuner compares performance and tuning time by uniformly sampling 8 shapes in [1,128] in the same manner as DietCode.
In the embodiment of the application, the evaluation index is defined as follows:
In terms of performance, HAOTuner was compared to three SOTA references:
a. operator libraries cuBLAS and cuDNN, called Vendor, developed manually for GPUs;
b. auto-tuner, ansor, which currently handles static shape workload with best effect;
c. the auto-tuner, dietCode, which currently handles the dynamic shape workload best.
For each workload, auto-tuner meets at least 1000trials, which is typically sufficient to ensure that they converge.
In terms of tuning time, HAOTuner is compared to two SOTA references:
a. during the tuning phase, we use Ansor to randomly initialize the cost model and train from scratch, called Vendor;
b. And training the cost model from scratch using DietCode, referred to as DietCode.
The test results are presented in three ways:
a. in the end-to-end model delay test, taking the average value of running results of more than five times by taking microseconds as a unit;
b. In the running time test of generating tensor program on a single calculation layer, taking the average value of running results of more than one hundred times in microsecond as a unit;
c. in the measurement of the total time consumed for automatic tuning, the unit of hours is taken. The lower the evaluation index, the better.
First: end-to-end model verification
Referring to FIG. 8, FIG. 8 shows the delay comparison of dynamic sequence lengths across the BERT-base model for various compilers; all values in fig. 8 were normalized on a vender basis. As shown in fig. 8, HAOTuner may perform best or equally best in all cases. Since DietCode fits a linear regression curve of throughput and microkernel size over T4 in advance, the optimization effect of HAOTuner over T4 is comparable to DietCode and Vendor. Meanwhile, dietCode and HAOTuner can achieve 23% acceleration compared to Ansor. From the end-side device AGX it can be seen that DietCode has a very poor tuning effect on the embedded device, and if no changes are made, its reasoning delay will drop 3.5 times on average than the Vendor and 2.5 times than Ansor.
For the delay comparison results shown in fig. 8, there are two reasons: 1. for auto-tuner, ansor is optimized in subgraphs, dietCode is optimized in microkernels, and generally the granularity of subgraphs is much smaller than microkernels, all Ansor can be optimized to very small subgraphs. In contrast, dietCode is optimized for the microkernel, and when the whole microkernel cannot be optimized, all operators in the microkernel cannot be optimized; 2. since the computing resources of the embedded device are limited, for example, there are only 8 SM units on AGX and only 2 on TX2 (insufficient to support BERT to perform auto tuning). Since the search space of the microkernel becomes small, the space for the microkernel to optimize the dynamic tensor program is limited, which is equivalent to not optimizing.
Furthermore, when evaluating BERT, it may happen that partial effects are not as good as DietCode, such as at 1080Ti, sequence lengths of 81, 128;3060Ti, 24 in sequence length, etc. The reason for this result may be due to data transmission delays. It is found in the multiple evaluation results that this situation does not affect the improvement of the overall model reasoning speed. Overall, HAOTuner was able to average 39% improvement over Ansor as seen from various hardware evaluation results. Can be improved by 26% on average compared with DietCode. Notably, HAOTuner lost only 4% of performance compared to vender in all hardware evaluations. Therefore, the dynamic tensor compiling optimization method provided by the embodiment of the application has better performance.
Second,: single operator verification (DENSE LAYER)
Referring to fig. 9, fig. 9 illustrates the run-time consumption of the dense operator to generate tensors at different sequence lengths. As can be seen from the observation of fig. 9, HAOTuner achieves the same optimization effect as DietCode at T4. The average run time over all sequence lengths is 29% higher than Ansor and comparable to the vector. Overall HAOTuner was improved by 32% on all hardware, 15% over DietCode. HAOTuner is higher than DietCode in performance because HAOTuner automatically matches the resource occupancy and memory access efficiency coefficient equation for hardware by constructing a hardware-friendly microkernel candidate set and in the field measurement stage of actual hardware, thereby more fully utilizing the characteristics of the hardware.
Third,: single operator verification (BatchMatmul Layer)
Referring to fig. 10, fig. 10 illustrates the runtime consumption of BatchMatmul operators to generate tensors at different sequence lengths. When the data layout is NT, dietCode and HAOTuner are raised by 10% compared to vender and 19% compared to Ansor for T4. When the data layout is NN, dietCode and HAOTuner are 14% higher than vender and 22% higher than Ansor for T4. Overall, HAOTuner was 33% higher than the vector, 30% higher than Ansor, and 22% higher than DietCode for both data layouts for all hardware.
Fourth,: tuning time verification
Fig. 11 shows the tuning time comparison of HAOTuner and DietCode on the BERT model to demonstrate the effect of HAOTuner on tuning time. Pre-training of the cost model was performed at T4, replacing the trained cost model with 1080Ti and AGX, respectively. First, the effect of microkernels on dynamic tensor workload tuning time is demonstrated at T4 with the tuning time of Ansor as a reference. From the comparison results, dietCode and HAOTuner can be shortened by an average of 6 times over the tuning time compared to Ansor. Second, from the results of T4 to 1080Ti, it can be seen that from the results of T4 to AGX, compared to DietCode without the migration of the cost model, HAOTuner achieves 21% acceleration over the total tuning time of the end-to-end model, compared to DietCode without the migration of the cost model. Overall, HAOTuner achieves cross-device migration of cost model through model pre-training in a multi-hardware deployment scenario, which can be reduced by 25% on average over tuning time.
Fig. 12 shows the tuning time comparison of HAOTuner and DietCode on the Dense operator. Referring to fig. 12, the embodiment of the present application also evaluates tuning time of the dynamic dense layer at different sequence lengths, as shown in part b of fig. 12. A cost model pre-trained on 1080Ti was used at T4 to represent HAOTuner. HAOTuner can reduce tuning time by 6.2 times on average compared with Ansor, and can reduce tuning time by 28% compared with DietCode. Meanwhile, in order to prove that cost model migration can accelerate tuning without degrading performance. Also, ablation experiments were performed on HAOTuner at T4, as shown in section a of FIG. 12, HAOTuner-T using cost model migration performed substantially the same as HAOTuner from head training cost model.
The embodiment of the application provides HAOTuner, a new auto-tuner framework of hardware-adaptive, which is built based on the dynamic tensor compiling optimization method. The dynamic shape workload of the current SOTA is used on 6 hardware devices to evaluate HAOTuner in the embodiment of the present application. Evaluation shows that HAOTuner improves performance by 39% on average compared to Ansor and 26% on average compared to DietCode in performing tuning on the end-to-end model. And HAOTuner is migrated through a cost model, the overall tuning time is reduced by 6 times compared with Ansor, and is reduced by 25% compared with DietCode.
The dynamic tensor compiling optimization method of the embodiment of the application uses the idea of microkernel as the intermediate medium of task allocation. The performance of tensor programs on hardware is determined by the microkernel size. Thus, the shape of the microkernel is determined from hardware resources in the dynamic tensor compilation optimization method, not just from the shape of the tensor. Firstly, screening a micro-kernel candidate set with good hardware resources to reduce the search space, and then adding a hardware resource related parameter for the micro-kernel evaluation model to predict the performance of tensor programs, and providing an effective migration training method to reduce the training cost of the micro-kernel evaluation model. The contribution of the dynamic tensor compiling optimization method in the embodiment of the application is as follows: firstly, a hardware self-adaptive dynamic tensor program automatic compiling technology is provided; secondly, a microkernel candidate set algorithm for automatically selecting hardware resources is provided, so that tuning time is shortened; the micro-kernel assessment model sensitive to hardware resources is designed and used for supporting different hardware architectures, and a solution for migration of the micro-kernel assessment model is provided, so that the micro-kernel assessment model can be rapidly deployed to different hardware platforms; third, HAOTuner was experimentally evaluated and it was demonstrated that HAOTuner, which is a dynamic tensor compilation optimization method according to an embodiment of the present application, leads the currently most advanced dynamic tensor compilation technique in several respects.
In some embodiments, a dynamic tensor compiling optimization device is further provided, referring to fig. 13, the device includes:
An obtaining module 1301, configured to obtain multiple hardware resource parameters of the target hardware platform before the target model is deployed on the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
The first screening module 1302 is configured to screen a plurality of candidate microkernels that meet a preset microkernel condition from a microkernel library that is constructed in advance according to a hardware resource parameter of the target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
The prediction module 1303 is configured to input the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predict, by using the target microkernel evaluation model, a performance of each candidate microkernel when a dynamic tensor in the target model uses the performance of the target model, so as to obtain a performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
And a second screening module 1304, configured to screen out a candidate microkernel with the highest performance score as a target microkernel of a dynamic tensor of a target model, so as to obtain the dynamic tensor using the target microkernel in the target model.
According to the dynamic tensor compiling and optimizing device, a hardware-friendly micro-kernel candidate set is selected as a search space according to various hardware resource information, so that the search space is reasonably compressed, and tuning time is effectively reduced; and then, the hardware resource sensitive cost model is used for evaluating the resource utilization rate level and the memory access efficiency of hardware of different microkernels, microkernels matched with the hardware resources are accurately selected from the microkernel candidate set, and the tuning effect of the dynamic tensor is improved.
In some embodiments, in the dynamic tensor compilation optimization device, the presetting microkernel conditions includes: the number of threads of the candidate micro-cores is an integer multiple of the number of threads in the thread cluster, the number of threads of the candidate micro-cores is smaller than or equal to the maximum number of threads in the thread block, and the total data access amount of the candidate micro-cores is an integer factor of the cache capacity;
the first screening module is specifically configured to, when being configured to screen a plurality of candidate microkernels that meet a preset microkernel condition:
the number of the screening threads is an integer multiple of the number of threads in the thread cluster, and the number of threads is less than or equal to the maximum number of threads in the thread block, and the total amount of data access is an integer factor of the cache capacity.
In some embodiments, the prediction module in the dynamic tensor compiling optimization device is specifically configured to, when predicting, by the target microkernel evaluation model, a performance of the target model when the dynamic tensor uses each candidate microkernel in the target model, and obtaining a performance score of each candidate microkernel:
Predicting the throughput of the candidate microkernel used by the dynamic tensor in the target model through the target microkernel evaluation model;
predicting a resource utilization coefficient when the candidate microkernel is used by the dynamic tensor;
Determining a memory access efficiency coefficient of the candidate microkernel used by the dynamic tensor based on various memory access times and required complete execution time of the candidate microkernel to the target access platform used by the dynamic tensor; the memory access efficiency scores of the microkernels with different sizes represent the difference of the microkernels with different sizes in memory access time;
and calculating the performance score of the candidate micro-kernel based on the throughput, the resource utilization coefficient and the memory access efficiency coefficient when the candidate micro-kernel is used by the dynamic tensor.
In some embodiments, the prediction module in the dynamic tensor compiling optimization device is specifically configured to, when predicting the resource utilization coefficient when the candidate microkernel is used by the dynamic tensor:
calculating a resource utilization coefficient of the dynamic tensor when the candidate microkernel is used through the following formula (1);
Wherein f OCCA (P/M) characterizes a resource utilization coefficient of a dynamic tensor P using microkernel M, P/M characterizes a total number of microkernels M needed to execute a single dynamic tensor P, divmod (P/M), numCores) characterizes a modulo function of P/M and a total number of computing units NumCores, and k 1 characterizes a rate of change of dynamic tensor performance with microkernel size; b 1 characterizes the dynamic parameters, where k 1+b1 =1.
In some embodiments, the prediction module in the dynamic tensor compiling optimization device is specifically configured to, when determining, based on multiple memory access times and required complete execution times of the dynamic tensor using the candidate microkernel to the target access platform, a memory access efficiency coefficient of the dynamic tensor using the candidate microkernel: determined by the following formula (2):
Wherein t represents the number of threads in the candidate microkernel; LS R t represents the access time of the candidate microkernel to the target hardware platform register; LS S t represents the access time of the candidate microkernel to the shared memory of the target hardware platform; LS G t represents the global access time of the candidate microkernel to the target hardware platform; executionTime M (P) represents the complete execution time required to use the left-hand rotation state tensor P of microkernel M; the parameter k 2、b2 is the change rate of the actual memory access time fitted in the training process of the target microkernel evaluation model; f LST (P, M) characterizes the memory access efficiency on the target hardware platform using the dynamic tensor P of microkernel M.
In some embodiments, the dynamic tensor compiling optimization device further includes a training module, configured to train to obtain the target microkernel evaluation model by:
Acquiring a source microkernel evaluation model which is trained on a source hardware platform, and training the source microkernel evaluation model on a target hardware platform through a training data set comprising a dynamic tensor instance to obtain a weight distribution difference coefficient of the source microkernel evaluation model between the source hardware platform and the target hardware platform;
Judging whether the weight distribution difference coefficient is smaller than a preset difference threshold value or not;
if yes, the weight of the source microkernel evaluation model is reserved as a target kernel evaluation model; if not, modifying the weight of the source microkernel evaluation model to obtain a target kernel evaluation model;
training the target kernel evaluation model according to the loss function of the target kernel evaluation model until the target kernel evaluation model converges to obtain a target microkernel evaluation model.
In some embodiments, in the dynamic tensor compilation optimization device, the source microkernel evaluation model is a lifting tree structure;
the training device is used for reserving the weight of the source microkernel evaluation model as the target kernel evaluation model, and is specifically used for:
The lifting tree structure of the source microkernel evaluation model is reserved as a target kernel evaluation model;
when the method is used for modifying the weight of the source microkernel evaluation model to obtain the target kernel evaluation model, the method is specifically used for:
and constructing a new tree for the source microkernel evaluation model to obtain a target kernel evaluation model.
In some embodiments, referring to fig. 14, an electronic device 1400 includes: a processor 1402, a memory 1401 and a bus, said memory 1401 storing machine readable instructions executable by said processor 1402, said processor 1402 and said memory 1401 communicating via the bus when the electronic device 1400 is running, said machine readable instructions when executed by said processor 1402 performing the steps of said dynamic tensor compilation optimization method.
In some embodiments, a computer readable storage medium is also provided, on which a computer program is stored, which computer program, when being executed by a processor, performs the steps of the dynamic tensor compilation optimization method.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, and are not repeated in the present disclosure. In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a platform server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (8)
1. A method for dynamic tensor compilation optimization, the method comprising:
Before the target model is deployed on a target hardware platform, acquiring various hardware resource parameters of the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
Screening a plurality of candidate microkernels meeting the preset microkernel conditions from a microkernel library constructed in advance according to hardware resource parameters of a target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
Screening out candidate microkernels with highest performance scores as target microkernels of dynamic tensors of a target model to obtain dynamic tensors using the target microkernels in the target model;
predicting, by the target microkernel evaluation model, performance of the target model when the dynamic tensor uses each candidate microkernel in the target model, to obtain a performance score for each candidate microkernel, including:
Predicting the throughput of the candidate microkernel used by the dynamic tensor in the target model through the target microkernel evaluation model;
predicting a resource utilization coefficient when the candidate microkernel is used by the dynamic tensor;
Determining a memory access efficiency coefficient of the candidate microkernel used by the dynamic tensor based on various memory access times and required complete execution time of the candidate microkernel to the target access platform used by the dynamic tensor; the memory access efficiency scores of the microkernels with different sizes represent the difference of the microkernels with different sizes in memory access time;
calculating a performance score of the candidate microkernel based on the throughput, the resource utilization coefficient and the memory access efficiency coefficient when the candidate microkernel is used by the dynamic tensor;
Based on the multiple memory access times and the required complete execution time of the dynamic tensor to the target access platform by using the candidate microkernel, determining the memory access efficiency coefficient of the dynamic tensor by using the candidate microkernel is determined by the following formula (2):
(2)
Wherein t represents the number of threads in the candidate microkernel; LS R t represents the access time of the candidate microkernel to the target hardware platform register; LS S t represents the access time of the candidate microkernel to the shared memory of the target hardware platform; LS G t represents the global access time of the candidate microkernel to the target hardware platform; executionTime M (P) represents the complete execution time required to use the left-hand rotation state tensor P of microkernel M; the parameter k 2、b2 is the change rate of the actual memory access time fitted in the training process of the target microkernel evaluation model; the memory access efficiency on the target hardware platform using the dynamic tensor P of the microkernel M is characterized.
2. The dynamic tensor compilation optimization method according to claim 1, wherein the preset microkernel conditions include: the number of threads of the candidate micro-cores is an integer multiple of the number of threads in the thread cluster, the number of threads of the candidate micro-cores is smaller than or equal to the maximum number of threads in the thread block, and the total data access amount of the candidate micro-cores is an integer factor of the cache capacity;
screening a plurality of candidate microkernels meeting the preset microkernel conditions; comprising the following steps:
the number of the screening threads is an integer multiple of the number of threads in the thread cluster, and the number of threads is less than or equal to the maximum number of threads in the thread block, and the total amount of data access is an integer factor of the cache capacity.
3. The method for optimizing dynamic tensor compilation according to claim 1, wherein predicting the resource utilization coefficient of the dynamic tensor when using the candidate microkernel comprises:
calculating a resource utilization coefficient of the dynamic tensor when the candidate microkernel is used through the following formula (1);
(1);
Wherein, Characterization of the resource utilization coefficient of the dynamic tensor P using microkernel M, P/M characterizes the total number of microkernels M required to execute a single dynamic tensor P,Characterizing a modulo function of P/M and the total number NumCores of computing units, and k 1 characterizing a rate of change of dynamic tensor performance with microkernel size; b 1 characterizes the dynamic parameters, where k 1+b1 =1.
4. The dynamic tensor compilation optimization method according to claim 1, wherein the target microkernel evaluation model is trained based on the following method:
Acquiring a source microkernel evaluation model which is trained on a source hardware platform, and training the source microkernel evaluation model on a target hardware platform through a training data set comprising a dynamic tensor instance to obtain a weight distribution difference coefficient of the source microkernel evaluation model between the source hardware platform and the target hardware platform;
Judging whether the weight distribution difference coefficient is smaller than a preset difference threshold value or not;
if yes, the weight of the source microkernel evaluation model is reserved as a target kernel evaluation model; if not, modifying the weight of the source microkernel evaluation model to obtain a target kernel evaluation model;
training the target kernel evaluation model according to the loss function of the target kernel evaluation model until the target kernel evaluation model converges to obtain a target microkernel evaluation model.
5. The dynamic tensor compilation optimization method according to claim 4, wherein the source microkernel evaluation model is a lifting tree structure;
Preserving the weight of the source microkernel evaluation model as a target kernel evaluation model, comprising:
The lifting tree structure of the source microkernel evaluation model is reserved as a target kernel evaluation model;
modifying the weight of the source microkernel evaluation model to obtain a target kernel evaluation model, comprising:
and constructing a new tree for the source microkernel evaluation model to obtain a target kernel evaluation model.
6. A dynamic tensor compilation optimization device, the device comprising:
The acquisition module is used for acquiring various hardware resource parameters of the target hardware platform before the target model is deployed on the target hardware platform; the hardware resource parameters comprise the maximum thread number in a thread block, the thread number in a thread cluster and the cache capacity;
The first screening module is used for screening a plurality of candidate microkernels which meet the preset microkernel conditions from a microkernel library constructed in advance according to the hardware resource parameters of the target hardware platform; the preset microkernel condition is set at least according to one hardware resource parameter;
The prediction module is used for inputting the screened multiple candidate microkernels into a pre-trained target microkernel evaluation model, and predicting the performance of each candidate microkernel by using the dynamic tensor in the target model through the target microkernel evaluation model to obtain the performance score of each candidate microkernel; the target microkernel evaluation model is obtained based on target hardware platform training;
the second screening module is used for screening candidate microkernels with highest performance scores as target microkernels of dynamic tensors of the target model so as to obtain the dynamic tensors of the target microkernels in the target model;
The prediction module is specifically configured to, when predicting, by the target microkernel evaluation model, a performance of the target model when each candidate microkernel is used by a dynamic tensor in the target model to obtain a performance score of each candidate microkernel:
Predicting the throughput of the candidate microkernel used by the dynamic tensor in the target model through the target microkernel evaluation model;
predicting a resource utilization coefficient when the candidate microkernel is used by the dynamic tensor;
Determining a memory access efficiency coefficient of the candidate microkernel used by the dynamic tensor based on various memory access times and required complete execution time of the candidate microkernel to the target access platform used by the dynamic tensor; the memory access efficiency scores of the microkernels with different sizes represent the difference of the microkernels with different sizes in memory access time;
calculating a performance score of the candidate microkernel based on the throughput, the resource utilization coefficient and the memory access efficiency coefficient when the candidate microkernel is used by the dynamic tensor;
The prediction module is used for determining a memory access efficiency coefficient of the candidate microkernel used by the dynamic tensor when the candidate microkernel is used by the dynamic tensor based on various memory access times and required complete execution time of the candidate microkernel to the target access platform, and is specifically used for: determined by the following formula (2):
(2)
Wherein t represents the number of threads in the candidate microkernel; LS R t represents the access time of the candidate microkernel to the target hardware platform register; LS S t represents the access time of the candidate microkernel to the shared memory of the target hardware platform; LS G t represents the global access time of the candidate microkernel to the target hardware platform; executionTime M (P) represents the complete execution time required to use the left-hand rotation state tensor P of microkernel M; the parameter k 2、b2 is the change rate of the actual memory access time fitted in the training process of the target microkernel evaluation model; the memory access efficiency on the target hardware platform using the dynamic tensor P of the microkernel M is characterized.
7. An electronic device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the electronic device is running, said machine readable instructions when executed by said processor performing the steps of the dynamic tensor compilation optimization method according to any of claims 1 to 5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the dynamic tensor compilation optimization method according to any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310538630.3A CN117076098B (en) | 2023-05-11 | 2023-05-11 | Dynamic tensor compiling optimization method and device, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310538630.3A CN117076098B (en) | 2023-05-11 | 2023-05-11 | Dynamic tensor compiling optimization method and device, electronic equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117076098A CN117076098A (en) | 2023-11-17 |
CN117076098B true CN117076098B (en) | 2024-07-30 |
Family
ID=88718166
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310538630.3A Active CN117076098B (en) | 2023-05-11 | 2023-05-11 | Dynamic tensor compiling optimization method and device, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117076098B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069284A (en) * | 2019-03-14 | 2019-07-30 | 成都恒创新星科技有限公司 | A kind of Compilation Method and compiler based on OPU instruction set |
CN113821208A (en) * | 2021-06-18 | 2021-12-21 | 清华大学 | Compiling optimization method and system for deep learning operator |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115222030A (en) * | 2022-07-15 | 2022-10-21 | 北京航空航天大学 | Automatic optimization accelerating method for deep learning network operator program |
-
2023
- 2023-05-11 CN CN202310538630.3A patent/CN117076098B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069284A (en) * | 2019-03-14 | 2019-07-30 | 成都恒创新星科技有限公司 | A kind of Compilation Method and compiler based on OPU instruction set |
CN113821208A (en) * | 2021-06-18 | 2021-12-21 | 清华大学 | Compiling optimization method and system for deep learning operator |
Also Published As
Publication number | Publication date |
---|---|
CN117076098A (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10963787B2 (en) | Systems and methods for generation of sparse code for convolutional neural networks | |
CN115543639B (en) | Optimization method for performing deep learning tasks in distributed mode and distributed system | |
US11216732B2 (en) | Systems and methods for generation of sparse code for convolutional neural networks | |
Khalid et al. | Troodon: A machine-learning based load-balancing application scheduler for CPU–GPU system | |
EP3920026A1 (en) | Scheduler, method of operating the same, and accelerator apparatus including the same | |
CN114995822B (en) | Deep learning compiler optimization method special for CNN accelerator | |
Phillips et al. | A CUDA implementation of the High Performance Conjugate Gradient benchmark | |
CN112434785B (en) | Distributed parallel deep neural network performance evaluation method for supercomputer | |
Shen et al. | Improving performance by matching imbalanced workloads with heterogeneous platforms | |
CN114358267A (en) | Method for reducing GPU memory occupation in deep neural network training process | |
Dublish et al. | Poise: Balancing thread-level parallelism and memory system performance in GPUs using machine learning | |
Zou et al. | Distributed training large-scale deep architectures | |
CN117355843A (en) | Generating and globally tuning application-specific machine learning accelerators | |
CN115016938A (en) | Calculation graph automatic partitioning method based on reinforcement learning | |
WO2021054990A1 (en) | Systems and methods for generation of sparse code for convolutional neural networks | |
Chen et al. | Optimizing sparse matrix-vector multiplication on emerging many-core architectures | |
CN116680063B (en) | Task scheduling method, device, computing system, electronic equipment and storage medium | |
Deniz et al. | Using machine learning techniques to detect parallel patterns of multi-threaded applications | |
Ahmed et al. | Fuzzy active learning to detect OpenCL kernel heterogeneous machines in cyber physical systems | |
CN117076098B (en) | Dynamic tensor compiling optimization method and device, electronic equipment and medium | |
US11960982B1 (en) | System and method of determining and executing deep tensor columns in neural networks | |
CN115827225A (en) | Distribution method of heterogeneous operation, model training method, device, chip, equipment and medium | |
US11900239B2 (en) | Systems and methods for accelerating sparse neural network execution | |
Huchant et al. | Adaptive partitioning for iterated sequences of irregular OpenCL kernels | |
Kornelsen et al. | Fast heterogeneous task mapping for reducing edge dnn latency |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |