Nothing Special   »   [go: up one dir, main page]

WO2021057746A1 - 神经网络处理方法、装置、计算机设备及存储介质 - Google Patents

神经网络处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021057746A1
WO2021057746A1 PCT/CN2020/116933 CN2020116933W WO2021057746A1 WO 2021057746 A1 WO2021057746 A1 WO 2021057746A1 CN 2020116933 W CN2020116933 W CN 2020116933W WO 2021057746 A1 WO2021057746 A1 WO 2021057746A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
tensor
neural network
splitting
split
Prior art date
Application number
PCT/CN2020/116933
Other languages
English (en)
French (fr)
Inventor
张潇
周玉松
孟小甫
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910910117.6A external-priority patent/CN110674936A/zh
Priority claimed from CN201910910118.0A external-priority patent/CN110659728B/zh
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Priority to EP20869294.7A priority Critical patent/EP4036810A4/en
Priority to US17/622,702 priority patent/US20220383082A1/en
Publication of WO2021057746A1 publication Critical patent/WO2021057746A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the present invention relates to the field of information processing technology, in particular to a neural network processing method, device, computer equipment and storage medium.
  • multi-core processors based on the memory sharing model have become the mainstream architecture of current processors.
  • This multi-core architecture and the vector processing capabilities in each core can also be applied to neural network calculations.
  • data parallelism can usually be used to make full use of the additional hardware resources brought about by the multi-core processor architecture, that is, each processor core can execute calculations on the same neural network model with different data at the same time.
  • the multi-core processor structure cannot use this parallel method to process small batches of neural network computing tasks that require low latency in reasoning scenarios. Then, how to ensure that the data parallelism and the neural network model parallelism are unified so as to make full use of the hardware resources of the multi-core processor is a technical problem that needs to be solved urgently.
  • the embodiment of the present invention provides a neural network processing method, device, computer equipment and storage medium.
  • the multi-core processor can directly call the single-core architecture.
  • the computing library fully utilizes the hardware resources of the multi-core processor, thereby avoiding the extra workload of re-implementation.
  • an embodiment of the present invention provides a neural network processing method, the method is applied to an artificial intelligence processor, the artificial intelligence processor includes M artificial intelligence processor cores, M is a positive integer greater than 1;
  • the method includes:
  • the neural network model includes a plurality of operators
  • splitting strategy set is a set of splitting methods corresponding to the target operator in the calculation graph
  • the sub-computing tasks are allocated to the corresponding artificial intelligence processor cores in the artificial intelligence processor for processing.
  • an embodiment of the present invention provides a neural network processing device, which includes a unit for executing the method of the above-mentioned first aspect. Specifically, the device is applied to an artificial intelligence processor.
  • the artificial intelligence processor includes M artificial intelligence processor cores, where M is a positive integer greater than 1.
  • the device includes:
  • the first obtaining unit is configured to obtain a calculation graph corresponding to a neural network model; wherein, the neural network model includes a plurality of operators;
  • the first determining unit is configured to determine the target splitting strategy of the neural network computing task in the splitting strategy set; wherein, the splitting strategy set is composed of the splitting methods corresponding to the target operator in the calculation graph set;
  • a splitting unit configured to split the neural network computing task according to the target splitting strategy to obtain multiple sub-computing tasks
  • the execution unit is configured to allocate the sub-computing tasks to the corresponding artificial intelligence processor cores in the artificial intelligence processor for processing.
  • an embodiment of the present application provides a chip, and the chip includes the neural network model processing device provided in the second aspect.
  • an embodiment of the present application provides a computer device that includes the chip provided in the third aspect or the neural network model processing device provided in the second aspect.
  • an embodiment of the present application provides a computer device, including a processor and a memory, the processor and the memory are connected to each other, wherein the processor includes a general-purpose processor and an artificial intelligence processor, and the memory is used for A computer program that supports a computer device to execute the above method is stored, the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method of the above first aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processing The device executes the method of the first aspect described above.
  • an embodiment of the present application provides a computer program product, wherein the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the above-mentioned computer program is operable to cause a computer to execute as implemented in this application. Examples include part or all of the steps described in the method described in the first aspect.
  • the computer program product may be a software installation package.
  • the neural network computing task is split into several smaller sub-computing tasks, so that the multi-core processor can directly call the computing library under the single-core architecture, making full use of the hardware resources of the multi-core processor , Which can avoid the extra workload of re-implementation.
  • FIG. 1A is a schematic structural diagram of a multi-core processor provided by an embodiment of the present application.
  • FIG. 1B is a schematic diagram of the semantics of a reshape operator provided by an embodiment of the present application.
  • FIG. 1C is a schematic diagram of the semantics of a transpose operator provided by an embodiment of the present application.
  • FIG. 1D is a schematic diagram of the semantics of a concat operator provided by an embodiment of the present application.
  • FIG. 1E is a schematic diagram of the semantics of a split operator provided by an embodiment of the present application.
  • FIG. 1F is a schematic diagram of continuous storage of tensor data according to an embodiment of the present application.
  • FIG. 1G is a schematic diagram of ensuring the equivalence of operations provided by an embodiment of the present application.
  • 1H is a schematic diagram of a stride-containing memory distribution provided by an embodiment of the present application.
  • FIG. 1I is a schematic structural diagram of a software stack of an artificial intelligence processor provided by an embodiment of the present application.
  • Figure 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • 3A is a schematic flowchart of a neural network processing method provided by an embodiment of the present application.
  • 3B is a schematic structural diagram of a facial recognition neural network model provided by an embodiment of the present application.
  • 3C is a schematic structural diagram of a neural network model for license plate character recognition provided by an embodiment of the present application.
  • Fig. 4 is a calculation diagram of a neural network convolution operator provided by an embodiment of the present application.
  • FIG. 5A is a schematic diagram obtained by splitting the input data according to the N dimension
  • FIG. 5B is a schematic diagram of splitting according to the C dimension of the output data
  • FIG. 5C is a schematic diagram obtained by splitting according to the C dimension of input data
  • FIG. 5D is a schematic diagram obtained by splitting according to the H dimension of the input data
  • FIG. 5E is a schematic diagram obtained by splitting according to the W dimension of the input data
  • 6A is a schematic flowchart of a neural network optimization method provided by an embodiment of the present application.
  • 6B is a schematic structural diagram of a glue operator extracted in an original calculation diagram provided by an embodiment of the present application.
  • FIGS. 7A-7P are schematic diagrams of optimization of neural network models provided by embodiments of the present application.
  • FIG. 8A is a schematic structural diagram of a first calculation graph provided by an embodiment of the present application.
  • FIG. 8B is a schematic structural diagram of a glue sub-picture provided by an embodiment of the present application.
  • FIG. 8C is a schematic structural diagram of an optimized equivalent optimized sequence provided by an embodiment of the present application.
  • FIG. 8D is a schematic structural diagram of an expanded first calculation graph provided by an embodiment of the present application.
  • FIG. 8E is a state collection diagram provided by an embodiment of the present application.
  • 8F-8M are schematic diagrams of state transitions provided by embodiments of the present application.
  • FIG. 9 is a schematic structural diagram of a neural network processing device provided by an embodiment of the present application.
  • Fig. 10 is a schematic structural diagram of a neural network optimization device provided by an embodiment of the present application.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the so-called data parallelism refers to dividing data into several blocks and mapping them to different processors, and each processor runs the same processing program to process the allocated data.
  • most of the parallel processing uses this processing method, especially for problems with high computational complexity, such as fluid mechanics calculations, image processing, and so on.
  • data parallelism can be applied to large-scale neural network parallel training.
  • the core of data parallelism is to use multiple processors to simultaneously train the same neural network model.
  • each processor obtains the data used in this iteration from the data set, completes a round of inference and training calculations for the entire network on each processor, and returns the gradient data calculated in this round To update the model.
  • the weight-maintaining server receives the gradients of all processors, it uses these gradients to update the model data.
  • the key to data parallelism lies in the batch size of the data to be processed in each iteration. The larger the batch, the more processors are divided as much as possible for parallel processing.
  • model parallelism is another neural network parallel calculation method besides data parallelism.
  • model parallelism is to distribute the computational load to different processors by dividing the parameters of the neural network model.
  • model parallelism The biggest difference between model parallelism and data parallelism is that the degree of model parallelism is statically determined at compile time, and cannot be changed once the operation is compiled, which is called the inherent property of the model; while data parallelism is dynamically specified at runtime, and the same model can be Specify a different degree of data parallelism.
  • data parallel programming tends to obtain the ultimate throughput rate; and the model Parallel programming is more inclined to obtain the ultimate low latency.
  • the most common structure currently adopted by multi-core processors is a multi-core structure based on storage sharing.
  • the processor contains multiple computing cores, each with independent cache, register file, computing unit, and Command control unit, all computing cores share the same global storage.
  • a single core is sufficient to complete any complex logic calculation task, but its performance is limited by Moore's Law and chip technology.
  • multiple computing cores are introduced into the processor, and they can be used to process computing tasks with a high degree of parallelism.
  • the shared storage multi-core structure is a classic multi-core structure, and it is very suitable for data-parallel neural network training methods.
  • Each core can be used as a processor in data parallel, read different data respectively, and then complete the forward and reverse calculations of the network model in parallel. In the calculation phase, each core can still maintain its good performance-to-power ratio under the previous single-core architecture. At the same time, the throughput of the entire system can also increase with the expansion of the number of cores.
  • the original operator before the split and several sub-operators after the split are all operators supported by the artificial intelligence processor.
  • the original tensor data is also split with the split of the operator. Divide into several new sub-tensor data. Reflected on the calculation graph, the original calculation graph containing a single operator is refined into a calculation graph containing more operators that can be executed in parallel.
  • operator splitting is not entirely limited to splitting model parameters, and data parallelism is also used to split data.
  • This method actually blurs the boundary between model parallelism and data parallelism.
  • the convolution operator Take the convolution operator as an example. If the input data and weight of the convolution operator are used as the equivalent low-order tensor data in the calculation graph, then the calculation is divided based on the division of the input data when the data is parallel, and when the model is parallel The calculation is divided based on the division of weights, both of which achieve the division of the calculation load by dividing the tensor data associated with the convolution operator. From this perspective, data parallelism and model parallelism are unified.
  • the tensor is only a feature description of a piece of data stored, and the tensor records information such as the shape and type of the data.
  • tensor should be understood as tensor data, which can include input tensor data and output tensor data in the neural network model, and can also include feature tensor data.
  • Tensor A [6,2], which represents a two-dimensional matrix, specifically, the matrix is a matrix with 6 rows and 2 columns.
  • the first type of operators are responsible for obtaining output features from input features. They have their own specific calculation tasks, and perform multiplication, addition, non-linear calculation, comparison and selection, and other mathematical operations on the input data.
  • the convolution operator uses the convolution kernel to perform convolution calculation on the local area of the input feature image, and obtains the output feature by linear calculation of the data in the input feature image; for example, the fully connected operator uses matrix multiplication on the input All the features of are combined linearly; for example, the pooling operator samples the input data to get the output data, and so on.
  • the semantics of another type of operator does not involve any calculation logic. Its input data and output data have not changed in any way, whether it is the number of values or the value itself.
  • This type of operator is usually used for neural network models. The format, shape, and memory arrangement of the tensor data in the calculation graph of the tensor are adjusted in order to adjust the tensor data calculated from the upstream calculation of the neural network model into a better and convenient form for downstream calculations.
  • "Glue” the context calculation part of the neural network. Specifically, this type of operator is called “glue” operator. Then, correspondingly, the part composed of the "glue” operator in the calculation graph is called the “glue” subgraph.
  • the reshape operator refers to reinterpreting the shape of the tensor.
  • the reshape operator can be used to adjust the shape of tensor data.
  • the parameter shape [-1], which means that the tensor is expanded into a list.
  • the parameter shape [a,b,c,...,n], where a, b, c,...n are all positive integers greater than 0, which means that the tensor is transformed into a multi-dimensional matrix.
  • FIG. 1B refers to the schematic diagram of the semantics of the reshape operator as shown in FIG. 1B.
  • the transpose operator that is, the tensor transposition operator, refers to transpose a tensor.
  • the transpose operator can be used to adjust the dimensional order of tensor data.
  • the perm parameter is a complete permutation of the natural number sequence [1,2,3,...,n], and different complete permutations represent different transpose operators.
  • the transpose operator can change the order of dimensions.
  • the concat operator that is, the concatenation operator, is used to concatenate multiple tensor data into one tensor along a specified dimension. Except for the specified dimensions, the other dimensions of the input tensor should be consistent.
  • the neural network splices multiple tensors representing features from different upstream locations into one, so that these features can be processed together in downstream calculations. Specifically, refer to the schematic diagram of the semantics of the concat operator shown in FIG. 1D.
  • the split operator that is, the split operator, is used to split a tensor into multiple tensors in a specified dimension. Except for the specified dimension, the multiple tensors after splitting remain consistent in other dimensions.
  • the split operator the features belonging to the same tensor data can be split into multiple copies, so that they can be processed separately in subsequent calculations. Specifically, refer to the schematic diagram of split operator semantics shown in Fig. 1E.
  • the glue operator is used to perform at least one of the format of the tensor data in the neural network model, the shape of the tensor data, and the arrangement of the tensor data in the memory. Adjustment.
  • the glue operator may include, but is not limited to, the aforementioned four different types of operators, and may also include other operators, which is not specifically limited in the embodiment of the present application.
  • Multi-dimensional tensors are used in neural network calculations as the basic unit of data transfer between operators.
  • data is stored in memory in a continuous manner.
  • data is stored in 16 consecutive bits between I0-I15.
  • the order of storing data is the same as the order of elements in the one-dimensional data that the tensor expands all dimensions at once from the outside to the inside. Accessing the data in the tensor is determined according to the coordinates of the elements in different dimensions and the dimensions themselves. For example, a tensor of shape (D0, D1, D2) is stored in a continuous memory of size D0 ⁇ D1 ⁇ D2. To access the data of coordinates (n0, n1, n2) in the tensor, you can The starting address and the calculated data offset (n0 ⁇ D1+n1) ⁇ D2+n2 determine the address of the data in the memory.
  • the tensor data in the calculation graph of the neural network model generally has 4 dimensions, which represent the data processed by the current calculation.
  • the batch size of N represents the number of feature images
  • C represents the size of feature images H and W.
  • the dimensional order of the tensor data may be NCHW, that is, N is the outermost dimension in the process of solving the offset, and W is the innermost dimension.
  • N is the outermost dimension in the process of solving the offset
  • W is the innermost dimension.
  • the default tensor data in Caffe uses this dimension order; MXNet and TensorFlow can support this dimension order.
  • the offset of the element with coordinates (n,c,h,w) in storage is ((n ⁇ C+c) ⁇ H+h) ⁇ W+w.
  • the dimensional order of the tensor data can also be NHWC (here, C is the innermost dimension), and the conversion method of the corresponding coordinate offset is ((n ⁇ H+h) ⁇ W+w) ⁇ C+c.
  • NHWC is closer to the BMP (full name: Bitmap) image data storage format than NCHW.
  • the BMP format file stores data according to individual pixels, and each pixel stores the color value of all channels. This eliminates the need for additional dimensional conversion when reading the input image.
  • the C dimension is easier to use vector calculation instructions for parallelization than the H and W dimensions.
  • the convolution kernel is 1 ⁇ 1
  • the calculation of a value in the output tensor only requires a set of data along the C dimension of the input tensor. This makes it possible to place the C dimension on the innermost dimension to better utilize the locality of the data. And you can directly use the highly optimized matrix multiplication to replace the 1 ⁇ 1 convolution calculation.
  • the dimensional order of the tensor data can also be CHWN (here, N is the innermost dimension), and the conversion method of the corresponding coordinate offset is ((c ⁇ H+h) ⁇ W+w) ⁇ N+n.
  • N is the innermost dimension
  • the conversion method of the corresponding coordinate offset is ((c ⁇ H+h) ⁇ W+w) ⁇ N+n.
  • neon developed by Nervana uses the tensor of this dimension order to perform convolution kernel pooling calculations.
  • putting the N dimension on the innermost side is the most intuitive parallel way, and its idea is consistent with the data parallelism in distributed training.
  • the difference in dimension will not cause errors in calculation results, but will affect performance.
  • the artificial intelligence processor adopts a different dimensional order, as long as it is ensured that each operator achieves an operation equivalent to the abstract semantic meaning in the actual dimensional order during execution, the correctness of the final result can be guaranteed.
  • the tensor data actually uses the NCWH data arrangement in the storage, and the definition of the neural network model is based on the NCHW.
  • the result of each operator in the actual execution process should be transformed on the basis of the input data.
  • tensor data is stored in memory in a continuous and compact manner, but the artificial intelligence processor may adopt a non-continuous data storage method.
  • the non-continuous storage mode refers to: the mathematical dimension of the tensor data half-length is used to calculate the actual dimension of the offset in the storage, where the actual dimension used to calculate the offset is called For stride.
  • the W dimension in a two-dimensional tensor is also 4 in the inner dimension itself, but the actual storage is arranged according to 6.
  • stride_n, stride_c, stride_h, and stride_w are used to indicate the offset that needs to be skipped to read the next value along the four dimensions of N, C, H, and W.
  • the offset of this element based on the starting address in storage is n ⁇ stride_n+c ⁇ stride_c+h ⁇ stride_h+w ⁇ stride_w.
  • the various layouts NCHW, NHWC, CHWN, etc. of tensors in continuous and tight arrangement can be regarded as special forms of stride.
  • the memory unit may have its own memory access alignment restriction, that is, the starting address of each memory access must be a multiple of a certain constant, which further increases the difficulty of instruction implementation.
  • a simpler method is to directly align the dimension of the tensor data up to the nearest integer multiple, and fill the supplemented part with 0.
  • the supplementary 0 has no effect on the final calculation result even if it participates in the calculation.
  • the stride of the corresponding dimension becomes an integral multiple of the calculation and memory access bit width, thus avoiding the trouble of processing the tail data separately.
  • reshape is a zero-overhead operation. It only needs to modify the shape information of the tensor, but when the dimensions involved involve stride-aligned dimensions, reshape is calculated The overhead introduced by the child cannot be ignored. For example, if the two dimensions of the tensor in Figure 1G are merged into one, the storage location of most of the elements needs to be readjusted to eliminate the two zeros at the end of the W dimension.
  • vector registers and SIMD Single Instruction Multiple Data, SIMD
  • SIMD Single Instruction Multiple Data
  • the input tensor further splits the C dimension, and divides it into sub-segments according to the data bit width that can be processed by the general-purpose processor, and stores them continuously in the memory , Improve the utilization of the cache.
  • the SIMD instruction of the artificial intelligence processor can complete 8 floating point calculations at a time
  • the layout of N, C, H, and W will be adjusted to N, C/8, H, W, 8 after segmentation.
  • This segmentation idea is also applicable to the calculation optimization of some artificial intelligence processors. The difference is that the latter can process wider vector data at a time, and the segmentation method can also ensure the continuity of memory access in the calculation phase, which is beneficial to Improve the efficiency of memory access.
  • the reconstruction of the subgraph refers to the internal calculation under the condition that the input tensor data and output tensor data of the "glue" subgraph remain unchanged, and the semantics represented by the entire "glue” subgraph remain unchanged. Sub and intermediate result tensor data are added, deleted, and the topological relationship is adjusted.
  • the equivalent rule includes at least one of the equivalent rule of the reshape operator, the equivalent rule of the transpose operator, the equivalent rule of the concat operator, and the equivalent rule of the split operator. In the following embodiments, they will be explained one by one.
  • the equivalent rule describes the logical relationship of glue operators that can be optimized.
  • the logical relationship of the glue operator is that the output data of one of the at least two glue operators is handed over to the other operator as input data for calculation operations.
  • the artificial intelligence processor is also called a dedicated processor.
  • the artificial intelligence processor refers to a processor for a specific application or field.
  • Graphics Processing Unit also known as display core, visual processor, display chip
  • GPU Graphics Processing Unit
  • visual processor display chip
  • NPU Neural Network Processor
  • NPU Neural Processing Unit
  • NPU Neural Processing Unit
  • the software stack of the artificial intelligence processor referring to FIG. 1I, the software stack structure 10 includes an artificial intelligence application 100, an artificial intelligence framework 102, an artificial intelligence learning library 104, an artificial intelligence runtime library 106, and a driver 108. Next, it will be elaborated in detail.
  • the artificial intelligence application 100 corresponds to different application scenarios and provides corresponding artificial intelligence algorithm models.
  • the algorithm model can be directly analyzed by the programming interface of the artificial intelligence framework 102.
  • the artificial intelligence algorithm model is converted into binary instructions through the artificial intelligence learning library 104, and the artificial intelligence runtime library 106 is called to convert the binary instructions. It is converted into an artificial intelligence learning task, the artificial intelligence learning task is placed in the task queue, and the artificial intelligence learning task in the task queue is scheduled by the driver 108 to be executed by the underlying artificial intelligence processor.
  • the artificial intelligence runtime library 106 can also be directly called to run the offline running files that have been solidified and generated previously, reducing the intermediate overhead of the software architecture and improving the operating efficiency.
  • the artificial intelligence framework is the first layer in the entire deep learning ecosystem.
  • Layer was regarded as the basic element of building neural networks.
  • the later artificial intelligence frameworks such as TensorFlow and MXNet, although different names are used, such as Operator, the core idea of the layer is still the same as that of Caffe. They are similar. They all divide neural network calculations into various common tensor data-oriented operators.
  • the artificial intelligence framework needs to embody the deep learning tasks expressed by the computational graph structure of the neural network mapping into a CPU or Instructions and data executed by the artificial intelligence processor.
  • the artificial intelligence framework uses operators as specific elements to implement computing tasks, and provides each operator with a kernel function (Kernel) executed on the CPU or artificial intelligence processor.
  • kernel function Kernel
  • the artificial intelligence framework Schedule and execute the kernel function corresponding to each operator in the calculation graph to complete the calculation of the entire neural network.
  • the problem of data parallelism is that its scalability depends on the size of the processed data batch. Although this is not usually a problem in the training phase, it is difficult to guarantee this premise in the inference phase.
  • the neural network model used in the real-time service field including video surveillance, autonomous driving, etc.
  • the processed data is usually serially input in a stream, resulting in a small scale of data processed each time or even a single Pictures.
  • data parallelism cannot provide any degree of parallelism, and all work tasks will be concentrated on a single core, which prevents the computing resources brought by multiple cores from being converted into the speed of processing tasks.
  • the model After completing the training of the neural network model using the data set offline, the model will be deployed to a server in the cloud to process the data sent from the outside world.
  • the application scenario will change from offline training to online reasoning.
  • a very important indicator is the time delay, that is, the time from the server receiving the data to be processed to the return of the processed result, and furthermore, the time to process the data using the neural network model.
  • the low latency ensures that the cloud server can respond to the data sent by the client in the shortest time. In some more sensitive scenarios, it directly determines whether the solution is available. Therefore, the requirements for artificial intelligence processors in the online reasoning stage have changed from processing large batches of data and high throughput to processing small batches of data with low latency.
  • the deep learning artificial intelligence processor adapts its own hardware design to adapt to the data parallel characteristics of the deep learning algorithm itself and improves the computational throughput.
  • the artificial intelligence processor often needs sufficient data scale to achieve high computational efficiency. Further splitting within the operator will reduce the calculation scale on each core. When the split reaches a certain granularity, the loss of computational efficiency on each core will exceed the benefits of splitting to increase the degree of parallelism. Therefore, between split parallelism and computational efficiency, sufficient parallelism must be provided while ensuring sufficient computational efficiency.
  • the neural network model can be regarded as a complex calculation graph composed of hundreds or even thousands of operators.
  • the algorithm logic in different types of operators is different, which leads to different methods of splitting these operators.
  • the division of each operator, in addition to balancing its own calculation efficiency and parallelism, also considers the combination with the front and rear operators, and even the overall impact.
  • the rapid development of deep learning has brought about more and more large-scale and complex networks. It is unrealistic to find a good parallel method manually. Therefore, an automated method is needed to ensure that it can be used for different networks. Give a better split and parallel strategy.
  • one operator is split into multiple smaller-scale sub-operators, so that the computing library under the single-core architecture can be directly called, avoiding the extra workload of re-implementation.
  • an activation operator can get many smaller activation operators after being split, which means that only the original single-core activation function needs to be called on multiple cores to complete each sub-task, without the need to modify or renew Implement a multi-core version of the activation function.
  • it is necessary to take into account the calculation efficiency and parallelism of each operator itself after the split, and also consider the mutual cooperation between the context operators in the split. The ultimate goal is to obtain a split and parallel solution that can effectively reduce the end-to-end reasoning delay of the entire neural network model.
  • the neural network processing method can avoid modifying the single-core processor calculation library as much as possible, and at the same time can realize the parallel execution of the neural network model on the multi-core processor.
  • the upper framework divides the operator in the neural network model into several sub-operators that can be executed in parallel.
  • the deep learning framework calls the computing library to generate the sub-operator that is executed on a single core.
  • the machine instructions by loading the machine instructions of the sub-operators onto different cores, realize the parallel calculation of the operators on the multi-core processor.
  • the deep learning framework can use a single-core processor calculation library to generate calculation instructions for sub-operators
  • the input and output tensor data of the operators in the neural network model are split into sub-operators along with the operators. It is also split into corresponding sub-tensor data.
  • FIG. 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 20 may include a general-purpose processor 201, a memory 202, a communication bus 203, a communication interface 204, and at least one artificial intelligence processor 205.
  • the general-purpose processor 201 and the artificial intelligence processor 205 pass through the communication bus.
  • the storage 202 and the communication interface 203 are connected.
  • the general-purpose processor 201 may be a central processing unit (CPU), and the general-purpose processor 201 may also be other general-purpose processors, digital signal processors (DSP), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor 201 may be a microprocessor or the general-purpose processor 201 may also be any conventional processor or the like.
  • the general-purpose processor 201 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the neural network processing method of the present application can be completed by the integrated logic circuit of hardware in the general-purpose processor 201 or instructions in the form of software.
  • the memory 202 may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), or other memories.
  • the memory 202 is used to store data and various software programs, such as a program for splitting the neural network model according to a determined target splitting strategy in the embodiment of the present application.
  • the memory may include a physical device for storing information, and the information is usually digitized and then stored in a medium using electrical, magnetic, or optical methods.
  • the memory described in this embodiment may also include: devices that use electrical energy to store information, such as RAM, ROM, etc.; devices that use magnetic energy to store information, such as hard disks, floppy disks, magnetic tapes, magnetic core memories, bubble memory, U disks ; A device that uses optical means to store information, such as CD or DVD.
  • devices that use electrical energy to store information such as RAM, ROM, etc.
  • devices that use magnetic energy to store information such as hard disks, floppy disks, magnetic tapes, magnetic core memories, bubble memory, U disks
  • a device that uses optical means to store information such as CD or DVD.
  • quantum memory graphene memory, and so on.
  • the communication interface 204 uses a transceiving device such as but not limited to a transceiver to implement communication between the computer device 20 and other devices or a communication network. For example, a model file sent by another device can be received through the communication interface 204.
  • a transceiving device such as but not limited to a transceiver to implement communication between the computer device 20 and other devices or a communication network. For example, a model file sent by another device can be received through the communication interface 204.
  • the artificial intelligence processor 205 can be mounted on a host CPU (Host CPU) as a coprocessor, and the host CPU allocates tasks to it. In practical applications, the artificial intelligence processor 205 can implement one or more operations. For example, taking a neural network processor (Network Processing Unit, NPU) NPU as an example, the core part of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 202 and perform multiplication and addition operations.
  • NPU Network Processing Unit
  • the artificial intelligence processor 205 may include 8 clusters, and each cluster includes 4 artificial intelligence processor cores.
  • the artificial intelligence processor 205 may be an artificial intelligence processor with a reconfigurable architecture.
  • the reconfigurable architecture means that if a certain artificial intelligence processor can use reusable hardware resources, it can flexibly change its own architecture according to different application requirements, so as to provide it for each specific application requirement. Match the architecture, then this artificial intelligence processor is called a reconfigurable computing system, and its architecture is called a reconfigurable architecture.
  • the computer device 20 is only an example provided by the embodiment of the present application, and the computer device 20 may have more or fewer components than the components shown, may combine two or more components, or may have Different configurations of components are realized.
  • the following is a schematic flow diagram of a neural network processing method provided by the embodiment of the application shown in FIG. 3A, specifically explaining how to implement the neural network model in the embodiment of the application
  • the following takes caffe as an example for detailed description, which can include but is not limited to the following steps:
  • Step S310 Obtain a calculation graph corresponding to the neural network model; wherein the neural network model includes multiple operators, and the multiple operators are used to perform neural network calculation tasks.
  • the target operator may be a corresponding target layer in the neural network model, and the target layer is at least one layer in the neural network model.
  • the calculation graph refers to a way of describing the calculation process of the neural network model using the graph structure.
  • the neural network model may receive input data, and generate a prediction output according to the received input data and current model parameters.
  • the neural network model can be a regression model, a deep neural network (DNN), a convolutional neural network model (Convolutional Neural Networks, CNN), a recurrent neural network model (Recurrent Neural Networks, RNN), etc.
  • DNN deep neural network
  • CNN convolutional Neural Networks
  • RNN recurrent neural network model
  • the embodiments of this application do not make specific limitations.
  • the input neurons and output neurons of the multi-layer operation do not refer to the neurons in the input layer and the neurons in the output layer of the entire neural network model.
  • the neuron in the lower layer of the network forward operation is the input neuron
  • the neuron in the upper layer of the network forward operation is the output neuron.
  • the layer is called the input layer, the neurons in it are the input neurons, the K+1 layer is called the output layer, and the neurons in it are the output neurons. That is, except for the top layer, each layer can be used as the input layer, and the next layer is the corresponding output layer.
  • different neural network models correspond to different neural network computing tasks.
  • the neural network computing tasks corresponding to the deep learning neural network model can be image classification, text classification, etc.
  • the neural network computing tasks corresponding to the convolutional neural network model can be image recognition, video classification, etc.
  • the long and short-term memory neural network model Long The neural network computing tasks corresponding to Short Term Memory Network (LSTM) can be speech recognition, picture description, natural language processing, etc.
  • Step S312 Determine the target splitting strategy of the neural network computing task in the splitting strategy set; wherein, the splitting strategy set is a set of splitting methods corresponding to the target operator in the calculation graph.
  • the split strategy set when determining the split strategy set, it may include:
  • the set of splitting strategies is determined according to the splitting manner corresponding to the target operator.
  • the target operator is one of multiple operators.
  • the number of cores for processing a single model and single input artificial intelligence processor as the first degree of parallelism, that is, the degree of model parallelism.
  • the user only needs to specify the first degree of parallelism during compilation, and the artificial intelligence runtime library 106 will automatically divide the calculation graph corresponding to the original neural network model in multiple dimensions such as topology, input and output, and model parameters, so that after the division
  • the model can be executed in parallel on multiple computing cores and automatically ensure data synchronization between multiple cores.
  • model parallel technology can be used to divide the VGG16 classification network into multiple cores and process the same input picture in parallel, so that the classification delay of a single picture can be significantly reduced.
  • a single model processes multiple inputs at the same time, and each input uses a different computing core to process, which is called a single-model multi-data parallel computing mode. It can be simply understood as multiple copies of the same model, and each model uses one or more cores (depending on the first degree of parallelism) to process different input data. But in fact the model (instructions, weights, etc.) is not copied, but shared by all cores.
  • the degree of data parallelism refers to the number of pieces of input data processed, and the degree of data parallelism is also called the second degree of parallelism.
  • the parallelism of the target operator is the second parallelism.
  • the parallelism of the target operator is the first parallelism.
  • the two programming methods of data parallelism and model parallelism can be used superimposedly to meet application scenarios that require high throughput under certain delay constraints.
  • the degree of parallelism includes a first degree of parallelism and a second degree of parallelism.
  • the actual number of computing cores used is the data parallelism multiplied by the model parallelism, and the product cannot exceed the number of computing cores of the artificial intelligence processor in the artificial intelligence processor.
  • the degree of parallelism refers to how many operators the operator will be split into. This variable is usually limited by the number of cores of the multi-core processor architecture. Under the premise of not exceeding the upper limit of the number of cores, Should ensure that the degree of parallelism is an integer power of 2.
  • the reason why the degree of parallelism is guaranteed to be an integer power of 2 is that in the current multi-core processor architecture, it is usually an integer power of 2. For example, 1, 2, 4, 8, 16, and so on. A task whose parallelism is not an integer power of 2 will often cause "fragments" in the scheduling of artificial intelligence processor cores.
  • the split dimension refers to the logical dimension along which the operator should split itself to obtain a series of sub-operators.
  • the tensor data in the calculation graph of the neural network model generally has 4 dimensions, each representing the current calculation
  • the batch size of the processed data is N, which represents the number of feature images, and C represents the size of feature images, H and W.
  • the computer equipment can select any one of the above four dimensions for splitting.
  • the activation operator can allow its input data and output data to be split in any dimension.
  • the input data of an activation operator is divided into several sub-blocks (from the point of view of consistency, the output data will also be divided in the same way), it may be expressed as input0, input1, input2,... ..., inputm-1 and output0, output1, output2,..., outputm-1, in the calculation stage, the entire activation operator is actually split into m smaller activation operators, These activation operators have no dependencies on each other and can run on multiple cores.
  • the size of the split dimension refers to the specific value of each sub-operator in the dimension after the operator is split into a series of sub-operators along the split dimension.
  • the split mode corresponding to each target operator can be determined according to the degree of parallelism, the split dimension, and the size of the split dimension.
  • the parallelism, the splitting dimension, and the size of the splitting dimension corresponding to the operators can determine the splitting methods corresponding to multiple target operators, so as to form a set of splitting strategies.
  • the set of splitting strategies is determined according to the parallelism, splitting dimension, and size of the splitting dimension corresponding to each target operator.
  • the face recognition neural network model contains a variety of different types of operators (convolution operators, pooling operators, fully connected operators), where the connection relationship between the operators is: Build-up layer 1-pooling layer 1-convolution layer 2-pooling layer 2-fully connected layer 1-fully connected layer 2. Since these operators can allow splitting in any dimension, in this case, the computer equipment can determine the corresponding splitting method for each operator according to the degree of parallelism, the splitting dimension, and the size of the splitting dimension. Can form a set of split strategies.
  • the neural network model contains many different types of operators. Among them, some operators can allow splitting in any dimension, and some operators only support splitting in limited dimensions. , the computer equipment can determine the respective splitting method for each target operator, and then determine the intersection of the splitting methods supported by each target operator in the multiple operators as Split the set of strategies. In general, in this case, the set of splitting strategies is determined according to the splitting mode supported by each target operator in the multiple operators.
  • the negative effects of unreasonable splitting methods can be avoided, for example, increasing the resource consumption of computer equipment, leading to time-consuming due to the unbalanced scale of the sub-operators after splitting Questions and so on.
  • the license plate character recognition neural network model contains many different types of operators (convolution operators, pooling operators, activation operators, softmax operator, etc.), where the connection relationship between the operators is: convolutional layer 1-activation function Relu-maximum pooling layer 1-convolution layer 2-activation function Relu-maximum pooling layer 2-convolution Layer 3-Activation Function Relu-Maximum Pooling Layer 3-Convolutional Layer 4-Activation Function-Maximum Pooling Layer 4-Convolutional Layer 5-Activation Function-Maximum Pooling Layer 5-Fully Connected Layer 1-Softmax Layer-Output Floor.
  • convolutional layer 1-activation function Relu-maximum pooling layer 1-convolution layer 2-activation function Relu-maximum pooling layer 2-convolution Layer 3-Activation Function Relu-Maximum Pooling Layer 3-Convolutional Layer 4-Activation Function-Maximum Pooling Layer 4-Convolutional Layer 5-Activation Function-Maximum Pooling Layer 5-Fully Connected Layer 1-Softmax Layer-Output Floor.
  • the neural network model contains a variety of different types of operators, some of which do not support any form of splitting at all, while other operators in the neural network model are used for data splitting.
  • the division format is consistent.
  • the neural network model is not split.
  • the computer device can determine the type according to the operator.
  • the way the operator is split please refer to Table 2:
  • the splitting methods supported by different types of operators are different.
  • the operator can be split in a targeted manner based on the characteristics of the operator, thereby avoiding the negative impact caused by unreasonable splitting methods, for example, increasing the resource consumption of computer equipment and causing Time-consuming problems caused by the unbalanced scale of the sub-operators after splitting, etc.
  • the different splitting methods of the convolution operator can be described as the following five types. These five conditions can cross each other and exist at the same time to ensure sufficient Split degree:
  • FIG. 4 it is a schematic diagram of an original calculation diagram of a convolution operator provided by an embodiment of the present application.
  • the convolution operator conv it contains input data (input) in 4 dimensions, and under the action of the weight matrix, output data (output) can be obtained.
  • FIGS. 5A to 5E there are multiple splitting methods for the convolution operator on the calculation graph provided in this embodiment of the application under the condition of a parallelism of 2.
  • FIG. 5A is a schematic diagram obtained by splitting according to the N dimension of input data
  • FIG. 5B is a schematic diagram obtained by splitting according to the C dimension of output data
  • FIG. 5A is a schematic diagram obtained by splitting according to the N dimension of input data
  • FIG. 5B is a schematic diagram obtained by splitting according to the C dimension of output data
  • FIG. 5A is a schematic diagram obtained by splitting according to the N dimension of input data
  • FIG. 5B is a schematic diagram obtained by splitting according to the C dimension of output data
  • FIG. 5A is a schematic diagram obtained
  • 5C is a schematic diagram obtained by splitting according to the C dimension of input data
  • 5D is a schematic diagram obtained by splitting according to the H dimension of the input data
  • FIG. 5E is a schematic diagram obtained by splitting according to the W dimension of the input data.
  • n represents the input data batch size
  • ic represents the number of input data feature images
  • ih represents the length of the input data feature image
  • iw represents the width of the input data feature image
  • oc represents the number of output data feature images
  • oh represents the output data feature image.
  • Length, ow represents the width of the output data feature image
  • kh represents the length of the convolution kernel window
  • kw represents the width of the convolution kernel window.
  • these splitting methods are executed in different dimensions, and at the same time, they can be combined with each other to form more splitting methods, which can provide sufficient parallelism to utilize the resources of multi-core processors, and at the same time To a certain extent, the excessive splitting of a single dimension can be avoided to affect the calculation efficiency of computer equipment.
  • the computer device can split the softmax operator in any one or several dimensions other than the dimension normalized by the probability of the softmax operator, and the result will be Several softmax operators that can be executed in parallel.
  • the determining the target splitting strategy of the neural network computing task in the splitting strategy set includes:
  • the target split strategy is determined according to the weight value.
  • the time taken for the target operator to be executed in parallel on the multi-core processor in a certain split mode can be characterized as a weight value.
  • the calculation time for a multi-core processor to complete an operator depends on the time of the core that takes the longest time to execute the split sub-calculation task.
  • the weight value of the target operator split mode can be determined through the following steps A11-A14:
  • A12. Determine the amount of fetched data d1, d2,...,dn of n sub-operators. Among them, di is calculated according to the type and scale of the i-th sub-operator after splitting;
  • A14 Determine the memory access bandwidth ⁇ of each artificial intelligence processor core.
  • the memory access bandwidth
  • B the total bandwidth of the multi-core artificial intelligence processor.
  • the computer device can calculate the weight value of the split mode of the target operator according to the following calculation formula (1):
  • the inner maximum value operation in the calculation formula is based on the fact that the calculation part and the memory access part realized by the operator can be hidden from each other, that is, the calculation and memory access can be performed concurrently as much as possible.
  • the calculation throughput of each core will decrease, and ⁇ can be further modified to make the estimation more accurate.
  • the outer maximum value operation in the calculation formula means that the time for the multi-core artificial intelligence processor to complete the calculation of an operator depends on the time of the core that takes the longest time to execute the sub-calculation task.
  • the weight of the target operator under a certain splitting method is determined as the weight of the splitting strategy. It is understandable that the weight of the split strategy included in the split strategy set can be determined through the foregoing implementation manner.
  • the weight of the split strategy can be not only the time spent in executing the sub-computing tasks, but also the throughput of executing the sub-computing tasks.
  • the weight of the splitting strategy can also be determined by actually measuring the time for executing all the sub-computing tasks in the operator splitting mode corresponding to the splitting strategy on the multi-core artificial intelligence processor.
  • the computer device may determine the split strategy with the smallest weight value as The target split strategy of the neural network model.
  • Step S314 Split the neural network computing task according to the target split strategy to obtain multiple sub-computing tasks.
  • Step S316 Allocate the sub-computing task to the corresponding artificial intelligence processor core in the artificial intelligence processor for processing.
  • the core idea of the technical solution described in the embodiments of this application is: by splitting the computing task of the target operator in the neural network model into smaller sub-computing tasks and assigning them to multiple cores for parallel execution. Make full use of the hardware resources of the multi-core processor structure chip.
  • each sub-operator after splitting can reuse the instruction implementation of the operator under the single-core architecture for calculation, it is possible to avoid the reconstruction of the instruction implementation of the original operator.
  • the neural network model is used to perform a specific neural network computing task, such as face recognition; another example, edge detection; another example, semantic analysis, and so on.
  • the running result refers to the result when the computer device executes a specific neural network computing task, which may include, but is not limited to: the accuracy of the neural network model, the running time of the neural network model, and so on.
  • the computer device can output the running result, for example, the computer device can display the running result on the display screen.
  • the multi-core processor can directly call the computing library under the single-core architecture, making full use of the hardware resources of the multi-core processor. In this way, the extra workload of re-implementation can be avoided.
  • steps in the flowchart of FIG. 3A are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 3A may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • FIG. 6A is a schematic flowchart of a neural network optimization method provided by an embodiment of this application, specifically explaining how to perform a neural network model in this embodiment of the application. Optimization can include but not limited to the following steps:
  • the "neural network model” is also called a model, such as the “first neural network model”, “the second neural network model” or the “third neural network model”, which can receive input data, and according to the received Input data and current model parameters generate predictive output.
  • the prediction output may include image detection output results, semantic analysis output results, image classification output results, and so on.
  • the neural network model can include deep learning neural network (DNN), convolutional neural network (Convolutional Neural Network, CNN), extreme learning machine (ELM) or other neural network models, etc. .
  • the neural network model includes a glue operator.
  • the glue operator can include reshape operator, transpose operator, concat operator, split operator, etc., and can also include other formats that can be used for tensor data in the neural network model, the shape of tensor data, and tensor data.
  • the amount of data is arranged in the memory to adjust the glue operator, which is not specifically limited in the embodiment of the present application.
  • the calculation graph refers to a way of describing the calculation process of the neural network model using the graph structure.
  • the glue subgraph as a calculation graph containing glue operators.
  • the glue subgraph extracted by the general-purpose processor in the computer equipment in the calculation graph corresponding to the neural network model can be seen in Fig. 6B, as shown in Fig. 6B.
  • the glue subgraph contains a reshape operator and a concat operator. Each glue operator is associated with corresponding tensor data.
  • the reconstruction result sub-graph refers to a sub-graph that can replace the glue sub-graph.
  • the reconstruction result subgraph is obtained by traversing the state set graph.
  • the reconstruction result subgraph is a path from the initial state to the end state in the state set graph.
  • processing the glue subgraph in the calculation graph may include: ensuring that the input tensor data and output tensor data of the glue subgraph remain unchanged, and the semantics represented by the entire glue subgraph remain unchanged. In this case, the glue operator and the intermediate result tensor data in the glue subgraph are added, deleted, and the topological relationship is adjusted.
  • the computer device when the number of glue sub-graphs extracted by the computer device is multiple, the computer device can expand the multiple glue sub-graphs, and obtain the corresponding glue sub-graph by reconstructing the sub-graphs. It is also possible to expand only any one of the glue sub-graphs, and obtain the optimized structure corresponding to the glue sub-graph by reconstructing the sub-graph, which is not specifically limited in the embodiment of this application.
  • the processing of the glue sub-graphs in the calculation graph to obtain the reconstruction result sub-graph set may include but is not limited to the following steps A21 to A23, which will be described in detail in the following:
  • Step A21 Expand the glue subgraph according to the logical relationship of the glue operator to obtain an expanded glue subgraph.
  • the expansion of the glue sub-graph according to the logical relationship of the glue operator to obtain the expanded glue sub-graph includes: the logic between the glue operators in the glue sub-graph is analyzed according to equivalent rules The relationship is expanded to obtain a logical relationship equivalent to the semantics of the glue sub-graph; the glue sub-graph is expanded according to the logical relationship equivalent to the semantics of the glue sub-graph to obtain the expanded glue sub-graph Figure.
  • the expansion of the logical relationship between the glue operators in the glue subgraph according to the equivalent rule includes:
  • the equivalent rule includes at least one of the equivalent rule of the reshape operator, the equivalent rule of the transpose operator, the equivalent rule of the concat operator, and the equivalent rule of the split operator.
  • the equivalent rule is a rule optimized according to the logical relationship of the glue operator, which is described in detail below:
  • the logical relationship of glue operators may include the logical relationship between reshape operators, or the logical relationship between reshape operators and other operators of the first type;
  • the other operators of the first type may include any one of the transpose operator, the concat operator, and the split operator.
  • the logical relationship of the glue operator includes the logical relationship between the reshape operators, for example, multiple consecutive reshape operators; in another possible implementation, the logical relationship of the glue operator Including the logical relationship between the reshape operator and other operators of the first type, for example, the reshape operator is adjacent to the transpose operator; another example, the reshape operator is adjacent to the concat operator; another example, the reshape operator is adjacent to the split operator Adjacent, etc.
  • the operator and the operator are adjacent to each other to represent the output tensor data of one operator as the input tensor data of another operator.
  • the logical relationship of the glue operator should be understood as the execution logic of the computer device in the process of executing the program code of the neural network model.
  • a computer device executes a certain piece of program code, it first executes the reshape operator, and then executes the transpose operator.
  • the computer device uses the output tensor data of the reshape operator as the transpose operator The input tensor data.
  • the first case the output tensor data of the transpose operator is the input tensor data of the reshape operator.
  • the logical relationship of the glue operator includes that the output tensor data of the transpose operator is the input tensor data of the reshape operator.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "transpose operator and reshape operator" based on the logical relationship of the glue operator, which can include:
  • the relative positions of the dimensions where the reshape operator performs dimensionality merging remain unchanged, and the output tensor data of the reshape operator is used as the input tensor data of the transpose operator.
  • the dimension refers to the dimension of the tensor data in the calculation graph in the neural network model.
  • the dimensions of the tensor data in the calculation graph in the convolutional neural network model can generally include 4 dimensions, which are respectively N representing the batch size of the data processed by the current calculation, which represents the feature
  • the C of the number of images represents the H and W of the characteristic image size.
  • the calculation graph corresponding to the neural network model includes a reshape operator and a transpose operator, where the output tensor data of the transpose operator is the input sheet of the reshape operator.
  • the relative position of the dimension merged by the reshape operator does not change during the execution of the transpose operator, in one implementation, as shown in b in Figure 7A, it can be performed according to the optimization path (1) Optimization, use part of the output tensor data of the reshape operator as the input tensor data of the transpose operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained; in another implementation method, it can also be performed according to the optimization path Optimize, use the output tensor data of the reshape operator as the input tensor data of the transpose operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • tensor A [3, 4, 5]
  • tensor B [5, 3, 4]
  • the operation of the reshape operator in the latter two dimensions can be considered as combining 3 and 4 first, and then splitting them, which can be split into 6 and 2.
  • the processor for example, general-purpose processor CPU, special artificial intelligence processor
  • the second case the output tensor data of the concat operator is the input tensor data of the reshape operator.
  • the logical relationship of the glue operator includes that the output tensor data of the concat operator is the input tensor data of the reshape operator.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "concat operator and reshape operator" according to the logical relationship of the glue operator, which can include:
  • the calculation graph corresponding to the neural network model includes the reshape operator and the concat operator, where the output tensor data of the concat operator is the input sheet of the reshape operator.
  • the reshape operator can be considered in the execution process: first merge the dimensions, and then split the merged dimensions.
  • the dimension 10 is split into a series of factors ⁇ 5, 2 ⁇ , so the dimension 10 can be expressed in the form of (4/2+6/2)*2, then, in this
  • the logical relationship equivalent to the semantics of the glue subgraph obtained by optimization can improve the overall performance of the neural network model.
  • the processor for example, general-purpose processor CPU, special artificial intelligence processor
  • the latter neural network model can reduce the resource consumption of computer equipment.
  • the third case the output tensor data of the split operator is the input tensor data of multiple reshape operators.
  • the logical relationship of the glue operator includes that the output tensor data of the split operator is the input tensor data of multiple reshape operators.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "split operator and multiple reshape operators" according to the logical relationship of the glue operator, which can include:
  • the output tensor of the split operator passes through the corresponding reshape operator, at most only one dimension has a different length, and the output tensor data of the multiple reshape operators is used as the input tensor of the split operator data.
  • the calculation graph corresponding to the neural network model contains multiple reshape operators and split operators.
  • the output tensor data of the split operator is multiple reshape operators.
  • the input tensor data of the sub After all the output tensors of the split operator pass through their corresponding reshape operators, at most only one dimension has a different length. For example, only the length of the C dimension is different.
  • the output tensor data of multiple reshape operators are used as the input tensor data of the split operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • tensor A [3,15,4]
  • tensor B [3,6,4] and tensor can be obtained
  • C [3,9,4]
  • tensor D [6,3,4]
  • tensor E [9,3, 4].
  • Analyzing tensor D and tensor E we can know that the output tensor of the reshape operator is different in only one dimension (dimension 6 in tensor D and dimension 9 in tensor E).
  • the processor for example, general-purpose processor CPU, special artificial intelligence processor
  • the fourth case multiple consecutive reshape operators.
  • the logical relationship of the glue operator may include N continuous reshape operators.
  • determining the semantically equivalent logical relationship with the glue subgraph of "multiple reshape operators" according to the logical relationship of the glue operator can include:
  • the calculation graph corresponding to the neural network model contains multiple continuous reshape operators.
  • the computer device treats the N continuous reshape operators After merging, an optimized structure as shown in b in Fig. 7D can be obtained.
  • tensor B [B1,B2,B3,...,Bn].
  • the tensor C [C1,C2,C3,...,Cn].
  • the input of the reshape3 operator obtained by combining the reshape1 operator and the reshape2 operator is an A tensor
  • the output is a C tensor.
  • A [1,32,1,1]
  • B [1,4,4,2]
  • C [16,2] .
  • the reshape1 operator and the reshape2 operator can be combined to obtain the reshape3 operator.
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the logical relationship of glue operators can include the logical relationship between transpose operators, or the logical relationship between transpose operators and other operators of the second type; here, the first
  • the other two types of operators can include any one of reshape operators, concat operators, and split operators.
  • the logical relationship of the glue operator includes the logical relationship between the transpose operators, for example, multiple consecutive transpose operators; in another possible implementation, the logical relationship of the glue operator Including the logical relationship between the transpose operator and other operators of the second type, for example, the transpose operator is adjacent to the reshape operator; another example, the transpose operator is adjacent to the concat operator; another example, the transpose operator is adjacent to the split operator Adjacent, etc.
  • the operator and the operator are adjacent to each other to represent the output tensor data of one operator as the input tensor data of another operator.
  • the first case the output tensor data of the reshape operator is the input tensor data of the transpose operator.
  • the logical relationship of the glue operator includes that the output tensor data of the reshape operator is the input tensor data of the transpose operator.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "reshape operator and transpose operator" according to the logical relationship of the glue operator, which can include:
  • the output tensor data of the transpose operator is used as the The input tensor data of the reshape operator.
  • the dimension refers to the dimension of the tensor data in the calculation graph in the neural network model.
  • the dimensions of the tensor data in the calculation graph in the convolutional neural network model can generally include 4 dimensions, which are respectively N representing the batch size of the data processed by the current calculation, which represents the feature
  • the C of the number of images represents the H and W of the characteristic image size.
  • the calculation graph corresponding to the neural network model includes a reshape operator and a transpose operator, where the output tensor data of the reshape operator is the input data of the transpose operator.
  • the relative position of the dimension split by the same dimension of the intermediate state during the splitting phase of the reshape operator does not change during the execution of the transpose operator.
  • the optimization can be performed according to the optimization path (1), and part of the output tensor data of the transpose operator is used as the input tensor data of the reshape operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained;
  • optimization can also be performed according to the optimization path (2), and the output tensor data of the transpose operator is used as the input tensor data of the reshape operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • tensor A [3, 4, 5]
  • tensor B [4, 3, 5]
  • the reshape operator can be considered in the execution process: first merge the dimensions, and then split the merged dimensions.
  • the logical relationship equivalent to the semantics of the glue subgraph obtained by optimization can improve the overall performance of the neural network model.
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the resource consumption of computer equipment can be reduced.
  • the second case the output tensor data of the concat operator is the input tensor data of the transpose operator.
  • the logical relationship of the glue operator includes that the output tensor data of the concat operator is the input tensor data of the transpose operator.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "concat operator and transpose" according to the logical relationship of the glue operator, which may include: converting the output tensor data of the transpose operator As the input tensor data of the concat operator.
  • the calculation graph corresponding to the neural network model includes transpose and concat operators, where the output tensor data of the concat operator is the input tensor data of the transpose operator
  • the output tensor data of the transpose operator is used as the input tensor data of the concat operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • tensor A [3,4,5]
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the third case the output tensor data of the split operator is the input tensor data of multiple transpose operators.
  • the logical relationship of the glue operator includes that the output tensor data of the split operator is the input tensor data of multiple transpose operators; the general-purpose processor is based on the logical relationship of the glue operator in the calculation graph The calculation graph is optimized.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "split operator and multiple transpose operators" according to the logical relationship of the glue operator, which can include:
  • the output tensor data of the multiple transpose operators is used as the input tensor data of the split operator.
  • the perm parameter is a full permutation of the natural number sequence [1,2,3,...,n], and different full permutations represent different transpose operators.
  • full queuing is defined as: randomly taking m (m less than or equal to n) elements from n different elements and arranging them in a certain order, which is called an arrangement in which m elements are taken from n different elements.
  • m n
  • all permutations are called full permutations.
  • the full arrangement of the three elements 1, 2, 3 can be: 1, 2, 3; 1, 3, 2; 2, 1, 3; 2, 3, 1; 3, 1, 2; 3, 2, 1.
  • the calculation graph corresponding to the neural network model contains multiple transpose operators and split operators.
  • the output tensor data of the split operator is multiple transpose operators.
  • the input tensor data of the sub when the perm parameters corresponding to multiple transpose operators are the same, as shown in b in Figure 7G, the output tensor data of multiple transpose operators is used as the input tensor data of the split operator , So that the logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • tensor A [3,10,5]
  • tensor B [3,4,5]
  • tensor C [3,6,5].
  • the perm parameters corresponding to each transpose operator are both [1,0, 2]
  • tensor D [4,3,5]
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the fourth case multiple consecutive transpose operators.
  • the logical relationship of the glue operator may include M continuous transpose operators.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "multiple transpose operators" according to the logical relationship of the glue operator, which may include: when the neural network model corresponds to the calculation graph
  • M continuous transpose operators are included, the M transpose operators are combined to obtain a transpose operator.
  • the continuous M transpose operators include a first transpose operator and a second transpose operator; the combining the M continuous transpose operators into one transpose operator includes: determining the first transpose operator The perm parameters corresponding to each of the transpose operator and the second transpose operator; the first parameter is determined according to the perm parameters corresponding to each of the first transpose operator and the second transpose operator, wherein the first parameter It is the perm parameter corresponding to the merged transpose operator.
  • the brackets [] indicate to take the elements in the array.
  • the merged transpose operator switches the order of the tensor data under the determined perm3 parameters.
  • the calculation graph corresponding to the neural network model contains multiple continuous transpose operators.
  • the computer device treats the M continuous transpose operators By merging, an optimized structure as shown in b in Figure 7H can be obtained, that is, a logical relationship that is semantically equivalent to the glue subgraph of "multiple continuous transpose operators".
  • the transpose_1423 operator and the transpose_1243 operator can be combined to obtain the transpose_1432 operator.
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the processor does not need to execute two different transpose operators in sequence, but only executes the combined transpose operator, which can reduce Redundant calculation to achieve the purpose of reducing the resource consumption of computer equipment.
  • the logical relationship of glue operators may include the logical relationship between concat operators, or the logical relationship between the concat operator and other operators of the third type.
  • the third type of other operators includes any one of the reshape operator, the transpose operator, and the split operator.
  • the logical relationship of the glue operator includes the logical relationship between the concat operators, for example, multiple consecutive concat operators; in another possible implementation, the logic of the glue operator The relationship includes the logical relationship between the concat operator and other operators.
  • the concat operator is adjacent to the reshape operator; for example, the concat operator is adjacent to the transpose operator; for example, the concat operator is adjacent to the split operator. ,and so on.
  • the operator and the operator are adjacent to each other to represent the output tensor data of one operator as the input tensor data of another operator.
  • the first case the output tensor data of multiple reshape operators is the input tensor data of the concat operator.
  • the logical relationship of the glue operator includes that the output tensor data of multiple reshape operators is the input tensor data of the concat operator.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "multiple reshape operators and concat operators" according to the logical relationship of the glue operator, which may include: when the multiple reshape operators The input tensors corresponding to each sub-sub are different in length at most in one dimension, and the output tensor data of the concat operator is used as the input tensor data of the multiple reshape operators.
  • the calculation graph corresponding to the neural network model includes a concat operator and multiple reshape operators.
  • the output tensor data of the multiple reshape operators is the concat operator.
  • the input tensor data of the sub when the input tensor corresponding to multiple reshape operators has a different length in at most one dimension, for example, the length in the W dimension is different.
  • the output tensor data of the concat operator is used as the input tensor data of multiple reshape operators, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the multiple reshape operators when the multiple reshape operators are consecutive multiple reshape operators, the multiple consecutive reshape operators may be combined to obtain one reshape operator.
  • the input of the reshape3 operator obtained by combining the reshape1 operator and the reshape2 operator is an A tensor
  • the output is a C tensor.
  • A [1,32,1,1]
  • B [1,4,4,2]
  • C [16,2] .
  • the reshape1 operator and the reshape2 operator can be combined to obtain the reshape3 operator.
  • the processor for example, general-purpose processor CPU, dedicated processor artificial intelligence processor
  • the neural network model is an optimized model, the resource consumption of computer equipment can be reduced the goal of.
  • the second case the output tensor data of multiple transpose operators is the input tensor data of the concat operator.
  • the logical relationship of the glue operator includes that the output tensor data of multiple transpose operators is the input tensor data of the concat operator.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "multiple transpose operators and concat operators" according to the logical relationship of the glue operator, which may include: in the multiple transpose operators When the perm parameters corresponding to the sub-subs are the same, the output tensor data of the concat operator is used as the input tensor data of the multiple transpose operators.
  • the perm parameter is a full permutation of the natural number sequence [1,2,3,...,n], and different full permutations represent different transpose operators.
  • full queuing is defined as: randomly taking m (m less than or equal to n) elements from n different elements and arranging them in a certain order, which is called an arrangement in which m elements are taken from n different elements.
  • m n
  • all permutations are called full permutations.
  • the full arrangement of the three elements 1, 2, 3 can be: 1, 2, 3; 1, 3, 2; 2, 1, 3; 2, 3, 1; 3, 1, 2; 3, 2, 1.
  • the calculation graph corresponding to the neural network model includes a concat operator and multiple transpose operators.
  • the output tensor data of the multiple transpose operators is the concat operator.
  • the input tensor data of the sub when the perm parameters corresponding to the multiple transpose operators are the same, as shown in b in Figure 7J, the output tensor data of the concat operator is used as the input of multiple transpose operators Tensor data, so that the logical relationship equivalent to the semantics of the glue subgraph can be obtained.
  • the perm parameters corresponding to each of the multiple transposes are [1, 0, 2]
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • runs the optimized neural network model Can reduce the resource consumption of computer equipment.
  • the multiple transpose operators when the multiple transpose operators are consecutive multiple transpose operators, the multiple consecutive transpose operators can be combined to obtain one transpose operator.
  • the continuous M transpose operators include a first transpose operator and a second transpose operator; the combining the M continuous transpose operators into one transpose operator includes:
  • the first parameter is determined according to the perm parameters corresponding to the first transpose operator and the second transpose operator, where the first parameter is the perm parameter corresponding to the combined transpose operator.
  • the brackets [] indicate to take the elements in the array.
  • the merged transpose operator switches the order of the tensors under the determined perm3 parameters.
  • the processor for example, a general-purpose processor CPU, a dedicated processor artificial intelligence processor
  • the processor for example, a general-purpose processor CPU, a dedicated processor artificial intelligence processor
  • the third case the output tensor data of the split operator is the input tensor data of the concat operator.
  • the logical relationship of the glue operator includes that the output tensor data of the split operator is the input tensor data of the concat operator.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "split operator and concat operator" according to the logical relationship of the glue operator, which may include: When the dimensions of the respective operations of the split operators are the same, the concat operator and the split operator are combined and eliminated.
  • the calculation graph corresponding to the neural network model includes a concat operator and a split operator, where the output tensor data of the split operator is the input tensor of the concat operator.
  • the amount of data in the case that the dimensions of the concat operator and the split operator are the same, for example, the concat operator and the split operator are the same in the C dimension during the execution. In this case, as shown in Figure 7K As shown in b, the concat operator and the split operator are combined and eliminated.
  • tensor A [3,10,5]
  • the processor for example, a general-purpose processor CPU, a dedicated processor artificial intelligence processor
  • the fourth case N consecutive concat operators.
  • the logical relationship of the glue operator may include N consecutive concat operators; where N is a positive integer greater than or equal to 2.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "multiple concat operators" according to the logical relationship of the glue operator, which may include:
  • the calculation graph corresponding to the neural network model contains multiple concat operators, and the multiple concat operators operate on the same dimension, for example, the N dimension, In this case, the computer equipment can merge these multiple concat operators to obtain a concat operator.
  • the optimized structure shown in b in Figure 7L that is, the optimized and glue subgraph The logical relationship of semantic equivalence.
  • the logical relationship of glue operators can include the logical relationship between split operators, or the logical relationship between the split operator and other operators of the fourth type; here ,
  • the other operators of the fourth category include any one of the reshape operator, the transpose operator, and the concat operator.
  • the logical relationship of the glue operator includes the logical relationship between the split operators, for example, multiple consecutive split operators; in another possible implementation, the logic of the glue operator The relationship includes the logical relationship between the split operator and other operators, for example, the split operator is adjacent to the reshape operator; for example, the split operator is adjacent to the transpose operator; for example, the split operator is adjacent to the concat operator ,and so on.
  • the operator and the operator are adjacent to each other to represent the output tensor data of one operator as the input tensor data of another operator.
  • the first case the output tensor data of the reshape operator is the input tensor data of the split operator.
  • the logical relationship of the glue operator includes that the output tensor data of the reshape operator is the input tensor data of the split operator.
  • the computer device determines the semantically equivalent logical relationship with the glue sub-graph of "reshape operator and split operator" according to the logical relationship of the glue operator, which may include: inversely deriving the said from output to input
  • the dimension k 0 +k 1 +...+k m operated by the split operator as a part of the output is split into p 0 ⁇ p during the reverse derivation process 1 ⁇ ... ⁇ (k 0 / ⁇ i p i +k 1 / ⁇ i p i +...+k m / ⁇ i p i ) ⁇ ... ⁇ p n-1 ⁇ p n , put
  • the output tensor data of the split operator is used as the input tensor data of the reshape operator.
  • the calculation graph corresponding to the neural network model includes the split operator and the reshape operator, where the output tensor data of the reshape operator is the input tensor of the split operator.
  • the dimension k 0 +k 1 +...+k m operated by the split operator as part of the output is split into shape in the reverse derivation process
  • the form of p n uses the output tensor data of the split operator as the input tensor data of the reshape operator, so that it can have a semantically equivalent logical relationship with the glue subgraph.
  • tensor A [3,10,5]
  • tensor C [6,2,5]
  • the second case the output tensor data of the transpose operator is the input tensor data of the split operator.
  • the logical relationship of the glue operator includes that the output tensor data of the transpose operator is the input tensor data of the split operator.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "transpose operator and split operator" based on the logical relationship of the glue operator, which can include:
  • the output tensor data of the split operator is used as the input tensor data of the transpose operator.
  • the calculation graph corresponding to the neural network model includes a split operator and a transpose operator, where the output tensor data of the transpose operator is the input tensor of the split operator.
  • the output tensor data of the split operator is used as the input tensor data of the transpose operator, so that a logical relationship equivalent to the semantics of the glue subgraph can be obtained .
  • tensor A [3, 10, 5]
  • tensor B [10, 3, 5]
  • the split is calculated
  • the processor for example, general-purpose processor CPU, special-purpose processor artificial intelligence processor
  • the third case the output tensor data of the concat operator is the input tensor data of the split operator.
  • the logical relationship of the glue operator includes that the output tensor data of the concat operator is the input tensor data of the split operator.
  • the computer equipment determines the semantically equivalent logical relationship with the glue subgraph of "concat operator and split operator" according to the logical relationship of the glue operator, which may include: When the dimensions of the respective operations of the split operators are the same, the concat operator and the split operator are combined and eliminated.
  • the calculation graph corresponding to the neural network model includes the split operator and the concat operator, where the output tensor data of the concat operator is the input tensor of the split operator.
  • the amount of data in the case that the concat operator and the split operator are semantically inverse operations, for example, the concat operator and the split operator are the same in the C dimension during the execution. In this case, as shown in Figure 7O As shown in b, the concat operator and the split operator are combined and eliminated.
  • the processor for example, a general-purpose processor CPU, a dedicated processor artificial intelligence processor
  • the fourth case N consecutive split operators.
  • the logical relationship of the glue operator includes N consecutive split operators; where N is a positive integer greater than or equal to 2.
  • the computer device determines the semantically equivalent logical relationship with the glue subgraph of "multiple split operators" according to the logical relationship of the glue operator, which may include: each of the N consecutive split operators When the dimension of the operation is the same dimension, the N consecutive split operators are combined.
  • the calculation graph corresponding to the neural network model contains multiple split operators, and the multiple split operators operate on the same dimension, for example, the N dimension,
  • the computer device can merge these multiple split operators to obtain a split operator.
  • the optimized structure shown in b in Figure 7P which is semantically equivalent to the glue subgraph The logical relationship.
  • the glue subgraph based on the equivalent rules described in this application, we can expand the glue subgraph to construct multiple new operator paths that are semantically equivalent to the glue subgraph.
  • the left side is the original structure of the glue subgraph, where the tensor data (A0, A1, A2, A3) is first transformed into tensor data (A0, A1*A2) through the reshape operator ,A3), and then transformed into tensor data (A0, A3, A1*A2) through the transpose operator, and finally split into two sub-tensor data through the split operator.
  • the glue subgraph expanded where the bold part represents the original topological relationship in the glue subgraph.
  • Figure 8A What can be known from Figure 8A is that in addition to the original topological relationship of the glue subgraph, there are many different ways to obtain the original subgraph from the input tensor data (A0, A1, A2, A3) of the original subgraph.
  • the method further includes: changing the content contained in the glue sub-graph after satisfying the added equivalent logical relations.
  • the position in the changed glue subgraph is determined according to the directed edges between the glue operators in the changed glue subgraph and the equivalent rule The equivalent logical relationship corresponding to at least two adjacent glue operators until the glue subgraph cannot be expanded by the equivalent rule.
  • the expanded glue subgraph satisfies the constraint: for any group of operators in the glue subgraph that satisfies the equivalent rules, the transformed operator topology also exists after the expansion
  • the glue subgraph of, that is, the expanded glue subgraph is a closure based on equivalent rules.
  • This constraint makes it impossible for the expanded glue subgraph to be further expanded by the equivalent rule again, so as to ensure that the expanded glue subgraph already contains as many topological structures of equivalent logical relations as possible, which is beneficial to Next, obtain the target sub-graph that is optimal for the performance of the artificial intelligence processor from the expanded glue sub-graph.
  • Step A22 Convert the expanded glue subgraph to obtain a state set graph of the tensor data associated with the glue operator.
  • any path from the initial state to the end state in the state set diagram of the tensor data associated with the glue operator is used to represent the reconstructed subgraph, and the reconstructed subgraph is Optimized way of glue subgraph.
  • the reason for converting the expanded glue subgraph is that the expanded glue subgraph is used to describe the realization process of the equivalent logical relationship of constructing the operator sequence, and cannot be based on the expanded glue subgraph.
  • the graph determines the target subgraph.
  • the conversion of the expanded glue subgraph to obtain the state set graph of the tensor data associated with the glue operator includes:
  • the corresponding input tensor data corresponding to the glue operator in the expanded glue subgraph is determined.
  • the state set diagram of the tensor data associated with the glue operator is determined according to the input tensor data and the output tensor data of the glue operator in the expanded glue subgraph.
  • all tensors in the expanded glue subgraph have unique numbers ⁇ 0,1,2,...,n ⁇ , and the data in all input tensors in the graph are taken as The data of a whole D and D are divided and combined into different tensors, and the combination of each tensor can be regarded as a state of D.
  • the state of D can be expressed as the numbered set of all input tensors ⁇ s0,s1,...,sm ⁇ , and its ultimate goal is to make D into the state ⁇ e0,e1,...,en ⁇ , where ei is the number of the i-th output tensor.
  • each glue operator associated with the input tensor turns at least one of all the tensors corresponding to the current D into another one or more tensors, that is, the number set representing the state of D has occurred Change, for example, from one numbered state set to another numbered state set.
  • a graph structure composed of the various states of D and the directed edges before the state represented by the glue operator can be obtained, that is, the state set graph.
  • Fig. 8B is a schematic structural diagram of a glue subgraph provided in an embodiment of this application.
  • the glue subgraph includes two reshape operators and one concat operator.
  • tensor data (2, 3, 5) can be obtained after reshape operator 1 to obtain tensor data (2, 15, 1);
  • tensor data (2, 4, 5) can be obtained after reshape operator 2
  • the tensor data (2,20,1) can be obtained.
  • tensor data (2,15,1) and tensor data (2,20,1) can get tensor data (2,35,1) after passing through the concat operator.
  • the output tensor data of the concat operator can be used as multiple reshape operators.
  • the input tensor data of the child Specifically, the determined logical relationship equivalent to the semantics of the glue subgraph may be as shown in FIG. 8C. Then, in this case, tensor data (2,3,5) and tensor data (2,4,5) can get tensor data (2,7,5) after passing through the concat operator; Zhang After the amount data (2,7,5) passes through the reshape operator, the tensor data (2,35,1) can be obtained.
  • the computer device Based on the determined equivalent logic relationship, the computer device adds the above equivalent logic relationship to the glue sub-graph to obtain an expanded glue sub-graph. For details, please refer to FIG. 8D.
  • the computer device converts the expanded glue subgraph to obtain the state set graph.
  • the state of D can be expressed as a set of numbers of all input tensors, specifically, it can be as shown in FIG. 8E.
  • tensor data (2,3,5) is represented by number 1
  • tensor data (2,4,5) is represented by number 2
  • tensor data (2,15,1) is represented by number 3
  • tensor data (2,20,1) is represented by the number 4
  • the tensor data (2,7,5) is represented by the number 5
  • the tensor data (2,35,1) is represented by the number 6.
  • Step 1 Starting from the input, tensor data (2,3,5)1 and tensor data (2,4,5)2 constitute the numbered state set 1 of the input tensor, specifically, the numbered state set 1 can be represented Is ⁇ 1,2 ⁇ , and the corresponding conversion diagram can be shown in Figure 8F;
  • Step 2 On the basis of step 1, the reshape operator associated with the input tensor data (2, 3, 1) converts the tensor corresponding to the current D to obtain the numbered state set 2, specifically, the numbered state Set 2 can be expressed as ⁇ 3, 2 ⁇ , and its corresponding conversion diagram can be shown in Figure 8G;
  • Step 3 On the basis of step 2, the reshape operator associated with the input tensor data (2, 4, 5) converts the tensor corresponding to the current D to obtain the numbered state set 3, specifically, the numbered state Set 3 can be expressed as ⁇ 1,4 ⁇ , and its corresponding conversion diagram can be shown in Figure 8H;
  • Step 4 On the basis of step 3, the reshape operator associated with the input tensor data (2, 4, 5) converts the tensor corresponding to the current D to obtain the numbered state set 4, specifically, the numbered state Set 4 can be expressed as ⁇ 3,4 ⁇ , and its corresponding conversion diagram can be shown in Figure 8I;
  • Step 5 On the basis of step 4, the reshape operator associated with the input tensor data (2, 3, 5) converts the tensor corresponding to the current D, and the numbering state ⁇ 1,4 ⁇ can be converted to the numbering state ⁇ 3,4 ⁇ , the corresponding conversion diagram can be shown in Figure 8J;
  • Step 6 On the basis of step 5, the concat operator associated with the input tensor data (2,15,1) and the input tensor data (2,20,1) converts the tensor corresponding to the current D,
  • the numbered state set 5 can be obtained. Specifically, the numbered state set 5 can be expressed as ⁇ 6 ⁇ , and the corresponding conversion diagram can be shown in Figure 8K;
  • Step 7 On the basis of step 6, the concat operator associated with the input tensor data (2,3,5) and the input tensor data (2,4,5) converts the tensor corresponding to the current D,
  • the numbered state set 6 can be obtained. Specifically, the numbered state set 6 can be expressed as ⁇ 5 ⁇ , and the corresponding conversion diagram can be shown in Figure 8L;
  • Step 8 On the basis of step 7, the reshape operator associated with the input tensor data (2,7,5) will convert the tensor corresponding to D, and the numbering state ⁇ 5 ⁇ can be converted to the numbering state ⁇ 6 ⁇ , the corresponding conversion diagram can be shown in Figure 8M.
  • FIG. 8M is a state set diagram obtained after the computer device converts the expanded glue sub-graph. Then, in this case, the target subgraph can be determined in FIG. 8M.
  • Step A23 Traverse the state set graph to obtain the reconstruction result sub-graph set.
  • the state set graph is traversed to determine the state path between adjacent operators and the weight of the state path.
  • the weight of the state path is used to characterize the performance of the operator during execution. For example, the smaller the weight, the better the performance of the operator during execution; for example, the greater the weight, It means that the performance of the operator in the execution process is better, which is not specifically limited in the embodiment of the present application.
  • the weight of an operator it is often necessary to consider the shape and scale of the input data of the operator. For ease of explanation, in the embodiments of the present application, the smaller the weight, the better the performance is taken as an example for description.
  • Figure 8M includes multiple paths from the start state to the end state.
  • any path from the start state to the end state corresponds to a reconstructed semantically equivalent glue.
  • the structure of the subgraph, our goal is to determine the shortest path among multiple state paths.
  • the state path between adjacent operators and the weight of the state path can be determined by traversing the state set diagram shown in FIG. 8M.
  • the state set shown in FIG. 8M includes three paths, namely path 1, path 2, and path 3.
  • the computer device determines that the sum of weights of operators on path 1 is 10, the sum of weights of operators on path 2 is 15, and the sum of weights of operators on path 3 is 17.
  • a path from the start state to the end state is used to characterize a reconstruction result subgraph.
  • the general-purpose processor can determine the target subgraph according to the weight of the state path, and optimize the neural network model according to the target subgraph to obtain an optimized neural network model.
  • S624 Determine a target sub-picture from the set of reconstruction result sub-pictures.
  • the determining the target subgraph from the reconstruction result subgraph set includes: determining the target subgraph according to the weight and the smallest reconstruction result subgraph in the reconstruction result subgraph set. Sub-picture; or determined as the target sub-picture according to the weight in the reconstruction result sub-picture set and the reconstruction result sub-picture less than a preset threshold.
  • the computer device can select the path with the smallest weight and the smallest from the multiple paths as the target subgraph. For example, the computer device determines that the sum of the weights of the operators on path 1 is 10, the sum of the weights of the operators on path 2 is 15, and the sum of the weights of the operators on path 3 is 17. In this case, the computer device determines Path 1 is the target sub-graph, that is, the computer device determines that path 1 is the sub-graph with the best performance after reconstruction.
  • the above method of obtaining the target subgraph is similar to the viterbi algorithm. This time is only a partial list of examples, not an exhaustive list. Those skilled in the art may understand the essence of the technical solution of this application.
  • other deformations or transformations are generated, such as: setting a threshold based on experience, and the weight of the state path is less than the set threshold, it can be used as the target sub-graph, so that the neural network can be adjusted according to the target sub-graph.
  • the network model is optimized. However, as long as its realized functions and achieved technical effects are similar to those of this application, they should all fall within the scope of protection of this application.
  • the computer device determines that the sum of the weights of operators on path 1 is 10, the sum of weights of operators on path 2 is 15, and the sum of weights of operators on path 3 is 17.
  • the computer equipment determines that path 1 is the target subgraph, that is, the computer equipment determines that path 1 is the reconstructed subgraph with the best performance.
  • the computer equipment replaces the original glue subgraph in the neural network model with path 1.
  • the sub-graph formed can realize the optimization of the neural network model to improve the overall performance of the neural network model.
  • the general-purpose processor may call the compiled interface of the artificial intelligence learning library that has been set to compile according to the optimized calculation graph, and obtain the corresponding binary instruction.
  • the binary instructions are processed by the runtime library to generate machine learning processing tasks.
  • the general-purpose processor can put machine learning processing tasks into the task queue, and finally the driver schedules the machine learning processing tasks in the task queue to be executed by the artificial intelligence processor, and the running results are obtained.
  • the machine learning processing task refers to the neural network model acquiring learning ability to complete a certain task.
  • machine learning processing tasks may include image recognition, edge detection, semantic analysis, and so on.
  • different neural network models correspond to different machine learning processing tasks.
  • the machine learning processing tasks corresponding to the deep learning neural network model can be image classification, text classification, etc.
  • the machine learning processing tasks corresponding to the convolutional neural network model can be image recognition, video classification, etc.
  • the long and short-term memory neural network model Long The machine learning processing tasks corresponding to Short Term Memory Network (LSTM) can be speech recognition, picture description, natural language processing, etc.
  • LSTM Short Term Memory Network
  • the request of the machine learning processing task may be an execution instruction input by the user for the neural network model.
  • a computer device receives a request for a machine learning processing task, it obtains the corresponding neural network model according to the type of the machine learning processing task, and runs the neural network model on the artificial intelligence processor, and then can get the operation for the machine learning processing task result.
  • the neural network model operated by the processor for example, a general-purpose processor, an artificial intelligence processor
  • the processor for example, a general-purpose processor, an artificial intelligence processor
  • the running result of the machine learning processing task refers to the result when the computer device executes the machine learning processing task, which may include, but is not limited to: the accuracy of the neural network model when the machine learning processing task is executed; the execution of machine learning When processing tasks, the running time of the neural network model and so on.
  • the computer device may output the running result, for example, the computer device can display the running result through a display screen.
  • the subgraph with better performance after reconstruction can be replaced with the original glue subgraph, which can improve the overall performance of the neural network model, making the artificial intelligence processor in
  • redundant calculations can be reduced, which in turn can reduce the resource consumption of computer equipment.
  • the computer device obtains the optimized structure corresponding to the glue sub-graph by reconstructing the sub-graph for the glue sub-graph containing multiple glue operators, and optimizes the neural network model according to the reconstructed sub-graph ,
  • This implementation can improve the overall performance of the neural network model.
  • the optimized neural network model is run on the computer equipment, the resource consumption of the computer equipment can be reduced.
  • steps in the flowchart of FIG. 6A are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 6A may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • FIG. 9 is a schematic structural diagram of a neural network processing device provided by an embodiment of the present application.
  • the device 90 may at least include:
  • the first obtaining unit 910 is configured to obtain a calculation graph corresponding to a neural network model; wherein, the neural network model includes a plurality of operators;
  • the first determining unit 912 is configured to determine the target splitting strategy of the neural network computing task in the splitting strategy set; wherein, the splitting strategy set is composed of the splitting method corresponding to the target operator in the calculation graph Collection of
  • a splitting unit 914 configured to split the neural network computing task according to the target splitting strategy to obtain multiple sub-computing tasks
  • the execution unit 916 is configured to call the multiple sub-computing tasks on the M artificial intelligence processor cores to obtain running results.
  • the device 90 may further include:
  • the second determining unit 918 is configured to determine the split mode corresponding to the target operator according to the parallelism, the splitting dimension, and the size of the splitting dimension corresponding to the target operator in the calculation graph;
  • the third determining unit 920 is configured to determine the set of splitting strategies according to the splitting mode corresponding to the target operator.
  • the third determining unit 920 is specifically configured to:
  • the intersection of the splitting modes supported by each target operator is determined as the set of splitting strategies.
  • the first determining unit 912 includes a first determining subunit and a second determining subunit; wherein,
  • the first determining subunit is configured to determine the weight value of the splitting mode corresponding to the target operator in the splitting strategy set;
  • the second determining subunit is configured to determine the target splitting strategy according to the weight value.
  • the weight value is determined according to the operation type of the target operator included in the split strategy, the data scale involved in the target operator, and the hardware parameters of the multi-core processor.
  • the device 90 may further include:
  • the second obtaining unit 922 is configured to obtain the operation type of the target operator
  • the fourth determining unit 924 is configured to determine the split mode of the target operator according to the operation type of the target operator.
  • FIG. 10 is a schematic structural diagram of a neural network optimization apparatus provided by an embodiment of the present application.
  • the apparatus 1000 may at least include:
  • the extracting unit 1010 is used to extract a glue sub-graph in the calculation graph corresponding to the neural network model; wherein the glue sub-graph is a sub-graph containing a glue operator; the glue operator is used to compare the graphs of the calculation graph. Volume data to be adjusted;
  • the processing unit 1012 is configured to process the glue sub-graph in the calculation graph while ensuring that the input tensor data and the output tensor data of the glue sub-graph remain unchanged, to obtain a reconstruction result sub-graph Set; wherein, the input tensor data and output tensor data of any reconstruction result subgraph in the reconstruction result subgraph set are respectively the same as the input tensor data and output tensor data of the glue subgraph;
  • the determining unit 1014 is configured to determine a target sub-picture from the set of reconstruction result sub-pictures
  • the optimization unit 1016 is configured to replace the target sub-graph with the corresponding glue sub-graph in the calculation graph to obtain an optimized calculation graph
  • the execution unit 1018 is configured to obtain corresponding binary instructions according to the optimized calculation graph, and assign them to the corresponding artificial intelligence processor to execute tasks.
  • the processing unit 1012 includes an expansion unit, a conversion unit, and a traversal unit unit; wherein,
  • the expansion unit is used to expand the glue sub-graph according to the logical relationship of the glue operator to obtain an expanded glue sub-graph; the conversion unit is used to convert the expanded glue sub-graph, Obtain a state set diagram of the tensor data associated with the glue operator; the traversal unit is used to traverse the state set diagram to obtain the reconstruction result sub-graph set.
  • the expansion unit includes: a first expansion unit and a second expansion unit; wherein,
  • the first expansion unit is used to expand the logical relationship between the glue operators in the glue sub-graph according to equivalent rules to obtain a logical relationship equivalent to the semantics of the glue sub-graph; the second expansion unit uses The glue sub-graph is expanded according to a logical relationship equivalent to the semantics of the glue sub-graph to obtain the expanded glue sub-graph.
  • the equivalent rule includes at least one of the equivalent rule of the reshape operator, the equivalent rule of the transpose operator, the equivalent rule of the concat operator, and the equivalent rule of the split operator.
  • the first extension unit is specifically configured to: transform the operator sequence corresponding to the logical relationship, and ensure that all semantics corresponding to the glue subgraph are obtained according to the equivalence rule. Equivalent logical relationship.
  • the conversion unit is specifically configured to: determine the type of the glue operator in the expanded glue subgraph and the logical relationship between the glue operators; based on the expansion The type of glue operator in the glue subgraph and the logical relationship between the glue operators, and the corresponding output tensor is determined according to the input tensor data corresponding to the glue operator in the expanded glue subgraph Data; the state set diagram of the tensor data associated with the glue operator is determined according to the input tensor data and the output tensor data of the glue operator in the expanded glue subgraph.
  • the determining unit is specifically configured to: determine the target subgraph according to the weight and the smallest reconstruction result subgraph in the reconstruction result subgraph set; or according to the reconstruction result subgraph;
  • the reconstruction result sub-pictures with weights and less than a preset threshold in the set of composition result sub-pictures are determined as the target sub-pictures.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the units or modules described as separate components may or may not be physically separate.
  • a component described as a unit or a module may be a physical unit or not a physical unit, that is, it may be located in one device, or may also be distributed on multiple devices.
  • the solutions of the embodiments of the present disclosure can be implemented by selecting some or all of the units according to actual needs.
  • the embodiment of the present application also provides a computer storage medium for storing computer software instructions used by the computer device shown in FIG. program of. By executing the stored program, neural network model processing can be realized to make full use of multi-core processing resources.
  • the neural network processing method, device, computer equipment and storage medium provided by the embodiments of the present application divide the neural network computing task into several smaller sub-computing tasks, so that the multi-core processor can directly Calling the computing library under the single-core architecture makes full use of the hardware resources of the multi-core processor, thereby avoiding the extra workload of recurring implementation.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • clause A1 a neural network processing method, characterized in that the method is applied to an artificial intelligence processor, the artificial intelligence processor includes M artificial intelligence processor cores, and M is a positive integer greater than 1;
  • the methods include:
  • the neural network model includes a plurality of operators
  • splitting strategy set is a set of splitting methods corresponding to the target operator in the calculation graph
  • the sub-computing tasks are allocated to the corresponding artificial intelligence processor cores in the artificial intelligence processor for processing.
  • the method further includes:
  • the set of splitting strategies is determined according to the splitting manner corresponding to the target operator.
  • the determining the splitting strategy set according to the splitting mode corresponding to the target operator includes:
  • the intersection of the splitting modes supported by each target operator is determined as the set of splitting strategies.
  • the determining the target split strategy of the neural network computing task in the split strategy set includes:
  • the target split strategy is determined according to the weight value.
  • the weight value is determined according to the operation type of the target operator included in the split strategy, the data scale involved in the target operator, and the hardware parameters of the multi-core processor.
  • A6 The method according to any one of A1-A4, the method further comprising:
  • the split mode of the target operator is determined according to the operation type of the target operator.
  • A8 The method according to A2, wherein the degree of parallelism corresponding to the target operator includes a first degree of parallelism and a second degree of parallelism; wherein the result of the first degree of parallelism multiplied by the second degree of parallelism is less than or equal to artificial intelligence processing The number of artificial intelligence processor cores in the device.
  • a neural network processing device characterized in that the device is applied to an artificial intelligence processor, the artificial intelligence processor includes M artificial intelligence processor cores, and M is a positive integer greater than 1; the device includes :
  • the first obtaining unit is configured to obtain a calculation graph corresponding to a neural network model; wherein, the neural network model includes a plurality of operators;
  • the first determining unit is configured to determine the target splitting strategy of the neural network computing task in the splitting strategy set; wherein, the splitting strategy set is composed of the splitting methods corresponding to the target operator in the calculation graph set;
  • a splitting unit configured to split the neural network computing task according to the target splitting strategy to obtain multiple sub-computing tasks
  • the execution unit is configured to allocate the sub-computing tasks to the corresponding artificial intelligence processor cores in the artificial intelligence processor for processing.
  • the second determining unit is configured to determine the split mode corresponding to the target operator according to the parallelism, the split dimension, and the size of the split dimension corresponding to the target operator in the calculation graph;
  • the third determining unit is configured to determine the set of splitting strategies according to the splitting mode corresponding to the target operator.
  • the intersection of the splitting modes supported by each target operator is determined as the set of splitting strategies.
  • the first determining unit includes a first determining subunit and a second determining subunit; wherein,
  • the first determining subunit is configured to determine the weight value of the splitting mode corresponding to the target operator in the splitting strategy set;
  • the second determining subunit is configured to determine the target splitting strategy according to the weight value.
  • B6 The device according to any one of B1-B4, the device further comprising:
  • the second obtaining unit is used to obtain the operation type of the target operator
  • the fourth determining unit is configured to determine the split mode of the target operator according to the operation type of the target operator.
  • a computer device including a processor and a memory, the processor and the memory are connected to each other, wherein the processor includes a general-purpose processor and an artificial intelligence processor, the memory is used to store a computer program, the computer The program includes program instructions, and the processor is configured to call the program instructions to execute the method according to any one of claims A1-A8.
  • a computer-readable storage medium storing a computer program.
  • the computer program includes program instructions that when executed by a processor cause the processor to execute as claimed in claim A1. -The method described in any one of A8.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种神经网络处理方法、装置、计算机设备及存储介质,通过将算子拆分成多个规模更小的算子,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源。

Description

神经网络处理方法、装置、计算机设备及存储介质 技术领域
本发明涉及信息处理技术领域,尤其涉及一种神经网络处理方法、装置、计算机设备及存储介质。
背景技术
随着人工智能技术的快速发展,基于内存共享模型的多核处理器已经成为了当前处理器的主流架构,这种多核架构和每个核内的向量处理能力同样可以应用到神经网络计算中。在实际应用中,通常可以采用数据并行的方式来充分利用多核处理器架构所带来的额外硬件资源,即令每个处理器核分别同时执行不同数据在同一个神经网络模型上的计算。然而,多核处理器结构并不能使用这种并行方法来处理推理场景下的小批量且要求低时延的神经网络计算任务。那么,如何保证数据并行与神经网络模型并行相统一,以充分利用多核处理器的硬件资源是亟需解决的技术问题。
发明内容
本发明实施例提供一种神经网络处理方法、装置、计算机设备及存储介质,通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
第一方面,本发明实施例提供了一种神经网络处理方法,所述方法应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述方法包括:
获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
第二方面,本发明实施例提供了一种神经网络处理装置,该装置包括用于执行上述第一方面的方法的单元。具体地,该装置应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述装置包括:
第一获取单元,用于获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
第一确定单元,用于在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
拆分单元,用于根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
执行单元,用于将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
第三方面,本申请实施例提供了一种芯片,所述芯片包括第二方面提供的神经网络模型处理装置。
第四方面,本申请实施例提供了一种计算机设备,所述计算机设备包括第三方面提供的芯片或第二方面提供的神经网络模型处理装置。
第五方面,本申请实施例提供了一种计算机设备,包括处理器和存储器,所述处理器 和存储器相互连接,其中,所述处理器包括通用处理器和人工智能处理器,所述存储器用于存储支持计算机设备执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
第七方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面所述的方法中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
在本申请实施例中,通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是本申请实施例提供的一种多核处理器的结构示意图;
图1B是本申请实施例提供的一种reshape算子语义的示意图;
图1C是本申请实施例提供的一种transpose算子语义的示意图;
图1D是本申请实施例提供的一种concat算子语义的示意图;
图1E是本申请实施例提供的一种split算子语义的示意图;
图1F是本申请实施例提供的一种张量数据连续存储的示意图;
图1G是本申请实施例提供的一种保证操作的等价性的示意图;
图1H是本申请实施例提供的一种含stride的内存分布的示意图;
图1I是本申请实施例提供的一种人工智能处理器的软件栈的结构示意图;
图2是本申请实施例提供的一种计算机设备的结构示意图;
图3A是本申请实施例提供的一种神经网络处理方法的流程示意图;
图3B是本申请实施例提供的一种人脸识别神经网络模型的结构示意图;
图3C是本申请实施例提供的一种车牌字符识别的神经网络模型的结构示意图;
图4是本申请实施例提供的一种神经网络卷积算子的计算图;
图5A为按照输入数据的N维度进行拆分得到的示意图;
图5B为按照输出数据的C维度进行拆分的示意图;
图5C为按照输入数据C维度进行拆分得到的示意图;
图5D为按照输入数据的H维度进行拆分得到的示意图;
图5E为按照输入数据的W维度进行拆分得到的示意图;
图6A是本申请实施例提供的一种神经网络优化方法的流程示意图;
图6B是本申请实施例提供的一种在原始计算图中提取的胶水算子的结构示意图;
图7A-图7P是本申请实施例提供的神经网络模型的优化示意图;
图8A是本申请实施例提供的一种第一计算图的结构示意图;
图8B是本申请实施例提供的一种胶水子图的结构示意图;
图8C是本申请实施例提供的一种优化后的等效优化序列的结构示意图;
图8D是本申请实施例提供的一种扩充后的第一计算图的结构示意图;
图8E是本申请实施例提供的一种状态集合图;
图8F-图8M是本申请实施例提供的状态转换示意图;
图9是本申请实施例提供的一种神经网络处理装置的结构示意图;
图10是本申请实施例提供的一种神经网络优化装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
应当理解,本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
为了便于更好的理解本申请所描述的技术方案,下面先解释本申请实施例所涉及的技术术语:
(1)数据并行
具体来说,所谓数据并行是指把数据划分成若干块分别映像到不同的处理器上,每一个处理器运行同样的处理程序对所分派的数据进行处理。现有中,大部分并行处理均采用这种处理方式,尤其是对于计算复杂性很高的问题,如流体力学计算、图象处理等。
在本申请实施例中,数据并行可以应用于大规模的神经网络并行训练中。具体来说,数据并行的核心是使用多个处理器同时进行对于同一个神经网络模型的训练。在训练的每一轮迭代中,每个处理器从数据集中获取本轮迭代使用的数据,在每个处理器上完成一轮整个网络的推理及训练计算,并返回本轮计算得到的梯度数据来进行模型的更新。维护权值的服务器在收到所有处理器的梯度之后,使用这些梯度进行模型数据的更新。显然,由于多个处理器会并行地执行训练任务,其等价于在每轮迭代中一个更大批量的数据能够被处理,也就加快了系统完成这个训练任务所需要的时间。所以,数据并行的关键在于每一轮迭代中待处理数据的批量的大小,批量越大,尽可能划分到越多的处理器来并行处理。
(2)模型并行
在本申请实施例中,模型并行是数据并行之外的另一种神经网络并行计算方式。简单来说,模型并行是通过划分神经网络模型参数的方式把计算负载分配到不同的处理器上。
模型并行和数据并行的最大区别在于:模型并行度是在编译时期静态确定,一旦操作编译完成之后就不可更改,称为模型的固有属性;而数据并行是在运行时期动态指定,同样的模型可以指定不同的数据并行度。此外,受限于硬件的运算核心数和DDR访存带宽,两种并行技术在人工智能处理器上的应用场景和使用定位略有差别:数据并行编程更倾向于获得极致的吞吐率;而模型并行编程更倾向于获得极致的低延时。
(3)多核处理器
当前多核处理器采用的最普遍的结构是基于存储共享的多核结构,如图1A所示,处理器中包含了多个计算核,每个计算核上有独立的缓存,寄存器堆,计算单元以及指令控 制单元,所有的计算核共享同一全局存储。
现有中,单个核已经足够完成任何复杂逻辑的计算任务,但其性能受限于摩尔定律和芯片工艺。为了进一步提升处理器的性能,多个计算核被引入处理器中,它们可以被用于处理那些有着较高并行度的计算任务。
在实际应用中,共享存储多核结构是一种经典的多核结构,并且非常适合数据并行的神经网络训练方法。每个核可以作为数据并行中的一个处理器,分别读取不同的数据,然后并行完成网络模型的正反向计算。每个核在计算阶段仍能够保持其在之前单核架构下良好的性能功耗比,与此同时,整个系统的吞吐量也可以随着核数的扩展而增加。
(4)算子拆分
在本申请实施例中,我们采用算子拆分的方式来实现计算任务的拆分来达到模型并行,即把单个算子拆分成多个可以并行执行的子算子。需要说明的是,这里,拆分前的原始算子和拆分后的若干个子算子都是人工智能处理器所支持的算子,原始的张量数据随着算子的拆分也被拆分成若干个新的子张量数据。反映到计算图上,则是把原来的包含单个算子的计算图细化成了一张包含更多可并行执行的算子的计算图。通过这一实现方式,可以实现类似于模型并行的算子内任务拆分,同时又保证了拆分后的每个子算子都可以复用单核架构下算子的指令实现来进行计算,避免了对原有算子的指令实现的重构。
在本申请实施例中,算子拆分不完全局限于对模型参数的拆分,也会采用数据并行的方式对数据进行拆分,这种方法实际上模糊了模型并行和数据并行的界限。以卷积算子为例,如果把卷积算子的输入数据和权值作为计算图中等同低位的张量数据,那么,数据并行时基于对输入数据的划分来分割计算,而模型并行时基于权值的划分来分割计算,这二者都是通过划分卷积算子相关联的张量数据来实现对计算负载的划分。从这个角度来说,数据并行和模型并行是统一的。
(5)张量(tensor)
在本技术方案中,张量仅仅是对存储的一块数据的特征描述,张量记录了数据的形状、类型等信息。
本申请实施例中,张量应该理解为张量数据,可以包括神经网络模型中输入张量数据、输出张量数据,也可以包括特征张量数据等。
以人工智能深度学习框架TensorFlow为例,一般使用阶(rank),形状(shape)和维数(dimension number)来描述张量的维度,其关系可以表示为如表1所示:
表1
Figure PCTCN2020116933-appb-000001
如表1所示,张量A=4,其表示一个数。张量A=[6,2],其表示二维矩阵,具体地,该矩阵为6行2列的矩阵。
(6)算子的划分
现有中,算法设计者采用算子作为基本单位,辅以与算子关联的张量数据来搭建描述神经网络算法的计算图。在本申请实施例中,按照算子的语义进行划分,可以把目前深度学习中的算子分为两类。下面对其进行详细阐述。
第一类算子负责从输入特征中获取输出特征,他们有着各自特定的计算任务,会对输 入数据进行乘法、加法、非线性计算、比较挑选以及其他的数学运算。例如,卷积算子使用卷积核对输入特征图像的局部区域进行卷积计算,通过对输入特征图像里的数据的线性计算得到输出特征;又例如,全连接算子使用矩阵乘法的方式对输入的所有特征进行线性组合;又例如,池化算子对输入数据进行采样得到输出数据,等等。
另一类算子的语义中并不涉及任何计算逻辑,其输入数据和输出数据不管是数值的数量,亦或是数值本身都没有发生任何变化,这类算子通常是用来对神经网络模型的计算图中的张量数据的格式、形状以及内存中的排布进行调整,为的是把神经网络模型上游计算得到的张量数据调整成对下游的计算更好和方便的形式,起到了“粘合”神经网络上下文计算的部分。具体地,这一类算子被称为“胶水”算子。那么,相应地,计算图中由“胶水”算子构成的部分称为“胶水”子图。
(7)“胶水”算子
在本申请实施例中,“胶水”算子有4种,包括reshape算子、transpose算子、concat算子、split算子。接下来对其一一进行介绍:
A、reshape算子
在本申请实施例中,reshape算子,也即,张量重塑算子,是指对张量的形状进行重新诠释。在实际应用中,reshape算子可以用于对张量数据的形状进行调整。具体地,reshape算子可以表示为:tf.reshape(tensor,shape,name=None),用于将tensor变换为参数shape的形式。
在一种情形中,参数shape=[-1],表示将tensor展开成一个列表。
在一种情形中,参数shape=[a,b,c,...,n],其中,a,b,c,...n均大于0的正整数,表示将tensor变换为多维矩阵。在一种情形中,参数shape=[a,-1,c,...,n],这里,b=-1,a,c,...,n均为大于0的正整数,表示tf根据tensor的原尺寸,自动计算b的值。
以张量A=[3,2,4]为例,当对张量A执行reshape1算子操作之后,得到张量B,其中,张量B=[2,6,2]。具体地,可以参见如图1B所示的reshape算子语义的示意图。
B、transpose算子
在本申请实施例中,transpose算子,也即,张量转置算子,是指对张量进行转置。在实际应用中,transpose算子可以用于调整张量数据的维度顺序。具体地,transpose算子可以表示为:tf.transpose(a,perm=None,name=’transpose’),用于按照perm参数调换tensor的顺序。这里,perm参数为自然数列[1,2,3,...,n]的一个全排列,不同的全排列表示不同的transpose算子。
一般情况下,多维张量有多个维度且彼此之间存在先后顺序,transpose算子可以改变维度的先后顺序。此外,需要说明的是,在一些场景下,transpose算子又被称为permute算子。以张量A=[3,2,4]为例,当对张量A执行transpose算子操作之后,得到张量B,其中,张量B=[4,2,3]。具体地,可以参见如图1C所示的transpose算子语义的示意图。
C、concat算子
在本申请实施例中,concat算子,也即,拼接算子,用于将多个张量数据沿着指定的维度拼接成一个张量。除了在指定维度外,输入张量的其他维度应该保持一致。通过concat算子,神经网络将代表来自上游不同位置的特征的多个张量拼接成一个,从而可以在下游计算中对这些特征共同进行处理。具体地,可以参见图1D所示的concat算子语义的示意图。
D、split算子
在本申请实施例中,split算子,也即拆分算子,用于将一个张量在指定维度上拆分成多个张量。拆分后的多个张量除了指定维度之外,在其他维度上保持一致。通过split算子,可以把属于同一张量数据的特征拆成多份,从而在后续计算中分别进行针对性处理。具体 地,可以参见图1E所示的split算子语义的示意图。
总的来说,在本申请实施例中,胶水算子用于对神经网络模型中的张量数据的格式、张量数据的形状和张量数据在内存中的排布中的至少一种进行调整。
需要说明的是,在本申请实施例中,胶水算子可以包括但不限于上述4种不同类型的算子,还可以包括其他算子,本申请实实施例不作具体限定。
(8)张量数据在存储中的数据排布
神经网络计算中使用多维张量作为算子间数据传递的基本单位。一般情况下,数据以连续存储的方式在内存中。例如,如图1F所示,数据存储在I0-I15间连续的16个比特位中。
在本申请实施例中,存储数据的顺序与张量由外到内把所有维度一次展开到的一维数据中元素的顺序相同,访问张量中数据根据元素在不同维度的坐标以及维度本身来决定。例如,形状为(D0,D1,D2)的张量,存储在大小为D0×D1×D2的连续内存中,要访问张量中坐标(n0,n1,n2)的数据,可以基于数据在内存中的起始地址和通过计算得到的数据偏移(n0×D1+n1)×D2+n2来确定数据在内存中的地址。
可以理解的是,使用这种紧密连续的存储方式来存储多维张量数据非常直观且方便,元素坐标和其在内存中的偏移的换算也非常简洁。现有中,深度学习框架,例如,以Caffe、MXNet为例,都是使用这种方式来管理神经网络模型中张量数据的内存管理,并在此基础上实现卷积、池化等各种算子在通用处理器、人工智能处理器(例如,GPU)上的核函数。然而,这种内存排布对性能来说却远远不是最优的。为了满足硬件设计、提高性能,硬件厂商设计了不同的数据在内存中的排布,这些与众不同的排布是导致“胶水”子图在神经网络处理上出现性能浪费的主要原因。
(9)维度顺序
以卷积神经网络为例(具体地,该卷积神经网络用语图像分类或物体检测),神经网络模型的计算图中的张量数据一般有4个维度,分别是表示当前计算所处理的数据的批量大小的N,表示特征图像数量的C,表示特征图像尺寸的H和W。
在本申请实施例中,张量数据的维度顺序可以为NCHW,即N是求解偏移过程中最外侧的维度,而W是最内侧维度。例如,Caffe中默认张量数据使用该维度顺序;MXNet以及TensorFlow可以支持该维度顺序。坐标为(n,c,h,w)的元素在存储中的偏移为((n×C+c)×H+h)×W+w。
在本申请实施例中,张量数据的维度顺序还可以为NHWC(这里,C是最内侧维度),相应的坐标向偏移的换算方法是((n×H+h)×W+w)×C+c。在实际应用中,NHWC相比于NCHW更加接近BMP(全称:Bitmap)的图片数据存储格式,BMP格式的文件中按照一个个像素点来存储数据,每个像素点存储了所有通道的颜色值,这使得在读取输入图像时不需要进行额外的维度转换。此外,从神经网络模型中最常见的卷积算子的最直接的计算逻辑来看,C维度相比H和W维度更加易于使用向量计算指令来做并行化。例如,当卷积核为1×1时,计算输出张量中的一个值只需要输入张量沿着C维度的一组数据,这使得把C维度放在最内侧维度可以更好地利用数据的局部性,并且还可以直接使用优化程度高的矩阵乘法来代替1×1的卷积计算。
在本申请实施例中,张量数据的维度顺序也可以为CHWN(这里,N为最内侧维度),相应的坐标向偏移的换算方式是((c×H+h)×W+w)×N+n。例如,Nervana开发的neon使用该维度顺序的张量进行卷积核池化计算。显然,在具有合适的批量大小的情况下,把N维度放在最内侧是最直观的并行方式,其思想和分布式训练中的数据并行一致。
从人工智能处理器的角度来说,为了最大化性能上的收益,也会结合自身的微结构设计选择最合适的维度顺序来存储张量数据。
在实际应用中,算法设计者往往假定了原始的张量数据在内存中排序时采用了NCHW的维度顺序。例如,一个由transpose和reshape构成的算子序列实现了(N,C,H,W)→(N,H,W,C)→(N,C×W,1,1)的变化过程,其本意是将C,H,W维度上的数据合并到一个维度中,并且保证原始的C维度能够处于合并的维度的最内侧。
在本申请实施例中,对采用了NCHW之外的维度顺序来存储张量数据的人工智能处理器,维度的不同不会导致计算结果的错误,但是会对性能造成影响。当人工智能处理器采用了不同的维度顺序时,只要保证每个算子在执行过程中在实际的维度顺序上实现了与抽象语义意义对等的操作,就可以保证最终结果的正确性。例如,如图1G所示,张量数据在存储中实际采用了NCWH的数据排布,而神经网络模型的定义是基于NCHW给出的。在这种情况下,为了保证每个操作的等价性,实际执行过程中每个算子的结果应该是在输入数据的基础上先经过变换
Figure PCTCN2020116933-appb-000002
变回定义阶段假定的维度顺序,完成指定算子的操作,再通过
Figure PCTCN2020116933-appb-000003
的反变换得到与实际维度顺序NCWH对应的正确的输出张量的排布。因为假定的顺序是NCHW,而实际使用的张量数据的排布顺序是NCWH,所以变换
Figure PCTCN2020116933-appb-000004
和反变换
Figure PCTCN2020116933-appb-000005
都是参数为(0,1,3,2)的transpose操作。在具体实现中,transpose算子可以把内部的多个transpose过程进行合并,但reshape算子在实现中则多出了一个transpose过程,这种情况是算法设计者在设计算法之初不可能想到的,但又是保证实现和抽象语义的一致性所必需的。因此,在算法设计者缺乏对底层维度顺序了解的前提下,在人工智能处理器上照搬原始的计算图结构会对性能造成影响。
(10)步幅(stride)
如前所述,一般情况下,张量数据是按照连续紧密的方式存储在内存中,但人工智能处理器则可能采取了非连续的数据存储方式。
在本申请实施例中,非连续的存储方式是指:张量数据半身的数学维度大大小小用于计算存储中的偏移的实际维度的大小,其中,计算偏移使用的实际维度被称为stride。例如,如图1H所示,二维张量中的W维度,也是内侧维度本身为4,但实际存储中是按照6来布局的,相应地,当跨W读取同一H维度上的数据时,需要跳过6个数值而不是4个数值。更一般地,用stride_n、stride_c、stride_h和stride_w分别表示沿着N、C、H、W四个维度读取下一个数值需要跳过的偏移量,对于给定元素在张量中的坐标(n,c,h,w),该元素在存储中基于起始地址的偏移为n×stride_n+c×stride_c+h×stride_h+w×stride_w。张量在连续紧密排布下的各种布局NCHW、NHWC、CHWN等可以看作是stride的特殊形式。比如,NCHW的连续布局可以当做是stride布局下stride_n=C×H×W,stride_c=H×W,stride_h=W,stride_w=1。
对人工智能处理器来说,在数据布局中采用stride往往处于数据对齐和访存位宽的考量。把向量计算用于神经网络模型中会遇到的对齐和取整的问题,比如硬件沿着C维度对卷积进行并行计算,向量计算指令以及长位宽寄存器允许一次处理64个浮点数的乘加,相应的就可以一次从存储中读取C维度宽度为64的数据进行计算。但神经网络模型中总是存在在C维度上不是64的整数倍的张量数据和算子。为了处理最尾部的余留部分,就需要单独实现访存和计算指令,这使得指令在设计上十分繁琐。更进一步来说,存储单元可能本身存在访存对齐的限制,即每次访存的起始地址必须是某一常数的倍数,这进一步加大了指令实现的难度。为了避免这种情况,一种更简单的方法是把张量数据的维度直接向上对齐到最接近的整倍数上,补充的部分用0填充。对包括卷积、池化、全连接算子在内的绝大部分算子而言,补充的0即便参与了计算也对最后的计算结果没有任何影响。通过补0 使得相应的维度的stride变成了计算及访存位宽的整倍数,因而避免了单独处理尾部数据的麻烦。
在实际应用中,对连续存储的张量数据来说,reshape是一个零开销的操作,只需要修改该张量的形状信息即可,但是当涉及的维度里涉及到了stride对齐的维度,reshape算子所引入的开销就不能被忽视。例如,假设将图1G中的张量的两个维度合并成一个,就需要重新调整绝大部分元素的存储位置,消除W维度最后的两个0。
(11)数据分段或维度分段(Blocking)
具体来说,向量寄存器和单指令多数据流SIMD(Single Instruction Multiple Data,SIMD)可以用来沿某一维度(通常是C)维度对卷积进行并行计算,但其一次能处理的数据位宽是有限的,为了能够保证寄存器内的中间结果可以被尽可能充分利用,输入张量把C维度进一步拆分,依照通用处理器能够处理的数据位宽分成一个个子段,并在内存中连续存储,提高了缓存的利用率。假设人工智能处理器的SIMD指令可以一次完成8个浮点计算,那么N,C,H,W的布局经过分段后会被调整为N,C/8,H,W,8。这种分段思路同样也适用于一些人工智能处理器的计算优化,区别在于后者可以一次处理更宽的向量数据,而分段的方法也能保证计算阶段访存的连续性,这有利于提高访存的效率。
在实际应用中,对采用了分段数据布局的人工智能处理器来说,涉及分段维度的数据布局调整需要考虑分段的影响,相对于前面提及的维度顺序和stride来说,针对分段布局所能使用的性能改进手段较少,但一些特殊情况下不同的神经网络计算图结构还是会对性能有一定的影响。
总的来说,存在各种各样的原因使人工智能处理器选择符合自身特点的存储数据排布方式,而算法设计者又很难知晓这些隐藏在底层中的细节,因此,在人工智能处理器上照搬原有的计算图结构就有可能会造成性能的浪费,而合理调整“胶水”子图(该“胶水”子图由“胶水”算子构成)的结构则可以避免大量的不必要的访存开销,优化整个神经网络模型的执行性能。
在本申请接下来的实施例中,将具体描述对包含多个胶水算子的“胶水”子图,如何进行子图重构来获取胶水子图对应的优化结构,并根据重构后的子图对神经网络模型进行优化,以提高神经网络模型的整体性能。这里,重构子图是指:在保证“胶水”子图的输入张量数据和输出张量数据不变,以及“胶水”子图整体所代表的语义不变的情况下,对内部的算子和中间结果张量数据进行增加、删除、拓扑关系调整。
(12)等效规则
在本申请实施例中,等效规则包括reshape算子的等效规则、transpose算子的等效规则、concat算子的等效规则以及split算子的等效规则中的至少一种。在接下来的实施例中,将一一进行阐述。
从本质上来看,等效规则描述的是可以优化的胶水算子的逻辑关系。在本申请实施例中,胶水算子的逻辑关系是至少两个胶水算子中一个算子的输出数据交由另一个算子作为输入数据进行运算操作。
(13)人工智能处理器
人工智能处理器,也称之为专用处理器,在本申请实施例中,人工智能处理器是指针对特定应用或者领域的处理器。例如:图形处理器(GPU,Graphics Processing Unit),又称显示核心、视觉处理器、显示芯片,是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上进行图像运算工作的专用处理器。又例如:神经网络处理器(NPU,Neural Processing Unit),是一种在人工智能领域的应用中针对矩阵乘法运算的专用处理器,采用“数据驱动并行计算”的架构,特别擅长处理视频、图像类的海量多媒体数据。
(14)人工智能处理器的软件栈
人工智能处理器的软件栈:参见图1I,该软件栈结构10包括人工智能应用100、人工智能框架102、人工智能学习库104、人工智能运行时库106以及驱动108。接下来对其进行具体阐述。
人工智能应用100对应不同的应用场景,提供对应的人工智能算法模型。该算法模型可以直接被人工智能框架102的编程接口解析,在其中一个可能的实现方式中,通过人工智能学习库104将人工智能算法模型转换为二进制指令,调用人工智能运行时库106将二进制指令转换为人工智能学习任务,将该人工智能学习任务放在任务队列中,由驱动108调度任务队列中的人工智能学习任务让底层的人工智能处理器执行。在其中另一个可能的实现方式中,也可以直接调用人工智能运行时库106,运行先前已固化生成的离线运行文件,减少软件架构的中间开销,提高运行效率。
人工智能框架是整个深度学习生态体系中的第一层。早期在Caffe中,Layer被当做是构建神经网络的基本元素,而在之后的人工智能框架,例如TensorFlow、MXNet中,虽然采用了不同的称呼,例如Operator,但与Caffe的layer在核心思想上依旧是相似的,都是将神经网络计算进一步拆分为各类常见的面向张量数据的算子,人工智能框架需要将神经网络映射的计算图结构所表达的深度学习任务具体化成可以在CPU或者人工智能处理器执行的指令和数据。在这个过程中,人工智能框架采用算子作为落实计算任务的具体元素,为每个算子都提供了在CPU或者人工智能处理器上执行的核函数(Kernel),根据计算图,人工智能框架调度执行计算图中每个算子对应的核函数,完成整个神经网络的计算。
为了便于更好的理解本申请,下面具体阐述本申请所描述的技术方案的研究思路。
现有技术中,数据并行的问题在于,其扩展性依赖于处理的数据批量的大小。尽管在训练阶段这通常不会是一个问题,但是对于推理阶段这个前提则难以保证。一般来说,用于实时服务领域(包括视频监控,自动驾驶等)的神经网络模型,处理的数据通常是以流的方式串行输入,导致了每次处理的数据规模很小甚至往往是单张图片。在这种情况下,数据并行不能提供任何并行度,所有的工作任务会集中在单个核上,这使得多核带来的计算资源不能转化成处理任务的速度。
当在线下使用数据集完成了神经网络模型的训练后,就会把模型部署到云端的服务器上来处理外界发来的数据,此时的应用场景就由离线训练变成了在线推理。在在线推理阶段,一个非常重要的指标是时延,也就是从服务器收到待处理数据到返回处理后的结果的时间,进一步来说,是使用神经网络模型处理数据的时间。低时延保证云端服务器能够对客户端发来的数据在最短的时间内做出响应,在一些更加敏感的场景下,直接决定了方案是否可用。因此,在线推理阶段对于人工智能处理器的要求就由处理大批量数据、高吞吐量转变为处理小批量数据、低时延。
在这种情况下,传统的数据并行或者模型并行难以有效降低推理任务的时延。对于数据并行来说,大批量数据是前提,这本身与在线推理小批量数据的特点矛盾。对于模型并行来说,它通常是为了解决一个规模很大的神经网络模型超过了单个设备的内存限制而采用的方法,把算子分配到不同的核上并不能降低网络的时延。为了真正能够在多核人工智能处理器上降低推理任务的时延,必须寻找一种方法,能够把对小批量数据甚至单个数据的推理计算任务合理地分配到多核架构的各个核上,保证每一时刻都有尽可能多的核参与计算,才能充分利用多核架构的资源。一种方法是把神经网络中的每个算子的计算任务都拆分到多个核上计算,这种方法即使在处理单张图片的推理任务时也能保证每一时刻都有多个核参与计算,从而达到了利用多核资源降低时延的目的。
但是,对于多核人工智能处理器来说,还有很多要解决的问题。首先,深度学习人工智能处理器通过定制化自身的硬件设计来适配深度学习算法本身的数据并行特征,提高计 算吞吐量,人工智能处理器往往需要足够的数据规模才能达到较高的计算效率,而算子内的进一步拆分会减小每个核上的计算规模。当拆分达到一定粒度,每个核上计算效率的损失会超过拆分增加并行度所带来的收益。因此,必须在拆分并行和计算效率之间,在保证足够计算效率的同时提供足够的并行度。
另一方面,神经网络模型可以看作是一个由通常数以百计甚至千记的算子所构成的复杂计算图。不同种类的算子内的算法逻辑各不相同,这就导致对这些算子进行拆分的方法也不一样。每个算子的拆分,除了平衡自身的计算效率和并行度,还要考虑和前后算子的搭配,甚至于对全局的影响。深度学习的快速发展带来的是越来越多的大规模复杂网络,通过手动方式寻找一种好的并行方法是不现实的,因此需要一种自动化的方法来保证来对于不同的网络都能够给出一种较好的拆分并行策略。
此外,还需要考虑的是对于底层人工智能处理器的可移植性。对于没有足够良好的可编程性的人工智能处理器来说,由单核扩展到多核,并且实现算子内部的拆分并行所带来的修改软件栈的工作量是非常大的。传统的数据并行和模型并行的实现仍然是基于一个处理核完成一个算子的计算任务,所以并不会带来很多额外的工作,而单个算子的跨核并行需要对算子本身实现进行修改,这种修改的难易程度依赖于人工智能处理器的可编程性和原有算子实现逻辑的复杂程度。如何减小在多核架构上实现低时延推理过程中的额外开销,缓解实现过程中工作量对于人工智能处理器本身可编程性的依赖,使得方法能够在未来对于不同的多核人工智能处理器都有一定的通用性也是一个需要考虑的问题。
基于上述分析描述,在本申请实施例中,把一个算子拆分成多个规模更小的子算子,这样可以直接调用单核架构下的计算库,避免了重新实现的额外工作量。比如:一个激活算子在经过拆分后可以得到许多更小的激活算子,这意味着只需要在多个核上调用原有的单核激活函数完成每个子任务,而不需要修改或者重新实现一个多核版本的激活函数。在这个过程中,既需要兼顾每个算子本身的拆分后的计算效率和并行度,也要考虑上下文算子彼此之间在拆分上的相互配合。最终目标是得到一个能够有效降低整个神经网络模型端到端的推理时延的拆分并行方案。
此外,需要说明的是,本申请实施例所提供的神经网络处理方法能够尽量避免对单核处理器计算库进行修改,同时也能够实现神经网络模型在多核处理器上的并行执行。具体地,上层框架通过把神经网络模型中的算子拆分成若干个可以并行执行子算子,对每个子算子,深度学习框架调用计算库生成所述子算子在单个核上执行的机器指令,通过把所述子算子的机器指令加载到不同核上,实现算子在多核处理器上的并行计算。具体地,因为深度学习框架可以使用单核处理器计算库生成子算子的计算指令,神经网络模型中所述算子的输入和输出张量数据随着所述算子被拆分成子算子同样被拆分成相应的子张量数据。
基于上述分析,首先介绍一下本申请所描述的方法可以适用的硬件设备的结构示意图。参见图2,是本申请实施例提供的一种计算机设备的结构示意图。如图2所示,计算机设备20可以包括通用处理器201、存储器202、通信总线203、通信接口204和至少一个人工智能处理器205,通用处理器201、人工智能处理器205通过所述通信总线连接所述存储器202和所述通信接口203。
通用处理器201可以是中央处理单元(Central Processing Unit,CPU),该通用处理器201还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器201可以是微处理器或者该通用处理器201也可以是任何常规的处理器等。
通用处理器201还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的神经网络处理方法的各个步骤可以通过通用处理器201中的硬件的集成逻辑电路 或者软件形式的指令完成。
存储器202可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)或其他存储器。本申请实施例中,存储器202用于存储数据以及各种软件程序,例如本申请实施例中根据确定好的目标拆分策略对神经网络模型进行拆分的程序等。
可选的,在本申请实施例中,所述存储器可以包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方法的媒体加以存储。本实施方式所述的存储器又可以包括:利用电能方式存储信息的装置,如RAM、ROM等;利用磁能方式存储信息的装置,如硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘;利用光学方式存储信息的装置,如CD或DVD。当然,还有其他方式的存储器,例如量子存储器、石墨烯存储器等等。
通信接口204使用例如但不限于收发器一类的收发装置,来实现计算机设备20与其他设备或通信网络之间的通信。例如,可以通过通信接口204接收其他设备发送的模型文件。
人工智能处理器205可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。在实际应用中,人工智能处理器205可以实现一种或多种运算。例如,以神经网络处理器(Network Processing Unit,NPU)NPU为例,NPU的核心部分为运算电路,通过控制器控制运算电路提取存储器202中的矩阵数据并进行乘加运算。
可选的,人工智能处理器205可以包括8个集群(cluster),每个cluster中包括4个人工智能处理器核。
可选的,人工智能处理器205可以是可重构体系结构的人工智能处理器。这里,可重构体系结构是指,如果某一人工智能处理器能够利用可重用的硬件资源,根据不同的应用需求,灵活的改变自身的体系结构,以便为每个特定的应用需求提供与之相匹配的体系结构,那么这一人工智能处理器就称为可重构的计算系统,其体系结构称为可重构的体系结构。
应当理解,计算机设备20仅为本申请实施例提供的一个例子,并且,计算机设备20可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
基于图2所示的计算机设备的结构示意图,下面结合图3A所示的本申请实施例提供的一种神经网络处理方法的流程示意图,具体说明在本申请实施例中是如何实现对神经网络模型的拆分的,下面以caffe为例进行详细描述,可以包括但不限于如下步骤:
步骤S310、获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子,所述多个算子用于执行神经网络计算任务。
在caffe框架下,所述目标算子可以是神经网络模型中的对应目标层(layer),该目标层为所述神经网络模型中的至少一层。
在本申请实施例中,计算图是指:使用图结构对神经网络模型的计算过程进行描述的一种方式。
在本申请实施例中,神经网络模型可以接收输入数据,并根据接收的输入数据和当前的模型参数生成预测输出。在实际应用中,该神经网络模型可以是回归模型、深度神经网络(deep neural network,DNN)、卷积神经网络模型(Convolutional Neural Networks,CNN)、循环神经网络模型(Recurrent Neural Networks,RNN)等,本申请实施例不作具体限定。
在计算机设备执行神经网络计算任务时,如果该神经网络计算任务具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络模型的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一 个卷积神经网络模型有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
在本申请实施例中,不同的神经网络模型对应着不同的神经网络计算任务。例如,深度学习神经网络模型对应的神经网络计算任务可以为图像分类,文本分类等;卷积神经网络模型对应的神经网络计算任务可以为图像识别,视频分类等;长短时记忆神经网络模型(Long Short Term Memory Network,LSTM)对应的神经网络计算任务可以为语音识别、图片描述、自然语言处理等。
步骤S312、在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合。
在本申请实施例中,在确定拆分策略集合时,可以包括:
根据所述计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
根据所述目标算子对应的拆分方式确定所述拆分策略集合。
在本申请实施例中,目标算子为多个算子中的一个算子。
在单模型、单输入的场景下,通过增加模型本身的并行度以及使用多个人工智能处理器的运算核(core),获得处理性能的提升(降低延时,提升吞吐率)。我们把处理单模型、单输入的人工智能处理器的运算核(core)的数目称为第一并行度,亦即模型并行度。用户只需要在编译时期指定第一并行度,人工智能运行时库106会自动地将原始的神经网络模型对应的计算图在拓扑结构、输入输出、模型参数等多个维度进行划分,使得划分后的模型能够在多个运算核(core)上并行地执行,并自动的保证多核间的数据同步。举一个实际例子,可以用模型并行技术将VGG16分类网络划分到多个核上,并行地处理同一张输入图片,这样单张图片的分类延时可以获得显著降低。理论上,第一并行度越高,使用的核心数越多,人工智能处理器执行时间越短。
将单个模型同时处理多份输入,每份输入使用不同的运算核心处理,称之为单模型多数据并行计算模式。可以简单理解为把同样的模型复制了多份,每一份模型使用一个或者多个核(取决于第一并行度)处理不同的输入数据。但实际上模型(指令、权值等)并没有复制,而是被所有的核共享了。数据并行度就是指处理的输入数据份数,数据并行度亦称为第二并行度。举个实际例子,可以用数据并行技术,将同一份Alexnet模型复制到32个人工智能处理器的运算核上去执行,分别处理32张不同的图片,从而充分发挥人工智能处理器的算力。
可以理解的是,在仅满足追求高吞吐率的应用场景下,目标算子的并行度为第二并行度。在仅满足特定延时限制的应用场景下,目标算子的并行度为第一并行度。
在本申请实施例中,数据并行与模型并行两种编程方式可以叠加使用,用于满足特定延时限制下还需要追求高吞吐率的应用场景。并行度包括第一并行度和第二并行度。其中,在这种情况下,实际用到的运算核的数目是数据并行度乘以模型并行度,其乘积不能超过人工智能处理器中人工智能处理器运算核的数目。
在本申请实施例中,并行度,是指该算子将被拆分成多少个算子,这一变量通常受限于多核处理器架构的核数,在不超过核数上限的前提下,应该保证并行度为2的整数幂次。
在本申请实施中,保证并行度为2的整数幂次的原因在于:现有中,多核处理器架构中通常是2的整数次幂。如,1,2,4,8,16等等。一个并行度不是2的整数次幂的任务往往会导致人工智能处理器核的调度上产生“碎片”。
在本申请实施例中,拆分维度,是指算子应该沿着哪一逻辑维度对它自身进行拆分,得到一系列子算子。
这里,以卷积神经网络模型为例(具体地,该卷积神经网络用于图像分类或物体检测),神经网络模型的计算图中的张量数据一般有4个维度,分别是表示当前计算所处理的数据的批量大小的N,表示特征图像数量的C,表示特征图像尺寸的H和W。在实际应用中,计算机设备可以选择上述4个维度中的任意一个维度进行拆分。
需要说明的是,选择在何种维度上对算子进行拆分对于拆分方式特别敏感的算子是非常有意义的。例如,对激活算子来说,可以允许其输入数据和输出数据在任意维度上进行拆分。在实际应用中,当一个激活算子的输入数据被分成了若干个子块(从一致性的角度来考虑,输出数据也会进行同样的划分),不妨表示为input0、input1、input2、......、inputm-1和output0、output1、output2、......、outputm-1,则在计算阶段,整个激活算子实际上被拆分成了m个更小的激活算子,这些激活算子彼此之间没有依赖关系,可以运行在多个核上。
在本申请实施例中,拆分维度大小,是指算子沿着拆分维度拆分成一系列子算子之后,每个子算子在该维度上的具体数值。
进一步需要说明的是,将各个维度上拆分数量进行相乘可以得到算子的并行度。
在本申请实施例中,可以根据并行度、拆分维度以及拆分维度大小确定每个目标算子对应的拆分方式,在神经网络模型中包含多个算子的情况下,根据每个目标算子对应的并行度、拆分维度以及拆分维度大小可以确定多个目标算子对应的拆分方式,从而可以构成拆分策略集合。总的来说,在这种情况下,拆分策略集合为根据每个目标算子对应的并行度、拆分维度以及拆分维度大小确定的。
为了便于理解,下面结合具体的实例进行阐述,例如,以caffe为例参考附图3B详细描述。在图3B中,人脸识别神经网络模型中包含多种不同类型的算子(卷积算子、池化算子、全连接算子),其中,各算子之间的连接关系为:卷积层1-池化层1-卷积层2-池化层2-全连接层1-全连接层2。由于这些算子可以允许在任意维度上进行拆分,那么,在这种情况下,计算机设备可以根据并行度、拆分维度以及拆分维度大小确定每个算子各自对应的拆分方式,从而可以构成拆分策略集合。
在其中一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,其中,一些算子可以允许在任意维度上进行拆分,一些算子只支持在有限维度上进行拆分,那么,在这种情况下,计算机设备可以分别确定每个目标算子各自对应的拆分方式,然后,将多个算子中的每个目标算子均支持的拆分方式的交集确定为拆分策略集合。总的来说,在这种情况下,拆分策略集合为根据多个算子中的每个目标算子均支持的拆分方式确定的。通过这一实现方式,可以避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的耗时问题等等。
为了便于理解,下面结合具体的实例进行阐述,例如,如图3C所示,车牌字符识别神经网络模型中包含多种不同类型的算子(卷积算子、池化算子、激活算子、softmax算子等),其中,各算子之间的连接关系为:卷积层1-激活函数Relu-最大池化层1-卷积层2-激活函数Relu-最大池化层2-卷积层3-激活函数Relu-最大池化层3-卷积层4-激活函数-最大池化层4-卷积层5-激活函数-最大池化层5-全连接层1-softmax层-输出层。由于卷积算子、池化算子、激活算子可以允许在任意维度上进行拆分,而softmax算子只支持在有限维度上进行拆分,那么,在这种情况下,计算机设备将这多个算子中的每个目标算子均支持的拆分方式的交集确定为拆分策略集合。
在一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,其中,一些算子完全不支持任何形式的拆分,而神经网络模型中的其他算子为了在数据的拆分格式上保持一致,在这种情况下,不对神经网络模型进行拆分。通过这一实现方式,可以避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的耗时问题等等。
在本申请实施例中,考虑到不同的算子具有不同的特性,为了避免不合理的拆分方式带来的负面影响,在对算子进行拆分时,计算机设备可以根据算子的类型确定算子的拆分方式。具体地,请参见表2:
表2
Figure PCTCN2020116933-appb-000006
如表2所示,不同类型的算子支持的拆分方式是不同的。通过这一实现方式,可以结合算子的特性对算子进行有针对性地拆分,从而可以避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的耗时问题等等。
具体来说,以卷积算子为例,在本申请实施例中,卷积算子的不同拆分方式可以描述为以下5种,这5种情况可以相互交叉,同时存在,可以保证足够的拆分度:
(1)当卷积算子输入数据的N维度超过1时,在N维度上进行拆分;
(2)在卷积算子的输入数据的C维度上进行拆分;
(3)在卷积算子的输出数据的C维度上进行拆分;
(4)在卷积算子的输入数据的H维度上进行拆分;
(5)在卷积算子的输入数据的W维度上进行拆分。
可以知道的是,上述五种拆分方式都是把原始的卷积算子拆分成更小的卷积。
为了便于理解,下面结合具体的实例进行阐述。如图4所示,是本申请实施例提供的一种卷积算子的原始计算图的示意图。对于卷积算子conv来说,其包含4个维度上的输入数据(input),并在权值矩阵的作用下,可以得到输出数据(output)。如图5A-图5E所示,为本申请实施例提供的计算图上卷积算子在并行度为2的条件下的多种拆分方式。具体地,图5A为按照输入数据的N维度进行拆分得到的示意图;图5B为按照输出数据的C维度进行拆分的示意图;图5C为按照输入数据C维度进行拆分得到的示意图;图5D为按照输入数据的H维度进行拆分得到的示意图;图5E为按照输入数据的W维度进行拆分得到的示意图。需要说明的是,图中每个张量数据给出了各个维度的起点和终点,用来明确拆分后的子张量数据与原始张量数据之间的关系。图中n表示输入数据批量大小、ic表示输入数据特征图像数量、ih表示输入数据特征图像的长度、iw表示输入数据特征图像的宽度、oc表示输出数据特征图像数量、oh表示输出数据特征图像的长度、ow表示输出数据特征图像的宽度、kh表示卷积核窗口的长度、kw表示卷积核窗口宽度。在实际应用中,这些拆分方式执行在不同的维度上,同时彼此之间可以通过相互组合形成更多的拆分方式,从而可以提供足够的并行度来利用多核处理器的资源,同时在一定程度上可以避免单个维度的过度拆分影响计算机设备的计算效率。
又例如,以分类器(softmax)算子为例,计算机设备可以在softmax算子概率归一化的维度之外的任意一个或几个维度上对softmax算子进行拆分,拆分后将得到若干个可以并行执行的softmax算子。
在本申请实施例中,所述在拆分策略集合中确定所述神经网络计算任务的目标拆分策略,包括:
分别确定所述拆分策略集合中目标算子对应的拆分方式对应的权重值;
根据权重值确定所述目标拆分策略。
在本申请实施例中,可以将目标算子在某种拆分方式下在多核处理器上并行执行时所用的时间表征为权重值。这里,需要说明的是,多核处理器完成一个算子的计算时间取决于执行拆分后的子计算任务耗时最长的那个核的时间。
在本申请实施例中,可以通过如下步骤A11-A14确定目标算子拆分方式的权重值:
A11、确定拆分后的n个子算子的计算负载c1,c2,…,cn。其中,ci根据拆分后第i个子算子的类型和规模计算得到;
A12、确定n个子算子的访存数据量d1,d2,…,dn。其中,di根据拆分后第i个子算子的类型和规模计算得到;
A13、确定每个人工智能处理器核的计算吞吐速率α。α由人工智能处理器本身的性能参数所决定;
A14、确定每个人工智能处理器核的访存带宽β。通常来说,人工智能处理器的多个核共享有限的访存带宽,因此β=B/n。其中,B是多核人工智能处理器的总带宽。
基于上述确定好的参数,计算机设备可以根据如下计算公式(1)来计算目标算子的拆分方式的权重值:
t=max i=1,...,n(max(c i/α,d i/β))       (1)
其中,计算公式中内侧的取最大值操作是基于算子实现的计算部分和访存部分之间能够相互隐藏,即计算和访存可以做到尽量并发执行。对于一些人工智能处理器来说,当子算子的规模过小时会导致每个核的计算吞吐量降低,可以对α进行进一步修正使估值更加准确。计算公式中外侧的取最大值操作就是多核人工智能处理器完成一个算子的计算的时间取决于执行子计算任务耗时最长的那个核的时间。
最后,将目标算子在某种拆分方式下的权重确定为拆分策略的权重。可以理解的是,通过上述实现方式可以确定拆分策略集合中包含的拆分策略的权重。
需要说明的是,上述计算权重的方式仅仅是例举的部分情况,而不是穷举,本领域技术人员在理解本申请技术方案的精髓的情况下,可能会在本申请技术方案的基础上产生其它的变形或者变换,比如:衡量拆分策略的权重不仅仅可以是执行子计算任务的所花费的时间,也可以是执行子计算任务的吞吐量。或也可以通过实际测量在多核人工智能处理器上执行拆分策略对应的算子拆分方式下的所有子计算任务的时间来确定拆分策略权重。但只要其实现的功能以及达到的技术效果与本申请类似,那么均应当属于本申请的保护范围。
在本申请实施例中,当计算机设备根据上述描述的方法确定好了拆分策略集合中的目标算子对应的拆分方式的权重值之后,计算机设备可以将权重值最小的拆分策略确定为神经网络模型的目标拆分策略。
步骤S314、根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务。
步骤S316、将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
如前所述,本申请实施例所描述的技术方案的核心思想为:通过把神经网络模型中的 目标算子的计算任务拆分成更小的子计算任务分配到多个核上并行执行来充分利用多核处理器结构芯片的硬件资源。
这里,由于拆分后的每个子算子都可以复用单核架构下算子的指令实现来进行计算,从而可以避免对原有算子的指令实现的重构。
在本申请实施例中,神经网络模型用于执行某个特定的神经网络计算任务,例如,人脸识别;又例如,边缘检测;又例如,语义分析等等。这里,运行结果是指,计算机设备执行特定神经网络计算任务时的结果,可以包括但不限于:神经网络模型的精度、神经网络模型的运行时间等等。在实际应用中,计算机设备可以输出该运行结果,例如,计算机设备通过显示屏显示该运行结果。
实施本申请实施例,通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图3A的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3A中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一种可能的实施例中,请参阅图6A,图6A为本申请实施例提供的一种神经网络优化方法的流程示意图,具体说明在本申请实施例中,是如何对神经网网络模型进行优化的,可以包括但不限于如下步骤:
S620、在神经网络模型对应的计算图中提取胶水子图;其中,所述胶水子图是包含胶水算子的子图;所述胶水算子用于对所述计算图的张量数据进行调整。
在本申请实施例中,“神经网络模型”也称模型,如“第一神经网络模型”、“第二神经网络模型”或“第三神经网络模型”,可以接收输入数据,并根据接收的输入数据和当前的模型参数生成预测输出。在实际应用中,预测输出可以包括图像检测输出结果、语义分析输出结果、图像分类输出结果等等。该神经网络模型可以包括深度学习神经网络模型(deep neural network,DNN)、卷积神经网络模型(Convolutional Neural Network,CNN)、极限学习机模型(extreme learning machine,ELM)或其他的神经网络模型等。
在本申请实施例中,神经网络模型中包含胶水算子。具体地,胶水算子可以包括reshape算子、transpose算子、concat算子、split算子等,还可以包括其他可以用于对神经网络模型中张量数据的格式、张量数据的形状以及张量数据在内存中排布进行调整的胶水算子,本申请实施例不作具体限定。
在本申请实施例中,计算图是指:使用图结构对神经网络模型的计算过程进行描述的一种方式。为了便于阐述,我们将胶水子图定义为包含胶水算子的计算图。例如,计算机设备中的通用处理器在神经网络模型对应的计算图中提取到的胶水子图可以参见图6B,如图6B所示,该胶水子图中包含reshape算子和concat算子,每个胶水算子均关联有对应的张量数据。
S622、在确保所述胶水子图的输入张量数据、输出张量数据不变的情况下,对所述计算图中的所述胶水子图进行处理,获得重构结果子图集合;其中,所述重构结果子图集合中的任意一个重构结果子图的输入张量数据和输出张量数据分别与所述胶水子图的输入张量数据和输出张量数据相同。
在本申请实施例中,重构结果子图是指可以对胶水子图进行替换的子图。具体地,重构结果子图为遍历状态集合图得到的。从本质上来看,重构结果子图为状态集合图中从起始状态到终点状态的一条路径。
在本申请实施例中,对计算图中的胶水子图进行处理可以包括:在保证胶水子图的输入张量数据和输出张量数据不变,以及胶水子图整体所代表的语义不变的情况下,对胶水子图内部的胶水算子和中间结果张量数据进行增加、删除、拓扑关系调整等等。
在本申请实施例中,计算机设备提取的胶水子图的数量为多个的情况下,计算机设备可以对这多个胶水子图进行扩充,通过重构子图的方式获取每个胶水子图对应的优化结构;也可以只对其中的任意一个胶水子图进行扩充,通过重构子图的方式获取胶水子图对应的优化结构,本申请实施例不作具体限定。
具体实现中,所述对所述计算图中的所述胶水子图进行处理,获得重构结果子图集合,可以包括但不限于如下步骤A21-步骤A23,接下来对其进行具体阐述:
步骤A21、根据胶水算子的逻辑关系对所述胶水子图进行扩充,获得扩充后的胶水子图。
具体实现中,所述根据胶水算子的逻辑关系对所述胶水子图进行扩充,获得扩充后的胶水子图,包括:根据等效规则对所述胶水子图中胶水算子之间的逻辑关系进行扩充,获得与所述胶水子图的语义等价的逻辑关系;根据与所述胶水子图的语义等价的逻辑关系对所述胶水子图进行扩充,获得所述扩充后的胶水子图。
这里,所述根据等效规则对所述胶水子图中胶水算子之间的逻辑关系进行扩充,包括:
对所述逻辑关系对应的算子序列进行变换,根据所述等效规则,确保获得所有与所述胶水子图的语义等价的逻辑关系。
在本申请实施例中,等效规则包括reshape算子的等效规则、transpose算子的等效规则、concat算子的等效规则、split算子的等效规则中的至少一种。从本质上来看,等效规则为根据胶水算子的逻辑关系进行优化的规则,下面对其进行具体阐述:
(1)reshape算子的等效规则:在本申请实施例中,胶水算子的逻辑关系可以包括reshape算子间的逻辑关系,或,reshape算子与第一类其他算子的逻辑关系;第一类其他算子可以包括transpose算子、concat算子、split算子中的任意一种算子。
在一种可能的实现方式中,胶水算子的逻辑关系包括reshape算子间的逻辑关系,例如,多个连续的reshape算子;在另一种可能的实现方式中,胶水算子的逻辑关系包括reshape算子与第一类其他算子的逻辑关系,例如,reshape算子与transpose算子相邻;又例如,reshape算子与concat算子相邻;又例如,reshape算子与split算子相邻,等等。在本申请实施例中,算子与算子相邻用于表征一个算子的输出张量数据作为另一个算子的输入张量数据。
在本申请实施例中,胶水算子的逻辑关系应该理解为计算机设备在执行神经网络模型这一程序代码过程中的执行逻辑。例如,计算机设备在执行某段程序代码过程中,先执行reshape算子,后执行transpose算子,在这种情况下,可以理解为:计算机设备将reshape算子的输出张量数据作为transpose算子的输入张量数据。
第一种情形:transpose算子的输出张量数据是reshape算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括transpose算子的输出张量数据是reshape算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与 “transpose算子和reshape算子”这一胶水子图语义等价的逻辑关系,可以包括:
在所述transpose算子的执行过程中,所述reshape算子进行维度合并的维度的相对位置不变,将reshape算子的输出张量数据作为所述transpose算子的输入张量数据。
在本申请实施例中,维度是指神经网络模型中的计算图中的张量数据的维度。例如,以卷积神经网络为例,卷积神经网络模型中的计算图中的张量数据的维度一般可以包括4个维度,分别为表示当前计算所处理的数据的批量大小的N,表示特征图像数量的C,表示特征图像尺寸的H和W。
在本申请实施例中,如图7A中的a所示,神经网络模型对应的计算图中包含reshape算子和transpose算子,其中,transpose算子的输出张量数据是reshape算子的输入张量数据,当reshape算子进行维度合并的维度的相对位置没有在transpose算子执行过程中发生变化,在一种实现方式中,如图7A中的b所示,可以按照优化路径(1)进行优化,将reshape算子的部分输出张量数据作为transpose算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系;在另一种实现方式中,也可以按照优化路径进行优化,将reshape算子的输出张量数据作为transpose算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量A在经过transpose算子之后,可以得到张量B=[5,3,4],与此同时,当张量B在经过reshape算子之后,可以得到张量C=[5,6,2]。这里,reshape算子在后两个维度上的操作可以认为是先对3和4进行合并,然后将其进行拆分,可以拆分成6和2。分析张量A=[3,4,5]和张量B=[5,3,4]可以知道的是,3和4的相对位置在transpose算子前后并没有发生变化,那么,在这种情况下,可以将reshape算子的输出张量数据作为transpose算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,4,5],张量A在经过reshape算子之后,可以得到张量B’=[6,2,5],与此同时,张量B’在经过transpose算子之后,可以得到张量C’=[5,6,2]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第二种情形:concat算子的输出张量数据是reshape算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括concat算子的输出张量数据是reshape算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“concat算子和reshape算子”这一胶水子图语义等价的逻辑关系,可以包括:
当所述concat算子所操作的维度k 0+k 1+...+k m在所述reshape算子的拆分阶段被拆分成p 0×p 1×...×(k 0/∏ ip i+k 1/∏ ip i+...+k m/∏ ip i)×...×p n-1×p n,将reshape算子的输出张量数据作为所述concat算子的输入张量数据;其中,k 0、k 1、k m表示所述concat算子拼接的维度大小。
在本申请实施例中,如图7B中的a所示,神经网络模型对应的计算图中包含reshape算子和concat算子,其中,concat算子的输出张量数据是reshape算子的输入张量数据,当concat算子所操作的维度k 0+k 1+...+k m在reshape算子的拆分阶段被拆分成形如p 0×p 1×...×(k 0/∏ ip i+k 1/∏ ip i+...+k m/∏ ip i)×...×p n-1×p n的形式,如图7B中的b所示,可以将reshape算子的输出张量数据作为所述concat算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量B=[3,6,5],张量A和张量B在经过concat算子之后,可以得到张量C=[3,10,5],与此同时,当张量C在经过reshape算子之后,可以得到张量D=[15,2,5]。分析上述变化过程可以知道的是,concat输出张量(也即张量C)中维度10为对张量A中维度4和张量B中维度6进行累加而来。由于reshape算子在执行过程中可以认为是:先对维度进行合并,然后,对合并后的维度进行拆分。当张量C在经过reshape算子时,维度10被拆分成一系列因子{5,2},因而维度10可以表示为(4/2+6/2)*2的形式,那么,在这种情况下,可以将reshape算子的输出张量数据作为所述concat算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,4,5],张量B=[3,6,5],这两个张量在经过reshape算子之后,可以得到张量C’=[6,2,5],张量D’=[9,2,5],那么,张量C’和张量D’在经过concat算子之后,可以得到张量E=[15,2,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第三种情形:split算子的输出张量数据是多个reshape算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括split算子的输出张量数据是多个reshape算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“split算子和多个reshape算子”这一胶水子图语义等价的逻辑关系,可以包括:
在所述split算子的输出张量经过各自对应的reshape算子之后,至多只有一个维度的长度不同,将所述多个reshape算子的输出张量数据作为所述split算子的输入张量数据。
在本申请实施例中,如图7C中的a所示,神经网络模型对应的计算图中包含多个reshape算子与split算子,其中,split算子的输出张量数据是多个reshape算子的输入张量数据,在split算子的所有输出张量经过各自对应的reshape算子之后,至多只有一个维度的长度不同,例如,只有C维度上的长度不同,在这种情况下,如图7C中的b所示,将多个reshape算子的输出张量数据作为split算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,15,4],张量A在经过split算子之后,可以得到张量B=[3,6,4]和张量C=[3,9,4],张量B和张量C在经过各自对应的reshape算子之后,可以得到张量D=[6,3,4]和张量E=[9,3,4]。分析张量D和张量E可以知道的是,reshape算子的输出张量只有一个维度不同(张量D中的维度6和张量E中的维度9),那么,在这种情况下,可以将多个reshape算子的输出张量数据作为split算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,15,4],张量A在经过reshape算子之后,可以得到张量B=[15,3,4],与此同时,张量B在经过split算子之后,可以得到张量C’=[6,3,4]和张量D’=[9,3,4]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第四种情形:多个连续的reshape算子。
具体实现中,所述胶水算子的逻辑关系可以包括N个连续的reshape算子。在这种情况下,根据胶水算子的逻辑关系确定与“多个reshape算子”这一胶水子图语义等价的逻辑关系,可以包括:
当神经网络模型对应的计算图中包含连续N个reshape算子时,对N个reshape算子进行合并,得到一个reshape算子。这里,N为大于等于2的正整数,如N=2。
在本申请实施例中,如图7D中的a所示,神经网络模型对应的计算图中包含多个连续的reshape算子,在这种情况下,计算机设备对这N个连续的reshape算子进行合并,可以得到如图7D中的b所示的优化结构。
为了便于理解,下面结合具体的实例进行阐述,以张量A=[A1,A2,A3,...,An]为例,当对张量A执行reshape1算子之后,得到张量B,其中,张量B=[B1,B2,B3,...,Bn]。与此同时,当对张量B执行reshape2算子之后,得到张量C,其中,张量C=[C1,C2,C3,...,Cn]。可以理解是,将reshape1算子与reshape2算子合并得到的reshape3算子的输入是A张量,输出为C张量。例如,A=[1,32,1,1],经过reshape1算子之后,变为B=[1,4,4,2],再经过reshape2算子之后,变为C=[16,2]。采用本申请描述的技术方案,对reshape1算子以及reshape2算子进行合并,可以得到reshape3算子,张量A在经过reshape3算子之后,直接从张量A=[1,32,1,1]变为张量C=[16,2]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
(2)transpose算子的等效规则:具体实现中,胶水算子的逻辑关系可以包括transpose算子间的逻辑关系,或,transpose算子与第二类其他算子的逻辑关系;这里,第二类其他算子可以包括reshape算子、concat算子、split算子中的任意一种算子。
在一种可能的实现方式中,胶水算子的逻辑关系包括transpose算子间的逻辑关系,例如,多个连续的transpose算子;在另一种可能的实现方式中,胶水算子的逻辑关系包括transpose算子与第二类其他算子的逻辑关系,例如,transpose算子与reshape算子相邻;又例如,transpose算子与concat算子相邻;又例如,transpose算子与split算子相邻,等等。这里,算子与算子相邻用于表征一个算子的输出张量数据作为另一个算子的输入张量数据。
第一种情形:reshape算子的输出张量数据是transpose算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括reshape算子的输出张量数据是transpose算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“reshape算子和transpose算子”这一胶水子图语义等价的逻辑关系,可以包括:
当所述reshape算子在拆分阶段由中间状态的同一维度所拆分出的维度的相对位置在执行所述transpose算子的过程中不发生变化,将transpose算子的输出张量数据作为所述reshape算子的输入张量数据。
在本申请实施例中,维度是指神经网络模型中的计算图中的张量数据的维度。例如,以卷积神经网络为例,卷积神经网络模型中的计算图中的张量数据的维度一般可以包括4个维度,分别为表示当前计算所处理的数据的批量大小的N,表示特征图像数量的C,表示特征图像尺寸的H和W。
在本申请实施例中,如图7E中的a所示,神经网络模型对应的计算图中包含reshape算子和transpose算子,其中,reshape算子的输出张量数据是transpose算子的输入张量数据,当reshape算子在拆分阶段由中间状态的同一维度所拆分出的维度的相对位置在执行transpose算子的过程中不发生变化,在一种实现方式中,如图7E中的b所示,可以按照优化路径(1)进行优化,将transpose算子的部分输出张量数据作为reshape算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系;在另一种实现方式中,也可以按照优化路径(2)进行优化,将transpose算子的输出张量数据作为reshape算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量A在经过reshape算子之后,可以得到张量B=[4,3,5],与此同时,当张量B在经过transpose算子之后,可以得到张量C=[5,4,3]。由于reshape算子在执行过程中可以认为是:先对维度进行合并,然后,对合并后的维度进行拆分。这里,在执行reshape算子的过程中,先对维度{3,4}进行合并,得到{12},然后对{12}进行拆分,可以得到维度{4,3}。分析张量B=[4,3,5]和张量C=[5,4,3]可以知道的是,在transpose算子的执行过程中,维度{4,3}的相对位置没有发生变化,那么, 在这种情况下,可以将transpose算子的输出张量数据作为reshape算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,4,5],张量A在经过transpose算子之后,可以得到张量B’=[5,3,4],与此同时,当张量B’在经过reshape算子之后,可以得到张量C’=[5,4,3]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第二种情形:concat算子的输出张量数据是transpose算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括concat算子的输出张量数据是transpose算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“concat算子和transpose”这一胶水子图语义等价的逻辑关系,可以包括:将所述transpose算子的输出张量数据作为所述concat算子的输入张量数据。
在本申请实施例中,如图7F中的a所示,神经网络模型对应的计算图中包含transpose和concat算子,其中,concat算子的输出张量数据是transpose算子的输入张量数据,在这种情况下,如图7F中的b所示,将transpose算子的输出张量数据作为concat算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量B=[3,6,5],在张量A和张量B在经过concat算子之后,可以得到张量C=[3,10,5],与此同时,当张量C在经过transpose算子之后,可以得到张量D=[10,3,5]。那么,在这种情况下,可以将transpose算子的输出张量数据作为concat算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,4,5],张量B=[3,6,5],当张量A和张量B经过各自对应的transpose算子之后,可以得到张量C’=[4,3,5]和张量D’=[6,3,5],与此同时,当张量C’和张量D’在经过concat算子之后,可以得到张量E=[10,3,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第三种情形:split算子的输出张量数据是多个transpose算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括split算子的输出张量数据是多个transpose算子的输入张量数据;所述通用处理器根据所述计算图中胶水算子的逻辑关系对所述计算图进行优化。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“split算子和多个transpose算子”这一胶水子图语义等价的逻辑关系,可以包括:
在所述多个transpose算子各自对应的perm参数相同时,将所述多个transpose算子的输出张量数据作为所述split算子的输入张量数据。
如前所述,transpose算子可以表示为:tf.transpose(a,perm=None,name=’transpose’),那么,可以知道的是,transpose算子包含有perm参数。在本申请实施例中,perm参数为自然数列[1,2,3,...,n]的一个全排列,不同的全排列表示不同的transpose算子。
具体地,全排队被定义为:从n个不同元素中任意取m(m小于等于n)个元素,按照一定的顺序排列起来,叫做从n个不同元素中取出m个元素的一个排列。当m=n时所有的排列情况叫做全排列。例如,1,2,3三个元素的全排列可以为:1,2,3;1,3,2;2,1,3;2,3,1;3,1,2;3,2,1。
在本申请实施例中,多个transpose算子各自对应的perm参数相同是指:多个transpose算子各自对应的全排队相同。
在本申请实施例中,如图7G中的a所示,神经网络模型对应的计算图中包含多个transpose算子和split算子,其中,split算子的输出张量数据是多个transpose算子的输入张量数据,在多个transpose算子各自对应的perm参数相同时,如图7G中的b所示,将多个 transpose算子的输出张量数据作为split算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,10,5],张量A在经过split算子之后,可以得到张量B=[3,4,5]和张量C=[3,6,5],与此同时,当张量B和张量C在经过各自对应的transpose算子之后,具体地,transpose算子各自对应的perm参数均为[1,0,2],可以得到张量D=[4,3,5]和张量E=[6,3,5]。那么,在这种情况下,将多个transpose算子的输出张量数据作为split算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,10,5],张量A在经过transpose算子之后,可以得到张量B’=[10,3,5],与此同时,当张量B’经过split算子之后,可以得到张量C’=[4,3,5]和张量D’=[6,3,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第四种情形:多个连续的transpose算子。
具体实现中,胶水算子的逻辑关系可以包括M个连续的transpose算子。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“多个transpose算子”这一胶水子图语义等价的逻辑关系,可以包括:当所述神经网络模型对应的计算图中包含M个连续的transpose算子时,将所述M个transpose算子进行合并,得到一个transpose算子。这里,M为大于等于2的正整数,如M=2。
具体实现中,所述连续M个transpose算子包括第一transpose算子和第二transpose算子;所述将所述连续M个transpose算子合并为一个transpose算子,包括:确定所述第一transpose算子以及所述第二transpose算子各自对应的perm参数;根据所述第一transpose算子以及所述第二transpose算子各自对应的perm参数确定第一参数,其中,所述第一参数为合并后的transpose算子对应的perm参数。
具体实现中,所述根据所述第一transpose算子以及所述第二transpose算子各自对应的perm参数确定第一参数,包括:在确定所述第一参数时,根据以下公式来计算:perm3[i]=perm1[perm2[i]],其中,perm3表示所述第一参数,perm1表示所述第一transpose算子对应的perm参数,perm2表示所述第二transpose算子对应的perm参数。这里,中括号[]表示取数组中的元素。
例如,第一transpose算子对应的perm参数为perm1=[1,2],第二transpose算子对应的perm参数为perm2=[2,1],当i=1时,perm3[1]=perm1[perm2[1]]=2。当i=2时,perm3[2]=perm1[perm2[2]]=1。从而可以得到合并后的transpose算子对应的perm参数perm3=[2,1]。进一步地,合并后的transpose算子在确定好的perm3参数下调换张量数据的顺序。
在本申请实施例中,如图7H中的a所示,神经网络模型对应的计算图中包含多个连续的transpose算子,在这种情况下,计算机设备对这M个连续的transpose算子进行合并,可以得到如图7H中的b所示的优化结构,也即与“多个连续的transpose算子”这一胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述。例如,张量A=[1,4,3,2],经过transpose_1423算子之后,变为张量B=[1,2,4,3],再经过transpose_1243算子之后,变为张量C=[1,2,3,4]。采用本申请所描述的技术方案,对transpose_1423算子以及transpose_1243算子进行合并,可以得到transpose_1432算子,张量A在经过transpose_1432算子之后,直接从张量A=[1,4,3,2]变为张量C=[1,2,3,4]。由于处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行神经网络模型时,无需依次执行两次不同的transpose算子, 而是只执行合并后的transpose算子,可以减少冗余计算,以达到减少计算机设备的资源消耗的目的。
(3)concat算子的等效规则:具体实现中,胶水算子的逻辑关系可以包括concat算子间的逻辑关系,或,所述concat算子与第三类其他算子的逻辑关系。这里,第三类其他算子包括reshape算子、transpose算子、split算子中的任意一种算子。
在其中一种可能的实现方式中,胶水算子的逻辑关系包括concat算子间的逻辑关系,例如,多个连续的concat算子;在另一种可能的实现方式中,胶水算子的逻辑关系包括concat算子与其他算子的逻辑关系,例如,concat算子与reshape算子相邻;又例如,concat算子与transpose算子相邻;又例如,concat算子与split算子相邻,等等。这里,算子与算子相邻用于表征一个算子的输出张量数据作为另一个算子的输入张量数据。
第一种情形:多个reshape算子的输出张量数据是concat算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括多个reshape算子的输出张量数据是concat算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“多个reshape算子和concat算子”这一胶水子图语义等价的逻辑关系,可以包括:当所述多个reshape算子各自对应的输入张量至多只有一个维度的长度不同,将所述concat算子的输出张量数据作为所述多个reshape算子的输入张量数据。
在本申请实施例中,如图7I中的a所示,神经网络模型对应的计算图中包含concat算子和多个reshape算子,其中,多个reshape算子的输出张量数据是concat算子的输入张量数据,当多个reshape算子各自对应的输入张量至多只有一个维度的长度不同,例如,在W维度上的长度不同,在这种情况下,如图7I中的b所示,将concat算子的输出张量数据作为多个reshape算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量B=[3,6,5],张量A和张量B在经过各自对应的reshape算子之后,可以得到张量C=[6,2,5]和张量D=[9,2,5],与此同时,当张量C和张量D在经过concat算子之后,可以得到张量E=[15,2,5]。分析张量A和张量B(张量A和张量B为reshape算子的输入张量)可以知道的是,张量A和张量B中只有一个维度的长度不同(张量A中的维度6和张量B中的维度4),那么,在这种情况下,将concat算子的输出张量数据作为多个reshape算子的输入张量数据,从而其实现过程可以描述为:张量A=A=[3,4,5],张量B=[3,6,5],张量A和张量B在经过concat算子之后,可以得到张量C’=[3,10,5],与此同时,当张量C’在经过reshape算子之后,可以得到张量D’=[15,2,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
需要说明的是,在本申请实施例中,当多个reshape算子为连续的多个reshape算子时,可以对这多个连续的reshape算子进行合并,得到一个reshape算子。例如,reshape1算子与reshape2相邻,张量A=[A1,A2,A3,...,An],当对张量A经过reshape1算子之后,可以得到张量B,其中,张量B=[B1,B2,B3,...,Bn]。与此同时,当张量B经过reshape2算子之后,得到张量C,其中,张量C=[C1,C2,C3,...,Cn]。可以理解是,将reshape1算子与reshape2算子合并得到的reshape3算子的输入是A张量,输出为C张量。例如,A=[1,32,1,1],经过reshape1算子之后,变为B=[1,4,4,2],再经过reshape2算子之后,变为C=[16,2]。采用本申请描述的技术方案,对reshape1算子以及reshape2算子进行合并,可以得到reshape3算子,张量A在经过reshape3算子之后,直接从张量A=[1,32,1,1]变为张量C=[16,2]。可以理解的是,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行神经网络模型时,这里,由于神经网络模型为优化后的模型,可以减少计算机设备的资源消耗的目的。
第二种情形:多个transpose算子的输出张量数据是concat算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括多个transpose算子的输出张量数据是concat算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“多个transpose算子和concat算子”这一胶水子图语义等价的逻辑关系,可以包括:在所述多个transpose算子各自对应的perm参数相同的情况下,将所述concat算子的输出张量数据作为所述多个transpose算子的输入张量数据。
如前所述,transpose算子可以表示为:tf.transpose(a,perm=None,name=’transpose’),那么,可以知道的是,transpose算子包含有perm参数。在本申请实施例中,perm参数为自然数列[1,2,3,...,n]的一个全排列,不同的全排列表示不同的transpose算子。
具体地,全排队被定义为:从n个不同元素中任意取m(m小于等于n)个元素,按照一定的顺序排列起来,叫做从n个不同元素中取出m个元素的一个排列。当m=n时所有的排列情况叫做全排列。例如,1,2,3三个元素的全排列可以为:1,2,3;1,3,2;2,1,3;2,3,1;3,1,2;3,2,1。
在本申请实施例中,多个transpose算子各自对应的perm参数相同是指:多个transpose算子各自对应的全排队相同。
在本申请实施例中,如图7J中的a所示,神经网络模型对应的计算图中包含concat算子与多个transpose算子,其中,多个transpose算子的输出张量数据是concat算子的输入张量数据,当这多个transpose算子各自对应的perm参数相同的情况下,如图7J中的b所示,将concat算子的输出张量数据作为多个transpose算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5],张量B=[3,6,5],张量A和张量B在经过各自对应的transpose算子之后,具体地,多个transpose各自对应的perm参数为[1,0,2],可以得到张量C=[4,3,5]和张量D=[6,3,5],与此同时,当张量C和张量D在经过concat算子之后,可以得到张量E=[10,3,5]。那么,在这种情况下,将concat算子的输出张量数据作为多个transpose算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,4,5],张量B=[3,6,5],张量A和张量B在经过concat算子之后,可以得到张量C’=[3,10,5],与此同时,当张量C’在经过transpose算子之后,可以得到张量D’=[10,3,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
需要说明的是,在本申请实施例中,当多个transpose算子为连续的多个transpose算子时,可以对这多个连续的transpose算子进行合并,得到一个transpose算子。具体地,连续M个transpose算子包括第一transpose算子和第二transpose算子;所述将所述连续M个transpose算子合并为一个transpose算子,包括:
确定所述第一transpose算子以及所述第二transpose算子各自对应的perm参数;
根据所述第一transpose算子以及所述第二transpose算子各自对应的perm参数确定第一参数,其中,所述第一参数为合并后的transpose算子对应的perm参数。
具体实现中,所述根据所述第一transpose算子以及所述第二transpose算子各自对应的perm参数确定第一参数,包括:在确定所述第一参数时,根据以下公式来计算:perm3[i]=perm1[perm2[i]],其中,perm3表示所述第一参数,perm1表示所述第一transpose算子对应的perm参数,perm2表示所述第二transpose算子对应的perm参数。这里,中括号[]表示取数组中的元素。
例如,第一transpose算子对应的perm参数为perm1=[1,2],第二transpose算子对应的 perm参数为perm2=[2,1],当i=1时,perm3[1]=perm1[perm2[1]]=2。当i=2时,perm3[2]=perm1[perm2[2]]=1。从而可以得到合并后的transpose算子对应的perm参数perm3=[2,1]。进一步地,合并后的transpose算子在确定好的perm3参数下调换张量的顺序。
为了便于理解,下面结合具体的实例进行阐述。例如,transpose_1423算子和transpose_1243算子相邻,张量A=[1,4,3,2],经过transpose_1423算子之后,变为张量B=[1,2,4,3],再经过transpose_1243算子之后,变为张量C=[1,2,3,4]。采用本申请所描述的技术方案,对transpose_1423算子以及transpose_1243算子进行合并,可以得到transpose_1432算子,张量A在经过transpose_1432算子之后,直接从张量A=[1,4,3,2]变为张量C=[1,2,3,4]。当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行神经网络模型时,这里,由于神经网络模型为优化后的模型,可以减少计算机设备的资源消耗的目的。
第三种情形:split算子的输出张量数据是concat算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括split算子的输出张量数据是concat算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“split算子和concat算子”这一胶水子图语义等价的逻辑关系,可以包括:在所述concat算子与所述split算子各自操作的维度相同的情况下,将所述concat算子与所述split算子合并消除。
在本申请实施例中,如图7K中的a所示,神经网络模型对应的计算图中包含concat算子与split算子,其中,split算子的输出张量数据是concat算子的输入张量数据,在满足concat算子与split算子各自操作的维度相同的情况下,例如,concat算子与split算子在执行过程中在C维度相同,在这种情况下,如图7K中的b所示,将concat算子与split算子合并消除。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,10,5],张量A在经过split算子之后,可以得到张量B=[3,4,5]和张量C=[3,6,5],与此同时,当张量B和张量C在经过concat算子之后,可以得到张量D=[3,10,5]。由于split算子和split算子各自操作的维度相同,即满足split算子的输出张量数据都是concat算子的输入张量数据,那么,在这种情况下,将concat算子与split算子合并消除。可以理解的是,由于上述优化操作可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第四种情形:N个连续的concat算子。
具体实现中,所述胶水算子的逻辑关系可以包括N个连续的concat算子;其中,N为大于等于2的正整数。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“多个concat算子”这一胶水子图语义等价的逻辑关系,可以包括:
在所述N个连续的concat算子各自操作的维度为同一个维度的情况下,将所述N个连续的concat算子进行合并。
在本申请实施例中,如图7L中的a所示,神经网络模型对应的计算图中包含多个concat算子,这多个concat算子所操作的是同一个维度,例如,N维度,在这种情况下,计算机设备可以对这多个concat算子进行合并,得到一个concat算子,具体地,请参见图7L中的b所示的优化结构,也即优化得到的与胶水子图语义等价的逻辑关系。
(4)split算子的等效规则:具体实现中,胶水算子的逻辑关系可以包括split算子间的逻辑关系,或,所述split算子与第四类其他算子的逻辑关系;这里,第四类其他算子包括reshape算子、transpose算子、concat算子中的任意一种算子。
在其中一种可能的实现方式中,胶水算子的逻辑关系包括split算子间的逻辑关系,例如,多个连续的split算子;在另一种可能的实现方式中,胶水算子的逻辑关系包括split算子与其他算子的逻辑关系,例如,split算子与reshape算子相邻;又例如,split算子与transpose 算子相邻;又例如,split算子与concat算子相邻,等等。这里,算子与算子相邻用于表征一个算子的输出张量数据作为另一个算子的输入张量数据。
第一种情形:reshape算子的输出张量数据是split算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括reshape算子的输出张量数据是split算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“reshape算子和split算子”这一胶水子图语义等价的逻辑关系,可以包括:在由输出到输入逆向推导所述reshape算子的过程中,作为所述输出的一部分的所述split算子所操作的维度k 0+k 1+...+k m在所述逆向推导过程中被拆分成p 0×p 1×...×(k 0/∏ ip i+k 1/∏ ip i+...+k m/∏ ip i)×...×p n-1×p n,将所述split算子的输出张量数据作为所述reshape算子的输入张量数据。
在本申请实施例中,如图7M中的a所示,神经网络模型对应的计算图中包含split算子与reshape算子,其中,reshape算子的输出张量数据是split算子的输入张量数据,在由输出到输入逆向推导reshape算子的过程中,作为输出的一部分的split算子所操作的维度k 0+k 1+...+k m在逆向推导过程中被拆分成形如p 0×p 1×...×(k 0/∏ ip i+k 1/∏ ip i+...+k m/∏ ip i)×...×p n-1×p n的形式,如图7M中的b所示,将split算子的输出张量数据作为reshape算子的输入张量数据,从而可以与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,10,5],张量A在经过reshape算子之后,可以得到张量B=[15,2,5],与此同时,当张量B经过split算子之后,可以得到张量C=[6,2,5]和张量D=[9,2,5],也即将维度15拆分成维度6和维度9。当逆向推导reshape算子,维度15在reshape算子的过程中被拆分成了{3,5},而维度15可以表示为3*(6/3+9/3),那么,在这种情况下,将split算子的输出张量数据作为reshape算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,10,5],张量A在经过split算子之后,可以得到张量B’=[3,4,5]和张量C’=[3,6,5],与此同时,当张量B’和张量C’在经过各自对应的reshape算子之后,可以得到张量D’=[6,2,5]和张量E=[9,2,5]。可以理解的是,由于上述优化操作可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第二种情形:transpose算子的输出张量数据是split算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括transpose算子的输出张量数据是split算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“transpose算子和split算子”这一胶水子图语义等价的逻辑关系,可以包括:
将所述split算子的输出张量数据作为所述transpose算子的输入张量数据。
在本申请实施例中,如图7N中的a所示,神经网络模型对应的计算图中包含split算子和transpose算子,其中,transpose算子的输出张量数据是split算子的输入张量数据,在这种情况下,如图7N中的b所示,将split算子的输出张量数据作为transpose算子的输入张量数据,从而可以得到与胶水子图语义等价的逻辑关系。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,10,5],张量A在经过transpose算子之后,可以得到张量B=[10,3,5],与此同时,当张量B在经过split算子之后,可以得到张量C=[4,3,5]和张量D=[6,3,5],那么,在这种情况下,将split算子的输出张量数据作为transpose算子的输入张量数据,从而其实现过程可以描述为:张量A=[3,10,5],张量A在经过split算子之后,可以得到张量B’=[3,4,5]和张量C’=[3,6,5],与此同时,当张量B’和张量C’在经过各自对应的transpose算子之后,可以得到张量D'=[4,3,5]和张量E=[6,3,5]。可以理解的是,由于优化得到的与胶水子图语义等价的逻辑关系可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运 行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第三种情形:concat算子的输出张量数据是split算子的输入张量数据。
具体实现中,所述胶水算子的逻辑关系包括concat算子的输出张量数据是split算子的输入张量数据。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“concat算子和split算子”这一胶水子图语义等价的逻辑关系,可以包括:在所述concat算子与所述split算子各自操作的维度相同的情况下,将所述concat算子与所述split算子合并消除。
在本申请实施例中,如图7O中的a所示,神经网络模型对应的计算图中包含split算子和concat算子,其中,concat算子的输出张量数据是split算子的输入张量数据,在满足concat算子与split算子在语义上互为逆操作的情况下,例如,concat算子与split算子在执行过程中在C维度相同,在这种情况下,如图7O中的b所示,将concat算子与split算子合并消除。
为了便于理解,下面结合具体的实例进行阐述,张量A=[3,4,5]和张量B=[3,6,5],张量A和张量B在经过concat算子之后,可以得到张量C=[3,10,5],与此同时,当张量C在经过split算子之后,可以得到张量D=[3,4,5]和E=[3,6,5]。由于concat算子与split算子各自操作的维度相同,并且在语义上互为逆操作,那么,在这种情况下,将concat算子与split算子合并消除。可以理解的是,由于上述优化操作可以提高神经网络模型的整体性能,那么,当处理器(例如,通用处理器CPU、专用处理器人工智能处理器)在运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
第四种情形:N个连续的split算子。
具体实现中,所述胶水算子的逻辑关系包括N个连续的split算子;其中,N为大于等于2的正整数。在这种情况下,计算机设备根据胶水算子的逻辑关系确定与“多个split算子”这一胶水子图语义等价的逻辑关系,可以包括:在所述N个连续的split算子各自操作的维度为同一个维度的情况下,将所述N个连续的split算子进行合并。
在本申请实施例中,如图7P中的a所示,神经网络模型对应的计算图中包含多个split算子,这多个split算子所操作的是同一个维度,例如,N维度,在这种情况下,计算机设备可以对这多个split算子进行合并,得到一个split算子,具体地,请参见图7P中的b所示的优化结构,也即与胶水子图语义等价的逻辑关系。
在本申请实施例中,基于本申请所描述的等效规则,我们可以对胶水子图进行扩充,从而搭建出多条与胶水子图语义等价的新的算子路径。例如,如图8A所示,左侧是胶水子图的原始结构,其中,形如张量数据(A0,A1,A2,A3)首先经过reshape算子变为张量数据(A0,A1*A2,A3),再经过transpose算子变为张量数据(A0,A3,A1*A2),最后通过split算子被拆分成两个子张量数据。右侧为基于预设的等效规则进行扩充后的胶水子图,其中,加粗部分代表的是胶水子图中原有的拓扑关系。从图8A可以知道的是,在胶水子图原有的拓扑关系之外,还存在多种不同的方式可以由原始子图的输入张量数据(A0,A1,A2,A3)得到原始子图的输出张量数据(A0,A30,A1*A2)和(A0,A31,A1*A2)。
在本申请实施例中,考虑到在胶水子图中加入与胶水子图的语义等价的逻辑关系之后,图中加入了新的算子或者图中原有的算子之间的连接关系发生了变化,在这种情况下,对新算子和被改变连接关系的算子的后继算子采用如上描述的方法来确定相应的等价逻辑关系,并将等价逻辑关系加入胶水子图中,以得到扩充后的胶水子图。
具体实现中,所述将所述至少两个胶水算子对应的等价逻辑关系加入所述胶水子图中之后,还包括:在满足加入的等价逻辑关系改变所述胶水子图中包含的胶水算子之间原先具有的有向边的情况下,根据改变后的胶水子图中胶水算子之间具有的有向边和所述等效规则确定所述改变后的胶水子图中位置相邻的至少两个胶水算子对应的等价逻辑关系,直至所述胶水子图无法通过所述等效规则进行扩充。
在本申请实施例中,在满足等效规则的情况下,将与胶水子图语义等价的逻辑关系加入胶水子图的过程中:
A211、如果当前算子和前一个算子互为拟操作,意味着当前算子和前一个算子构成的算子序列的起点张量数据和终点张量数据是同一个张量,在这种情况下,合并这两个张量,得到一个张量。
A212、如果将要加入胶水子图中的张量或算子已经存在于胶水子图中,在这种情况下,直接使用胶水子图中的该张量或算子。
A213、扩充得到的胶水子图中不存在重复的算子序列。
在本申请实施例中,经过扩充后的胶水子图满足约束:对胶水子图中任意一组满足等效规则的算子的拓扑结构,其经过变换后的算子拓扑结构同样存在于扩充后的胶水子图中,即扩充后的胶水子图是一个基于等效规则的闭包。这一约束使得扩充后的胶水子图不可能再次通过等效规则进行进一步的扩充,从而可以保证扩充后的胶水子图中已经包含了尽可能多的等价逻辑关系的拓扑结构,这有利于接下来从扩充后的胶水子图中获取对人工智能处理器性能最优的目标子图。
在本申请实施例中,通过这一实现方式,可以保证胶水子图中的每个胶水算子,无论是原始胶水子图中已有的,或者是之后添加的,都会确定位置相邻的至少两个胶水算子是否可以根据等效规则进行优化。其次,在确定了位置相邻的至少两个胶水算子的等价逻辑关系之后,将其加入胶水子图中。最后,会再次确定加入胶水子图的新算子或者改变已有算子的连接关系的算子的后一个算子是否可以根据等效规则进行优化,从而可以保证不会遗漏那些由于胶水子图的结构发生变化而引入的新的逻辑关系。
步骤A22、对所述扩充后的胶水子图进行转换,得到与胶水算子关联的张量数据的状态集合图。
在本申请实施例中,与胶水算子关联的张量数据的状态集合图中任意一条从起始状态到终点状态的路径用于表征重构后的子图,重构后的子图即为胶水子图的优化方式。
在本申请实施例中,将扩充后的胶水子图进行转换的原因在于:扩充后的胶水子图用于描述构建算子序列的等价逻辑关系的实现过程,并不能基于扩充后的胶水子图确定目标子图。
具体实现中,所述对所述扩充后的胶水子图进行转换,得到与胶水算子关联的张量数据的状态集合图,包括:
确定所述扩充后的胶水子图中的胶水算子的类型以及所述胶水算子之间的逻辑关系;
基于所述扩充后的胶水子图中的胶水算子的类型以及所述胶水算子之间的逻辑关系,根据所述扩充后的胶水子图中的胶水算子对应的输入张量数据确定对应的输出张量数据;
根据所述扩充后的胶水子图中的胶水算子的输入张量数据和输出张量数据确定与胶水算子关联的张量数据的状态集合图。
在本申请实施例中,扩充后的胶水子图中的所有张量都有唯一的编号{0,1,2,......,n},图中的所有输入张量中的数据被作为一个整体D,D的数据被划分并组合成不同的张量,每种张量的组合方式都可以被看成是D的一种状态。在最开始阶段,D的状态可以表示为所有输入张量的编号的集合{s0,s1,...,sm},其最终目标是使D变成状态{e0,e1,...,en},其中,ei是第i个的输出张量的编号。由输入开始,每个与输入张量关联的胶水算子将当前D所对应的所有张量中的至少一个张量变成另外的一个或多个张量,也就是代表D的状态的编号集合发生了变化,例如,由一个编号状态集合变成了另一个编号状态集合。由此,可以得到一个由D的各种状态和胶水算子所表示的状态之前的有向边构成的图结构,也即状态集合图。
为了便于理解,下面结合具体的实例进行阐述。参见图8B,为本申请实施例提供的一 种胶水子图的结构示意图,如图8B所示,该胶水子图中包含两个reshape算子和一个concat算子。具体地,张量数据(2,3,5)在经过reshape算子1之后,可以得到张量数据(2,15,1);张量数据(2,4,5)在经过reshape算子2之后,可以得到张量数据(2,20,1)。此外,张量数据(2,15,1)和张量数据(2,20,1)在经过concat算子之后,可以得到张量数据(2,35,1)。从上述实现过程可以知道的是,由于多个reshape算子各自对应的输入张量至多只有一个维度的长度不同,在这种情况下,可以将concat算子的输出张量数据作为多个reshape算子的输入张量数据。具体地,确定好的与胶水子图语义等价的逻辑关系可以如图8C所示。那么,在这种情况下,张量数据(2,3,5)和张量数据(2,4,5)在经过concat算子之后,可以得到张量数据(2,7,5);张量数据(2,7,5)在经过reshape算子之后,可以得到张量数据(2,35,1)。此外,需要说明的是,该胶水子图中并无其他可以优化的逻辑关系。
基于上述确定好的等价逻辑关系之后,计算机设备将上述等价逻辑关系加入胶水子图中,得到扩充后的胶水子图,具体地,请参见图8D。在得到扩充后的胶水子图之后,计算机设备将扩充后的胶水子图进行转换,以得到状态集合图。在最开始阶段,D的状态可以表示为所有输入张量的编号的集合,具体地,可以如图8E所示。其中,张量数据(2,3,5)用编号①表示,张量数据(2,4,5)用编号②表示,张量数据(2,15,1)用编号③表示,张量数据(2,20,1)用编号④表示,张量数据(2,7,5)用编号⑤表示,张量数据(2,35,1)用编号⑥表示。接下来具体阐述将扩充后的胶水子图转换状态集合图的实现过程:
步骤1:由输入开始,张量数据(2,3,5)①和张量数据(2,4,5)②构成了输入张量的编号状态集合1,具体地,编号状态集合1可以表示为{①,②},其对应的转换示意图可以如图8F所示;
步骤2:在步骤1的基础上,与输入张量数据(2,3,1)关联的reshape算子将当前D所对应的张量进行转换,可以得到编号状态集合2,具体地,编号状态集合2可以表示为{③,②},其对应的转换示意图可以如图8G所示;
步骤3:在步骤2的基础上,与输入张量数据(2,4,5)关联的reshape算子将当前D所对应的张量进行转换,可以得到编号状态集合3,具体地,编号状态集合3可以表示为{①,④},其对应的转换示意图可以如图8H所示;
步骤4:在步骤3的基础上,与输入张量数据(2,4,5)关联的reshape算子将当前D所对应的张量进行转换,可以得到编号状态集合4,具体地,编号状态集合4可以表示为{③,④},其对应的转换示意图可以如图8I所示;
步骤5:在步骤4的基础上,与输入张量数据(2,3,5)关联的reshape算子将当前D所对应的张量进行转换,编号状态{①,④}可以转换为编号状态{③,④},其对应的转换示意图可以如图8J所示;
步骤6:在步骤5的基础上,与输入张量数据(2,15,1)、输入张量数据(2,20,1)关联的concat算子将当前D所对应的张量进行转换,可以得到编号状态集合5,具体地,编号状态集合5可以表示为{⑥},其对应的转换示意图可以如图8K所示;
步骤7:在步骤6的基础上,与输入张量数据(2,3,5)、输入张量数据(2,4,5)关联的concat算子将当前D所对应的张量进行转换,可以得到编号状态集合6,具体地,编号状态集合6可以表示为{⑤},其对应的转换示意图可以如图8L所示;
步骤8:在步骤7的基础上,与输入张量数据(2,7,5)关联的reshape算子将当D所对应的张量进行转换,编号状态{⑤}可以转换为编号状态{⑥},其对应的转换示意图可以如图8M所示。
在本申请实施例中,图8M即为计算机设备将扩充后的胶水子图进行转换后得到状态集合图。那么,在这种情况下,可以在图8M中确定目标子图。
步骤A23、遍历所述状态集合图,获得所述重构结果子图集合。
在本申请实施例中,遍历所述状态集合图,确定相邻算子之间的状态路径以及状态路径的权重。
在本申请实施例中,状态路径的权重用于表征算子在执行过程中的性能优劣,例如,权重越小,表示算子在执行过程中的性能越优;又例如,权重越大,表示算子在执行过程中的性能越优,本申请实施例不作具体限定。在确定算子的权重时,往往需要结合算子的输入数据的形状、规模进行考虑。为了便于阐述,在本申请实施例中,以权重越小,性能越优作为一种示例进行说明。
在本申请实施例中,以图8M为例,其中,张量数据(2,3,5)和张量数据(2,4,5)为起始状态,张量数据(2,35,1)为终止状态。由图8M可以知道的是,图8M中包括多条从起始状态到终点状态的路径,这里,任意一条由起点状态到终点状态的路径都对应着一种重构后的语义等效的胶水子图的结构,我们的目标在于在多条状态路径中确定最短路径。
具体地,可以通过遍历图8M所示状态集合图,确定相邻算子之间的状态路径以及状态路径的权重。例如,图8M所示的状态集合中包含3条路径,分别为路径1、路径2和路径3。其中,计算机设备确定路径1上的算子的权重和为10,路径2上的算子的权重和为15,路径3上的算子的权重和为17。这里,从起始状态到终止状态之间的一条路径用于表征一个重构结果子图。
从而,通用处理器可以根据所述状态路径的权重确定目标子图,并根据所述目标子图对所述神经网络模型进行优化,得到优化后的神经网络模型。
S624、从所述重构结果子图集合中确定目标子图。
在本申请实施例中,所述从所述重构结果子图集合中确定目标子图,包括:根据所述重构结果子图集合中权重和最小的重构结果子图确定为所述目标子图;或根据所述重构结果子图集合中权重和小于预设阈值的重构结果子图确定为所述目标子图。
在本申请实施例中,当计算机设备确定每条路径上的权重和之后,计算机设备可以在多条路径中选择权重和最小的路径作为目标子图。例如,计算机设备确定路径1上的算子的权重和为10,路径2上的算子的权重和为15,路径3上的算子的权重和为17,在这种情况下,计算机设备确定路径1为目标子图,也即,计算机设备确定路径1为重构后的性能最优的子图。
需要说明的是,上述获取目标子图的方式类似于viterbi算法,此次仅仅是例举的部分情况,而不是穷举,本领域技术人员在理解本申请技术方案的精髓的情况下,可能会在本申请技术方案的基础上产生其它的变形或者变换,比如:根据经验设置一阈值,状态路径的权重小于设定的阈值,就可以将其作为目标子图,从而可以根据目标子图对神经网络模型进行优化。但只要其实现的功能以及达到的技术效果与本申请类似,那么均应当属于本申请的保护范围。
S626、将所述目标子图替换所述计算图中对应的胶水子图,获取优化后的计算图。
如前所述,例如,计算机设备确定路径1上的算子的权重和为10,路径2上的算子的权重和为15,路径3上的算子的权重和为17,在这种情况下,计算机设备确定路径1为目标子图,也即,计算机设备确定路径1为重构后的性能最优的子图,此时,计算机设备将神经网络模型中原始胶水子图替换为路径1构成的子图,从而可以实现对神经网络模型的优化,以提高神经网络模型的整体性能。
S628、根据所述优化后的计算图获取对应的二进制指令,以分配至对应人工智能处理器上执行任务。
在本申请实施例中,通用处理器可以根据优化后的计算图,调用已设置好的人工智能学习库的编译接口来编译,获得对应的二进制指令。该二进制指令经运行时库处理生成机器学习处理任务。在实际应用中,通用处理器可以将机器学习处理任务放入任务队列,最 终由驱动器调度任务队列中的机器学习处理任务让人工智能处理器执行,得到运行结果。
本申请实施例中,机器学习处理任务是指,神经网络模型通过获取学习能力,以完成某项任务。这里,机器学习处理任务可以包括图像识别,边缘检测,语义分析,等等。具体地,为了提高神经网络模型的实用性,不同的神经网络模型对应不同的机器学习处理任务。例如,深度学习神经网络模型对应的机器学习处理任务可以为图像分类,文本分类等;卷积神经网络模型对应的机器学习处理任务可以为图像识别,视频分类等;长短时记忆神经网络模型(Long Short Term Memory Network,LSTM)对应的机器学习处理任务可以为语音识别、图片描述、自然语言处理等。
在本申请实施例中,机器学习处理任务的请求可以为用户针对神经网络模型输入的执行指令。当计算机设备在接收到机器学习处理任务的请求时,根据机器学习处理任务的类型获取对应的神经网络模型,并在人工智能处理器上运行神经网络模型,继而可以得到针对机器学习处理任务的运行结果。需要说明的是,处理器(例如,通用处理器,人工智能处理器)运行的神经网络模型为经过优化后的神经网络模型。
在本申请实施例中,机器学习处理任务的运行结果是指,计算机设备执行机器学习处理任务时的结果,可以包括但不限于:执行机器学习处理任务时,神经网络模型的精度;执行机器学习处理任务时,神经网络模型的运行时间等等。进一步可选的,计算机设备可以输出该运行结果,例如,计算机设备通过显示屏显示该运行结果。可以理解的是,由于对神经网络模型对应的计算图进行了优化,将重构后性能更优的子图替换原先的胶水子图,可以提高神经网络模型的整体性能,使得人工智能处理器在调用优化后的神经网络模型执行机器学习处理任务时,可以减少冗余计算,进而可以减少计算机设备的资源消耗。
实施本申请实施例,计算机设备对包含多个胶水算子的胶水子图,通过重构子图的方式获取胶水子图对应的优化结构,并根据重构后的子图对神经网络模型进行优化,这一实现方式可以提高神经网络模型的整体性能。此外,当在计算机设备运行优化后的神经网络模型时,可以减少计算机设备的资源消耗。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图6A的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图6A中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
上述详细阐述了本申请实施例的方法,为了便于更好地实施本申请实施例的上述方案,相应地,下面还提供用于配合实施上述方案的相关装置。
参见图9,图9是本申请实施例提供的一种神经网络处理装置的结构示意图,该装置90至少可以包括:
第一获取单元910,用于获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
第一确定单元912,用于在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
拆分单元914,用于根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
执行单元916,用于在所述M个人工智能处理器核上分别调用所述多个子计算任务,得到运行结果。
在一种可能的实现方式中,所述装置90还可以包括:
第二确定单元918,用于根据计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
第三确定单元920,用于根据所述目标算子对应的拆分方式确定所述拆分策略集合。
在一种可能的实现方式中,所述第三确定单元920具体用于:
将每个目标算子支持的拆分方式的交集确定为所述拆分策略集合。
在其中一种可能的实现方式中,所述第一确定单元912包括第一确定子单元和第二确定子单元;其中,
所述第一确定子单元,用于分别确定所述拆分策略集合中目标算子对应的拆分方式的权重值;
所述第二确定子单元,用于根据权重值确定所述目标拆分策略。
在一种可能的实现方式中,所述权重值为根据拆分策略中包含的目标算子的运算操作类型、目标算子涉及的数据规模和多核处理器的硬件参数确定的。
在一种可能的实现方式中,所述装置90还可以包括:
第二获取单元922,用于获取目标算子的运算操作类型;
第四确定单元924,用于根据所述目标算子的运算操作类型确定所述目标算子的拆分方式。
在一种可能的实施例中,参见图10,图10是本申请实施例提供的一种神经网络优化装置的结构示意图,该装置1000至少可以包括:
提取单元1010,用于在神经网络模型对应的计算图中提取胶水子图;其中,所述胶水子图是包含胶水算子的子图;所述胶水算子用于对所述计算图的张量数据进行调整;
处理单元1012,用于在确保所述胶水子图的输入张量数据、输出张量数据不变的情况下,对所述计算图中的所述胶水子图进行处理,获得重构结果子图集合;其中,所述重构结果子图集合中的任意一个重构结果子图的输入张量数据和输出张量数据分别与所述胶水子图的输入张量数据和输出张量数据相同;
确定单元1014,用于从所述重构结果子图集合中确定目标子图;
优化单元1016,用于将所述目标子图替换所述计算图中对应的胶水子图,获取优化后的计算图;
执行单元1018,用于根据所述优化后的计算图获取对应的二进制指令,以分配至对应人工智能处理器上执行任务。
在其中一种可能的实现方式中,所述处理单元1012包括扩充单元、转换单元和遍历单元单元;其中,
所述扩充单元,用于根据胶水算子的逻辑关系对所述胶水子图进行扩充,获得扩充后的胶水子图;所述转换单元,用于对所述扩充后的胶水子图进行转换,得到与胶水算子关联的张量数据的状态集合图;所述遍历单元,用于遍历所述状态集合图,获得所述重构结果子图集合。
在一种可能的实现方式中,所述扩充单元包括:第一扩充单元和第二扩充单元;其中,
第一扩充单元,用于根据等效规则对所述胶水子图中胶水算子之间的逻辑关系进行扩充,获得与所述胶水子图的语义等价的逻辑关系;第二扩充单元,用于根据与所述胶水子图的语义等价的逻辑关系对所述胶水子图进行扩充,获得所述扩充后的胶水子图。
在一种可能的实现方式中,所述等效规则包括reshape算子的等效规则、transpose算子的等效规则、concat算子的等效规则、split算子的等效规则中的至少一种。
在一种可能的实现方式中,所述第一扩充单元具体用于:对所述逻辑关系对应的算子序列进行变换,根据所述等效规则,确保获得所有与所述胶水子图的语义等价的逻辑关系。
在其中一种可能的实现方式中,所述转换单元具体用于:确定所述扩充后的胶水子图中的胶水算子的类型以及所述胶水算子之间的逻辑关系;基于所述扩充后的胶水子图中的胶水算子的类型以及所述胶水算子之间的逻辑关系,根据所述扩充后的胶水子图中的胶水算子对应的输入张量数据确定对应的输出张量数据;根据所述扩充后的胶水子图中的胶水算子的输入张量数据和输出张量数据确定与胶水算子关联的张量数据的状态集合图。
在其中一种可能的实现方式中,所述确定单元具体用于:根据所述重构结果子图集合中权重和最小的重构结果子图确定为所述目标子图;或根据所述重构结果子图集合中权重和小于预设阈值的重构结果子图确定为所述目标子图。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
所述作为分离部件说明的单元或模块可以是物理上分开的,也可以不是物理上分开的。作为单元或模块说明的部件可以是物理单元,也可以不是物理单元,即可以位于一个装置中,或者也可以分布到多个装置上。本披露中实施例的方案可以根据实际的需要选择其中的部分或者全部单元来实现。
此外,这里需要指出的是,本申请实施例还提供了一种计算机存储介质,用于存储为上述图2所示的计算机设备所用的计算机软件指令,其包含用于执行上述方法实施例所涉及的程序。通过执行存储的程序,可以实现神经网络模型处理,以充分利用多核处理的资源。
由上可见,本申请实施例提供的神经网络处理方法、装置、计算机设备和存储介质,该方法通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
进一步地,依据以下条款可更好地理解前述内容:
例如,条款A1、一种神经网络处理方法,其特征在于,所述方法应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述方法包括:
获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
A2、根据A1所述的方法,所述获取神经网络模型对应的计算图之后,所述在拆分策略集合中确定所述神经网络计算任务的目标拆分策略之前,还包括:
根据所述计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
根据所述目标算子对应的拆分方式确定所述拆分策略集合。
A3、根据A2所述的方法,所述根据所述目标算子对应的拆分方式确定所述拆分策略集合,包括:
将每个目标算子支持的拆分方式的交集确定为所述拆分策略集合。
A4、根据A1所述的方法,所述在拆分策略集合中确定所述神经网络计算任务的目标拆分策略,包括:
分别确定所述拆分策略集合中目标算子对应的拆分方式的权重值;
根据权重值确定所述目标拆分策略。
A5、根据A4所述的方法,所述权重值为根据拆分策略中包含的目标算子的运算操作类型、目标算子涉及的数据规模和多核处理器的硬件参数确定的。
A6、根据A1-A4任一项所述的方法,所述方法还包括:
获取目标算子的运算操作类型;
根据所述目标算子的运算操作类型确定所述目标算子的拆分方式。
A7、根据A2所述的方法,所述目标算子对应的并行度包括第一并行度或第二并行度。
A8.根据A2所述的方法,所述目标算子对应的并行度包括第一并行度和第二并行度;其中,所述第一并行度乘以第二并行度的结果小于等于人工智能处理器中的人工智能处理器核的数目。
B1、一种神经网络处理装置,其特征在于,所述装置应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述装置包括:
第一获取单元,用于获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
第一确定单元,用于在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
拆分单元,用于根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
执行单元,用于将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
B2、根据B1所述的装置,所述装置还包括:
第二确定单元,用于根据计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
第三确定单元,用于根据所述目标算子对应的拆分方式确定所述拆分策略集合。
B3、根据B2所述的装置,所述第三确定单元具体用于:
将每个目标算子支持的拆分方式的交集确定为所述拆分策略集合。
B4、根据B1所述的装置,所述第一确定单元包括第一确定子单元和第二确定子单元;其中,
所述第一确定子单元,用于分别确定所述拆分策略集合中目标算子对应的拆分方式的权重值;
所述第二确定子单元,用于根据权重值确定所述目标拆分策略。
B5、根据B4所述的装置,所述权重值为根据拆分策略中包含的目标算子的运算操作类型、目标算子涉及的数据规模和多核处理器的硬件参数确定的。
B6、根据B1-B4任一项所述的装置,所述装置还包括:
第二获取单元,用于获取目标算子的运算操作类型;
第四确定单元,用于根据所述目标算子的运算操作类型确定所述目标算子的拆分方式。
C1、一种计算机设备,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述处理器包括通用处理器和人工智能处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求A1-A8任一项所述的方法。
D1、一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求A1-A8任一项所述的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。

Claims (20)

  1. 一种神经网络处理方法,其特征在于,所述方法应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述方法包括:
    获取神经网络模型对应的计算图;其中,所述计算图中包含多个算子;
    在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
    根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
    将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
  2. 根据权利要求1所述的方法,其特征在于,所述获取神经网络模型对应的计算图之后,所述在拆分策略集合中确定所述神经网络计算任务的目标拆分策略之前,还包括:
    根据所述计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
    根据所述目标算子对应的拆分方式确定所述拆分策略集合。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述目标算子对应的拆分方式确定所述拆分策略集合,包括:
    将每个目标算子支持的拆分方式的交集确定为所述拆分策略集合。
  4. 根据权利要求1所述的方法,其特征在于,所述在拆分策略集合中确定所述神经网络计算任务的目标拆分策略,包括:
    分别确定所述拆分策略集合中目标算子对应的拆分方式的权重值;
    根据权重值确定所述目标拆分策略。
  5. 根据权利要求4所述的方法,其特征在于,所述权重值为根据拆分策略中包含的目标算子的运算操作类型、目标算子涉及的数据规模和多核处理器的硬件参数确定的。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:
    获取目标算子的运算操作类型;
    根据所述目标算子的运算操作类型确定所述目标算子的拆分方式。
  7. 根据权利要求2所述的方法,其特征在于,所述目标算子对应的并行度包括第一并行度或第二并行度。
  8. 根据权利要求2所述的方法,其特征在于,所述目标算子对应的并行度包括第一并行度和第二并行度;其中,所述第一并行度乘以第二并行度的结果小于等于人工智能处理器中的人工智能处理器核的数目。
  9. 一种神经网络处理装置,其特征在于,所述装置应用于人工智能处理器,所述人工智能处理器包括M个人工智能处理器核,M为大于1的正整数;所述装置包括:
    第一获取单元,用于获取神经网络模型对应的计算图;其中,所述神经网络模型包含多个算子;
    第一确定单元,用于在拆分策略集合中确定所述神经网络计算任务的目标拆分策略;其中,所述拆分策略集合为所述计算图中目标算子对应的拆分方式组成的集合;
    拆分单元,用于根据所述目标拆分策略对所述神经网络计算任务进行拆分,得到多个子计算任务;
    执行单元,用于将所述子计算任务分配到人工智能处理器中的对应人工智能处理器核上进行处理。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    第二确定单元,用于根据计算图中目标算子对应的并行度、拆分维度、拆分维度大小确定所述目标算子对应的拆分方式;
    第三确定单元,用于根据所述目标算子对应的拆分方式确定所述拆分策略集合。
  11. 根据权利要求10所述的装置,其特征在于,所述第三确定单元具体用于:
    将每个目标算子支持的拆分方式的交集确定为所述拆分策略集合。
  12. 根据权利要求9所述的装置,其特征在于,所述第一确定单元包括第一确定子单元和第二确定子单元;其中,
    所述第一确定子单元,用于分别确定所述拆分策略集合中目标算子对应的拆分方式的权重值;
    所述第二确定子单元,用于根据权重值确定所述目标拆分策略。
  13. 根据权利要求12所述的装置,其特征在于,所述权重值为根据拆分策略中包含的目标算子的运算操作类型、目标算子涉及的数据规模和多核处理器的硬件参数确定的。
  14. 根据权利要求9-13任一项所述的装置,其特征在于,所述装置还包括:
    第二获取单元,用于获取目标算子的运算操作类型;
    第四确定单元,用于根据所述目标算子的运算操作类型确定所述目标算子的拆分方式。
  15. 根据权利要求10所述的装置,其特征在于,所述目标算子对应的并行度包括第一并行度或第二并行度。
  16. 一种芯片,其特征在于,所述芯片集成如权利要求9-15任一项所述的神经网络处理装置。
  17. 一种计算机设备,其特征在于,所述计算机设备包括如权利要求16所述的芯片或如权利要求9-15任一项所述的神经网络处理装置。
  18. 一种计算机设备,其特征在于,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述处理器包括通用处理器和人工智能处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1-8任一项所述的方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-8任一项所述的方法。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-8任一项所述的方法。
PCT/CN2020/116933 2019-09-24 2020-09-22 神经网络处理方法、装置、计算机设备及存储介质 WO2021057746A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20869294.7A EP4036810A4 (en) 2019-09-24 2020-09-22 NEURAL NETWORK PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIUM
US17/622,702 US20220383082A1 (en) 2019-09-24 2020-09-22 Neural network processing method and apparatus, computer device and storage medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201910910117.6 2019-09-24
CN201910910117.6A CN110674936A (zh) 2019-09-24 2019-09-24 一种神经网络处理方法、装置、计算机设备及存储介质
CN201910910118.0 2019-09-24
CN201910910118.0A CN110659728B (zh) 2019-09-24 2019-09-24 神经网络优化方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021057746A1 true WO2021057746A1 (zh) 2021-04-01

Family

ID=75165104

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116933 WO2021057746A1 (zh) 2019-09-24 2020-09-22 神经网络处理方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
US (1) US20220383082A1 (zh)
EP (1) EP4036810A4 (zh)
WO (1) WO2021057746A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327630A (zh) * 2022-01-05 2022-04-12 北京大学 一种适用于华为昇腾芯片的高性能算子生成方法
CN114816773A (zh) * 2022-06-29 2022-07-29 浙江大华技术股份有限公司 数据处理方法、系统、电子装置和存储介质
CN114970847A (zh) * 2022-05-09 2022-08-30 清华大学 数据处理方法、装置和存储介质
CN115762515A (zh) * 2022-11-08 2023-03-07 北京百度网讯科技有限公司 用于语音识别的神经网络的处理和应用方法、装置及设备
CN116362316A (zh) * 2023-05-29 2023-06-30 成都阿加犀智能科技有限公司 一种模型转换方法、装置、存储介质及电子设备
WO2023197857A1 (zh) * 2022-04-11 2023-10-19 华为技术有限公司 一种模型切分方法及其相关设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11714992B1 (en) 2018-12-13 2023-08-01 Amazon Technologies, Inc. Neural network processing based on subgraph recognition
WO2021063317A1 (en) * 2019-10-01 2021-04-08 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Tensor processing method and apparatus, electronic device
KR20220064665A (ko) * 2020-11-12 2022-05-19 삼성전자주식회사 인공지능 모델을 분산 처리하는 전자 장치 및 그 동작 방법
DE102021202933A1 (de) * 2021-03-25 2022-09-29 Robert Bosch Gesellschaft mit beschränkter Haftung Verfolgung mehrerer Objekte in Zusammenarbeit mehrerer neuronaler Netzwerke
US11782706B1 (en) 2021-06-29 2023-10-10 Amazon Technologies, Inc. Reconfigurable neural network processing based on subgraph recognition
US20230004786A1 (en) * 2021-06-30 2023-01-05 Micron Technology, Inc. Artificial neural networks on a deep learning accelerator
US12118400B2 (en) * 2021-11-29 2024-10-15 International Business Machines Corporation Performing batched training for machine-learning pipelines
CN115858178B (zh) * 2023-02-21 2023-06-06 芯砺智能科技(上海)有限公司 一种卷积计算中资源共享的方法、装置、介质及设备
CN116560666B (zh) * 2023-07-10 2023-09-22 上海燧原科技有限公司 基于多层级代码生成的ai前端统一计算方法、装置及介质
CN117056068B (zh) * 2023-08-08 2024-03-19 杭州观远数据有限公司 ETL中JobEngine任务拆分方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106155635A (zh) * 2015-04-03 2016-11-23 北京奇虎科技有限公司 一种数据处理方法和装置
CN107862378A (zh) * 2017-12-06 2018-03-30 芯原微电子(上海)有限公司 基于多核的卷积神经网络加速方法及系统、存储介质及终端
CN109426553A (zh) * 2017-08-21 2019-03-05 上海寒武纪信息科技有限公司 任务切分装置及方法、任务处理装置及方法、多核处理器
US20190138891A1 (en) * 2017-11-09 2019-05-09 Samsung Electronics Co., Ltd. Apparatus and method with neural network
CN109993299A (zh) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 数据训练方法及装置、存储介质、电子装置
CN110674936A (zh) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 一种神经网络处理方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106155635A (zh) * 2015-04-03 2016-11-23 北京奇虎科技有限公司 一种数据处理方法和装置
CN109426553A (zh) * 2017-08-21 2019-03-05 上海寒武纪信息科技有限公司 任务切分装置及方法、任务处理装置及方法、多核处理器
US20190138891A1 (en) * 2017-11-09 2019-05-09 Samsung Electronics Co., Ltd. Apparatus and method with neural network
CN107862378A (zh) * 2017-12-06 2018-03-30 芯原微电子(上海)有限公司 基于多核的卷积神经网络加速方法及系统、存储介质及终端
CN109993299A (zh) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 数据训练方法及装置、存储介质、电子装置
CN110674936A (zh) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 一种神经网络处理方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4036810A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327630A (zh) * 2022-01-05 2022-04-12 北京大学 一种适用于华为昇腾芯片的高性能算子生成方法
CN114327630B (zh) * 2022-01-05 2023-02-10 北京大学 一种适用于华为昇腾芯片的高性能算子生成方法
WO2023197857A1 (zh) * 2022-04-11 2023-10-19 华为技术有限公司 一种模型切分方法及其相关设备
CN114970847A (zh) * 2022-05-09 2022-08-30 清华大学 数据处理方法、装置和存储介质
CN114816773A (zh) * 2022-06-29 2022-07-29 浙江大华技术股份有限公司 数据处理方法、系统、电子装置和存储介质
CN114816773B (zh) * 2022-06-29 2022-09-23 浙江大华技术股份有限公司 数据处理方法、系统、电子装置和存储介质
CN115762515A (zh) * 2022-11-08 2023-03-07 北京百度网讯科技有限公司 用于语音识别的神经网络的处理和应用方法、装置及设备
CN115762515B (zh) * 2022-11-08 2023-12-01 北京百度网讯科技有限公司 用于语音识别的神经网络的处理和应用方法、装置及设备
CN116362316A (zh) * 2023-05-29 2023-06-30 成都阿加犀智能科技有限公司 一种模型转换方法、装置、存储介质及电子设备
CN116362316B (zh) * 2023-05-29 2023-12-12 成都阿加犀智能科技有限公司 一种模型转换方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
EP4036810A1 (en) 2022-08-03
US20220383082A1 (en) 2022-12-01
EP4036810A4 (en) 2023-10-18

Similar Documents

Publication Publication Date Title
WO2021057746A1 (zh) 神经网络处理方法、装置、计算机设备及存储介质
CN110659728B (zh) 神经网络优化方法、装置、计算机设备及存储介质
WO2021057720A1 (zh) 神经网络模型处理方法、装置、计算机设备及存储介质
WO2021057713A1 (zh) 用多核处理器实现神经网络模型拆分方法及相关产品
WO2021057722A1 (zh) 用多核处理器实现神经网络模型拆分方法及相关产品
CN110674936A (zh) 一种神经网络处理方法、装置、计算机设备及存储介质
CN110929627B (zh) 基于宽模型稀疏数据集的高效gpu训练模型的图像识别方法
US9953003B2 (en) Systems and methods for in-line stream processing of distributed dataflow based computations
CN111401510A (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN110826708B (zh) 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111401538A (zh) 一种数据处理方法、装置、计算机设备及存储介质
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
CN111401539A (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN111401511A (zh) 一种数据处理方法、装置、计算机设备及存储介质
US12079608B2 (en) Efficient optimization for neural network deployment and execution
US12079734B1 (en) Compilation time reduction for memory and compute bound neural networks
CN111401537A (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN115860061A (zh) 图神经网络优化方法和图神经网络推理系统
US20220292334A1 (en) Efficient memory use optimization for neural network deployment and execution
US20220292300A1 (en) Efficient quantization for neural network deployment and execution
US11960982B1 (en) System and method of determining and executing deep tensor columns in neural networks
Fan et al. Accelerating Convolutional Neural Networks by Exploiting the Sparsity of Output Activation
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
US20230051344A1 (en) Optimization of memory use for efficient neural network execution
US20230043584A1 (en) Optimization of memory use for efficient neural network execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20869294

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020869294

Country of ref document: EP

Effective date: 20220425