CN116775149A - Cold start method and device for neural network - Google Patents
Cold start method and device for neural network Download PDFInfo
- Publication number
- CN116775149A CN116775149A CN202310732822.8A CN202310732822A CN116775149A CN 116775149 A CN116775149 A CN 116775149A CN 202310732822 A CN202310732822 A CN 202310732822A CN 116775149 A CN116775149 A CN 116775149A
- Authority
- CN
- China
- Prior art keywords
- operator
- operation process
- reading
- core processor
- kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 254
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 106
- 230000008569 process Effects 0.000 claims abstract description 201
- 238000012545 processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 11
- 238000005457 optimization Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the application discloses a method and a device for cold starting of a neural network, which adopts edge equipment with a multi-core processor, wherein the multi-core processor adopts a large-size core architecture, an operator core of the neural network is taken as a unit, the operation process of the neural network is split into a plurality of operation processes of the operator cores, and the operation process of parameter reading of a first operator core, the operation process of reading and converting corresponding weights and the operation process of operating the first operator core are scheduled to be completed in the large-core processor according to the operation sequence of the operator cores; and scheduling the operation process of running the other operator cores in the large core processor, and scheduling the operation process of reading parameters of the other operator cores and the operation process of reading and converting the corresponding weights in the selected small core processor. Thus, the delay time is reduced on the premise of not affecting the operation precision of the neural network.
Description
Technical Field
The application relates to the technical field of neural networks, in particular to a method and a device for cold starting of a neural network.
Background
With the development of artificial intelligence technology, neural networks have been applied to business recognition in countless fields such as computer vision technology and natural language understanding. In the application of neural networks for recognition such as image or semantic, the deployment of neural networks is turned from large data centers to edge devices in pursuits of low processing latency and privacy of the processed data. An edge device (edge device) is a device that provides an entry point to an enterprise or service provider core network, such as: smart phones, access devices for the internet of things, wearable devices, access devices for automatic driving automobiles, and the like. These edge devices have computational power that is fully utilized for neural network-based processing.
There are two significant trends in deploying neural networks in edge devices: 1) The number and types of neural networks per edge device have exploded; 2) The structural complexity of neural networks is also increasing, such as deploying deep neural networks, which are a type of multi-layer unsupervised, multi-layer neural network. These two trends highlight congestion of neural networks on resource-limited edge devices. It is therefore unlikely that all neural networks will be preloaded into the memory of the edge device and then wait to run. That is, the cold start, i.e., loading, of the neural network on the edge device, and the subsequent initialization and execution processes are becoming important. The speed of the neural network cold start on the edge device, as well as the hot start, is critical to the quality of service and user experience of the edge device.
However, neural networks, particularly deep neural networks, are much more useful for cold starts on edge devices than for hot starts. When the neural network is started in a cold mode, delay often occurs, and how to reduce delay time on the premise that operation accuracy of the neural network is not affected is a problem to be solved urgently.
Disclosure of Invention
In view of this, the embodiment of the application provides a method for cold starting a neural network on an edge device, which can reduce delay time without affecting the operation accuracy of the neural network.
The embodiment of the application also provides a device for cold starting of the neural network on the edge equipment, which can reduce the delay time on the premise of not influencing the operation precision of the neural network.
The application is realized in the following way:
in one embodiment of the present application, there is provided a method for cold starting a neural network on an edge device, the method including:
splitting the operation of the neural network into a plurality of operation of operator cores by taking the operator cores of the neural network as units;
according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a big kernel processor in edge equipment for execution;
scheduling the operation process of running the other operator kernels to be completed in the large kernel processing; scheduling the operation process of parameter reading of other operator kernels and the operation process of reading and converting of corresponding weights to a small kernel processor selected from the edge equipment for execution;
and the big core processor and the small core processor of the edge equipment execute the corresponding operation process based on the scheduling respectively.
In the above method, the scheduling the operation procedure of reading the parameters of the remaining operator kernels and the operation procedure of reading and converting the corresponding weights to the small kernel processor selected in the edge device includes:
and according to the operator kernel operation sequence, dispatching the operator kernels to the sequentially selected small kernel processors one by one, and dispatching and executing the operation process of parameter reading of the operator kernels and the operation process of reading and converting the corresponding weights.
In the above method, the scheduling to be executed in the small core processor selected in the edge device includes:
after the small core processor acquires the information for scheduling the operator kernel, initializing an operation list of the small core processor, and storing operation information read by parameters of the operator kernel and operation information read and converted by corresponding weights in the operation list;
the small core processor executing the corresponding operation process based on scheduling comprises:
and the small core processor sequentially executes the corresponding operation processes according to the operation list.
In the above method, the scheduling is performed in a big core processor in the edge device, and further includes:
calculating, for each operator core, a total time of an operation process of executing parameter reading of the operator core, an operation process of reading and converting corresponding weights, and an operation process of operating the operator core at a large core processor, and judging whether the total time is smaller than a total time of an operation process of executing parameter reading of the operator core at a small core processor, an operation process of executing reading and converting corresponding weights at a small core processor, and an operation process of operating the operator core at a large core processor, except for a first operator core;
if yes, scheduling an operation process of executing parameter reading of an operator kernel in the small core processor and an operation process of executing reading and conversion of corresponding weight in the small core processor to the large core processor to finish;
if not, the operation process of parameter reading of the operator kernel is scheduled and completed in the small core processor, and the operation process of reading and converting of the corresponding weight is scheduled and completed in the small core processor.
In the above method, the scheduling is performed in a small core processor selected from the edge device, and further includes:
determining a small core processor which finishes the operation first, judging whether the small core processor can finish the operation process of parameter reading of an operator kernel with the longest operation time and the operation process of reading and converting corresponding weights;
if yes, the operation process of completing parameter reading of the operator kernel with the longest operation time and the operation process of reading and converting the corresponding weight are scheduled to be completed in the small kernel processing.
In the above method, the operation procedure for judging whether the corelet processor can complete the parameter reading of the operator kernel with the longest operation time, and the operation procedure for reading and converting the corresponding weight include:
judging whether the time of the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight of the small core processor is before the time of operating the operation process of the operator core with the longest operation time, if so, scheduling the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight into the small core processor.
In another embodiment of the present application, there is provided an apparatus for performing cold start of a neural network on an edge device, the apparatus including: the decision module and the execution module, wherein,
the decision module is used for splitting the operation of the neural network into a plurality of operation of operator cores by taking the operator cores of the neural network as units; according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a big kernel processor in edge equipment for execution; scheduling the operation process of operating the other operator cores in the large core processing, and scheduling the operation process of parameter reading of the other operator cores and the operation process of reading and converting corresponding weights to a small core processor selected from the edge equipment for execution;
and the execution module is used for controlling the large-core processor and the small-core processor to execute the corresponding operation process based on the scheduling respectively.
In still another embodiment of the present application, there is provided an electronic apparatus including:
a processor;
a memory storing a program configured to implement a method of cold starting a neural network on an edge device as described above when executed by the processor.
In yet another embodiment of the present application, a non-transitory computer readable storage medium is provided, the non-transitory computer readable storage medium storing instructions, wherein the instructions, when executed by a processor, cause the processor to perform a method for cold starting a neural network on an edge device as described above.
In a further embodiment of the application, a computer program product is provided, comprising a computer program or instructions which, when executed by a processor, implement the steps of a method for cold starting a neural network on an edge device according to any of the preceding claims.
As seen above, the embodiment of the present application adopts an edge device with a multi-core processor, where the multi-core processor adopts a large and small core architecture, and divides the operation process of the neural network into a plurality of operation processes of the operator cores in units of operator cores of the neural network, and dispatches the operation process of reading parameters of a first operator core, the operation process of reading and converting corresponding weights, and the operation process of operating the first operator core in the operation sequence of the operator cores to be completed in the large core processor; and scheduling the operation process of running the other operator cores in the large core processor, and scheduling the operation process of reading parameters of the other operator cores and the operation process of reading and converting the corresponding weights in the selected small core processor. In this way, the neural network is executed in a matching mode of the cycle of the large core processor and the cycle of the small core processor of the edge equipment in the cold start process of the edge equipment, and delay time is reduced on the premise of not affecting operation precision of the neural network.
Drawings
FIG. 1 is a flow chart of a cold start process performed on an edge device according to an embodiment of the present application
Fig. 2 is a flowchart of a method for performing cold start on an edge device by using a neural network according to an embodiment of the present application;
fig. 3 is a schematic diagram of a process of performing cold start of an entire neural network at an edge device by using a pipeline scheduling policy according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an apparatus for performing cold start on an edge device by using a neural network according to an embodiment of the present application;
fig. 5 is a schematic diagram of an electronic device according to another embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical scheme of the application is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
In the cold start process of the neural network, especially the deep neural network, on the edge device, the bottleneck with long occupation time mainly comprises: the method includes the steps of reading an original weight of the neural network from a disk into a memory on an edge device at loading, converting the original weight to an executable format weight at initialization, and inputting data into a run-time process in the neural network based on the executable format weight at execution. These processes are also referred to as cold reasoning processes for deep neural networks at edge devices.
Currently, there are two ways to mitigate the cold inference delay. One is to share the original weights of the neural networks, so as to pack and store more neural networks into the memory of the edge device, so as to preheat each neural network to be operated subsequently, but the method has no expandability, and when the number and the types of the neural networks to be operated by the edge device are increased, the operation precision of the neural network model is obviously reduced. Another way is: the execution time of the neural network is estimated, and the original weight is read and stored in advance, however, when the edge device is to operate a plurality of neural networks, it is difficult to estimate the execution time of each neural network. Both of the above approaches solve the problem of cold start delay in an indirect manner, but rely on changing the neural network structure or external knowledge, which is difficult to implement in practice.
Therefore, the embodiment of the application adopts a direct mode to optimize the delay of the neural network when cold start is carried out on the edge equipment, does not depend on any assumption of the structure or the execution environment of the neural network, and can ensure zero precision loss. The embodiment of the application adopts the edge equipment with the multi-core processor, wherein the multi-core processor adopts a large-size core architecture, and the method can comprise the following steps: the method comprises the steps of dividing the operation process of the neural network into a plurality of operation processes of operator cores by taking the operator cores of the neural network as a unit, and dispatching the operation processes of parameter reading, corresponding weight reading and conversion of a first operator core and the operation processes of operation of the first operator core in the operation sequence of the operator cores to be completed in the large core processor; and scheduling the operation process of running the other operator cores in the large core processor, and scheduling the operation process of reading parameters of the other operator cores and the operation process of reading and converting the corresponding weights in the selected small core processor.
The small core processors are selected according to the operator core operation sequence, the related operation processes of the operator cores are dispatched to the small core processors selected in sequence one by one, and the operation processes of parameter reading of the operator cores and the operation processes of reading and converting of corresponding weights are dispatched to be completed.
In this way, the neural network is executed in a matching mode of the cycle of the large core processor and the cycle of the small core processor of the edge equipment in the cold start process of the edge equipment, and delay time is reduced on the premise of not affecting operation precision of the neural network.
The embodiment of the application is based on the premise of balancing the workload of each small core processor when the small core processor is circulated. The application also comprises the following steps when the operation process of parameter reading of each operator kernel and the operation process of reading and converting of corresponding weight are circularly completed by adopting the small core processor: and determining a small core processor which finishes the operation first, judging whether the small core processor can finish the operation process of reading the parameters of the operator core with the longest operation time and the operation process of reading and converting the corresponding weight, and if so, dispatching the operation process of finishing the parameter reading of the operator core with the longest operation time and the operation process of reading and converting the corresponding weight to the small core processor to finish the operation. When judging whether the small core processor can finish the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight, judging whether the time of the small core processor finishing the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight is before the time of operating the operation process of the operator core with the longest operation time, if so, scheduling the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight into the small core processor. In this way, the workload of the individual corelet processors may be more balanced.
In the embodiment of the application, when the small core processor circulates, after the small core processor acquires the information for scheduling the operator kernel, initializing an operation list of the small core processor, and storing the operation information read by the parameters of the operator kernel and the operation information read and converted by the corresponding weight in the operation list; and when the scheduling is executed, the small core processor sequentially executes according to the operation list.
In the application, the working loads of the large core processor and the small core processor are balanced in the process of cold starting of the edge equipment by adopting the cooperation of the large core processor cycle and the small core processor cycle of the edge equipment, so as to minimize the finishing time of the large core processor. In a large core processor cycle, further comprising: in addition to the first operator kernel, for each operator kernel, calculating the total time of the operation process of executing the parameter reading of the operator kernel at the big kernel processor, the operation process of executing the reading and converting of the corresponding weight and the operation process of operating the operator kernel at the small kernel processor, judging whether the total time is smaller than the total time of the operation process of executing the parameter reading of the operator kernel at the small kernel processor, the operation process of executing the reading and converting of the corresponding weight at the small kernel processor and the operation process of operating the operator kernel at the big kernel processor, if so, the operation process of executing the parameter reading of the operator kernel at the small kernel processor and the operation process of executing the reading and converting of the corresponding weight at the small kernel processor are scheduled to be completed by the big kernel processor, otherwise, the operation process of executing the parameter reading of the operator kernel at the small kernel is still scheduled by the small kernel processor, and the operation process of executing the reading and converting of the corresponding weight at the small kernel processor is scheduled to be completed. In this way, the workload among all the core processors may be largely balanced.
In an embodiment of the present application, the operator kernel of the neural network refers to a basic mathematical operation unit used by each neural network layer (neural network layer) in the neural network. The operator kernels form basic calculation units of the neural network, different operator kernels respectively correspond to calculation logics of different neural network layers and are usually presented in a calculation matrix mode, for example, a convolution layer is an operator, an execution convolution layer operation unit is an operator kernel, a weight summation process in a full-connection layer is an operator, and an operation unit executing full-connection layer operation is another operator kernel.
In the embodiment of the application, the big core processor of the edge equipment is a performance core (Performance Cores) Central Processing Unit (CPU) or a Graphic Processor (GPU) in the edge equipment, which refers to a high-performance processor core and is the main force of the operation of the edge equipment; the small core processor of the edge device is an Efficiency Core (CPU), which refers to a core CPU for energy saving purpose, and is mainly used at the time of low load of the edge device.
The following describes embodiments of the present application in detail.
As shown in fig. 1, fig. 1 is a process flow diagram of cold start performed on an edge device according to an embodiment of the present application, including a decision stage and an execution stage, where in the decision stage, an operator kernel of a neural network, a weight after buffer conversion, and a scheduling policy adopted are selected, and the execution stage is entered under the control of a scheduler; in the execution stage, data are input into corresponding operator kernels to be executed, and an execution result is obtained, wherein the operator kernels run based on corresponding converted weights under the scheduling of a scheduler. Wherein the execution of the operator kernel is referred to in the illustration as reasoning.
The embodiment of the application researches the optimization space of the neural network when cold start is carried out on the edge equipment, and determines the following three effective optimization modes.
First optimization mode
The optimal operator kernel implementation is selected from the neural network. In neural networks, and in particular deep neural networks, there are typically many different implementations of each operator kernel. These operator kernels, which are partitioned in the neural network, are designed to increase the operation speed, whereas the current selection strategy of the operator kernels is based entirely on the operation speed of the neural network when a hot start is performed. However, the operator core with the fastest running speed when performing the hot start does not necessarily show the best performance in the cold start, for example, some operator cores run fast, but the time required for the operation process of the corresponding weight reading and conversion is relatively long, which delays the speed of the cold start. In this case, the selection strategy of the operator kernel in the neural network should be a strategy that reduces the cold start delay, resulting in an optimal operator kernel in the neural network.
Second optimization mode
The converted weights of the corresponding operator kernels are cached. By storing the converted weights on the disk of the edge device for direct reading and execution, the process of converting weights can be bypassed. However, the converted weights may occupy more memory space and result in higher disk read-write costs. Whether the operation process of reading and converting the weight of the corresponding operator kernel is executed or the operation process of reading the converted weight of the corresponding operator kernel is executed needs to be balanced between the disk reading and writing cost and the calculation cost.
Third optimization mode
Optimal pipeline scheduling policy. And (3) an execution process of an operator kernel based on a pipeline technology and a binding mode of the kernel of the edge equipment. The execution process of one operator kernel comprises the following steps: the operation process of reading and converting the corresponding weight and the operation process of operating the operator kernel. The transmission blocking time between disk and memory in an edge device can be reduced by a pipelined scheduling policy. The pipeline scheduling policy may also schedule asymmetric core processors on edge devices to handle different operational processes, such as between a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), or a large core CPU or a small core CPU, etc.
Considering the three optimization modes, the delay reduction effect of the neural network when cold start is performed on the edge equipment is tightly coupled. For example, selecting different operator kernels may have different pipeline scheduling policies. The present application addresses the following two challenges in order to provide a comprehensive and efficient solution for cold-start of a neural network on an edge device.
First, the search space is too large. The setting is formulated into selection of a combined operator core, bypassing of an operation process corresponding to conversion of weights of the operator cores and operation processes of a plurality of running operator cores so as to obtain an optimal scheduling operator core problem, wherein the problem is a non-deterministic question (NP) problem of polynomial complexity and is difficult to solve. Secondly, because the capacity of the disk and the memory of the edge device is limited, different operation processes can interfere with each other, so that the problem is further complicated.
The present application thus employs a heuristic optimal pipeline scheduling strategy that is inspired by some key observations in the cold start process of neural networks on edge devices, such as: 1) The operational process differs in the processing of different core processors. 2) Multithreading on multiple cores is more efficient in executing operations than multithreading on other cores. Therefore, the application uses the big core processor of the edge device to multithread the operation process of the operator kernel, and the operation process of parameter reading of the operator kernel and the operation process of reading and converting of the corresponding weight are executed in the small core processor. The embodiment of the application also utilizes the characteristic that the operation process of reading and converting the weight of the corresponding operator kernel has fewer storage resource dependencies than the operation process of reading the converted weight of the corresponding operator kernel, so that the execution scheduling of the operator kernel can be easily arranged.
Based on the above-mentioned heuristics, the embodiment of the present application sets an intuitive and effective optimal pipeline scheduling strategy, which is to balance the workload on different core processors in the edge device to minimize the total execution time of cold start. Meanwhile, in the scheduling planning process, each operation process is scheduled again through analysis of each operation process in the operator kernel execution process, so that the performance of each operation process is improved, and scheduling planning is conducted better.
Fig. 2 is a flowchart of a method for performing cold start on an edge device by using a neural network according to an embodiment of the present application, where the method is applied to an edge device having a multi-core processor, and includes the specific steps of:
step 201, splitting the operation of the neural network into a plurality of operations of operator cores by taking the operator cores of the neural network as units;
step 202, according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a large-core processor in edge equipment for execution;
step 203, scheduling the operation process of running the other operator cores in the large core processing, and scheduling the operation process of reading parameters of the other operator cores and the operation process of reading and converting the corresponding weights to a small core processor selected from the edge equipment for execution;
step 204, the big core processor and the small core processor of the edge device execute the corresponding operation process based on the scheduling respectively.
In the above method, the scheduling the operation procedure of reading the parameters of the remaining operator kernels and the operation procedure of reading and converting the corresponding weights to the small kernel processor selected in the edge device includes:
and according to the operator kernel operation sequence, the operator kernels are dispatched to the sequentially selected small kernel processors one by one, and the operation process of parameter reading of the operator kernels and the operation process of reading and converting corresponding weights are dispatched and executed.
In the above method, the scheduling to be executed in the small core processor selected in the edge device includes:
after the small core processor acquires the information for scheduling the operator kernel, initializing an operation list of the small core processor, and storing operation information read by parameters of the operator kernel and operation information read and converted by corresponding weights in the operation list;
the small core processor executing the corresponding operation process based on scheduling comprises:
and the small core processor sequentially executes the corresponding operation processes according to the operation list.
In the above method, the scheduling is performed in a big core processor in the edge device, and further includes:
calculating, for each operator core, a total time of an operation process of executing parameter reading of the operator core, an operation process of reading and converting corresponding weights, and an operation process of operating the operator core at a large core processor, and judging whether the total time is smaller than a total time of an operation process of executing parameter reading of the operator core at a small core processor, an operation process of executing reading and converting corresponding weights at a small core processor, and an operation process of operating the operator core at a large core processor, except for a first operator core;
if yes, scheduling an operation process of executing parameter reading of an operator kernel in the small core processor and an operation process of executing reading and conversion of corresponding weight in the small core processor to the large core processor to finish;
if not, the operation process of parameter reading of the operator kernel is scheduled and completed in the small core processor, and the operation process of reading and converting of the corresponding weight is scheduled and completed in the small core processor.
In the above method, the scheduling is performed in a small core processor selected from the edge device, and further includes:
determining a small core processor which finishes the operation first, judging whether the small core processor can finish the operation process of parameter reading of an operator kernel with the longest operation time and the operation process of reading and converting corresponding weights;
if yes, the operation process of completing parameter reading of the operator kernel with the longest operation time and the operation process of reading and converting the corresponding weight are scheduled to be completed in the small kernel processing.
Here, the operation procedure for determining whether the corelet processor can complete the parameter reading of the operator kernel with the longest operation time, and the operation procedure for reading and converting the corresponding weight include:
judging whether the time of the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight of the small core processor is before the time of operating the operation process of the operator core with the longest operation time, if so, scheduling the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight into the small core processor.
By adopting the method, the mode based on the small core processor circulation and the large core processor circulation is realized, the cold start process of the neural network at the edge equipment is executed by taking the operator kernel of the neural network as a unit, and the execution result is output. As shown in fig. 3, fig. 3 is a schematic diagram of a process of performing cold start of an entire neural network at an edge device by using a pipelining scheduling policy according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of an apparatus for performing cold start on an edge device by using a neural network according to an embodiment of the present application, where the apparatus includes: the decision module and the execution module, wherein,
the decision module is used for splitting the operation of the neural network into a plurality of operation of operator cores by taking the operator cores of the neural network as units; according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a big kernel processor in edge equipment for execution; scheduling the operation process of operating the other operator cores in the large core processing, and scheduling the operation process of parameter reading of the other operator cores and the operation process of reading and converting corresponding weights to a small core processor selected from the edge equipment for execution;
and the execution module is used for controlling the large-core processor and the small-core processor to execute the corresponding operation process based on the scheduling respectively.
In another embodiment of the application, a non-transitory computer readable storage medium is provided that stores instructions that, when executed by a processor, cause the processor to perform a method of cold starting a neural network on an edge device of one of the previous embodiments.
Fig. 5 is a schematic diagram of an electronic device according to another embodiment of the present application. As shown in fig. 5, another embodiment of the present application further provides an electronic device, which may include a processor 501, where the processor 501 is configured to perform a step of cold starting a neural network on an edge device. As can also be seen from fig. 5, the electronic device provided by the above embodiment further comprises a non-transitory computer readable storage medium 502, on which non-transitory computer readable storage medium 502 a computer program and a neural network model are stored, wherein the computer program, when executed by the processor 501, performs the steps in the method for cold starting a neural network on an edge device.
In particular, the non-transitory computer readable storage medium 502 can be a general purpose storage medium, such as a removable disk, a hard disk, a FLASH, a read-only memory (ROM), an erasable programmable read-only memory (EPROM or FLASH memory), or a portable compact disc read-only memory (CD-ROM), etc., and the computer program on the non-transitory computer readable storage medium 502, when executed by the processor 501, can cause the processor 801 to perform the steps of a method for cold booting a neural network on an edge device as described above.
In practice, the non-transitory computer readable storage medium 502 may be included in the apparatus/device/system described in the above embodiment, or may exist alone, and not be assembled into the apparatus/device/system. The computer readable storage medium carries one or more programs that when executed are capable of performing the steps of a method of cold starting a neural network on an edge device.
Yet another embodiment of the present application provides a computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps of a method for cold starting a neural network on an edge device as described above.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments of the application and/or in the claims may be combined in various combinations and/or combinations without departing from the spirit and teachings of the application, all of which are within the scope of the disclosure.
The principles and embodiments of the present application have been described herein with reference to specific examples, which are intended to be included herein for purposes of illustration only and not to be limiting of the application. It will be apparent to those skilled in the art that variations can be made in the present embodiments and applications within the spirit and principles of the application, and any modifications, equivalents, improvements, etc. are intended to be included within the scope of the present application.
Claims (10)
1. A method for cold start of a neural network on an edge device, the method comprising:
splitting the operation of the neural network into a plurality of operation of operator cores by taking the operator cores of the neural network as units;
according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a big kernel processor in edge equipment for execution;
scheduling the operation process of running the other operator kernels to be completed in the large kernel processing; scheduling the operation process of parameter reading of other operator kernels and the operation process of reading and converting of corresponding weights to a small kernel processor selected from the edge equipment for execution;
and the big core processor and the small core processor of the edge equipment execute the corresponding operation process based on the scheduling respectively.
2. The method of claim 1, wherein the scheduling the operations of reading parameters of the remaining operator kernels and the operations of reading and translating the corresponding weights to the selected corelet processor in the edge device comprises:
and according to the operator kernel operation sequence, dispatching the operator kernels to the sequentially selected small kernel processors one by one, and dispatching and executing the operation process of parameter reading of the operator kernels and the operation process of reading and converting the corresponding weights.
3. The method of claim 1, wherein the scheduling to execute in a corelet processor selected in the edge device comprises:
after the small core processor acquires the information for scheduling the operator kernel, initializing an operation list of the small core processor, and storing operation information read by parameters of the operator kernel and operation information read and converted by corresponding weights in the operation list;
the small core processor executing the corresponding operation process based on scheduling comprises:
and the small core processor sequentially executes the corresponding operation processes according to the operation list.
4. The method of claim 1, wherein the scheduling is performed in a large core processor in an edge device, further comprising:
calculating, for each operator core, a total time of an operation process of executing parameter reading of the operator core, an operation process of reading and converting corresponding weights, and an operation process of operating the operator core at a large core processor, and judging whether the total time is smaller than a total time of an operation process of executing parameter reading of the operator core at a small core processor, an operation process of executing reading and converting corresponding weights at a small core processor, and an operation process of operating the operator core at a large core processor, except for a first operator core;
if yes, scheduling an operation process of executing parameter reading of an operator kernel in the small core processor and an operation process of executing reading and conversion of corresponding weight in the small core processor to the large core processor to finish;
if not, the operation process of parameter reading of the operator kernel is scheduled and completed in the small core processor, and the operation process of reading and converting of the corresponding weight is scheduled and completed in the small core processor.
5. The method of claim 1, wherein the scheduling is performed in a corelet processor selected in the edge device, further comprising:
determining a small core processor which finishes the operation first, judging whether the small core processor can finish the operation process of parameter reading of an operator kernel with the longest operation time and the operation process of reading and converting corresponding weights;
if yes, the operation process of completing parameter reading of the operator kernel with the longest operation time and the operation process of reading and converting the corresponding weight are scheduled to be completed in the small kernel processing.
6. The method of claim 5, wherein the operation of determining whether the corelet processor can re-complete the parameter reading of the operator kernel with the longest operation time, and the operation of reading and translating the corresponding weights, comprises:
judging whether the time of the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight of the small core processor is before the time of operating the operation process of the operator core with the longest operation time, if so, scheduling the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and converting of the corresponding weight into the small core processor.
7. An apparatus for cold starting a neural network on an edge device, the apparatus comprising: the decision module and the execution module, wherein,
the decision module is used for splitting the operation of the neural network into a plurality of operation of operator cores by taking the operator cores of the neural network as units; according to the operator kernel running sequence, scheduling an operation process of parameter reading of a first operator kernel, an operation process of reading and converting corresponding weight and an operation process of running the first operator kernel into a big kernel processor in edge equipment for execution; scheduling the operation process of operating the other operator cores in the large core processing, and scheduling the operation process of parameter reading of the other operator cores and the operation process of reading and converting corresponding weights to a small core processor selected from the edge equipment for execution;
and the execution module is used for controlling the large-core processor and the small-core processor to execute the corresponding operation process based on the scheduling respectively.
8. An electronic device, comprising:
a processor;
a memory storing a program configured to implement a method of cold starting a neural network on an edge device as claimed in any one of claims 1 to 6 when executed by the processor.
9. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform a method of cold starting a neural network on an edge device according to any one of claims 1 to 6.
10. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the steps of a method of cold starting a neural network on an edge device as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310732822.8A CN116775149A (en) | 2023-06-20 | 2023-06-20 | Cold start method and device for neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310732822.8A CN116775149A (en) | 2023-06-20 | 2023-06-20 | Cold start method and device for neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116775149A true CN116775149A (en) | 2023-09-19 |
Family
ID=88009385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310732822.8A Pending CN116775149A (en) | 2023-06-20 | 2023-06-20 | Cold start method and device for neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116775149A (en) |
-
2023
- 2023-06-20 CN CN202310732822.8A patent/CN116775149A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9715408B2 (en) | Data-aware workload scheduling and execution in heterogeneous environments | |
US8914805B2 (en) | Rescheduling workload in a hybrid computing environment | |
CN103069389B (en) | High-throughput computing method and system in a hybrid computing environment | |
US9342355B2 (en) | Joint optimization of multiple phases in large data processing | |
US9262220B2 (en) | Scheduling workloads and making provision decisions of computer resources in a computing environment | |
Bicer et al. | Time and cost sensitive data-intensive computing on hybrid clouds | |
US11663051B2 (en) | Workflow pipeline optimization based on machine learning operation for determining wait time between successive executions of the workflow | |
EP3286647A1 (en) | Placement of a calculation task on a functionally asymmetric processor | |
US11275561B2 (en) | Mixed precision floating-point multiply-add operation | |
US10949259B2 (en) | System and method of scheduling and computing resource allocation optimization of machine learning flows | |
Pasdar et al. | Hybrid scheduling for scientific workflows on hybrid clouds | |
Choi et al. | Data-locality aware scientific workflow scheduling methods in HPC cloud environments | |
Ahmed et al. | Heterogeneous energy-aware load balancing for industry 4.0 and IoT environments | |
CN111061485A (en) | Task processing method, compiler, scheduling server, and medium | |
WO2022100439A1 (en) | Workflow patching | |
CN108139929A (en) | For dispatching the task dispatch of multiple tasks and method | |
CN116775149A (en) | Cold start method and device for neural network | |
Singh et al. | Critical path based scheduling algorithm for workflow applications in cloud computing | |
CN115964164A (en) | Computer-implemented method, hardware accelerator, and storage medium | |
US11372677B1 (en) | Efficient scheduling of load instructions | |
WO2022159300A1 (en) | Branching operation for neural processor circuit | |
CN114490002A (en) | Data processing system, task scheduling method, device, chip and electronic equipment | |
Rao et al. | Scheduling data intensive workloads through virtualization on MapReduce based clouds | |
WO2024131170A1 (en) | Operator processing method and apparatus, and chip, computing device and storage medium | |
Monge et al. | Logos: Enabling local resource managers for the efficient support of data-intensive workflows within grid sites |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |