CN116775149A - A method and device for cold start of neural network - Google Patents
A method and device for cold start of neural network Download PDFInfo
- Publication number
- CN116775149A CN116775149A CN202310732822.8A CN202310732822A CN116775149A CN 116775149 A CN116775149 A CN 116775149A CN 202310732822 A CN202310732822 A CN 202310732822A CN 116775149 A CN116775149 A CN 116775149A
- Authority
- CN
- China
- Prior art keywords
- operator
- operation process
- reading
- core processor
- kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 268
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 107
- 230000008569 process Effects 0.000 claims abstract description 215
- 238000006243 chemical reaction Methods 0.000 claims abstract description 18
- 238000012545 processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 11
- 238000010586 diagram Methods 0.000 description 11
- 238000005457 optimization Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本申请实施例公开了一种神经网络冷启动的方法及装置,采用具有多核处理器的边缘设备,其中多核处理器采用大小核架构,以神经网络的算子内核为单位,将神经网络的运行过程拆分为多个所述算子内核的运行过程,按照所述算子内核的运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度在所述大核处理器中完成;将运行其余算子内核的操作过程调度在所述大核处理器中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度在所选取的小核处理器完成。这样,在不影响神经网络运行精度的前提下,降低延迟时间。
The embodiment of the present application discloses a method and device for cold starting a neural network, using an edge device with a multi-core processor, where the multi-core processor adopts a large and small core architecture, and the operation of the neural network is combined with the operator core of the neural network as a unit. The process is divided into the running processes of multiple operator kernels. According to the running order of the operator kernels, the operation process of reading the parameters of the first operator kernel, the reading and conversion of the corresponding weights are The process and the operation process of running the first operator core are scheduled to be completed in the large core processor; the operation process of running the other operator cores is scheduled to be completed in the large core processor, and the operation processes of the remaining operator cores are scheduled to be completed in the large core processor. The operation process of parameter reading, and the operation process of reading and converting corresponding weights are scheduled to be completed on the selected small core processor. In this way, the delay time can be reduced without affecting the accuracy of the neural network operation.
Description
技术领域Technical field
本申请涉及神经网络技术领域,特别涉及一种神经网络冷启动的方法及装置。The present application relates to the field of neural network technology, and in particular to a method and device for cold starting a neural network.
背景技术Background technique
随着人工智能技术的发展,神经网络已经应用于计算机视觉技术及自然语言理解等无数领域的业务识别。在应用神经网络进行诸如图像或语义的识别时,为了追求低处理延时以及处理数据的隐私,神经网络的部署从大型数据中心转向边缘设备上。边缘设备(edge device)是向企业或服务提供商核心网络提供入口点的设备,例如:智能手机、物联网的接入设备、可穿戴设备及自动驾驶汽车的接入设备等等。这些边缘设备具有算力,将这些边缘设备的算力充分利用,用于进行基于神经网络的处理是十分必要的。With the development of artificial intelligence technology, neural networks have been applied to business recognition in countless fields such as computer vision technology and natural language understanding. When applying neural networks for image or semantic recognition, in order to pursue low processing latency and privacy of processed data, the deployment of neural networks has shifted from large data centers to edge devices. Edge devices are devices that provide entry points to the core network of enterprises or service providers, such as smartphones, Internet of Things access devices, wearable devices, and autonomous vehicle access devices, etc. These edge devices have computing power, and it is necessary to make full use of the computing power of these edge devices for processing based on neural networks.
在边缘设备中部署神经网络有两个显著趋势:1)每个边缘设备的神经网络的数量及类型呈爆炸式增长;2)神经网络的结构复杂性也在增加,比如部署深度神经网络,深度神经网络是一种多层无监督的多层神经网络。这两个趋势突显了神经网络在资源有限的边缘设备上的拥挤。因此,不太可能将所有神经网络都预先加载到边缘设备的内存中后等待运行。也就是说,神经网络在边缘设备上的冷启动,即加载,与后续的初始化及执行的过程,正在变得十分重要。神经网络在边缘设备上冷启动的速度,与热启动一样,对边缘设备的服务质量及用户体验都至关重要。There are two significant trends in deploying neural networks in edge devices: 1) the number and type of neural networks per edge device has exploded; 2) the structural complexity of neural networks is also increasing, such as deploying deep neural networks, deep Neural network is a multi-layer unsupervised multi-layer neural network. These two trends highlight the crowding of neural networks on resource-constrained edge devices. Therefore, it is unlikely that all neural networks will be pre-loaded into the memory of the edge device and then waiting to be run. In other words, the cold start, that is, loading, and subsequent initialization and execution of neural networks on edge devices are becoming very important. The speed of cold start of the neural network on the edge device, like the hot start, is crucial to the service quality and user experience of the edge device.
但是,神经网络,尤其是深度神经网络在边缘设备上的冷启动的用时要远大于热启动的用时。神经网络在边缘设备进行冷启动时,常常会出现延迟,如何在不影响神经网络运行精度的前提下,降低延迟时间,是一个亟待解决的问题。However, the cold start time of neural networks, especially deep neural networks, on edge devices is much longer than the warm start time. Neural networks often experience delays when edge devices are cold-started. How to reduce the delay time without affecting the accuracy of the neural network operation is an urgent problem that needs to be solved.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种神经网络在边缘设备上进行冷启动的方法,该方法能够在不影响神经网络运行精度的前提下,降低延迟时间。In view of this, embodiments of the present application provide a method for cold starting a neural network on an edge device. This method can reduce the delay time without affecting the operation accuracy of the neural network.
本申请实施例还提供一种神经网络在边缘设备上进行冷启动的装置,该装置能够在不影响神经网络运行精度的前提下,降低延迟时间。Embodiments of the present application also provide a device for cold starting a neural network on an edge device, which can reduce the delay time without affecting the operation accuracy of the neural network.
本申请是这样实现的:This application is implemented as follows:
本申请的一个实施例中,提供一种神经网络在边缘设备上进行冷启动的方法,所述方法包括:In one embodiment of the present application, a method for cold starting a neural network on an edge device is provided. The method includes:
以神经网络的算子内核为单位,将神经网络的运行拆分为多个所述算子内核的运行;Taking the operator kernel of the neural network as a unit, split the operation of the neural network into the operation of multiple operator kernels;
按照所述算子内核运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度到边缘设备中的大核处理器中执行;According to the operation sequence of the operator kernel, the operation process of reading the parameters of the first operator kernel, the operation process of reading and converting the corresponding weights, and the operation process of running the first operator kernel are scheduled to the edge. Executed in the large-core processor in the device;
将运行其余算子内核的操作过程调度在所述大核处理中完成;将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行;The operation process of running the remaining operator kernels is scheduled to be completed in the large core processing; the operation process of reading the parameters of the remaining operator kernels, as well as the operation process of reading and converting the corresponding weights, is scheduled to the edge device Executed in the selected small core processor;
所述边缘设备的所述大核处理器及所述小核处理器,分别基于所述调度执行对应的所述操作过程。The large-core processor and the small-core processor of the edge device respectively execute the corresponding operation process based on the schedule.
在上述方法中,所述将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行包括:In the above method, the operation process of reading the parameters of the remaining operator cores and the operation process of reading and converting the corresponding weights is scheduled to be executed in the small core processor selected in the edge device, including:
按照所述算子内核运行顺序,将所述算子内核一个接一个的调度到顺序选取的所述小核处理器上,调度执行所述算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程。According to the operation sequence of the operator cores, the operator cores are scheduled one by one to the small core processors selected in sequence, and the operation process of parameter reading of the operator cores and the corresponding weights are scheduled to be executed. The operation process of reading and conversion.
在上述方法中,所述调度到所述边缘设备中选取的小核处理器中执行包括:In the above method, the scheduling to the small core processor selected in the edge device for execution includes:
所述小核处理器获取到调度所述算子内核的信息后,初始化自身的操作列表,将所述算子内核的参数读取的操作信息、及对应权重的读取及转化的操作信息存储在操作列表中;After the small core processor obtains the information of scheduling the operator core, it initializes its own operation list, and stores the operation information of reading the parameters of the operator core and the operation information of reading and converting the corresponding weights. in the action list;
所述小核处理器基于调度执行对应的所述操作过程包括:The small core processor executes the corresponding operation process based on scheduling, including:
所述小核处理器根据所述操作列表顺序执行对应的所述操作过程。The small core processor sequentially executes the corresponding operation processes according to the operation list.
在上述方法中,所述调度到边缘设备中的大核处理器中执行,还包括:In the above method, the scheduling is executed in the large-core processor in the edge device, and further includes:
除第一算子内核外,针对每一算子内核,计算在大核处理器执行所述算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行算子内核的操作过程的总时间,判断所述总时间是否小于在小核处理器执行所述算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程、以及在大核处理器运行算子内核的操作过程的总时间;In addition to the first operator core, for each operator core, the operation process of reading the parameters of the operator core, the operation process of reading and converting the corresponding weights, and running the operator are calculated on the large core processor The total time of the operation process of the kernel is determined by whether the total time is less than the operation process of reading the parameters of the operator kernel in the small core processor, and the operation process of reading and converting the corresponding weights in the small core processor. , and the total time for the operation process of running the operator kernel on the large-core processor;
如果是,将在小核处理器执行算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程调度到大核处理器完成;If so, the operation process of reading the parameters of the operator kernel on the small core processor, and the operation process of reading and converting the corresponding weights on the small core processor are scheduled to be completed on the large core processor;
如果否,在小核处理器调度完成所述算子内核的参数读取的操作过程、在小核处理器调度完成对应权重的读取及转化的操作过程。If not, the small core processor is scheduled to complete the operation process of reading the parameters of the operator kernel, and the small core processor is scheduled to complete the operation process of reading and converting the corresponding weights.
在上述方法中,所述调度到所述边缘设备中选取的小核处理器中执行,还包括:In the above method, the scheduling is executed in the small core processor selected in the edge device, and further includes:
确定其中最先完成操作的小核处理器,判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程;Determine the small-core processor that completes the operation first, and determine whether the small-core processor can complete the operation process of reading the parameters of the operator core with the longest operation time, and the operation process of reading and converting the corresponding weights. ;
如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理中完成。If so, the operation process of reading the parameters of the operator kernel with the longest operation time and the operation process of reading and converting the corresponding weights are scheduled to be completed in the small core process.
在上述方法中,所述判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程包括:In the above method, the operation process of determining whether the small core processor can complete the parameter reading of the operator core with the longest operation time, and the operation process of reading and converting the corresponding weights include:
判断所述小核处理器再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程的时间是否在运行所述完成操作时间最长的算子内核的操作过程的时间之前,如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理器中。Determine whether the small core processor completes the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and conversion of the corresponding weights before running the operation process of the operator core with the longest operation time. Before the time of the operation process of the sub-kernel, if so, the operation process of parameter reading of the operator kernel with the longest operation time, and the operation process of reading and conversion of the corresponding weights will be scheduled to the small core processor middle.
本申请的另一实施例中,提供一种神经网络在边缘设备上进行冷启动的装置,所述装置包括:决策模块及执行模块,其中,In another embodiment of the present application, a device for cold starting a neural network on an edge device is provided. The device includes: a decision-making module and an execution module, wherein,
决策模块,用于以神经网络的算子内核为单位,将神经网络的运行拆分为多个所述算子内核的运行;按照所述算子内核运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度到边缘设备中的大核处理器中执行;将运行其余算子内核的操作过程调度在所述大核处理中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行;The decision-making module is used to divide the operation of the neural network into the operation of multiple operator kernels based on the operator kernel of the neural network; according to the operation sequence of the operator kernels, divide the first operator kernel among them The operation process of parameter reading, the operation process of reading and converting the corresponding weights, and the operation process of running the first operator kernel are scheduled to be executed in the large-core processor in the edge device; the operation process of running the remaining operator kernels The operation process scheduling is completed in the large core processing, and the operation process of reading the parameters of the remaining operator cores, as well as the operation process of reading and converting the corresponding weights, is scheduled to the small core processor selected in the edge device in execution;
执行模块,用于控制大核处理器及小核处理器,分别基于调度执行对应的所述操作过程。The execution module is used to control the large-core processor and the small-core processor to execute the corresponding operation process based on scheduling.
本申请再一实施例中,提供一种电子设备,包括:In yet another embodiment of the present application, an electronic device is provided, including:
处理器;processor;
存储器,存储有程序,所述程序配置为在被所述处理器执行时实现上述的一种神经网络在边缘设备上进行冷启动的方法。The memory stores a program, and the program is configured to implement the above-mentioned method of cold starting a neural network on an edge device when executed by the processor.
本申请再一实施例中,提供一种非瞬时计算机可读存储介质,所述非瞬时计算机可读存储介质存储指令,其特征在于,所述指令在由处理器执行时使得所述处理器执行上述的一种神经网络在边缘设备上进行冷启动的方法。In yet another embodiment of the present application, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium stores instructions, wherein the instructions, when executed by a processor, cause the processor to execute The above-mentioned method of cold starting a neural network on an edge device.
本申请再一实施例中,提供一种计算机程序产品,包括计算机程序或指令,该计算机程序或指令被处理器执行时实现上述任一项所述的一种神经网络在边缘设备上进行冷启动的方法的步骤。In yet another embodiment of the present application, a computer program product is provided, including a computer program or instructions. When the computer program or instructions are executed by a processor, a neural network according to any one of the above is implemented to perform a cold start on an edge device. steps of the method.
如上所见,本申请实施例采用具有多核处理器的边缘设备,其中多核处理器采用大小核架构,以神经网络的算子内核为单位,将神经网络的运行过程拆分为多个所述算子内核的运行过程,按照所述算子内核的运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度在所述大核处理器中完成;将运行其余算子内核的操作过程调度在所述大核处理器中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度在所选取的小核处理器完成。这样,神经网络在边缘设备进行冷启动过程中,采用边缘设备的大核处理器循环及小核处理器循环的配合方式执行完成,在不影响神经网络运行精度的前提下,降低延迟时间。As can be seen above, the embodiment of the present application uses an edge device with a multi-core processor. The multi-core processor adopts a large and small core architecture. Taking the operator core of the neural network as a unit, the running process of the neural network is divided into multiple said operators. The running process of the sub-kernel, according to the running order of the operator kernel, includes the operation process of reading the parameters of the first operator kernel, the operation process of reading and converting the corresponding weights, and running the first operator kernel. The operation process is scheduled to be completed in the large-core processor; the operation process of running the remaining operator kernels is scheduled to be completed in the large-core processor, the operation process of reading the parameters of the remaining operator kernels, and the corresponding The weight reading and conversion operations are scheduled to be completed on the selected small core processor. In this way, during the cold start process of the edge device, the neural network is executed using a combination of the large-core processor cycle and the small-core processor cycle of the edge device, reducing the delay time without affecting the operation accuracy of the neural network.
附图说明Description of drawings
图1为本申请实施例提供的在边缘设备上进行冷启动时的过程流程图Figure 1 is a process flow chart when cold starting on an edge device provided by an embodiment of the present application.
图2为本申请实施例提供的一种神经网络在边缘设备上进行冷启动的方法流程图;Figure 2 is a flow chart of a method for cold starting a neural network on an edge device provided by an embodiment of the present application;
图3为本申请实施例提供的采用流水调度策略执行整个神经网络在边缘设备的冷启动的过程示意图;Figure 3 is a schematic diagram of the process of executing a cold start of the entire neural network on an edge device using a pipeline scheduling strategy provided by an embodiment of the present application;
图4为本申请实施例提供的一种神经网络在边缘设备上进行冷启动的装置结构示意图;Figure 4 is a schematic structural diagram of a device for cold starting a neural network on an edge device according to an embodiment of the present application;
图5为本申请的另一个实施例所提供的一种电子设备的示意图。FIG. 5 is a schematic diagram of an electronic device provided by another embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含。例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects without necessarily using Used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can, for example, be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units need not be limited to those steps or units that are expressly listed, but may include steps or units that are not expressly listed or that are not specific to the process, method, product, or device. Other steps or units inherent to the equipment.
下面以具体实施例对本申请的技术方案进行详细说明。下面几个具体实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solutions of the present application are described in detail below with specific examples. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
神经网络,特别是深度神经网络在边缘设备上进行冷启动过程中,占用时间长的瓶颈主要包括:在加载时将神经网络的原始权重从磁盘读取到边缘设备上的内存中、在初始化时将该原始权重转换为可执行格式的权重、以及在执行时将数据输入基于可执行格式的权重的神经网络中的运行过程。上述这些过程也被称为深度神经网络在边缘设备的冷推理过程。During the cold start process of neural networks, especially deep neural networks, on edge devices, the bottlenecks that take a long time mainly include: reading the original weights of the neural network from disk to the memory on the edge device during loading, and during initialization. A running process of converting the original weights into weights in an executable format and inputting data into a neural network based on the weights in the executable format at execution time. The above processes are also called the cold inference process of deep neural networks on edge devices.
目前,有两种方式可以缓解冷推理延迟。一种是通过对神经网络的原始权重共享,目的是将更多的神经网络打包存储到边缘设备内存中,以便对后续要运行的每个神经网络进行预热,但是这种方式不具备可扩展性,因为边缘设备要运行的神经网络数量和种类增多时,就会导致神经网络模型的运行精度的显著下降。另一种方式为:预估神经网络的执行时间,提前进行原始权重的读取及存储操作,但是,当边缘设备要运行多个神经网络时,则很难预估每个神经网络的执行时间。上述两种方法都以间接方式解决了冷启动的延迟问题,但是依赖于改变神经网络结构或外部知识,在具体实现上比较困难。Currently, there are two ways to mitigate cold inference latency. One is to share the original weights of the neural network, with the purpose of packaging and storing more neural networks into the memory of the edge device in order to warm up each neural network to be run subsequently, but this method is not scalable. Sexuality, because when the number and types of neural networks to be run on edge devices increase, it will lead to a significant decrease in the operating accuracy of the neural network model. Another way is to estimate the execution time of the neural network and read and store the original weights in advance. However, when the edge device needs to run multiple neural networks, it is difficult to estimate the execution time of each neural network. . Both of the above methods solve the cold start delay problem in an indirect way, but they rely on changing the neural network structure or external knowledge, making them difficult to implement.
因此,本申请实施例采用直接方式优化神经网络在边缘设备上进行冷启动时的延迟,不依赖于对神经网络的结构或执行时环境的任何假设,且可以保证零精度损失。本申请实施例采用具有多核处理器的边缘设备,其中多核处理器采用大小核架构,可以包括:多个大核处理器及多个小核处理器,以神经网络的算子内核为单位,将神经网络的运行过程拆分为多个所述算子内核的运行过程,按照所述算子内核的运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度在所述大核处理器中完成;将运行其余算子内核的操作过程调度在所述大核处理器中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度在所选取的小核处理器完成。Therefore, the embodiments of the present application use a direct method to optimize the delay of the neural network when cold starting on the edge device, without relying on any assumptions about the structure or execution environment of the neural network, and can ensure zero accuracy loss. The embodiment of the present application uses an edge device with a multi-core processor. The multi-core processor adopts a large and small core architecture and may include: multiple large core processors and multiple small core processors. Taking the operator core of the neural network as a unit, The operation process of the neural network is divided into the operation processes of multiple operator kernels. According to the operation sequence of the operator kernels, the operation process of reading the parameters of the first operator kernel and reading the corresponding weights are and the operation process of conversion, as well as the operation process of running the first operator kernel, are scheduled to be completed in the large-core processor; the operation process of running the remaining operator kernels is scheduled to be completed in the large-core processor, and the remaining operator kernels are scheduled to be completed in the large-core processor. The operation process of parameter reading of the operator kernel and the operation process of reading and converting the corresponding weights are scheduled to be completed on the selected small core processor.
在这里,选取小核处理器是按照所述算子内核运行顺序,将所述算子内核的相关操作过程一个接一个地调度到顺序选取的小核处理器上,调度完成所述算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程。Here, selecting a small core processor is to schedule the relevant operation processes of the operator core to the sequentially selected small core processors one by one according to the operation sequence of the operator core, and schedule the completion of the operator core The operation process of parameter reading, and the operation process of reading and conversion of corresponding weights.
这样,神经网络在边缘设备进行冷启动过程中,采用边缘设备的大核处理器循环及小核处理器循环的配合方式执行完成,在不影响神经网络运行精度的前提下,降低延迟时间。In this way, during the cold start process of the edge device, the neural network is executed using a combination of the large-core processor cycle and the small-core processor cycle of the edge device, reducing the delay time without affecting the operation accuracy of the neural network.
本申请实施例在进行小核处理器循环时,是以平衡各个小核处理器的工作负载为前提的。本申请在采用小核处理器循环完成各个算子内核的参数读取的操作过程,及对应权重的读取及转换的操作过程时,还包括:确定其中最先完成操作的小核处理器,判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理中完成。在这里,判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程时,判断所述小核处理器再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程的时间是否在运行所述完成操作时间最长的算子内核的操作过程的时间之前,如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理器中。这样,可以更加平衡各个小核处理器的工作负载。When performing the small-core processor cycle in the embodiment of the present application, the premise is to balance the workload of each small-core processor. When this application uses a small core processor to cycle through the operation process of reading parameters of each operator core, and the operation process of reading and converting corresponding weights, it also includes: determining the small core processor that completes the operation first, Determine whether the small core processor can complete the operation process of parameter reading of the operator core with the longest operation time, and the operation process of reading and conversion of the corresponding weights. If so, the operation process of the operator core with the longest operation time will be completed. The operation process of parameter reading of the operator kernel and the operation process of reading and converting the corresponding weights are scheduled to be completed in the small core processing. Here, when it is judged whether the small core processor can complete the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and conversion of the corresponding weights, it is judged that the small core processor Is the time to complete the operation process of parameter reading of the operator kernel with the longest operation time and the operation process of reading and conversion of the corresponding weights within the time of running the operation process of the operator kernel with the longest operation time? Previously, if so, the operation process of reading the parameters of the operator core with the longest operation time and the operation process of reading and converting the corresponding weights are scheduled to the small core processor. In this way, the workload of each small core processor can be more balanced.
本申请实施例中,在进行小核处理器循环时,当小核处理器获取到调度所述算子内核的信息后,则初始化自身的操作列表,将所述算子内核的参数读取的操作信息、及对应权重的读取及转化的操作信息存储在操作列表中;在执行调度时,所述小核处理器根据所述操作列表顺序执行。In the embodiment of the present application, when the small core processor cycles, after the small core processor obtains the information of scheduling the operator core, it initializes its own operation list and reads the parameters of the operator core. The operation information, and the operation information corresponding to the reading and conversion of weights are stored in the operation list; when performing scheduling, the small core processor executes sequentially according to the operation list.
本申请在采用边缘设备的大核处理器循环和小核处理器循环配合完成神经网络在边缘设备进行冷启动过程中,平衡了大核处理器和小核处理器的工作负载,以最小化大核处理器的完成时间。在大核处理器循环中,还包括:除了第一算子内核外,针对每一算子内核,计算在大核处理器执行所述算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行算子内核的操作过程的总时间,判断是否小于在小核处理器执行所述算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程、以及在大核处理器运行算子内核的操作过程的总时间,如果是,则将在小核处理器执行算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程调度到大核处理器完成,否则,仍然在小核处理器调度完成所述算子内核的参数读取的操作过程、在小核处理器调度完成对应权重的读取及转化的操作过程。这样,可以很大程度平衡所有核处理器之间的工作负载。This application uses the large-core processor cycle and the small-core processor cycle of the edge device to cooperate to complete the neural network during the cold start of the edge device, balancing the workload of the large-core processor and the small-core processor to minimize the large-core processor. The completion time of the core processor. In the large-core processor cycle, it also includes: in addition to the first operator core, for each operator core, calculate the operation process of parameter reading of the operator core performed by the large-core processor, and the reading of corresponding weights. The operation process of retrieval and transformation, and the total time of the operation process of running the operator kernel are judged whether it is less than the operation process of executing the parameter reading of the operator kernel on the small core processor, and the operation process of executing the corresponding weight on the small core processor. The operation process of reading and conversion, and the total time of the operation process of running the operator kernel on the large-core processor. If so, the operation process of reading the parameters of the operator kernel on the small-core processor will be executed on the small-core processor. The processor performs the operation process of reading and converting the corresponding weights and schedules it to the large-core processor for completion. Otherwise, the operation process of reading the parameters of the operator core is still scheduled to be completed on the small-core processor. Complete the operation process of reading and converting the corresponding weights. In this way, the workload among all core processors can be balanced to a large extent.
在本申请实施例中,神经网络的算子内核指的是神经网络中的各个神经网络层(neural network layer)所用的基本数学运算单元。算子内核构成了神经网络的基本计算单元,不同的算子内核分别对应不同的神经网络层的计算逻辑,通常以计算矩阵的方式呈现,比如,卷积层是一个算子,执行卷积层运算单元为一个算子内核,全连接层中的权值求和过程,是一个算子,执行全连接层运算的运算单元为另一个算子内核。In the embodiment of the present application, the operator kernel of the neural network refers to the basic mathematical operation unit used by each neural network layer (neural network layer) in the neural network. The operator kernel constitutes the basic calculation unit of the neural network. Different operator kernels correspond to the calculation logic of different neural network layers. They are usually presented in the form of calculation matrices. For example, the convolution layer is an operator that executes the convolution layer. The computing unit is an operator core. The weight summation process in the fully connected layer is an operator. The computing unit that performs the fully connected layer operation is another operator core.
在本申请实施例中,边缘设备的大核处理器为边缘设备中的性能核(PerformanceCores)中央处理器(CPU)或图形处理器(GPU),指的是高性能的处理器内核,是边缘设备的运算的主力;边缘设备的小核处理器为效能核(Efficiency cores)CPU,指的是以节能为目的的内核CPU,主要在边缘设备的低负载时使用。In the embodiment of this application, the large-core processor of the edge device is the performance core (PerformanceCores) central processing unit (CPU) or graphics processing unit (GPU) in the edge device, which refers to a high-performance processor core and is the edge The main computing power of the device; the small-core processor of the edge device is the Efficiency cores CPU, which refers to the core CPU for the purpose of saving energy and is mainly used when the edge device is under low load.
以下对本申请实施例进行详细说明。The embodiments of the present application are described in detail below.
如图1所示,图1为本申请实施例提供的在边缘设备上进行冷启动时的过程流程图,包括决策阶段及执行阶段,在决策阶段中,要选择神经网络的算子内核、缓存转换后的权重及采用的调度策略,在调度器的控制下进入到执行阶段;在执行阶段,输入数据到对应的算子内核中执行,得到执行结果,其中,算子内核在调度器的调度下,基于对应的转换后的权重进行运行。其中,算子内核的执行在图示中被称为推理。As shown in Figure 1, Figure 1 is a process flow chart for cold start on an edge device provided by an embodiment of the present application, including a decision-making phase and an execution phase. In the decision-making phase, the operator core and cache of the neural network are selected. The converted weights and the adopted scheduling strategy enter the execution stage under the control of the scheduler; in the execution stage, the data is input to the corresponding operator kernel for execution, and the execution result is obtained. Among them, the operator kernel is scheduled by the scheduler. , run based on the corresponding converted weights. Among them, the execution of the operator kernel is called inference in the diagram.
本申请实施例研究了神经网络在边缘设备上进行冷启动时的优化空间,确定了以下三个有效的优化方式。The embodiment of this application studies the optimization space of neural networks when cold starting on edge devices, and determines the following three effective optimization methods.
第一个优化方式The first optimization method
从神经网络中选择最佳的算子内核实现。在神经网络中,特别是深度神经网络中通常每个算子内核有许多不同的实现。这些在神经网络中划分的算子内核是为了提高运行速度,而目前的算子内核的选择策略完全基于神经网络在进行热启动时的运行速度。然而,在进行热启动时的运行速度最快的算子内核不一定在冷启动中表现出最好的性能,比如,有的算子内核运行速度很快,但是对应的权重读取及转换的操作过程所需的时间却比较长,会延迟冷启动的速度。在这种情况下,神经网络中的算子内核的选择策略应该为减少冷启动延迟的策略,从而得到神经网络中的最佳的算子内核。Select the best operator kernel implementation from the neural network. In neural networks, especially deep neural networks, there are usually many different implementations of each operator kernel. These operator kernels divided in the neural network are to improve the running speed, and the current operator kernel selection strategy is completely based on the running speed of the neural network during hot start. However, the operator kernel that runs the fastest during a hot start may not necessarily show the best performance during a cold start. For example, some operator kernels run very fast, but the corresponding weight reading and conversion are The operation process takes a long time, which will delay the cold start speed. In this case, the selection strategy of the operator kernel in the neural network should be a strategy to reduce the cold start delay, so as to obtain the optimal operator kernel in the neural network.
第二个优化方式The second optimization method
缓存对应算子内核的转换后的权重。通过将该转换后的权重存储在边缘设备的磁盘上,以便直接读取和执行,可以绕过转换权重的操作过程。然而,转换后的权重可能会占用更多存储空间,并导致更高的磁盘读写成本。执行对应算子内核的权重的读取及转化的操作过程,还是执行对应算子内核的已经转换的权重读取的操作过程,需要在磁盘读写成本和计算成本之间进行权衡。Cache the converted weights of the corresponding operator kernel. The operational process of converting weights can be bypassed by storing this converted weight on disk on the edge device for direct reading and execution. However, the converted weights may take up more storage space and result in higher disk read and write costs. Whether to perform the operation process of reading and converting the weights corresponding to the operator core or to perform the operation process of reading the converted weights corresponding to the operator core requires a trade-off between disk read and write costs and computing costs.
第三种优化方式The third optimization method
最佳流水线调度策略。基于流水线技术的算子内核的执行过程及边缘设备的内核的绑定方式。一个算子内核的执行过程包括了:对应权重的读取及转化的操作过程,以及运行算子内核的操作过程。可以通过流水线调度策略来减少边缘设备中的磁盘和内存之间的传输阻塞时间。该流水线调度策略还可以调度边缘设备上的非对称核处理器处理不同的操作过程,例如中央处理器(CPU)与图形处理器(GPU)之间,或者大核CPU或小核CPU等处理不同的操作过程。Optimal pipeline scheduling strategy. The execution process of the operator kernel based on pipeline technology and the binding method of the kernel of the edge device. The execution process of an operator kernel includes: the operation process of reading and converting the corresponding weights, and the operation process of running the operator kernel. The blocking time of transfers between disk and memory in edge devices can be reduced through pipeline scheduling strategies. The pipeline scheduling strategy can also schedule asymmetric core processors on edge devices to process different operations, such as between central processing units (CPUs) and graphics processing units (GPUs), or between large-core CPUs or small-core CPUs. operating process.
综合考虑上述三种优化方式,对神经网络在边缘设备上进行冷启动时的降低延迟作用是紧密耦合的。例如,选择不同的算子内核可能有不同的流水线调度策略。为了设置一个全面且有效的神经网络在边缘设备上进行冷启动的方案,本申请要解决下面两个挑战。Taking the above three optimization methods into consideration, the delay reduction effect of the neural network when cold starting on the edge device is tightly coupled. For example, selecting different operator kernels may have different pipeline scheduling strategies. In order to set up a comprehensive and effective cold-start solution for neural networks on edge devices, this application addresses the following two challenges.
首先,搜索空间太大。将设置公式化为组合算子内核的选择、对应算子内核的权重的转换的操作过程的绕过及多个运行算子内核的操作过程,以获得最优的调度算子内核问题,这个问题是多项式复杂程度的非确定性问(NP)问题,很难解决。其次,由于边缘设备的磁盘及内存的容量有限,不同操作过程之间会相互干扰,使得问题进一步复杂化。First, the search space is too large. The setting is formulated as the selection of combined operator kernels, the bypass of the operation process of the conversion of the corresponding operator kernel weights, and the operation process of multiple running operator kernels to obtain the optimal scheduling operator kernel problem. This problem is Non-deterministic (NP) problems of polynomial complexity are difficult to solve. Secondly, due to the limited disk and memory capacity of edge devices, different operating processes will interfere with each other, further complicating the problem.
因此,本申请采用了一种启发式的最佳流水线调度策略,该策略受到了对神经网络在边缘设备上进行冷启动过程中的一些关键观察启发,例如:1)操作过程在不同内核处理器处理时的区别。2)多核上的多线程比其他核上的多线程在执行操作上更高效。因此,本申请使用边缘设备的大核处理器来多线程地运行算子内核的操作过程,而将算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程在小核处理器执行。本申请实施例还利用了一个特点,即执行对应算子内核的权重的读取及转化的操作过程,比执行对应算子内核的已经转换的权重读取的操作过程,具有更少的存储资源依赖性,因此,可以轻松地安排算子内核的执行调度。Therefore, this application adopts a heuristic optimal pipeline scheduling strategy, which is inspired by some key observations during the cold start of neural networks on edge devices, such as: 1) The operation process runs on different core processors Differences in processing. 2) Multi-threading on multiple cores is more efficient in performing operations than multi-threading on other cores. Therefore, this application uses the large-core processor of the edge device to run the operation process of the operator kernel in multi-threads, and the operation process of reading the parameters of the operator kernel, and the operation process of reading and converting the corresponding weights are performed in a small Core processor executes. The embodiments of this application also take advantage of a feature, that is, the operation process of reading and converting the weights corresponding to the operator kernel requires fewer storage resources than the operation process of reading the converted weights of the corresponding operator kernel. dependencies, therefore, the execution schedule of operator kernels can be easily arranged.
本申请实施例基于上述启发,设置了一个直观且有效的最佳流水线调度策略,其关键是平衡边缘设备中的不同核处理器上的工作负载,以最小化冷启动的总执行时间。同时,在进行调度规划过程中,通过对算子内核执行过程中每项操作过程的分析来再次调度每项操作过程,以提升每项操作过程的性能,以便更好地进行调度规划。Based on the above inspiration, the embodiment of this application sets up an intuitive and effective optimal pipeline scheduling strategy. The key is to balance the workload on different core processors in the edge device to minimize the total execution time of cold start. At the same time, during the scheduling planning process, each operation process is rescheduled by analyzing each operation process during the execution of the operator kernel to improve the performance of each operation process and facilitate better scheduling planning.
图2为本申请实施例提供的一种神经网络在边缘设备上进行冷启动的方法流程图,所述方法应用于具有多核处理器的边缘设备上,其具体步骤包括:Figure 2 is a flow chart of a method for cold starting a neural network on an edge device provided by an embodiment of the present application. The method is applied to an edge device with a multi-core processor. The specific steps include:
步骤201、以神经网络的算子内核为单位,将神经网络的运行拆分为多个所述算子内核的运行;Step 201: Taking the operator kernel of the neural network as a unit, split the operation of the neural network into the operation of multiple operator kernels;
步骤202、按照所述算子内核运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度到边缘设备中的大核处理器中执行;Step 202: According to the operation sequence of the operator kernel, the operation process of reading the parameters of the first operator kernel, the operation process of reading and converting the corresponding weights, and the operation process of running the first operator kernel, Scheduled to the large-core processor in the edge device for execution;
步骤203、将运行其余算子内核的操作过程调度在所述大核处理中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行;Step 203: Schedule the operation process of running the remaining operator kernels to be completed in the large core processing, and schedule the operation process of reading the parameters of the remaining operator kernels and the operation process of reading and converting the corresponding weights to all Executed in the small core processor selected in the edge device;
步骤204、所述边缘设备的大核处理器及小核处理器,分别基于调度执行对应的所述操作过程。Step 204: The large-core processor and the small-core processor of the edge device respectively execute the corresponding operation process based on scheduling.
在上述方法中,所述将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行包括:In the above method, the operation process of reading the parameters of the remaining operator cores and the operation process of reading and converting the corresponding weights is scheduled to be executed in the small core processor selected in the edge device, including:
按照所述算子内核运行顺序,将所述算子内核一个接一个地调度到顺序选取的所述小核处理器上,调度执行所述算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程。According to the running order of the operator cores, the operator cores are scheduled one by one to the sequentially selected small core processors, and the operation process of parameter reading of the operator cores and the corresponding weights are scheduled to be executed. The operation process of reading and conversion.
在上述方法中,所述调度到所述边缘设备中选取的小核处理器中执行包括:In the above method, the scheduling to the small core processor selected in the edge device for execution includes:
所述小核处理器获取到调度所述算子内核的信息后,初始化自身的操作列表,将所述算子内核的参数读取的操作信息、及对应权重的读取及转化的操作信息存储在操作列表中;After the small core processor obtains the information of scheduling the operator core, it initializes its own operation list, and stores the operation information of reading the parameters of the operator core and the operation information of reading and converting the corresponding weights. in the action list;
所述小核处理器基于调度执行对应的所述操作过程包括:The small core processor executes the corresponding operation process based on scheduling, including:
所述小核处理器根据所述操作列表顺序执行对应的所述操作过程。The small core processor sequentially executes the corresponding operation processes according to the operation list.
在上述方法中,所述调度到边缘设备中的大核处理器中执行,还包括:In the above method, the scheduling is executed in the large-core processor in the edge device, and further includes:
除第一算子内核外,针对每一算子内核,计算在大核处理器执行所述算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行算子内核的操作过程的总时间,判断所述总时间是否小于在小核处理器执行所述算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程、以及在大核处理器运行算子内核的操作过程的总时间;In addition to the first operator core, for each operator core, the operation process of reading the parameters of the operator core, the operation process of reading and converting the corresponding weights, and running the operator are calculated on the large core processor The total time of the operation process of the kernel is determined by whether the total time is less than the operation process of reading the parameters of the operator kernel in the small core processor, and the operation process of reading and converting the corresponding weights in the small core processor. , and the total time for the operation process of running the operator kernel on the large-core processor;
如果是,将在小核处理器执行算子内核的参数读取的操作过程、在小核处理器执行对应权重的读取及转化的操作过程调度到大核处理器完成;If so, the operation process of reading the parameters of the operator kernel on the small core processor, and the operation process of reading and converting the corresponding weights on the small core processor are scheduled to be completed on the large core processor;
如果否,在小核处理器调度完成所述算子内核的参数读取的操作过程、在小核处理器调度完成对应权重的读取及转化的操作过程。If not, the small core processor is scheduled to complete the operation process of reading the parameters of the operator kernel, and the small core processor is scheduled to complete the operation process of reading and converting the corresponding weights.
在上述方法中,所述调度到所述边缘设备中选取的小核处理器中执行,还包括:In the above method, the scheduling is executed in the small core processor selected in the edge device, and further includes:
确定其中最先完成操作的小核处理器,判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程;Determine the small-core processor that completes the operation first, and determine whether the small-core processor can complete the operation process of reading the parameters of the operator core with the longest operation time, and the operation process of reading and converting the corresponding weights. ;
如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理中完成。If so, the operation process of reading the parameters of the operator kernel with the longest operation time and the operation process of reading and converting the corresponding weights are scheduled to be completed in the small core process.
在这里,判断所述小核处理器是否可以再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程包括:Here, the operation process of determining whether the small-core processor can complete the parameter reading of the operator core with the longest operation time, and the operation process of reading and converting the corresponding weights includes:
判断所述小核处理器再完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程的时间是否在运行所述完成操作时间最长的算子内核的操作过程的时间之前,如果是,则将完成操作时间最长的算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程调度到所述小核处理器中。Determine whether the small core processor completes the operation process of parameter reading of the operator core with the longest operation time and the operation process of reading and conversion of the corresponding weights before running the operation process of the operator core with the longest operation time. Before the time of the operation process of the sub-kernel, if so, the operation process of parameter reading of the operator kernel with the longest operation time, and the operation process of reading and conversion of the corresponding weights will be scheduled to the small core processor middle.
采用上述方法,实现了基于小核处理器循环和大核处理器循环的方式,以神经网络的算子内核为单位,执行了神经网络在边缘设备的冷启动过程,输出得到执行结果。如图3所示,图3为本申请实施例提供的采用流水调度策略执行整个神经网络在边缘设备的冷启动的过程示意图。Using the above method, the cold start process of the neural network on the edge device is executed based on the small-core processor loop and the large-core processor loop. The operator core of the neural network is used as the unit, and the execution results are output. As shown in Figure 3, Figure 3 is a schematic diagram of the process of using a pipeline scheduling strategy to perform a cold start of the entire neural network on an edge device according to an embodiment of the present application.
图4为本申请实施例提供的一种神经网络在边缘设备上进行冷启动的装置结构示意图,所述装置包括:决策模块及执行模块,其中,Figure 4 is a schematic structural diagram of a device for cold starting a neural network on an edge device according to an embodiment of the present application. The device includes: a decision-making module and an execution module, where,
决策模块,用于以神经网络的算子内核为单位,将神经网络的运行拆分为多个所述算子内核的运行;按照所述算子内核运行顺序,将其中的第一算子内核的参数读取的操作过程、对应权重的读取及转化的操作过程、以及运行第一算子内核的操作过程,调度到边缘设备中的大核处理器中执行;将运行其余算子内核的操作过程调度在所述大核处理中完成,将其余算子内核的参数读取的操作过程、以及对应权重的读取及转化的操作过程,调度到所述边缘设备中选取的小核处理器中执行;The decision-making module is used to divide the operation of the neural network into the operation of multiple operator kernels based on the operator kernel of the neural network; according to the operation sequence of the operator kernels, divide the first operator kernel among them The operation process of parameter reading, the operation process of reading and converting the corresponding weights, and the operation process of running the first operator kernel are scheduled to be executed in the large-core processor in the edge device; the operation process of running the remaining operator kernels The operation process scheduling is completed in the large core processing, and the operation process of reading the parameters of the remaining operator cores, as well as the operation process of reading and converting the corresponding weights, is scheduled to the small core processor selected in the edge device in execution;
执行模块,用于控制大核处理器及小核处理器,分别基于调度执行对应的所述操作过程。The execution module is used to control the large-core processor and the small-core processor to execute the corresponding operation process based on scheduling.
在本申请的另一个实施例中,提供了一种非瞬时计算机可读存储介质,所述非瞬时计算机可读存储介质存储指令,所述指令在由处理器执行时引发所述处理器执行前述实施例中的一种神经网络在边缘设备上进行冷启动的方法。In another embodiment of the present application, a non-transitory computer-readable storage medium is provided, the non-transitory computer-readable storage medium stores instructions that, when executed by a processor, cause the processor to perform the foregoing A method for cold starting a neural network on an edge device in an embodiment.
图5为本申请的另一个实施例所提供的一种电子设备的示意图。如图5所示,本申请另一实施例还提供一种电子设备,其可以包括处理器501,其中,处理器501用于执行上述一种神经网络在边缘设备上进行冷启动的步骤。从图5中还可以看出,上述实施例提供的电子设备还包括非瞬时计算机可读存储介质502,该非瞬时计算机可读存储介质502上存储有计算机程序和神经网络模型,其中该计算机程序被处理器501运行时执行上述一种神经网络在边缘设备上进行冷启动的方法中的步骤。FIG. 5 is a schematic diagram of an electronic device provided by another embodiment of the present application. As shown in Figure 5, another embodiment of the present application also provides an electronic device, which may include a processor 501, where the processor 501 is configured to perform the above-mentioned step of cold starting a neural network on an edge device. It can also be seen from Figure 5 that the electronic device provided by the above embodiment also includes a non-transitory computer-readable storage medium 502. The non-transitory computer-readable storage medium 502 stores a computer program and a neural network model, wherein the computer program When the processor 501 is running, the steps in the above method for cold starting a neural network on an edge device are executed.
具体地,该非瞬时计算机可读存储介质502能够为通用的存储介质,如移动磁盘、硬盘、FLASH、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、或便携式紧凑磁盘只读存储器(CD-ROM)等,该非瞬时计算机可读存储介质502上的计算机程序被处理器501运行时,能够引发处理器801执行上述的一种神经网络在边缘设备上进行冷启动的方法中的各个步骤。Specifically, the non-transitory computer-readable storage medium 502 can be a general storage medium, such as a removable disk, a hard disk, FLASH, a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), or a portable Compact disk read-only memory (CD-ROM), etc., when the computer program on the non-transitory computer-readable storage medium 502 is run by the processor 501, it can cause the processor 801 to execute the above-mentioned neural network to perform cooling on the edge device. The various steps in the method that are initiated.
实际应用中,所述的非瞬时计算机可读存储介质502可以是上述实施例中描述的设备/装置/系统中所包含的,也可以是单独存在,而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或多个程序被执行时,能够执行上述的一种神经网络在边缘设备上进行冷启动的方法中的各个步骤。In practical applications, the non-transitory computer-readable storage medium 502 may be included in the equipment/device/system described in the above embodiments, or may exist independently without being assembled into the equipment/device/system. . The above computer-readable storage medium carries one or more programs. When the above one or more programs are executed, each step in the above-mentioned method for cold starting a neural network on an edge device can be executed.
本申请的再一实施例还提供一种计算机程序产品,包括计算机程序或指令,该计算机程序或指令被处理器执行时实现上述的一种神经网络在边缘设备上进行冷启动的方法中的各个步骤。Yet another embodiment of the present application further provides a computer program product, including a computer program or instructions, which when executed by a processor implements each of the above-mentioned methods for cold starting a neural network on an edge device. step.
本申请附图中的流程图和框图,示出了按照本申请公开的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或者代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应该注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同附图中所标准的顺序发生。例如,两个连接地表示的方框实际上可以基本并行地执行,它们有时也可以按照相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或者流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the drawings of this application illustrate the possible implementation architecture, functions and operations of systems, methods and computer program products according to various embodiments disclosed in this application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the standard order noted in the figures. For example, two blocks shown connected may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations, or may be implemented by special purpose hardware-based systems that perform the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.
本领域技术人员可以理解,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合,即使这样的组合或结合没有明确记载于本申请中。特别地,在不脱离本申请精神和教导的情况下,本申请的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合,所有这些组合和/或结合均落入本申请公开的范围。It will be understood by those skilled in the art that the features described in the various embodiments and/or claims of the present disclosure may be combined and/or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments and/or claims of the present application may be combined and/or combined in various ways without departing from the spirit and teachings of the present application, and all of these combinations and/or combinations fall within the disclosure scope of this application.
本文中应用了具体实施例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思路,并不用于限制本申请。对于本领域的技术人员来说,可以依据本申请的思路、精神和原则,在具体实施方式及应用范围上进行改变,其所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。This article uses specific embodiments to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the methods and core ideas of the present application, and is not intended to limit the present application. For those skilled in the art, changes can be made in the specific implementation modes and application scope according to the ideas, spirit and principles of this application. Any modifications, equivalent substitutions, improvements, etc. shall be included in this application. within the scope of protection.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310732822.8A CN116775149A (en) | 2023-06-20 | 2023-06-20 | A method and device for cold start of neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310732822.8A CN116775149A (en) | 2023-06-20 | 2023-06-20 | A method and device for cold start of neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116775149A true CN116775149A (en) | 2023-09-19 |
Family
ID=88009385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310732822.8A Pending CN116775149A (en) | 2023-06-20 | 2023-06-20 | A method and device for cold start of neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116775149A (en) |
-
2023
- 2023-06-20 CN CN202310732822.8A patent/CN116775149A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9152601B2 (en) | Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units | |
CN110187965B (en) | Operation optimization of neural network and data processing method, equipment and storage medium | |
CN110262901A (en) | A kind of data processing method and data processing system | |
US11275561B2 (en) | Mixed precision floating-point multiply-add operation | |
CN111310922A (en) | Method, apparatus, device and storage medium for processing deep learning computing tasks | |
CN114661466B (en) | Task unloading method for intelligent workflow application in edge computing environment | |
CN109710372B (en) | Calculation intensive cloud workflow scheduling method based on owl search algorithm | |
WO2020164644A2 (en) | Neural network model splitting method, apparatus, computer device and storage medium | |
JPWO2012105593A1 (en) | Data flow graph processing apparatus, data flow graph processing method, and data flow graph processing program | |
US9471387B2 (en) | Scheduling in job execution | |
WO2022142479A1 (en) | Hardware accelerator, data processing method, system-level chip, and medium | |
CN110187970A (en) | A Distributed Big Data Parallel Computing Method Based on Hadoop MapReduce | |
CN115714820A (en) | Distributed micro-service scheduling optimization method | |
CN105843660A (en) | Code optimization scheduling method for encoder | |
WO2024260118A1 (en) | Task scheduling method, task scheduling system, and computer storage medium | |
CN116762084A (en) | Branch operation for a neural processor circuit | |
US20200134467A1 (en) | Sharing preprocessing, computations, and hardware resources between multiple neural networks | |
CN115617474A (en) | Dependency task scheduling method for starting time perception facing edge calculation | |
CN108228323A (en) | Hadoop method for scheduling task and device based on data locality | |
CN116775149A (en) | A method and device for cold start of neural network | |
CN112241289B (en) | Text data processing method and electronic equipment | |
CN117422957A (en) | Assessment method for execution time of deep learning model | |
TW202029063A (en) | Systems and methods for accelerating nonlinear mathematical computing | |
JP2023024960A (en) | Optimization of memory usage for efficiently executing neural network | |
CN109828837B (en) | An associative task scheduling method based on longest path first |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |