CN115906935A

CN115906935A - Parallel differentiable neural network architecture searching method

Info

Publication number: CN115906935A
Application number: CN202211299553.2A
Authority: CN
Inventors: 张秀伟; 王文娜; 尹翰林; 邢颖慧; 崔恒飞; 张艳宁
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-10-23
Filing date: 2022-10-23
Publication date: 2023-04-04
Anticipated expiration: 2042-10-23
Also published as: CN115906935B

Abstract

The invention discloses a parallel differentiable neural network architecture searching method, which comprises the steps of firstly constructing a dual-path super network with a binary gate; then, carrying out search space serialization by using a sigmoid function; then, optimizing the super network by using a gradient descending mode to obtain an optimal basic unit comprising a common unit and a reduction unit; and finally, stacking the obtained basic units to obtain a required deep neural network, and retraining the deep neural network until the network converges. By designing a rapid and parallel differentiable neural network architecture searching method, the speed and the performance of neural network architecture searching are obviously improved.

Description

Parallel differentiable neural network architecture searching method

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a parallel differentiable neural network architecture searching method.

Background

The rapid development of deep learning proves the dominant position of the deep learning in the fields of artificial intelligence and deep learning. Due to the diligent efforts of researchers, the performance of deep neural networks is constantly increasing. However, since the manual design of the neural network requires a continuous trial and error process and relies heavily on the design experience of experts, it is time-consuming and resource-consuming to manually create the neural network structure. To reduce manpower and cost, neural Network Architecture Search (NAS) techniques are proposed. The NAS is a technology for automatically searching a neural network architecture by means of an algorithm to meet requirements of different tasks, and becomes a research hotspot in the field of automatic machine learning.

The core of the NAS method is to construct a huge search space, then an efficient search algorithm is adopted to mine the space, and the optimal architecture is found under a series of training data and constraint conditions. Early work was primarily based on reinforcement learning and evolutionary algorithms. They have shown great potential in finding high performance neural network architectures. However, the neural network architecture search method based on reinforcement learning and evolutionary algorithm usually bears heavy computational burden, which seriously hinders the wide application and research of NAS. To reduce the heavy computational burden, weight sharing algorithms have been proposed that formulate the search space as an over-parameterized super-network and evaluate the sampled architecture without additional optimization. By sharing the weights, NAS speeds up by several orders of magnitude.

One particular type of weight sharing method is the micro neural network architecture search technique proposed in the document "certificates: differentiated architecture search". The technology firstly defines a search space as a super network stacked by basic units (a common unit and a reduction unit), and finds the optimal architecture of a neural network by searching the basic units. DARTS then converts the discrete operations into a way of weighting a fixed set of operations, so the super-network can be trained by a gradient-based two-layer optimization method. This makes it more potential for NAS to explore optimal network architectures from a large architectural search space. Nevertheless, the prior art has certain limitations: still need bear huge network architecture and the huge amount of computation cost that redundant space brought, therefore the extensive application and the research of NAS have been restricted.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a parallel differentiable neural network architecture searching method, which comprises the steps of firstly constructing a dual-path super network with a binary gate; then, carrying out search space serialization by using a sigmoid function; then, optimizing the super network by using a gradient descending mode to obtain an optimal basic unit comprising a common unit and a reduction unit; and finally, stacking the obtained basic units to obtain a required deep neural network, and retraining the deep neural network until the network converges. By designing a rapid and parallel differentiable neural network architecture searching method, the speed and the performance of neural network architecture searching are obviously improved.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: constructing a dual-path super network with binary gates;

the super network is formed by stacking L basic units;

the basic unit comprises a common unit and a reduction unit; the common unit and the reduction unit are directed acyclic graphs formed by 7 nodes, wherein the directed acyclic graphs comprise 2 input nodes, 4 middle nodes and 1 output node, the connection between the nodes represents different operations, and the connection relationship between the nodes in the common unit and the reduction unit is different;

step 1-1: setting the operation pool as O, wherein the operation pool O comprises 8 basic operation operators which are respectively: sep-conv-3X 3, sep-conv-5X 5, dil-conv-3X 3, dil-conv-5X 5, max-pool-3X 3, avg-pool-3X 3, skip-connection and none;

the operation pool O is decomposed into two operator subsets O by random sampling ₁ And O ₂ In which O is ₁ And O ₂ Satisfy | O ₁ |＝|O ₂ |,|O ₁ |+|O ₂ I = O and

O ₁ and O ₂ Respectively used for constructing two sub-networks;

two groups of channels are sampled from input channels of the whole network and are respectively adopted by two sub-networks, and the two sub-networks are finally combined into one sub-network through addition operation; for two different nodes x in a basic unit of a super network _i To x _j The information dissemination, described as:

wherein x is _i And x _j Represent different nodes, and 0 ≦ i<j≤5，

And &>

Each represents O ₁ And O ₂ Weights of different operations;

And &>

Are two sets of channel sample masks, the masks consisting of only 0 and 1;

And &>

Respectively representing selected and unselected channels;

And &>

Two groups of selected channels are adopted by two operation operator subsets simultaneously;

the super network covers all the frameworks in the form of two parallel paths;

step 1-2: in the process of training the super network, selectively activating each path to participate in training by using binary gating;

for two nodes x in a basic unit _i To x _j The binary gating of the super network is described as:

wherein the value of gate1 and gate2 is 0 or 1, excluding the situation that gate1 and gate2 are 0 at the same time; binary gating operation is carried out in a random sampling mode to selectively activate corresponding paths to participate in training;

step 2: utilizing a sigmoid function to carry out search space serialization and redefining two sub-networks;

where δ (·) represents a sigmoid function, which is calculated as follows:

and step 3: optimizing the super network by using a gradient descending mode to obtain an optimal basic unit comprising a common unit and a reduction unit;

finding the optimal alpha by jointly optimizing the network parameter w and the structure parameter alpha to determine the optimal basic unit:

s.t.w ^* (α)＝arg min L _train (w,α)

wherein L is _train For training loss, L _val In order to verify the loss, cross entropy loss is adopted for both training loss and verification loss;

after obtaining the structural parameter α, according to a one-hot encoding:

selecting two operations with the maximum alpha value as the input of the intermediate node of the basic unit;

and 4, step 4: and (3) stacking the basic units obtained in the step (2) to obtain a required deep neural network, and retraining the deep neural network until the network converges.

Preferably, the super network optimization process in step 3 adopts a network formed by stacking 6 common units and 2 reduction units, wherein the 2 reduction units are respectively located at 1/3 and 2/3 of the total depth of the network.

Preferably, the deep neural network required in step 4 is a deep neural network for CIFAR-10, and is formed by stacking 20 basic units, wherein each basic unit comprises 2 reduction units and 18 common units.

Preferably, the deep neural network required in step 4 is a deep neural network for ImageNet, and is formed by stacking 12 basic units, wherein each basic unit comprises 2 reduction units and 12 common units, and the 2 reduction units are respectively located at 1/3 and 2/3 of the total depth of the network.

The invention has the following beneficial effects:

the invention provides a rapid and parallel differentiable neural network architecture searching technology, which reduces the memory consumption in the training process and improves the neural network architecture searching speed by constructing a dual-path super network with a binary gate. At the same time, considering that softmax is used to select the best input for the intermediate nodes of the two operator subsets, unfair problems may be encountered. In order to solve the problem, a sigmoid function is introduced, and the performance of each operation operator is measured under the condition of no normalization, so that the performance of the neural network architecture search technology is ensured. The invention obviously improves the speed and the performance of the neural network architecture search.

Drawings

Fig. 1 is a diagram of implementation steps of a fast parallel search method for a differentiable neural network architecture according to the present invention.

Fig. 2 is a search method of the present invention, taking a basic unit as an example.

FIG. 3 is a structural diagram of a basic unit searched on CIFAR-10 according to the present invention: wherein (a) is a normal unit and (b) is a reduction unit.

FIG. 4 is a diagram of the structure of the basic unit searched on ImageNet in the present invention: wherein (a) is a normal unit and (b) is a reduction unit.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

A parallel differentiable neural network architecture searching method comprises the following steps:

step 1: constructing a dual-path super network with binary gates;

the super network is formed by stacking L basic units;

O ₁ and O ₂ Respectively used for constructing two sub-networks;

two groups of channels are sampled from input channels of the whole network and are respectively adopted by two sub-networks, and the two sub-networks are finally combined into one sub-network through addition operation; for two different nodes x in a basic unit of a super network _i To x _j The information dissemination of (c), described as:

wherein x is _i And x _j Represent different nodes, and 0 ≦ i<j≤5，

And &>

Each represents O ₁ And O ₂ Weights of different operations in;

And &>

Are two sets of channel sample masks, the masks consisting of only 0 and 1;

And &>

Respectively representing selected and unselected channels;

And &>

the super network covers all the frameworks in the form of two parallel paths;

where δ (·) represents a sigmoid function, which is calculated as follows:

the super network optimization process adopts a network formed by stacking 6 common units and 2 reduction units, wherein the 2 reduction units are respectively positioned at 1/3 and 2/3 of the total depth of the network;

s.t.w ^* (α)＝arg min L _train (w,α)

wherein L is _train To exercise loss, L _val In order to verify the loss, cross entropy loss is adopted for both training loss and verification loss;

after obtaining the structural parameter α, according to a one-hot encoding:

The deep neural network for CIFAR-10 is formed by stacking 20 basic units, wherein each basic unit comprises 2 reduction units and 18 common units.

The deep neural network for ImageNet is formed by stacking 12 basic units, wherein each basic unit comprises 2 reduction units and 12 common units, and the 2 reduction units are respectively positioned at 1/3 and 2/3 of the total depth of the network.

The specific embodiment is as follows:

the fast parallel differentiable neural network architecture searching method of the embodiment specifically comprises the following steps:

s1: constructing a super network, wherein the super network is a dual-path super network with binary gates;

s2: utilizing a sigmoid function to carry out search space serialization;

s3: optimizing the super network by using a gradient descending mode to obtain an optimal basic unit (a common unit and a reduction unit);

s4: and (3) stacking the basic units obtained in the step (S2) to obtain a required deep neural network, and retraining the deep neural network until the network converges.

By adopting the technical scheme, the memory consumption of the neural network architecture searching technology is reduced by constructing the super network containing two parallel paths and gating operation, so that the neural network architecture searching speed is increased. Unfair problems may be encountered when using softmax to select the best input for the intermediate node of the two operator subsets. In order to solve the problem, a sigmoid function is introduced to carry out search space serialization, and the sigmoid function is used for measuring the performance of operation under the condition of no normalization. The invention obviously improves the speed and the performance of the neural network architecture search. In step S1, the super network is formed by stacking L basic units. The basic unit comprises a common unit and a reduction unit. The normal unit and the reduction unit are directed acyclic graphs composed of 7 nodes, including 2 input nodes, 4 intermediate nodes, and 1 output node, and the connection between the nodes represents a possible operation (e.g., convolution of 3 × 3). The reduction unit adopts convolution with stride of 2, so that the spatial resolution of the characteristic diagram is reduced to half of the original resolution. In order to increase the search speed, the invention constructs a dual-path super network with binary gates. The specific construction process of the dual-path super network with the binary gate is as follows:

s11: assuming that the whole operation pool is O, the operation pool O contains 8 basic operation operators, which are: sep-conv-3X 3, sep-conv-5X 5, dil-conv-3X 3, dil-conv-5X 5, max-pool-3X 3, avg-pool-3X 3, identity (skip-connection) and none. The operation pool O is decomposed into two smaller operator subsets O ₁ And O ₂ In which O is ₁ And O ₂ Satisfy | O ₁ |＝|O ₂ |,|O ₁ |+|O ₂ I = O and

O ₁ and O ₂ Respectively for constructing two smaller sub-networks. In order to reduce the computational burden and to make the search space cover all possible architectures, the invention designs a dual-path super network with binary gates. First, the present invention employs a partial connection strategy. In particular, two groups of channels are sampled from the whole input channel, which are respectively employed by the two sub-networks. The two sub-networks are combined into one by addition, so that the super-network appears in a parallel fashion, as node x _i To node x _j For example, the super network may be described as follows:

wherein x is _i And x _j Represent different nodes, and 0 ≦ i<j≤5。

And &>

Each represents O ₁ And O ₂ The weights of the different operations in (1).

And &>

Are two sets of channel sample masks, the masks consisting of only 0 and 1.

And &>

Representing selected and unselected channels, respectively.

And &>

Two selected sets of channels are employed simultaneously by two subsets of operators. This design brings an intuitive advantage that the super network can cover all possible architectures in the form of two parallel paths;

s12: in the process of training the super network, each path is selectively activated to participate in training by using binary gating. With node x _i To node x _j For example, the super network may be described as:

wherein the values of gate1 and gate2 are 0 or 1. In the actual operation, the case where gate1=0 and gate2=0 is excluded. Binary gating operation is performed in a random sampling mode to selectively activate corresponding paths to participate in training, so that the memory cost is greatly reduced.

In the step S2, the sigmoid function is adopted to perform search space serialization, which is specifically as follows:

where δ (·) represents a sigmoid function, which is calculated as follows:

in step 3, the super network is optimized, and an optimal α is found by jointly optimizing a network parameter w and a structural parameter α to determine an optimal basic unit:

s.t.w ^* (α)＝arg min L _train (w,α)

wherein L is _train To exercise loss, L _val To verify the loss. And cross entropy loss is adopted for both training loss and verification loss. After obtaining the architecture parameter α, according to one-hot encoding:

and selecting the two operations with the maximum alpha value as the input of the middle node of the basic unit.

In the super network optimization process in the step S3, a large network formed by stacking 6 common units and 2 reduction units is adopted, wherein the 2 reduction units are respectively located at 1/3 and 2/3 of the total depth of the network.

In the step S4, the deep neural network for CIFAR-10 is formed by stacking 20 basic units (2 reduction units and 18 common units), and the deep neural network for ImageNet is a large network constructed by stacking 12 basic units (2 reduction units and 12 common units). Wherein 2 reduction units are respectively positioned at 1/3 and 2/3 of the total depth of the network.

And (4) evaluating the deep neural network constructed in the step (S4) on a corresponding data set, and testing the performance of the deep neural network. Wherein, the classification accuracy of 97.47 percent is realized on the CIFAR-10 only by using 0.08GPU day. Compared with the result of reference "Darts" (classification accuracy of 97.24% in 1GPU day), the search rate and the network performance are greatly improved, wherein the search rate is improved by 12.5 times. Meanwhile, the searching speed of the method is high, the method supports the direct searching on ImageNet, and only 2.44GPU days are utilized on ImageNet, so that the Top-1 classification accuracy of 76.1 percent and the Top-5 classification accuracy of 92.8 percent are realized.

Claims

1. A parallel differentiable neural network architecture searching method is characterized by comprising the following steps:

step 1: constructing a dual-path super network with binary gates;

the super network is formed by stacking L basic units;

O ₁ and O ₂ Respectively used for constructing two sub-networks;

wherein x is _i And x _j Represent different nodes, and 0 ≦ i<j≤5，

And &>

Each represents O ₁ And O ₂ Middle and different exercisesMaking a weight;

And &>

Are two sets of channel sample masks, the masks consisting of only 0 and 1;

And &>

Respectively representing selected and unselected channels;

And &>

the super network covers all the frameworks in the form of two parallel paths;

wherein the value of gate1 and gate2 is 0 or 1, excluding the situation that gate1 and gate2 are 0 at the same time; binary gating operation carries out value taking in a random sampling mode to selectively activate corresponding paths to participate in training;

step 2: utilizing a sigmoid function to carry out search space serialization and redefine two sub-networks;

where δ (·) represents a sigmoid function, which is calculated as follows:

s.t.w ^* (α)＝arg minL _train (w,α)

after obtaining the structural parameter α, according to one-hot encoding:

2. The parallel differentiable neural network architecture searching method according to claim 1, wherein the super network optimization process in step 3 is a network formed by stacking 6 normal units and 2 reduction units, wherein the 2 reduction units are respectively located at 1/3 and 2/3 of the total depth of the network.

3. The method according to claim 1, wherein the deep neural network required in step 4 is a deep neural network for CIFAR-10, and is formed by stacking 20 basic units, wherein each basic unit comprises 2 reduction units and 18 normal units.

4. The parallel differentiable neural network architecture searching method according to claim 1, wherein the deep neural network required in the step 4 is a deep neural network for ImageNet, and is formed by stacking 12 basic units, wherein each basic unit comprises 2 reduction units and 12 common units, and the 2 reduction units are respectively located at 1/3 and 2/3 of the total depth of the network.