A dimensionality reduction approach for convolutional neural networks

3140 Accesses
1 Altmetric
Explore all metrics

Abstract

The focus of this work is on the application of classical Model Order Reduction techniques, such as Active Subspaces and Proper Orthogonal Decomposition, to Deep Neural Networks. We propose a generic methodology to reduce the number of layers in a pre-trained network by combining the aforementioned techniques for dimensionality reduction with input-output mappings, such as Polynomial Chaos Expansion and Feedforward Neural Networks. The motivation behind compressing the architecture of an existing Convolutional Neural Network arises from its usage in embedded systems with specific storage constraints. The conducted numerical tests demonstrate that the resulting reduced networks can achieve a level of accuracy comparable to the original Convolutional Neural Network being examined, while also saving memory allocation. Our primary emphasis lies in the field of image recognition, where we tested our methodology using VGG-16 and ResNet-110 architectures against three different datasets: CIFAR-10, CIFAR-100, and a custom dataset.

Compressed neural architecture utilizing dimensionality reduction and quantization

Article 27 April 2022

Is DeCAF Good Enough for Accurate Image Classification?

Space Efficient Quantization for Deep Convolutional Neural Networks

Article 22 March 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction and motivations

Neural networks are a widespread machine learning technique, increasingly employed in various fields such as computer vision [1,2,3], natural language processing [4, 5], robotics [6, 7], and speech recognition [8, 9]. The accuracy of such models is strictly related to the number of layers, neurons, and inputs [10,11,12], therefore, to tackle more complex problems, these architectures are forced to go in depth. While on the one hand we have increasing precision, on the other hand the high number of degrees of freedom translates into a longer optimization step and, from a practical point of view, into a larger architecture to manage. The dimension of the network is rarely considered a bottleneck of this methodology, but the diffusion of neural networks in many engineering fields led to its employment also in embedded systems [13,14,15], which typically show limited hardware. Deep vision algorithms are indeed developed using workstations with high computational resources, posing a challenge when deploying them in industrial applications. The vision devices, in which these nets need to be integrated, are often characterized by restricted memory sizes and low CPU performance [16,17,18]. In these contexts the size of the architecture can thus become an additional constraint, requiring a reduction in the network’s degrees of freedom.

Finding the intrinsic dimension of neural networks is a very challenging task and, to the best of the authors’ knowledge, lacks rigorous theoretical proofs. Various methods have been proposed, including network pruning and sharing [19,20,21,22,23], low-rank matrix and tensor factorization [24,25,26,27], parameter quantization [28,29,30], and knowledge distillation [31,32,33,34]. In this contribution (see Fig. 1), we present an extension of the idea explored in [35], where the Active Subspace (AS) property and Polynomial Chaos Expansion (PCE) are exploited to provide a reduced and more robust version of the original network. While such work has focused on analyzing the AS capability in reducing deep architectures, we aim here to provide a generic framework for neural network reduction, investigating other mathematical tools rather than AS and PCE. Mimicking the procedure presented in [35], the original architecture is initially split into two cascading parts: the pre- and post-model. We assume that the second one brings a negligible contribution to the final outcome, giving us the possibility to approximate such part of the model without introducing a significant error. A response surface (or in more general terms, an input-output mapping) is indeed built to fit the data, replacing the last layers of the network. This response surface may belong to a high-dimensional space since the input dimension is equal to the dimension of the output features of the pre-model. Consequently, in order to maintain the reduction computationally affordable, we also need to reduce the dimensionality of the pre-model outputs, which, it should be noted, are also the input parameters for the response surface. By combining all these ingredients, we obtain a reduced version of the network that only includes a few of the initial layers, but achieves a level of accuracy comparable to the full model. It is important to specify that the numerical experiments we are about to present exclusively involve Convolutional Neural Networks (CNNs), but the methodology can potentially be applied to other models as well.

In this contribution, we examine various tools for the dimensional reduction and response surface analysis. In addition to AS and PCE, already tested in the aforementioned reference, we employ Proper Orthogonal Decomposition (POD) and Feedforward Neural Network (FNN). The former, similar to AS, is a well-established technique for Model Order Reduction [36,37,38], which compresses the data by projecting it onto a lower-dimensional space. On the other hand, FNN is employed to construct the surface response as an alternative to PCE. The advantage of FNN over PCE is twofold: i,) the simplified input-output mapping (thanks to the low-dimensional space) allows for a FNN with fewer layers and neurons, further reducing the already minimal space demanded for the PCE method; ii,) from a programming perspective, the possibility to approximate a part of the neural network with another network makes the software integration easier, especially when the hosting system is embedded.

The article is organized as follows. Section 2 provides an algorithmic overview of all the numerical methods involved in the reduction framework. This includes an analysis of AS in Section 2.1.1, POD in Section 2.1.2, PCE in Section 2.2.1, and FNN in Section 2.2.2. In Section 3, we delve into the details of the framework used to reduce the neural networks. Section 4 is dedicated to presenting the results obtained by reducing benchmark CNNs designed for image recognition with the proposed methodology. We conduct this analysis using three different datasets during the initial learning step, investigating the dependency of the results on the original problem. Finally, in Section 5 we summarize the entire procedure and propose some future perspectives to enhance the framework.

2 Numerical tools

In this section we introduce all the techniques employed for the reduction of the network, in order to facilitate the comprehension of the framework discussed in Section 3.

2.1 Dimensionality reduction techniques

This subsection provides an algorithmic overview of the reduction methods examined in this contribution: the Active Subspace (AS) property, and the Proper Orthogonal Decomposition (POD). Widely employed in the reduced order modeling community, such techniques are used here to decrease the dimensionality of the intermediate convolutive features. However, the specific details will be discussed in the next section. We just specify that, while this work concentrates on AS and POD, the framework is generic, allowing to replace these two methods with others for dimensionality reduction.

2.1.1 Active subspaces

Active Subspaces (AS) [39, 40] method is a reduction tool used to identify important directions in the parameter space by exploiting the gradients of the function of interest. Such information allows the application of a rotational transformation to the domain, in order to obtain an approximation of the original function but in a lower dimension. Let $\varvec{\mu } = [\mu _1 \dots \mu _n]^T \in \mathbb {R}^{n}$ represent a n-dimensional variable with an associated probability density function $\rho (\varvec{\mu })$, and let g be the function of interest, $g(\varvec{\mu }): {\mathbb {R}}^n \rightarrow {\mathbb {R}}$. We assume here g is scalar and continuous (for the vector-valued extension see [41, 42]). Starting from this, an uncentered covariance matrix ${\textbf{C}}$ of the gradient of g can be constructed by considering the average of the outer product of the gradient with itself:

$$\begin{aligned} {\textbf{C}}=\mathbb {E}[\nabla g(\varvec{\mu })\nabla g(\varvec{\mu })^T] = \int (\nabla _{\varvec{\mu }} g)(\nabla _{\varvec{\mu }} g)^T \rho \text {d}\varvec{\mu }, \end{aligned}$$

(1)

where the symbol $\mathbb {E}[\cdot ]$ denotes the expected value, and $\nabla _{\varvec{\mu }} g \equiv \nabla g(\varvec{\mu })$. We assume the gradients are computed during the simulation, otherwise, if not provided, they can be approximated with different techniques such as local linear models, global models, finite difference, or Gaussian process [43,44,45], for example. Since ${\textbf{C}}$ is symmetric, it admits the following eigenvalue decomposition:

$$\begin{aligned} {\textbf{C}}= {\textbf{V}}{\varvec{\Lambda }}{\textbf{V}}^T, \quad {\varvec{\Lambda }}= \textrm{diag}(\lambda _1, \dots ,\lambda _n), ~~ \lambda _1\ge \cdots \ge \lambda _n\ge 0, \end{aligned}$$

(2)

where ${\textbf{V}}$ is the $n \times n$ orthogonal matrix whose columns $\{\textbf{v}^1, \dots , \textbf{v}^n \}$ are the normalized eigenvectors of ${\textbf{C}}$, whereas ${\varvec{\Lambda }}$ is a diagonal matrix containing the corresponding non-negative eigenvalues $\lambda _i$, for $i=1,\dots , n$, arranged in descending order.

We can decompose these two matrices as:

$$\begin{aligned} {\varvec{\Lambda }}= & {} \begin{bmatrix} {\varvec{\Lambda }}_1 &{} \\ &{}{\varvec{\Lambda }}_2 \end{bmatrix},\nonumber \\ {\textbf{V}}= & {} [{\textbf{V}}_1~~ {\textbf{V}}_2], \qquad {\textbf{V}}_1\in {\mathbb {R}}^{n\times n_{\text {AS}}}, ~~{\textbf{V}}_2\in {\mathbb {R}}^{n\times (n-n_{\text {AS}})}. \end{aligned}$$

(3)

The space spanned by ${\textbf{V}}_1$ columns is called the active subspace of dimension $n_{\textrm{AS}} < n$, whereas the inactive subspace is defined as the range of the remaining eigenvectors in ${{\textbf{V}}}_2$. Once we have defined these spaces, the input ${\varvec{\mu }\in \mathbb {R}^n}$ can be reduced to a low-dimensional vector $\tilde{\varvec{\mu }}_1\in {\mathbb {R}}^{n_{\text {AS}}}$ using ${\textbf{V}}_1$ as projection map. To be more precise, any ${\varvec{\mu }\in {\mathbb {R}}^n}$ can be expressed in this way using the decomposition in Eq. 3 and the properties of ${\textbf{V}}$:

$$\begin{aligned} \varvec{\mu }= {\textbf{V}}{\textbf{V}}^T\varvec{\mu }= {\textbf{V}}_1{\textbf{V}}_1^T\varvec{\mu }+ {\textbf{V}}_2{\textbf{V}}_2^T\varvec{\mu }= {\textbf{V}}_1\tilde{\varvec{\mu }}_1 + {\textbf{V}}_2\tilde{\varvec{\mu }}_2, \end{aligned}$$

(4)

where the two new variables $\tilde{\varvec{\mu }}_1$ and $\tilde{\varvec{\mu }}_2$ are the active and inactive variable respectively:

$$\begin{aligned} \tilde{\varvec{\mu }}_1 = {\textbf{V}}_1^T\varvec{\mu }\in {\mathbb {R}}^{n_{\text {AS}}}, \qquad ~ \tilde{\varvec{\mu }}_2 = {\textbf{V}}_2^T\varvec{\mu }\in {\mathbb {R}}^{n-n_{\text {AS}}}. \end{aligned}$$

(5)

For the actual computations of the AS, we have used the open-source Python package called ATHENA [46].

2.1.2 Proper orthogonal decomposition

In this section, we will discuss the Proper Orthogonal Decomposition (POD) approach of Reduce Order Modeling [36,37,38, 47] for reducing the number of degrees of freedom of a parametric system. Specifically, we will focus on the POD with interpolation (PODI) method [48,49,50].

Let $\textbf{S} = [{\textbf{u}}^1\dots {\textbf{u}}^{n_S}]$ be the matrix of snapshots, i.e. the full order system outputs ${\textbf{u}}^i\in {\mathbb {R}}^N$. Once these solutions are collected, we aim to describe them as a linear combination of a few main structures, the POD modes, and thus project them onto a low dimensional space spanned by these modes. To calculate the POD modes, we need to compute the singular value decomposition (SVD) of the snapshots matrix $\textbf{S}$:

$$\begin{aligned} \textbf{S} = \varvec{\Psi }\varvec{\Sigma }\varvec{\Theta }^T, \end{aligned}$$

(6)

where the left-singular vectors, i.e. the columns of the unitary matrix $\varvec{\Psi }$, are the POD modes, and the diagonal matrix $\varvec{\Sigma }$ contains the corresponding singular values in decreasing order. Therefore, by selecting the first modes we are retaining only the most energetic ones and we can construct a reduced space into which we project the high-fidelity solutions. As a results, we obtain:

$$\begin{aligned} \textbf{S}^{\text {POD}}=\varvec{\Psi }^T_{N_{\text {POD}}}\textbf{S}, \end{aligned}$$

(7)

where $\varvec{\Psi }_{N_{\text {POD}}}$ is the matrix containing only the first $N_{\text {POD}}$ modes, and the columns of $\textbf{S}^{\text {POD}}$ represent the reduced snapshot $\tilde{{\textbf{u}}}^i\in {\mathbb {R}}^{N_{\text {POD}}}$.

2.2 Input–output mapping

After reducing the dimensions of the outputs from the intermediate layer, we need to establish a correlation between these outputs and the final output of the original network. For example, in an image identification problem, this would involve determining the classes to which the image belongs. To achieve this, we construct an input–output mapping using the input dataset. The following subsections provide an algorithmic overview of the two methods that were explored to approximate this mapping: the Polynomial Chaos Expansion (PCE) [51] and the Feed–forward Neural Network (FNN) [52].

2.2.1 Polynomial chaos expansion

The Polynomial Chaos Expansion (PCE) theory was initially proposed by Wiener in [53], demonstrating that a real-valued random variable $X:{\mathbb {R}}^R\rightarrow {\mathbb {R}}$ can be decomposed in the following manner:

$$\begin{aligned} X(\varvec{\xi }) = \sum _{j=0}^{\infty } c_j {\varvec{\phi }}_j(\varvec{\xi }), \end{aligned}$$

(8)

i.e. as an infinite sum of orthogonal polynomials weighted by unknown deterministic coefficients $c_j$ [54]. The vector $\varvec{\xi } = (\xi _1, \dots , \xi _R)$ represents the multi-dimensional random vector, where each element is associated with uncertain input parameters, while ${\varvec{\phi }}_j(\varvec{\xi })$ are multivariate orthogonal polynomials, that can be decomposed into products of one-dimensional orthogonal polynomials with different variables.

We can approximate the infinite sum in Eq. 8 by truncating it at the $(P+1)$-th term, such that:

$$\begin{aligned} X(\varvec{\xi }) \approx \sum _{j=0}^{P} c_j {\varvec{\phi }}_j(\varvec{\xi }). \end{aligned}$$

(9)

The number of unknown coefficients in this summation is given by $P+1 = \frac{(p+R)!}{p!R!}$ [55], where p is the degree of the polynomial we are considering in the R-dimensional space.

When the parameters $\xi _1, \dots , \xi _R$ are independent, $ {\varvec{\phi }}_j(\varvec{\xi })$ can be decomposed into products of one-dimensional functions:

$$\begin{aligned} {\varvec{\phi }}_j(\varvec{\xi })= & {} {\varvec{\phi }}_j(\xi _1, \dots , \xi _R)\nonumber \\= & {} \prod _{k=1}^R \phi _k^{d_k}(\xi _k),\ ~~~~j=0,\dots ,P, \nonumber \\ d_k= & {} 0,\dots ,p ~~~~~ \text {s.t.} ~ \sum _{k=1}^{R}d_k\le p. \end{aligned}$$

(10)

To determine the PCE, we need to find out the polynomial chaos expansion coefficients $c_j$ for $j = 0, \dots , P$, and the one-dimensional orthogonal polynomials $\phi _k^{d_k},~ k=1,\dots ,R$, of degree $d_k$.

Based on the work of Askey and Wilson [56], we can provide the orthogonal polynomials for different distributions. One of the possible choices is represented by the Gaussian distribution with the related Hermite polynomials.

The estimation of the coefficients of PCE can then be carried out in different ways [57, 58]. One method involves using a projection technique based on the orthogonality of the polynomials. Another method, which we will describe, is a regression-based approach.

In order to determine the coefficients $c_j$, we need to solve a minimization problem:

$$\begin{aligned} \textbf{c} = \mathrm {arg\,min}_{\textbf{c}^*\in {\mathbb {R}}^P} \frac{1}{N_{\text {PCE}}}\sum _{i=1}^{N_{\text {PCE}}} \left( \hat{X} - \sum _{j=0}^{P} c^*_j {\varvec{\phi }}_j(\varvec{\xi }^{i})\right) , \end{aligned}$$

(11)

where $N_{\text {PCE}}$ indicates the total number of realizations of the input vector we are considering, whereas $\hat{X}$ represents the real output of the model. To solve Eq. 11 we consider the following matrix:

$$\begin{aligned} {\varvec{\Phi }}= \begin{pmatrix} {\varvec{\phi }}_0(\varvec{\xi }^{1}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{1}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{1}) \\ {\varvec{\phi }}_0(\varvec{\xi }^{2}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{2}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{2}) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\varvec{\phi }}_0(\varvec{\xi }^{N_{\text {PCE}}}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{N_{\text {PCE}}}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{N_{\text {PCE}}}) \end{pmatrix}. \end{aligned}$$

(12)

Thus, the solution of Eq. 11 is computed using a least-square optimization approach, as shown in Eq. 13:

$$\begin{aligned} \textbf{c} = ({\varvec{\Phi }}^T{\varvec{\Phi }})^{-1} {\varvec{\Phi }}^T \hat{X}. \end{aligned}$$

(13)

It is important to emphasize that if the matrix ${\varvec{\Phi }}^T{\varvec{\Phi }}$ is ill-conditioned, as may occur, then the singular value decomposition method should be employed.

2.2.2 Feedforward neural network

A Feedforward Neural Network (FNN), also known as multilayer perceptron, is a popular neural network model commonly used for function regression [52]. As depicted in Fig. 2, it mainly comprises an input layer, an output layer, and a certain number of hidden layers^{Footnote 1}, where the processing units composing them are called neurons. Each neuron is then characterized by a weight vector that determines the strength of its connections with neurons in the subsequent layer.

From a more technical perspective, let $\tilde{{\textbf{x}}}\in {\mathbb {R}}^{n_{\text {in}}}$ represent the input vector and M denote the total number of hidden layers in the FNN. The output vector ${\textbf{h}}\in {\mathbb {R}}^{n_{\text {out}}}$ is obtained by applying an activation function to the weighted sum of all the inputs received by that neuron. The role of this activation function is to introduce non-linearity in the network. There are numerous options available [10, 60], and some common ones are represented by the ReLU function, sigmoid, logistic function, and radial activation functions.

To better understand the derivation of the general formula (15), we start by considering a FNN that comprises a single output and one hidden layer. In this scenario, the final output can be expressed as:

$$\begin{aligned} {\textbf{h}}= \sigma \left( \sum _{i=1}^{n_{\text {in}}} w_i \tilde{x}_i +b_i\right) , \end{aligned}$$

(14)

where $\sigma $ is the activation function, $W = \{w_i\}_{i=1}^{n_{\text {in}}}$ represents the weights of the net and b the bias^{Footnote 2}. Therefore, when considering M layers, the final output can be seen as a weighted sum of its inputs followed by the activation function, where each input can be rewritten using the same approach described in Eq. 14:

$$\begin{aligned} h_j= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \tilde{x}^{(M)}_i\right) \nonumber \\= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \left( \sigma \left( \sum _{q=1}^{n_{M-1}} w^{(M)}_{iq} \tilde{x}^{(M-1)}_q\right) \right) \right) =\nonumber \\ \dots= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \left( \sigma \left( \sum _{q=1}^{n_{M-1}} w^{(M)}_{iq} \left( \sigma \left( \dots \left( \sigma \left( \sum _{k=1}^{n_{in}} w^{(1)}_{sk} \tilde{x}_k\right) \right) \right) \right) \right) \right) \right) ,\nonumber \\ j= & {} 1, \dots , n_{\text {out}}, \end{aligned}$$

(15)

where $n_m$, $m=1,\dots ,M$, represents the number of neurons in layer m, whereas $n_{\text {in}}$ and $n_{\text {out}}$ are the neurons in the input and output layers respectively. $W^m= (w_{ki}^{(m)})_{ki},~ k=1,\dots ,n_m, ~ i=1,\dots ,n_{m-1}$ indicates then the weight matrix related to layer m. Note that the first number in any weight’s subscript matches the index of the neuron in the next layer and the second number matches the index of the neuron in the previous layer.

Once we have constructed an FNN by choosing its architecture, we need to gain a performing model for a desired task. One of the main characteristics of an FNN is indeed its ability to learn from observational data during the so-called training process. In this phase, the net acquires knowledge from our dataset by minimizing the loss function^{Footnote 3}$\mathcal {L}$:

$$\begin{aligned} \min _W\left\{ \frac{1}{n_{\text {out}}}\sum _{i=1}^{n_{\text {out}}} \mathcal {L}(h_i,\hat{h}_i) \right\} , \end{aligned}$$

(16)

where ${\textbf{h}}= \{h_j \}_{j=0}^{n_{\text {out}}}$ represents the expected output and $\hat{{\textbf{h}}}= \hat{{\textbf{h}}}(\tilde{{\textbf{x}}}; W)= \{\hat{h}_j(\tilde{{\textbf{x}}}; W) \}_{j=0}^{n_{\text {out}}}$ is the prediction made by our FNN. To solve this minimization problem, the Backpropagation algorithm [62] is commonly employed. Consequently, the model’s parameters are optimized by adjusting the network’s weights using the following procedure:

$$\begin{aligned} w_{ki}^{(m),t} = w_{ki}^{(m), t-1} -\epsilon \frac{d\mathcal {L}}{dw_{ki}^{(m)}}, \end{aligned}$$

(17)

where $\epsilon $ is the learning rate, which is appropriately chosen according to the problem under consideration. The parameter t represents the training epoch, which indicates a complete repetition of the parameter update involving the complete training dataset at once. The gradients required for the weight update in Eq. 17 are then computed using the chain rule.

3 The reduced artificial neural networks

In this section, we provide the rigorous description of the proposed framework, which is summarized in Fig. 3 and Fig. 1. The primary objective of our framework is to reduce, in terms of dimensionality, a generic Artificial Neural Network (ANN). Indeed, it is important to note that the only assumption we make about the original network is that it consists of L layers.

Network splitting

In the beginning, the original network, denoted as ${\mathcal {ANN}}: {\mathbb {R}}^{n_0} \rightarrow {\mathbb {R}}^{n_L}$ is split into two distinct parts. The first l layers constitute the pre-model, while the last $L-l$ layers form the so-called post-model. By describing the network as composition of functions ${\mathcal {ANN}}\equiv f_L \circ f_{L-1} \circ \dots \circ f_1$, we can formally define the pre- and the post-model as follows:

$$\begin{aligned} {\mathcal {ANN}}_{\text {pre}}^l= & {} f_l \circ f_{l-1} \circ \dots \circ f_1,\nonumber \\ {\mathcal {ANN}}_{\text {post}}^l= & {} f_L \circ f_{L-1} \circ \dots \circ f_{l+1}, \end{aligned}$$

(18)

where each function $f_j: {\mathbb {R}}^{n_{j-1}} \rightarrow {\mathbb {R}}^{n_j}$ for $j=1,\dots ,L$, represents the different layers of the network — e.g. convolutional, fully connected, batch-normalization, ReLU, pooling layers. The original model can then be rewritten as:

$$\begin{aligned} {\mathcal {ANN}}({\textbf{x}}^0) = {\mathcal {ANN}}^l_{\text {post}}({\mathcal {ANN}}^l_{\text {pre}}({\textbf{x}}^0)), \end{aligned}$$

(19)

for any $1\le l < L$ and ${\textbf{x}}^0 \in {\mathbb {R}}^{n_0}$.

As described in [35], the reduction of the network is achieved by approximating the post-model, which means that the pre-model is actually copied from the original network to the reduced one. Before proceeding with the algorithmic explanation of how the post-model is approximated, we specify that the index l, denoting the cut-off layer, is the only parameter of this initial step, and it plays an important role in the final outcome. This index indeed determines how many layers of the original network are retained in the reduced architecture, controlling, in a few words, how much information of the original network we are discarding. As described in [35], it is then chosen empirically based on considerations about the network and the dataset at hand, balancing the final accuracy and the compression ratio.

Dimensionality reduction

As mentioned earlier, our goal is to project the output ${\textbf{x}}^{(l)}$ of the pre-model onto a lower-dimensional space using reduction techniques as:

Active Subspaces: as described in Section 2.1.1 and in [35], we consider a function $g_l$ defined by:
$$\begin{aligned} g_l({\textbf{x}}^{(l)}) = \text {loss} ({\mathcal {ANN}}^l_{\text {post}}({\textbf{x}}^{(l)})), \end{aligned}$$
(20)
in order to extract the most important directions and determine the projection matrix used to reduce the pre-model output.
Proper Orthogonal Decomposition: as discussed in Section 2.1.2, the SVD decomposition (6) is exploited to compute the projection matrix $\varvec{\Psi }_r$ and subsequently obtain the reduced solution
$$\begin{aligned} {\textbf{z}}= \varvec{\Psi }^T_r{\textbf{x}}^{(l)}. \end{aligned}$$
(21)

It is important to emphasize that in order to apply these methodologies to the pre-model output, a flattening of ${\textbf{x}}^{(l)}$ should be carried out. These approaches are specifically based on flat-view matrix models, requiring the transformation of ${\textbf{x}}^{(l)}$ from a tensorial structure to a two-dimensional one.

Input-Output mapping

The final part of the reduced neural network is dedicated to classifying the output generated by the reduction layer. Two different techniques have been employed for this purpose:

the Polynomial Chaos Expansion, as introduced in Section 2.2.1. According to Eq. 9, the final output of the network, denoted as ${\textbf{y}}={\mathcal {ANN}}({\textbf{x}}^0)\in {\mathbb {R}}^{n_L}$, which represents the true response of the model, can be approximated as follows:
$$\begin{aligned} \hat{{\textbf{y}}}\approx \sum _{|{\varvec{\alpha }}|=0}^{p}{\textbf{c}}_{{\varvec{\alpha }}}{\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}}), \qquad |{\varvec{\alpha }}|=\alpha _1+\dots +\alpha _r, \end{aligned}$$
(22)
where ${\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}})$ are the multivariate polynomial functions chosen based on the probability density function $\rho $ associated with ${\textbf{z}}$. Therefore, the estimation of coefficients ${\textbf{c}}_{\alpha }$ is carried out by solving the minimization problem (11):
$$\begin{aligned} \min _{c_{\alpha }}\frac{1}{N_{\text {train}}}\sum _{j=1}^{N_{\text {train}}}\left\Vert {\textbf{y}}^j-\sum _{|{\varvec{\alpha }}|=0}^{p}{\textbf{c}}_{{\varvec{\alpha }}}{\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}}^j)\right\Vert ^2. \end{aligned}$$
(23)
a Feedforward Neural Network, as described in Section 2.2.2. In this case, the output of the reduction layer ${\textbf{z}}$ coincides with the network input. By applying Eq. 15, we can determine the final output $\hat{{\textbf{y}}}$ of the reduced net^{Footnote 4}, which is given by:
$$\begin{aligned} \hat{y}_j= & {} \sum _{i=1}^{n_1}w_{ji}^{(2)}z^{(1)}_i\nonumber \\= & {} \sum _{i=1}^{n_1}w_{ji}^{(2)} \sigma \left( \sum _{m=1}^{r} w_{im}^{(1)} z_{m}\right) , ~~ j = 1,\dots ,n_{\text {out}}, \end{aligned}$$
(24)
where $n_{\text {out}}$ corresponds to the number of categories that compose the dataset under consideration, and $\sigma $ is the Softplus function:
$$\begin{aligned} \text {Softplus}({\textbf{x}}) = \frac{1}{\beta }\log (1+\exp (\beta {\textbf{x}})). \end{aligned}$$
(25)

3.1 Training phase

Once the reduced version of the network is constructed, we need to train it. Following [35], for the training phase of the reduced ANN, we employ the technique of knowledge distillation [31]. A knowledge distillation framework involves a large pre-trained teacher model, which is our full network, and a small student model, in our case ${\mathcal {ANN}}^{\text {red}}$. Therefore, the main goal is to efficiently train the student network under the guidance of the teacher network to achieve comparable or even superior performance.

Let ${\textbf{y}}$ be a vector of logits, which refers to the output of the last layer in a deep neural network. The probability $p_i$ that the input belongs to the i-th class is determined by the softmax function:

$$\begin{aligned} p_i = \frac{exp(y_i)}{\sum _{j=0}^{n_{\text {class}}} exp(y_j)}. \end{aligned}$$

(26)

As described in [31], a temperature factor T needs to be introduced in order to control the importance of each target:

$$\begin{aligned} p_i = \frac{exp(y_i/T)}{\sum _{j=0}^{n_{\text {class}}} exp(y_j/T)}, \end{aligned}$$

(27)

where if $T\rightarrow \infty $ all classes have the same probability, whereas if $T\rightarrow 0$ the targets $p_i$ become one-hot labels.

Firstly, we need then to define the distillation loss, which matches the logits between the teacher model and the student model, as mentioned in [35]. The knowledge transfer from the teacher to the student is accomplished by mimicking the final prediction of the full net, using response-based knowledge. Therefore, in this case, the distillation loss [31, 32] is given by:

$$\begin{aligned} L_D(p({\textbf{y}}_t, T), p({\textbf{y}}_s, T)) = \mathcal {L}_{\text {KL}}(p({\textbf{y}}_t,T), p({\textbf{y}}_s, T)), \end{aligned}$$

(28)

where ${\textbf{y}}_t$ and ${\textbf{y}}_s$ indicate the logits of the teacher and student networks, respectively, while $\mathcal {L}_{\text {KL}}$ represents the Kullback-Leibler (KL) divergence loss [63]:

$$\begin{aligned}{} & {} \mathcal {L}_{\text {KL}}((p({\textbf{y}}_s,T), p({\textbf{y}}_t, T))\nonumber \\{} & {} \quad = T^2 \sum _j p_j(y_{t,j}, T)\log \frac{p_j(y_{t,j}, T)}{p_j(y_{s,j}, T)}. \end{aligned}$$

(29)

The student loss is then defined as the cross-entropy loss between the ground truth label and the logits of the student network [32]:

$$\begin{aligned} L_S({\textbf{y}}, p({\textbf{y}}_s,T)) = \mathcal {L}_{\text {CE}}(\hat{{\textbf{y}}}, p({\textbf{y}}_s,T)), \end{aligned}$$

(30)

where $\hat{{\textbf{y}}}$ is a ground truth vector, characterized by having only the component corresponding to the ground truth label on the training sample set to 1, while the other components are set to 0. $\mathcal {L}_{\text {CE}}$ represents instead the cross entropy loss, which is described as follows:

$$\begin{aligned} \mathcal {L}_{\text {CE}} (\hat{{\textbf{y}}}, p({\textbf{y}}_s,T))=\sum _i -\hat{y}_i \log (p_i(y_{s,i}, T)). \end{aligned}$$

(31)

As can be observed, both losses, Eqs. 28 and 30, use the same logits of the student model but with different temperatures. In the distillation loss, the temperature T is set to a value greater than 1 ($T=\tau >1$) while in the student loss, the temperature is set to 1 ($T=1$). Finally, the final loss is calculated as a weighted sum between the distillation loss and the student loss:

$$\begin{aligned} L({\textbf{x}}^0, W)= & {} \lambda L_D(p({\textbf{y}}_t, T=\tau ), p({\textbf{y}}_s, T=\tau )) \nonumber \\{} & {} + (1-\lambda ) L_S(\hat{{\textbf{y}}}, p({\textbf{y}}_s , T=1)), \end{aligned}$$

(32)

where $\lambda $ is the regularization parameter, ${\textbf{x}}^0$ represents an input vector from the training set, and W coincides with the parameters of the student model.

4 Numerical results

In this section, we present a comparison between the results obtained using different reduction methods in terms of final accuracy, memory allocation, and procedure speed.

4.1 Neural network architectures

We used Convolutional Neural Networks (CNNs) as a test network, which is a type of ANN commonly applied to image recognition problems [64, 65]. In the past decade, several CNN architectures have been introduced [11, 61] to address this problem, such as AlexNet, ResNet, Inception, VGGNet.

As starting point for testing our methods, we have employed one of the VGG network architectures, specifically VGG-16 [66]. As shown in Fig. 4, this architecture consists of the following components:

13 convolutional blocks. Each block includes a convolutional layer followed by a non-linear layer, where ReLU is used as the activation function.
5 max-pooling layers,
3 fully-connected layers.

The ConvNet used in our study is called VGG-16, as it is composed of a total of 16 layers with tunable parameters. Out of these 16 layers, 13 are convolutional layers, and the remaining 3 are fully connected layers.

In comparison, we also tested our methodology on ResNet [67], and in particular on ResNet-110, as done in [35]. As the name suggests, ResNet-110 comprises a total of 110 layers. These layers are divided into 3 groups, each containing 18 basic residual blocks. We recall that these blocks consist of two convolutional layers, followed by batch normalization, and a skip/shortcut connection that adds the input to the output of the block.

4.2 Dataset

For training and testing our net we have used^{Footnote 5}:

CIFAR-10 dataset [68], a computer-vision dataset used for object recognition. It comprises 60000 color images of size $32\times 32$, which are divided into 10 non-overlapping classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
Custom dataset, composed of 3448 color images of size $32\times 32$, organized in 4 classes: 3 non-overlapping classes and a mixed one, characterized by pictures with objects of different categories present at the same time.
CIFAR-100 dataset [68], another benchmark computer-vision dataset for object recognition. It consists of 60000 color images of size $32\times 32$, divided into 100 classes, with each class containing 600 images.

4.3 Software and hardware configuration

To implement and construct the reduced version of the convolutional neural networks described in the previous sections, we utilized PyTorch [69] as our development environment. We also employed the open-source Python library SciPy [70] for scientific computing and the open-source Python package ATHENA [46] for the actual computation of the active subspaces.

Regarding the hardware configuration, we ran all experiments involving VGG-16, except for the CIFAR-100 dataset, on the CPU. All other tests were performed using an NVIDIA GPU. This decision was influenced by the availability of hardware resources during the development and testing phases for the selected architectures.

4.4 Results VGG-16

We now present the results of the reduced network constructed starting from VGG-16 and based on CIFAR-10, CIFAR-100 and our custom dataset. First of all, the original network VGG-16 has been trained^{Footnote 6} on each of the different datasets presented. We needed only a 60 epochs training phase for CIFAR-10 and the custom case, whereas a longer training of 300 epochs was required for CIFAR-100. From Tables 2 and 3, it can be seen that at the end of these learning processes, VGG-16 gains good accuracy: $77.98\%$ for the CIFAR-10 and $95.65\%$ for the custom dataset. Table 4 provides instead the accuracy achieved in the CIFAR-100 case, presenting the Top-1 and Top-5 scores, as done in [35]. It can be observed that the increase in the number of classes has resulted in a lower Top-1 value, as well as the need for longer training.

We report the results obtained with different reduced versions of VGG-16 constructed following the steps of Algorithm 1 and using several cut-off layers^{Footnote 7} l, as reported in [35]: 5, 6, and 7 for CIFAR-10 and the custom case, 7, 8 and 9 for the other dataset. We remark that in the case of dimensionality reduction using the Active Subspaces technique, we employed the Frequent Direction method [71], which was implemented within ATHENA to compute the AS. We set the parameter r, representing the dimension of the reduced space, to 50 for both AS and for POD in accordance with [35], where considerations on the structural analysis of VGG-16 can be found.

Table 1 Results obtained for the reduced net POD+FNN (7) trained on CIFAR-10 with different structures for the FNN

Full size table

Table 2 Results obtained with CIFAR-10 dataset

Full size table

Table 3 Results obtained with a custom dataset

Full size table

Table 4 Results obtained with CIFAR-100 dataset

Full size table

When a FNN was employed to classify our image, we trained it for 500 epochs with the dataset at hand before re-training the entire reduced net. In Table 1, we provide a summary of the results obtained by training a reduced net using various FNN architectures. This includes different numbers of hidden layers and constant numbers of hidden neurons within each hidden layer of the network. Specifically, we compare the storage requirements of the FNN with the accuracy of the considered reduced network POD-FNN under consideration at epoch 0, i.e. after its initialization, and at epoch 10, i.e. after the re-training of the whole reduced net. From the results, it can thus be observed that increasing the number of hidden layers and hidden neurons does not result in improved accuracy. Based on accuracy and memory allocation considerations (refer to Table 1 for details), we opted for the following architecture:

CIFAR-10: FNN with 50 input neurons, 10 output neurons, and one hidden layer with 20 hidden neurons.
Custom Dataset: FNN with 50 input neurons, 4 output neurons, and one hidden layer with 10 hidden neurons.
CIFAR-100: FNN with 50 input neurons, 100 output neurons, and one hidden layer with 70 hidden neurons.

Table 5 Results obtained for POD+FNN(7) without using a pre-trained original network

Full size table

After completing these steps, the reduced neural network was re-trained using CIFAR-10 and the custom dataset for a total of 10 epochs. Additionally, it was re-trained for 20 epochs specifically on the CIFAR-100 dataset. The outcomes of this training process are summarized in Tables 2, 3, and in 4, presenting a comparison among various reduced neural networks in terms of accuracy (both before and after the final training, or using Top-1 and Top-5 scores), memory storage requirements, and the time needed for initialization and training of each reduced network. As mentioned earlier, we provide results for each reduced network, namely AS+PCE, AS+FNN, POD+FNN, using three different cut-off layers: 5,6, and 7 or 7, 8 and 9, depending on the case.

In our context, which specifically involves working with a custom dataset, understanding memory allocation is crucial. This is because we aim to include a CNN into an embedded system that has specific storage constraints. Tables 2, 3 and 4 demonstrate that the memory allocation required for the created reduced nets is decreased with respect to that of the original VGG-16. For instance, the checkpoint file^{Footnote 8} needed to store the full net occupies approximately 56 MB, whereas that of its reduced versions is less than 10 MB in most cases. It is then important to note that for CIFAR-100, opting for higher cut-off values results in a larger storage requirement due to the increased pre-model size. This emphasizes the significant role the cut-off index plays in the final model compression. Additionally, it is worth mentioning that replacing PCE with an FNN leads to a substantial memory space savings of two orders of magnitude: $10^{-4}$ as opposed to $10^{-2}$.

Table 6 Results obtained with CIFAR-10 dataset

Full size table

Table 7 Results obtained with the custom dataset

Full size table

Table 2 shows that in the case of POD+FNN, the net does not require an additional training with the entire dataset. This is because, after the initialization (epoch 0), the network’s accuracy is already acceptable, and for index 7, it is already high. Additionally, we observe that all proposed reduced nets require less time to achieve well-performing models. This is reasonable since the compression in size is strictly related to the decrease in the number of CNN parameters. However, while this holds true for CIFAR-10 and the custom dataset, the increased number of classes, and thus complexity, in CIFAR-100 necessitates longer training time.

Nevertheless, an interesting aspect of this reduction methodology is the non-necessity of having a pre-trained starting model to obtain an exploitable net, as summarized in Table 5. We provide the results obtained for our proposed reduced net POD+FNN(7), constructed without starting from the pre-trained VGG-16. It can be inferred that with all datasets POD+FNN achieves a comparable level of accuracy as in the previous cases where a pre-trained VGG-16 was employed. However, for CIFAR-10 and the custom dataset, we used the same number of epochs as the pre-trained case, whereas for CIFAR-100, it required twice the number of epochs to achieve the same level of accuracy. The immediate consequence of this is the saving of the time needed to gain a performing network, which amounts to approximately 5 hours. It is evident that these considerations remain valid even when using the custom dataset under consideration. Table 3 reports also how after the initialization POD+FNN has already a greater accuracy than VGG-16 for all the choices of l.

In all cases, it can be observed that the proposed reduced CNN achieves a similar, if not higher, accuracy compared to the original VGG-16, while occupying significantly less storage. Moreover,increasing the cut-off layer index l results in improved accuracy since more original features are retained. However, this also leads to a smaller compression ratio. Consequently, as previously mentioned, determining the appropriate value for l requires striking a trade-off between the desired levels of accuracy and reduction, considering also the specific field of application.

Table 8 Results obtained with CIFAR-100 dataset

Full size table

4.5 Results ResNet-110

After obtaining interesting results with VGGNet, we proceeded to test our reduction methodology on ResNet-110, following the approach described in [35]. Initially, the network has been trained on each dataset for 60 epochs, achieving a good level of accuracy as reported in Table 6, in Tables 7, and 8. Similarly to the VGG-16 case, we provide the Top-1 and Top-5 accuracy scores for CIFAR-100.

Also in this setting, we have performed multiple experiments to determine the FNN architecture. In analogy with the approach outlined in Table 1 for reducing VGG-16, we used the same FNN structures described previously for VGG-16 across the different cases. Furthermore, for ResNet, the chosen reduced dimension r is also set to 50, based on the eigenvalue analysis presented in [35]. Numerous tests confirmed that this choice of r was optimal, as increasing its value did not yield improved results.

Once we finalized the compression and input–output mapping techniques, we proceeded to construct the reduced versions of our original model. Algorithm 1 describes the entire procedure, with the last step corresponding to the training phase. During this phase, we re-trained the proposed networks using the aforementioned datasets under consideration. We have thus re-train our reduced nets for 10 epochs in the case of CIFAR-10 and the custom dataset, and for 20 epochs with CIFAR-100. Tables 6, 7 and 8 provide the outcomes obtained using the described experimental setup, comparing them in terms of the achieved accuracy, memory footprint, and time required for the initialization and learning processes. Similarly to what is explained in Section 4.4, we report the results for each proposed reduced net using three different cut-off values^{Footnote 9}: 31, 33, 35 for CIFAR-10 and the custom dataset, and 37, 39, 43 for CIFAR-100. By combining the reduction and input–output mapping methods, we have constructed the following compressed models, AS+PCE, AS+FNN, POD+FNN, of which we are now going to analyze the performances.

In terms of memory allocation, it is worth noting that each of the aforementioned reduced nets requires less than 3 MB of space, resulting in a reduction of approximately $60\%$ in the memory footprint. Furthermore, the introduction of an FNN in the final part of the method leads to a storage decrease of one order of magnitude.

In all cases, we can observe that the reduced networks achieved a level of accuracy comparable to the original ResNet-110. The advantage of constructing lightweight architectures is that they result in faster models in most situations. Specifically, we want to emphasize the POD+FNN net, since it consistently outperforms the other reduced networks in terms of achieved accuracy, storage requirements, initialization, and training times. Regarding the initialization process, we can observe that POD requires less time compared to AS, saving approximately one time hour. Furthermore, the training duration is similar to AS in the case of CIFAR-100 and the custom dataset, while it is faster for CIFAR-10.

In conclusion, based on the aforementioned considerations, we can deduce that the results obtained with ResNet-110 are generally in line with those previously achieved with VGG-16. The proposed reduced methodology enables the creation of lightweight versions of ResNet-110 that are equally accurate to the original model but have fewer parameters, making them more manageable to train.

5 Conclusions and perspectives

In this paper, we propose a generic framework for compressing neural networks, specifically Convolutional Neural Networks, with the objective of reducing the number of layers in the network while minimizing the error in the final prediction. This reduction is achieved by replacing a finite set of network layers with a response surface, which involves also dimensionality reduction techniques to operate on a low-dimensional space. We analyze various dimensionality reduction methods, and investigate how the combination of these techniques with different input-output mappings can impact the final accuracy.

The primary goal of creating this reduced network is to compress existing deep neural network architectures to be included in embedded systems with memory and space constraints. The numerical experiments conducted on two different CNNs, namely VGG-16 and ResNet-110, demonstrate that the proposed techniques can produce a compressed version of an existing network by reducing the number of layers and parameters. This reduction in size results in memory savings while maintaining a comparable level of accuracy to the original CNN. In comparison to VGG-16, the original ResNet-110 requires less storage space, approximately 7 MB, making it already suitable for many applications in vision-embedded systems. However, the use of smaller devices or specific requirements may necessitate a compressed and faster version of the network. Additionally, the results reveal that the combination of POD with FNN generally leads to reduced training time, making the proposed framework superior to the method presented in [35].

A potential drawback of this technique is the requirement to begin with a pre-trained network in order to reduce it. However, our experiments have demonstrated the non-necessity of this starting point to reach good accuracy with the proposed reduced architecture. Despite the saved space and memory, the actual bottleneck in many problems lies in the learning procedure. In such cases, our framework could be extended to reduce the architecture dimension during training, rather than only after its completion, potentially resulting in a significant speedup in the optimization step.

In conclusion, the conducted experiments illustrate the consistency of our proposed methodology when applied to different CNNs and datasets. While we cannot claim that this reduction framework can be universally applied to all existing types of ANNs, it has proven effective in compressing CNNs for image recognition tasks.

Data availability and access

Code for the reduction of neural network is provided as part of the Smithers package. It is available at https://github.com/mathLab/Smithers. The CIFAR-10 and CIFAR-100 datasets can be downloaded from the official webpage: https://www.cs.toronto.edu/~kriz/cifar.html. Restrictions are applied on the availability of the custom dataset, which was used under license from Electrolux Professional for the current study, and therefore is not publicly available.

Notes

A priori there is not a right number of hidden layers to use: it depends on the fields of application of your net and on the problem under consideration [12, 59].
For simplicity the bias is put to zero in the following discussion.
There exists several types of loss functions that are used in this context, such as the Cross-Entropy Loss, the Euclidean Loss, and the Hinge Loss. The appropriate choice depends on the problem being considered [10, 61].
Note that in this case the number of hidden layers is set to 1 since, as discussed in Section 4, we notice that one hidden layer is enough to gain a good level of accuracy (see for example Table 1).
It is important to emphasize that the implementation for both VGG-16 and ResNet-110 differs from the standard approach when applied to the CIFAR dataset. Therefore, as suggested in [35], we have considered this aspect while constructing the models to maintain consistency with the original works.
We have selected 60 and 300 as the number of epochs for the training phases, and for the reduced nets, we have chosen 10 and 20 epochs. This decision was made as a trade-off between achieving a high final accuracy and minimizing the required time. To ensure a fair comparison, we have maintained the same epoch values across all the different cases we are considering.
In [35] and its corresponding implementation, they refer to indices 5, 6, 7, 8 and 9. These indices represent the convolutional layers in a list where only convolutional and linear layers are taken into consideration as possible cut-off layers. Thus, if we consider the entire network with all the different layers, the corresponding layers would be 11, 13, 16, 18 and 20 respectively.
Note that in all cases (CIFAR-10, CIFAR-100 and custom dataset) the checkpoint file requires 56 MB of memory. However, if you need to store additional information, such as the architecture of the network, training epochs, and loss, the required allocation increases to around 220 MB.
The chosen cut-off indexes for ResNet-110 are determined in a similar manner as discussed for VGG-16. Specifically, they are based on the indexes used in [35]. It is important to note that these indexes refer solely to the convolutional layers. Hence, when considering the entire ResNet-110 structure, these indexes correspond to layers 61, 67, 73, 75, 81, and 87, respectively.

References

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25:1097–1105. https://doi.org/10.1145/3065386
Article Google Scholar
Elgendy M (2020) Deep Learning for Vision Systems. Simon and Schuster, New York
Google Scholar
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020) Deep learning for generic object detection: A survey. International journal of computer vision 128:261–318. https://doi.org/10.1007/s11263-019-01247-4
Article MATH Google Scholar
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe magazine 13(3):55–75. https://doi.org/10.1109/MCI.2018.2840738
Article Google Scholar
Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural Language Processing: State of The Art, Current Trends and Challenges. Multimedia Tools and Applications 82 (2022). DOI: https://doi.org/10.1007/s11042-022-13428-4
Noda K, Arie H, Suga Y, Ogata T (2014) Multimodal integration learning of robot behavior using deep neural networks. Robotics and Autonomous Systems 62(6):721–736. https://doi.org/10.1016/j.robot.2014.03.003
Article Google Scholar
Kiyokawa T, Katayama H, Tatsuta Y, Takamatsu J, Ogasawara T (2021) Robotic Waste Sorter With Agile Manipulation and Quickly Trainable Detector. IEEE Access 9:124616–124631. https://doi.org/10.1109/ACCESS.2021.3110795
Article Google Scholar
Wali A, Alamgir Z, Karim S, Fawaz A, Ali MB, Adan M, Mujtaba M (2022) Generative adversarial networks for speech processing: A review. Computer Speech & Language 72:101308. https://doi.org/10.1016/j.csl.2021.101308
Yu, D., Deng, L.: Automatic Speech Recognition vol. 1. Springer, London (2016). https://doi.org/10.1007/978-1-4471-5779-3
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge, USA (2016). http://www.deeplearningbook.org
Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review 53(8):5455–5516. https://doi.org/10.1007/s10462-020-09825-6
Article Google Scholar
Trenn S (2008) Multilayer Perceptrons: Approximation Order and Necessary Number of Hidden Units. IEEE Transactions on Neural Networks 19(5):836–44. https://doi.org/10.1109/TNN.2007.912306
Article Google Scholar
Wang, E., Davis, J.J., Zhao, R., Ng, H.-C., Niu, X., Luk, W., Cheung, P.Y.K., Constantinides, G.A.: Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. ACM Computing Surveys 52(2) (2019). https://doi.org/10.1145/3309551
Wuraola A, Patel N (2022) Resource efficient activation functions for neural network accelerators. Neurocomputing 482:163–185. https://doi.org/10.1016/j.neucom.2021.11.032
Article Google Scholar
Huang J, Zhao J, Cai W (2019) Compressing convolutional neural networks using POD for the reconstruction of nonlinear tomographic absorption spectroscopy. Computer Physics Communications 241:33–39. https://doi.org/10.1016/j.cpc.2019.03.020
Article MathSciNet Google Scholar
Messaoud S, Bouaafia S, Maraoui A, Ammari AC, Khriji L, Machhout M (2022) Deep convolutional neural networks-based hardware-software on-chip system for computer vision application. Computers & Electrical Engineering 98:107671. https://doi.org/10.1016/j.compeleceng.2021.107671
Article Google Scholar
Udendhran R, Balamurugan M, Suresh A, Varatharajan R (2020) Enhancing image processing architecture using deep learning for embedded vision systems. Microprocessors and Microsystems 76:103094. https://doi.org/10.1016/j.micpro.2020.103094
Article Google Scholar
da Silva ET, Sampaio F, da Silva LC, Medeiros DS, Correia GP (2020) A method for embedding a computer vision application into a wearable device. Microprocessors and Microsystems 76:103086. https://doi.org/10.1016/j.micpro.2020.103086
Article Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406 (2017). https://doi.org/10.1109/ICCV.2017.155
Chen S, Zhao Q (2019) Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(12):3048–3056. https://doi.org/10.1109/TPAMI.2018.2874634
Article Google Scholar
Li, Y., Adamczewski, K., Li, W., Gu, S., Timofte, R., Van Gool, L.: Revisiting random channel pruning for neural network compression. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 191–201 (2022). https://doi.org/10.1109/CVPR52688.2022.00029
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11256–11264 (2019). https://doi.org/10.1109/CVPR.2019.01152
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763 (2017). https://doi.org/10.1109/ICCV.2017.298
Cichocki, A., Lee, N., Oseledets, I., Phan, A.-H., Zhao, Q., Mandic, D.P.: Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends® in Machine Learning 9(4-5), 249–429 (2016). https://doi.org/10.1561/2200000059
Cichocki, A., Phan, A.-H., Zhao, Q., Lee, N., Oseledets, I., Sugiyama, M., Mandic, D.P.: Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends® in Machine Learning 9(6), 431–673 (2017). https://doi.org/10.1561/2200000067
Li, Y., Gu, S., Mayer, C., Van Gool, L., Timofte, R.: Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8015–8024 (2020). https://doi.org/10.1109/CVPR42600.2020.00804
Li, Y., Gu, S., Van Gool, L., Timofte, R.: Learning Filter Basis for Convolutional Neural Network Compression. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5622–5631 (2019). https://doi.org/10.1109/ICCV.2019.00572
Yang, J., Shen, X., Xing, J., Tian, X., Li, H., Deng, B., Huang, J., Hua, X.-s.: Quantization Networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7300–7308 (2019). https://doi.org/10.1109/CVPR.2019.00748
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2017) Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18(1):6869–6898
MathSciNet MATH Google Scholar
Deng L, Jiao P, Pei J, Wu Z, Li G (2018) GXNOR-Net: Training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework. Neural Networks 100:49–58. https://doi.org/10.1016/j.neunet.2018.01.010
Article MATH Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. In: NIPS Deep Learning and Representation Learning Workshop (2015)
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: A survey. International Journal of Computer Vision 129(6):1789–1819. https://doi.org/10.1007/s11263-021-01453-z
Article Google Scholar
Cho, J.H., Hariharan, B.: On the Efficacy of Knowledge Distillation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4793–4801 (2019). https://doi.org/10.1109/ICCV.2019.00489
Bang D, Lee J, Shim H (2021) Distilling from professors: Enhancing the knowledge distillation of teachers. Information Sciences 576:743–755. https://doi.org/10.1016/j.ins.2021.08.020
Article MathSciNet Google Scholar
Cui C, Zhang K, Daulbaev T, Gusak J, Oseledets I, Zhang Z (2020) Active subspace of neural networks: Structural analysis and universal attacks. SIAM Journal on Mathematics of Data Science 2(4):1096–1122. https://doi.org/10.1137/19M1296070
Article MathSciNet MATH Google Scholar
Benner, P., Grivet-Talocia, S., Quarteroni, A., Rozza, G., Schilders, W., Silveira, L.M.: Model Order Reduction: Volume 1: System- and Data-Driven Methods and Algorithms. De Gruyter, Berlin, Boston (2021). https://doi.org/10.1515/9783110498967
Benner, P., Schilders, W., Grivet-Talocia, S., Quarteroni, A., Rozza, G., Miguel Silveira, L.: Model Order Reduction: Volume 2: Snapshot-Based Methods and Algorithms. De Gruyter, Berlin, Boston (2020). https://doi.org/10.1515/9783110671490
Benner, P., Schilders, W., Grivet-Talocia, S., Quarteroni, A., Rozza, G., Miguel Silveira, L.: Model Order Reduction: Volume 3: Applications. De Gruyter, Berlin, Boston (2020). https://doi.org/10.1515/9783110499001
Constantine, P.G.: Active Subspaces: Emerging Ideas for Dimension Reduction in Parameter Studies. SIAM Spotlights, vol. 2. SIAM, U.S. (2015). https://doi.org/10.1137/1.9781611973860
Constantine PG, Dow E, Wang Q (2014) Active Subspace Methods in Theory and Practice: Applications to Kriging Surfaces. SIAM Journal on Scientific Computing 36(4):1500–1524. https://doi.org/10.1137/130916138
Article MathSciNet MATH Google Scholar
Romor F, Tezzele M, Lario A, Rozza G (2022) Kernel-based active subspaces with application to computational fluid dynamics parametric problems using discontinuous Galerkin method. International Journal for Numerical Methods in Engineering 123(23):6000–6027. https://doi.org/10.1002/nme.7099
Article MathSciNet Google Scholar
Zahm O, Constantine PG, Prieur C, Marzouk YM (2020) Gradient-based dimension reduction of multivariate vector-valued functions. SIAM Journal on Scientific Computing 42(1):534–558. https://doi.org/10.1137/18M1221837
Article MathSciNet MATH Google Scholar
Ahnert K, Abel M (2007) Numerical differentiation of experimental data: local versus global methods. Computer Physics Communications 177:764–774. https://doi.org/10.2514/6.2003-4213
Article MathSciNet MATH Google Scholar
Williams CK, Rasmussen CE (2006) Gaussian Processes for Machine Learning, vol 2. The MIT press, Cambridge, MA, USA
MATH Google Scholar
Mohamed, S., Rosca, M., Figurnov, M., Mnih, A.: Monte Carlo Gradient Estimation in Machine Learning. Journal of Machine Learning Research 21(1) (2020). https://doi.org/10.5555/3455716.3455848
Romor F, Tezzele M, Rozza G (2021) ATHENA: Advanced Techniques for High dimensional parameter spaces to Enhance Numerical Analysis. Software Impacts 10:100133. https://doi.org/10.1016/j.simpa.2021.100133
Article Google Scholar
Hesthaven, J.S., Rozza, G., Stamm, B.: Certified Reduced Basis Methods for Parametrized Partial Differential Equations, 1st edn. Springer Briefs in Mathematics, p. 135. Springer, Switzerland (2015). https://doi.org/10.1007/978-3-319-22470-1. Springer
Bui-Thanh T, Damodaran M, Willcox K (2003) Proper orthogonal decomposition extensions for parametric applications in compressible aerodynamics. In: 21st AIAA Applied Aerodynamics Conference, p. 4213. https://doi.org/10.2514/6.2003-4213
Bui-Thanh T, Damodaran M, Willcox K (2004) Aerodynamic data reconstruction and inverse design using proper orthogonal decomposition. AIAA journal 42(8):1505–1516. https://doi.org/10.2514/1.2159
Article Google Scholar
Rozza, G., Stabile, G., Ballarin, F.: Advanced Reduced Order Methods and Applications in Computational Fluid Dynamics. Society for Industrial and Applied Mathematics, Philadelphia, PA (2022). https://doi.org/10.1137/1.9781611977257
Xiu D, Karniadakis GE (2002) The Wiener-Askey polynomial chaos for stochastic differential equations. SIAM journal on scientific computing 24(2):619–644. https://doi.org/10.1137/S1064827501387826
Article MathSciNet MATH Google Scholar
Fine, T.L.: Feedforward Neural Network Methodology. Information Science and Statistics. Springer, New York (1999). https://doi.org/10.1007/b97705
Wiener N (1938) The Homogeneous Chaos. American Journal of Mathematics 60(4):897–936. https://doi.org/10.2307/2371268
Article MathSciNet MATH Google Scholar
Janya-Anurak, C.: Framework for Analysis and Identification of Nonlinear Distributed Parameter Systems Using Bayesian Uncertainty Quantification Based on Generalized Polynomial Chaos. Karlsruher Schriften zur Anthropomatik, vol. 31. KIT Scientific Publishing, Karlsruhe, Deutschland (2017). https://doi.org/10.5445/KSP/1000066940
Ghanem, R.G., Spanos, P.D.: Stochastic Finite Elements: a Spectral Approach. Springer, New York (1991). https://doi.org/10.1007/978-1-4612-3094-6
Askey, R., Wilson, J.A.: Some basic hypergeometric orthogonal polynomials that generalize Jacobi polynomials. Memoirs of the American Mathematical Society 54(319) (1985). https://doi.org/10.1090/memo/0319
Sudret B (2008) Global sensitivity analysis using polynomial chaos expansions. Reliability engineering & system safety 93(7):964–979. https://doi.org/10.1016/j.ress.2007.04.002
Article Google Scholar
Cheng K, Lu Z (2018) Adaptive sparse polynomial chaos expansions for global sensitivity analysis based on support vector regression. Computers & Structures 194:86–96. https://doi.org/10.1016/j.compstruc.2017.09.002
Article Google Scholar
Shaham U, Cloninger A, Coifman RR (2018) Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis 44(3):537–557. https://doi.org/10.1016/j.acha.2016.04.003
Zaki MJ, Meira W Jr (2020) Data Mining and Machine Learning: Fundamental Concepts and Algorithms. Cambridge University Press, U.K
Book MATH Google Scholar
Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L (2021) Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of big Data 8(1):1–74. https://doi.org/10.1186/s40537-021-00444-8
Rojas, R.: The Backpropagation Algorithm. In: Neural Networks, pp. 149–182. Springer, Berlin, Heidelberg (1996). https://doi.org/10.1007/978-3-642-61068-4_7
Borza, D.L., Ileni, T.A., Marinescu, A.I., Darabant, S.A.: Teacher or supervisor? effective online knowledge distillation via guided collaborative learning. Computer Vision and Image Understanding, 103632 (2023). https://doi.org/10.1109/CVPR.2016.90
LeCun Y (1989) Generalization and network design strategies. Connectionism in perspective 19(143–155):18
Google Scholar
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J et al (2018) Recent advances in convolutional neural networks. Pattern recognition 77:354–377. https://doi.org/10.1016/j.patcog.2017.10.013
Article Google Scholar
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., new York, United States (2019)
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., SciPy 1.0 Contributors: SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Ghashami M, Liberty E, Phillips JM, Woodruff DP (2016) Frequent Directions: Simple and Deterministic Matrix Sketching. SIAM Journal on Computing 45:1762–1792. https://doi.org/10.1137/15M1009718
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank Marco Tezzele for the productive discussions and comments.

Funding

This work was partially supported by an industrial Ph.D. grant sponsored by Electrolux Professional, and was partially funded by European Union Funding for Research and Innovation — Horizon 2020 Program — in the framework of European Research Council Executive Agency: H2020 ERC CoG 2015 AROMA-CFD project 681447 “Advanced Reduced Order Methods with Applications in Computational Fluid Dynamics” P.I. Professor Gianluigi Rozza. Open access funding provided by Scuola Internazionale Superiore di Studi Avanzati - SISSA within the CRUI-CARE Agreement.

Author information

Nicola Demo and Gianluigi Rozza are authors contributed equally to this work.

Authors and Affiliations

Mathematics Area, mathLab, SISSA, via Bonomea 265, Trieste, I-34136, Italy
Laura Meneghetti, Nicola Demo & Gianluigi Rozza

Authors

Laura Meneghetti
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Demo
View author publications
You can also search for this author in PubMed Google Scholar
Gianluigi Rozza
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Laura Meneghetti, Nicola Demo; Methodology: Laura Meneghetti, Nicola Demo; Formal analysis and investigation: Laura Meneghetti; Writing - original draft preparation: Laura Meneghetti; Writing - review and editing: Nicola Demo, Gianluigi Rozza; Funding acquisition: Gianluigi Rozza; Supervision: Gianluigi Rozza.

Corresponding author

Correspondence to Gianluigi Rozza.

Ethics declarations

Ethical and informed consent for data used

Fulfilled.

Competing interests

No potential competing interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Meneghetti, L., Demo, N. & Rozza, G. A dimensionality reduction approach for convolutional neural networks. Appl Intell 53, 22818–22833 (2023). https://doi.org/10.1007/s10489-023-04730-1

Download citation

Accepted: 23 May 2023
Published: 04 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04730-1

A dimensionality reduction approach for convolutional neural networks

Abstract

Similar content being viewed by others

Compressed neural architecture utilizing dimensionality reduction and quantization

Is DeCAF Good Enough for Accurate Image Classification?

Space Efficient Quantization for Deep Convolutional Neural Networks

Explore related subjects

1 Introduction and motivations

2 Numerical tools

2.1 Dimensionality reduction techniques

2.1.1 Active subspaces

2.1.2 Proper orthogonal decomposition

2.2 Input–output mapping

2.2.1 Polynomial chaos expansion

2.2.2 Feedforward neural network

3 The reduced artificial neural networks

3.1 Training phase

4 Numerical results

4.1 Neural network architectures

4.2 Dataset

4.3 Software and hardware configuration

4.4 Results VGG-16

4.5 Results ResNet-110

5 Conclusions and perspectives

Data availability and access

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation