1 Introduction
Deep neural networks (
DNNs) have monopolized the world of Artificial Intelligence, due to their ability to extract features and produce high-quality decisions from complex data. They have been extensively applied across various domains, including
natural language processing (
NLP) [
63,
67,
88],
computer vision (
CV) [
43,
65,
116,
129], and wireless networking [
56,
109]. The great success of DNNs in image classification, image recognition, machine translation, video gaming, and many others can be attributed to their complex and deep non-linear layer architectures. Advances in automatic learning techniques coupled with hardware [
42] and optimization-level techniques are also playing an important role in the success of DNN. These models are typically designed as a sequence of interconnected neural layers stacked on top of one another. Achieving high accuracy often requires deeper models. Popular models such as AlexNet [
65], VGGNet [
116], ResNet [
43], and DenseNet [
51] achieve high performance with deep architectures that have millions of parameters. However, these models demand significant processing power and energy, introducing latency [
64]. Deploying these models on resource-constrained devices for low-latency applications can be challenging [
100]. For example, DNN-based speech or face recognition applications [
18,
47] and real-time navigation assistance [
102] provide precise predictive accuracy but with an extremely high requirement of computational resources that can’t be affordable by most mobile or IoT devices available on the market today.
Researchers have proposed various innovative techniques to reduce DNN complexity and enable deployment in low-resource environments, such as mobile devices. Model compression techniques, such as model pruning [
36], quantization [
55], and
knowledge distillation (
KD) [
48], aim at compressing large models. Additionally, hardware and software optimization methods are introduced to reduce execution time. Lightweight models, such as MnasNet [
120] and MobileNetV2 [
158], require fewer computational resources but suffer from significant accuracy loss. For instance, compared to ResNet-152 [
43], MobileNetV2 [
158] reports an accuracy loss of up to 6.4% for image classification on the ImageNet dataset [
93]. The tradeoff between accuracy and computational resource requirements can be mitigated by offloading data to powerful remote cloud servers [
90]. In this approach, a mobile device gathers input (e.g., an image) and sends it to the cloud server for processing by the DNN model. Cloud servers, equipped with the necessary computational resources like GPUs, can run complex DNN models efficiently [
33]. However, this cloud-based approach introduces latency in data transfer and raises privacy concerns, which are unacceptable for most latency-sensitive mobile applications.
A promising new computing paradigm, called edge computing [
13], allocates computational resources closer to end devices. This approach has been successfully applied in various domains, such as smart home systems [
154], autonomous vehicles [
80], and healthcare applications [
144]. In edge computing, most DNN computations are performed at edge devices near the end devices, rather than at centralized cloud servers. Edge computing involves offloading data partially or completely to an edge device situated close to the end device or mobile device. While this method addresses many limitations of cloud-based solutions, the performance of edge-based approaches depends heavily on the bandwidth availability between the end device and the edge device or server [
151]. Deploying DNNs on edge devices with variable bandwidth and offloading constraints, while maintaining accuracy, is challenging [
34]. Research has extensively investigated training and executing resource-demanding DNNs on resource-limited edge devices. Recent contributions have proposed various DNN model partitioning techniques that split the training and execution of the DNN model among devices, edge servers, and cloud servers based on context parameters.
Belilovsky et al. [
6], Marquez et al. [
91], Lee et al. [
74], Wang et al. [
131], and Huang et al. [
50] demonstrated that small and lightweight neural network architectures can recognize and classify most input samples from complex datasets with acceptable accuracy. For instance, a DNN model with a single convolutional (conv) layer can accurately classify 30% of the images in the complex ImageNet dataset [
76]. The overthinking problem occurs when a DNN makes correct predictions before reaching its final layer, but by the final layer, those correct predictions become misclassifications. This issue can be mitigated using a smaller DNN architecture [
2,
62]. Additionally, the deep and sequential nature of neural networks, combined with the gradient locking problem [
129], makes the DNN structure challenging to parallelize.
To address the previous problems, the early-exit DNN was introduced. This approach modifies the conventional DNN architecture by adding multiple side branches into its intermediate layers, accelerating inference since many inputs can be classified at these branches, as shown in [
6,
74,
91]. The novel concept of a early-exit DNN model, first proposed by Teerapittayanon et al. [
123] and named BranchyNet, modifies the conventional DNN structure by adding multiple side branches. In BranchyNet, inference results can be deduced from any side branch based on confidence criteria, unlike conventional DNN models where input samples propagate through all layers to the output layer regardless of the input sample characteristics. The early-exit DNN is ideal for edge computing as it enhances performance and reduces computational costs by allowing early-exits at intermediate layers. It also offers flexibility in partitioning the DNN model via side branches and supports multi-tier DNN execution.
This article explores recent advances in early-exit DNNs and their applications. Systematic state-of-the-art surveys of early-exit DNNs are very few. Scardapane et al. [
112] introduced multi-exit neural networks and discussed various techniques and ideas for designing, training, and deploying them in time-constrained environments. Matsubara et al. [
93] provided a short overview of early-exit DNNs and their applications. Laskaridis et al. [
69] briefly described the architecture and execution of early-exit networks and their applicability in adaptive inferences. Han et al. [
37] reviewed dynamic neural networks and provided abstract details on early-exit DNNs. The previous surveys either mentioned early-exiting in the context of dynamic inference networks [
37], combined it with split computing and offloading [
93], or provided a very brief description of early-exit DNN, its design, training, and deployment [
69]. [
112] provided a coherent introduction to early-exit DNN but only addressed a few training and deployment strategies. Furthermore, many studies have been conducted since the publication of that survey. This survey provides a comprehensive introduction to early-exit DNNs and a systematic analysis of recent techniques and strategies for designing, training, and deploying early-exit DNNs. This article focuses on the design constraints of early-exit DNN and divides the architecture into different parts, reviewing recent developments in each. It also discusses the key benefits and open challenges of early-exit DNN and its applicability across various domains.
1.1 The Main Contributions of This Work
Early-exit DNNs have recently sparked significant research interest, yet only a few surveys have reviewed recent advances in this area. Researchers have used various terms to describe similar concepts, such as cascaded networks, multi-classifiers, dynamic networks, adaptive networks, early termination neural networks, and multi-exit neural networks. Despite the progress, several research issues still need to be resolved to advance early-exit DNNs. A structured introduction and systematic survey of the state-of-the-art can increase research interest, provide insights, and attract new researchers to the field. In this context, this article makes the following contributions:
(1)
Unified Literature: This article unifies the fragmented literature under different names and provides a coherent, structured introduction to early-exit DNNs.
(2)
Benefits Comparison: It comprehensively compares the advantages of early-exit DNNs over conventional DNNs.
(3)
Architecture Description: It describes the early-exit DNN architecture and discusses major design constraints.
(4)
Implementation Review: It reviews several early-exit DNN implementations in-depth and discusses various training and deployment strategies.
(5)
Research Challenges and Applications: It emphasizes critical research challenges and explores various application possibilities.
The remainder of this article is organized as follows: Section
2 discusses the motivation and benefits of early-exit DNNs. Section
3 analyses the literature that conveys similar concepts of an early-exit DNN. Section
4 describes the architecture of the early-exit DNN, while Section
5 explains the various training strategies used for training the early-exit DNN. Section
6 covers the deployment and inference processes of the early-exit DNN. Sections
7 and
8 explore various applications and research challenges of early-exit DNN, respectively. Finally, Section
9 concludes this work.
3 Related Works
Recent research has seen a proliferation of studies focusing on early-exit DNNs, often presented under diverse names. As illustrated in Table
1, these studies can be categorized into distinct groups: cascaded networks, multi-classifiers or internal classifiers, dynamic or adaptive inferences, dynamic or adaptive networks, layer skipping networks, deeply supervised models, neural trees, conditional deep learning networks, multi-branch or multi-exit networks, and early-exit or early termination neural networks. This section provides a brief overview of each category.
Cascaded networks consist of multiple DNNs, referred to as component DNNs, arranged sequentially for classification. These networks operate in a cascade fashion, stopping computation when prediction confidence exceeds a specified threshold [
91]. Similar to early-exit DNNs, cascaded networks employ thresholds and softmax outputs to determine termination points [
8]. As input complexity rises, additional component DNNs are incorporated into the cascade. Bolukbasi et al. [
9] introduced an alternative cascading architecture where computation can pause after each conv layer. This design includes conv layers branching to classifiers, with a decision function guiding the sequence of DNN components in execution. Guan et al. [
32] utilized a cascade of classifiers with a policy module to select the optimal classifier for each input instance. The policy module integrates three
fully connected (
fc) layers, two ReLU units, and a softmax function to compute the termination probability. Wang et al. [
134] introduced the “
I Don’t Know” (
IDK) cascade framework, enhancing base models with additional classifiers that generate an IDK class for uncertain predictions. Upon predicting an IDK class, the subsequent model in the cascade is activated. Inference concludes upon predicting a real class or reaching the end of the cascade. Notably, IDK classifiers operate independently of base models, evaluating their confidence scores to assess prediction uncertainty. Soldaini et al. [
117] proposed a
cascade transformer (
CT) for document ranking, leveraging classifiers at various encoding stages to prune candidates in batches.
Dynamic or adaptive inference is a contemporary strategy to reduce DNN computational demands for resource-constrained devices like IoT and mobile devices. These approaches utilize multi-branch neural networks to dynamically select computational paths for each input instance [
16,
22,
23,
31,
84,
115,
127,
136,
145]. Similar to early-exit DNNs, they employ multi-classifier and internal classifier networks attached to intermediate layers of conventional DNNs to facilitate early prediction [
62,
77]. Multi-branch networks [
16,
78,
95,
111] use multiple kernels to extract features from input samples and combine the results to produce the final outcome. Dynamic DNNs, or dynamic wide networks, employ semantic specialization techniques to enhance computational efficiency during execution [
22,
89,
95,
97]. They partition classes into visually similar groups and assign them to specific branches, enabling these branches to discriminate among the classes. This reduces computational costs by focusing on a subset of the architecture for inference on each input. HydraNet [
95] employs a gating mechanism to specialize branches in feature extraction for visually similar classes, improving component selection during execution. Li et al. [
77] and Shuang Li et al. [
78] developed adaptive neural networks and
dynamic domain adaptation (
DDA) frameworks, respectively, optimizing performance and enabling knowledge transfer across domains. Fang et al. [
22] introduced FlexDNN, adapting to varying model complexities in on-device video analytics to improve deployment efficiency.
In layer-skipping networks, the inference process stops early by selectively skipping intermediate layer computations. A gating system analyzes input characteristics to decide if executing a layer is necessary. Veit et al. [
127] proposed a convolutional network with an adaptive inference graph to dynamically determine the next computing layer. SkipNet [
135] is a modified residual network featuring a gating mechanism to selectively skip layers. BlockDrop [
138] dynamically drops and selects layers in ResNet for each instance.
Conditional Deep Learning (
CDL) activates layers based on input complexity [
103]. CDL networks consist of sequential stages with a linear classifier and an activation module. Each stage predicts class labels and confidence scores, with the activation module deciding whether to proceed to the next layer or terminate the process with the predicted label [
104]. Neural trees integrate DNN parameter optimization with decision tree conditional computation. Tanno et al. [
122] introduced
Adaptive Neural Trees (
ANTs), combining architecture learning, hierarchical representation, and conditional computation. ANTs use edges, leaf nodes, and rooting functions from DNN representation learning, selecting execution layers conditionally via stochastic routing. ANT’s growth depends on data complexity and availability, performing conditional computation by activating a subset of model parameters based on selected root-to-leaf paths. Yu et al. [
148] employed a Soft Conditional Gate, which uses pooling and an MLP layer for input-conditioned routing to select the optimal path during inference. The activation probabilities of dynamic nodes are compared with a predefined threshold to decide whether to terminate the inference process.
Deeply supervised nets (
DSN) attach multiple classifiers to intermediate layers, using objective functions to enhance interpretability and reduce prediction error [
74]. MSDNet [
50] is a DSN with intermediate exit classifiers in ResNet and DenseNet backbones, improving efficiency and performance in resource-aware environments. Multi-exit or early-exit networks generate predictions using intermediate results from attached side exit branches.
5 Training Strategies for Early-exit DNN
Early-exit DNNs modify conventional DNN architectures by incorporating additional side branches, necessitating effective training strategies. As illustrated in Table
5, early-exit DNN training can be categorized into six main strategies. The first strategy is joint training, where the entire network, including the backbone DNN and all side branches, is trained as a unified optimization problem [
123]. The second strategy is branch-wise training, which involves iteratively training each side branch alongside the preceding backbone layers [
46]. In the third strategy, separate training, side branches are treated as standalone classifiers and trained independently [
110]. The fourth strategy is two-stage training, where the backbone DNN is initially trained, and its parameters are frozen, followed by separate training of the side branches using these frozen weights [
128]. The fifth strategy is KD-based training, where side branches act as student models learning from the final outputs of the backbone DNN, which serves as the teacher [
48]. Finally, the sixth strategy is a hybrid approach that combines multiple training techniques to comprehensively train the early-exit DNN model [
100]. The following section discusses these training strategies in detail.
5.1 Joint Training Strategy
As shown in Table
5, the joint training strategy is the most commonly used method for training early-exit DNNs. It is utilized in BranchyNet [
123], SDN [
62], DDI [
136], DynExit [
132], Boomerang [
149], Edgent [
82], SPINN [
70], PABEE [
156], EdgeML [
155], AdaDet [
146], HiDEC [
53], DeeDiff [
121], OdeBERT [
87], and Leco [
150]. Figure
4 provides a pictorial representation of this strategy, and the training procedure is summarized in Algorithm
1. In the figure, L
\(_1\) to L
\(_n\) and FE are neural layers of the Backbone DNN, and E
\(_1\) to E
\(_j\) are side branches. The losses generated by the corresponding exit branches are denoted as
\(l_1\) to
\(l_j\), while the loss generated at FE is
\(l_n\). The combined loss is represented as
\(l\). In this method, training the entire DNN, including additional side branches, is treated as a single optimization problem. It can be implemented in two ways: by combining the losses of each side branch [
123], or by calculating the overall loss from the combined outputs of each side branch [
111]. The training process consists of two passes: a forward pass, where training data is propagated through the backbone DNN and side branches, during which training parameters and errors are recorded; and a backward pass, where the recorded error is propagated back through the backbone model and weight parameters are adjusted using a gradient descent algorithm. BranchyNet [
123] defined a softmax cross-entropy loss function for each side branch, as shown in Equation (
3) and formulated a single optimization problem by combining all the losses, as shown in Equation (
4).
where
\(C\) is the set of all possible labels,
\(y\) is the ground-truth label vector,
\(\hat{y}_{i,c} = softmax(z)= \frac{exp(z)}{\sum _{c \in C}exp(z_c)}\), where
\(z=f_i(x;\theta)\) is the output of the
\(i^{th}\) exit branch and
\(\theta\) denotes the weight parameters of all neural layers between an entry point and an exit point.
where
\(N\) is the total number of exit points, including the main network’s exit, and
\(w_i\) is the weighting factor chosen across all side branches.
5.2 Branch-wise Training Strategy
In the branch-wise training strategy, each side branch is trained separately along with the preceding backbone DNN layers. This method follows the concept of forward-thinking, introduced by Hettinger et al. [
46], which trains conventional DNNs in a layer-wise fashion without backpropagation. MSDNet [
50] applied this strategy, where each iteration freezes the parameters of auxiliary classifiers and the preceding backbone DNN layers, as depicted in Figure
5. Initially, L
\(_1\), L
\(_2\), and E
\(_1\) are trained and their parameters are then frozen. Next, L
\(_1\), L
\(_2\), L
\(_3\), L
\(_4\), and E
\(_2\) are trained while leveraging the frozen parameters of L
\(_1\) and L
\(_2\). Similarly, subsequent branches are trained, leveraging previously frozen parameters. Algorithm
2 summarizes this training operation. Belilovsky et al. [
5,
6] also investigated this training strategy. Larochelle et al. [
68] explored an unsupervised layer-wise training strategy, while Bengio et al. [
7] employed a supervised layer-wise training approach. Wang et al. [
131] adopted a semi-supervised layer-wise training method. The primary advantage of this training strategy is its ability to mitigate issues like vanishing and exploding gradients, given that each branch is relatively lightweight.
5.3 Separate Training Strategy
In the separate training strategy, individual models are created for each side branch, starting from the input layer and encompassing all subsequent layers connected to the respective branch. Each side branch undergoes independent training [
15,
20,
71,
74,
132]. Figure
6 illustrates this approach, where L
\(_1\), L
\(_2\), and E
\(_1\) form one independent model, and similarly, L
\(_1\), L
\(_2\), L
\(_3\), L
\(_4\), and E
\(_2\) form another, with each trained separately. This strategy may involve varying training objectives for each exit branch, typically focusing on minimizing standard loss functions like cross-entropy. Algorithm
3 provides a broad overview of this approach. Separate training offers substantial benefits when different exit points capture distinct features or levels of abstraction, enabling each branch to make a unique contribution to the overall architecture [
128]. In this training, each side branch can be associated with different classifiers; for instance, DSN [
74] uses SVM and softmax classifiers on its side branches.
5.4 Two-stage Training Strategy
In this training approach, the backbone DNN is initially trained without considering the side branches, and its parameters are then frozen. These frozen parameters are subsequently used to train each side branch separately. As shown in Figure
7, the backbone layers (L
\(_1\) to L
\(_n\) and FE) are trained first. Their parameters are then used to train each branch independently: side exit branch E
\(_1\) is trained using the frozen parameters of L
\(_1\) and L
\(_2\). Similarly, side exit branch E
\(_2\) is trained using the frozen parameters of L
\(_1\), L
\(_2\), L
\(_3\), and L
\(_4\), and this process is repeated for all branches. The training process is summarized in Algorithm
4. This method is utilized in EPNet [
16], CDL [
104], PersEPhonEE [
75], DeeBERT [
140], PTEENet [
66], TEEM [
110], ENet [
79], TLEE [
133], and LGViT [
143]. This approach is particularly useful when the backbone DNN is already in place and needs to be converted into a multi-exit architecture. Additionally, it enables rapid experimentation with quickly trainable and detachable side branches. Furthermore, this strategy allows for the incorporation of non-neural classifiers, such as decision trees or support vector machines, as auxiliary classifiers for intermediate predictions.
5.5 KD-based Training Strategy
The core principle of KD-based training involves transferring information asymmetrically from high-accuracy exit points to lower-accuracy ones. This approach assumes that the final output layer of the backbone DNN achieves the highest accuracy, with additional branches learning from it. Hence, the backbone DNN acts as the “teacher” model, guiding the augmented exit branches, which serve as “student” models. To facilitate this knowledge transfer, KD-based methods employ a distillation loss that measures the similarity between the predictions of the teacher and student models. This encourages the student models to replicate the soft targets or probability distributions generated by the teacher model. The distillation loss is typically formulated using the
Kullback-Leibler (
KL) divergence, as described in Equation (
5):
where
\(i\) and
\(c\) are indices representing branch and class, and the superscripts
\(t\) and
\(s\) denote the teacher and student models, respectively.
\(d_i^{t,s}\) denotes the distillation loss for branch
\(i\) with respect to
\(P^t\), the soft targets provided by the teacher, and
\(P^s_i\), the soft targets generated by student
\(i\). Temperature scaling is applied to both
\(P^t\) and
\(P^s_i\) to soften their probability distributions.
\(P^t_c\) signifies the predicted probability distribution by the teacher for class
\(c\), and
\(P^s_{i,c}\) is the corresponding distribution generated by student
\(i\) for class
\(c\). The KL divergence penalizes the student model for deviating from the teacher’s soft targets. One key advantage of the distillation loss is its applicability in scenarios lacking ground truth labels, which enables semi-supervised training of early-exit DNNs. In this method, the teacher model undergoes training first, and subsequently, its outputs are employed to train the student models. The overall loss at each branch
\(i\) integrates both the cross-entropy loss
\(l_i^s\) and the distillation loss
\(d_i^{t,s}\) weighted by a hyperparameter
\(\alpha\), as shown in Equation (
6):
This formulation ensures a balanced training approach, leveraging the complementary benefits of both loss components. Phuong and Lampert [
106] and Li et al. [
77] utilized a KD-based approach to train early-exit DNNs. The pictorial representation of this training strategy is illustrated in Figure
8 and the training procedure is summarized in Algorithm
5. GAML-BERT [
159], FastBERT [
84], and DE
\(^3\)-BERT [
41] introduced a mutual learning BERT framework that improves BERT’s early-exit performance by using mutual KD techniques among exit points. DynamicSleepNet [
137] also employed this training approach for training a transformer-based early-exit model for sleep stage classification.
5.6 Hybrid Training Strategy
This approach combines multiple training strategies to train early-exit DNNs. FreezeOut [
10] integrates joint and branch-wise training strategies, while EENet [
52] combines joint training with KD-based methods. Initially, the entire model is trained using joint training strategy then the side branches are fined further by using branch-wise or KD-based methods. Pacheco et al. [
100] merge joint training with branch-wise training to create specialized expert side branches. They first train the entire DNN with side branches using joint training, followed by branch-wise training of each side branch with specialized datasets. BERxiT [
141] fine-tunes its parameters by combining joint and separate training strategies. DDI [
136] combines two-stage and joint training strategies. First, the backbone model is trained and its parameters are frozen. Only the gating networks are then trained to achieve a zero skipping ratio. With this initialization, the backbone DNN’s parameters are unfrozen and it is jointly trained with the gating networks to reach the targeted skipping ratio.
5.7 Comparison of Early-exit Training Strategies
Figure
9 evaluates early-exit training strategies based on computational complexity, implementation complexity, interbranch coordination, flexibility, and transfer learning potential. Table
6 outlines the strengths and weaknesses of each strategy, illustrating the tradeoffs involved in choosing a training method. Computational complexity refers to the amount of computational resources required to execute a training strategy, while implementation complexity assesses the difficulty involved in implementing the strategy. Interbranch coordination measures the effectiveness of communication and collaboration between branches during training. Flexibility indicates how well the training process can adapt to different needs and transfer learning potential evaluates the model’s ability to transfer knowledge to new tasks or domains.
The joint training strategy excels in holistic optimization and inter-branch learning but requires high computational resources and training complexity. The branch-wise training strategy, which trains branches sequentially, balances training complexity and inter-branch coordination, allowing each branch to build upon the previous one. This strategy’s medium complexity and flexibility make it effective for optimal results. The separate training strategy offers maximum flexibility by training each branch independently but suffers from low inter-branch coordination, leading to potential inconsistencies. The two-stage training strategy enables rapid experimentation with quickly trainable and detachable side branches and allows for the incorporation of non-neural classifiers. KD-based training leverages pre-trained models to transfer knowledge to early-exit branches, enhancing performance through standard and distillation losses. However, it relies heavily on the backbone DNN’s quality and involves medium complexity and moderate flexibility. The hybrid training strategy combines strengths from multiple strategies, offering flexibility and customization options but entails high complexity in design and implementation. This comparison highlights each early-exit training strategy’s strengths and weaknesses, providing insights into the tradeoffs between complexity, coordination, and flexibility, crucial for selecting the most suitable approach for specific applications.
8 Research Challenges and Future Directions
Early-exit DNNs have been a promising research topic, with numerous articles contributing to significant advances. Despite this progress, many open research problems remain. This section discusses the challenges and potential future directions in early-exit DNNs.
8.1 Expanding into New Modalities and Applications
As illustrated in Table
9, early-exit DNN research has primarily focused on CV tasks, particularly image classification using CNNs as a base model. Domains such as video processing, medical applications, disaster management, machine translation, and text processing have received comparatively less attention. Recent efforts are beginning to explore early-exit DNNs in cross-domain applications, employing DNN variants like recurrent neural networks, generative adversarial networks, graph neural networks, and transformers. Implementing dynamic early-exit strategies in these diverse models introduces new challenges that need addressing. Moreover, there are numerous untapped domains where early-exit techniques could be beneficial. Currently, early-exit DNN models tailored for specific tasks cannot be readily applied across different applications. For example, a classification-oriented early-exit DNN may not seamlessly transition to an object detection task due to the lack of standardized criteria for evaluating input sample complexity. Developing a universal early-exit DNN model applicable across multiple domains remains a formidable challenge.
8.2 Filling Theoretical and Practical Gaps
The hardware and libraries for DNNs are currently optimized for conventional models and are not well-suited for early-exit neural networks. As a result, actual implementation lags behind theoretical efficiency estimates. Investigating the optimization of hardware in conjunction with algorithms and neural network libraries for implementing early-exit DNNs is an important research direction. This will improve the efficiency of early-exit DNNs. Designing an efficient early-exit DNN that is compatible with existing hardware and software is a key open research challenge that must be addressed.
8.3 Optimal Architectural Design
Attaching side branches to conventional DNNs can incur additional computational costs if they fail to produce correct early predictions. The structure and distribution of side branches and the adopted early-exit policy significantly impact the efficiency of early-exit DNNs. Developing an optimal configuration that balances the additional computational cost of side branches with the performance gains from early-exits is challenging. Optimal configurations, including backbone model selection, the number and structure of side branches, and their placement, have yet to be fully investigated.
8.4 Optimal Training Strategy
Effectively training early-exit DNNs is a critical task. In conventional DNNs, intermediate layers near the input extract lower-level features, while deeper layers extract more detailed features. Early-exit DNNs force intermediate layers near the input to extract coarse features via side branches. In a joint or end-to-end training strategy, this can create tension among the gradients of different side branches, reducing overall accuracy. A separate or two-stage training strategy may add extra overhead and reduce performance. Developing an optimal training strategy that combines the benefits of both approaches is an important research area to explore.
8.5 Effective Early-exit Policies
Most existing studies use a static early-exit policy with human-crafted rules or heuristics to determine exit criteria. Some studies employ learnable early-exit policies that reduce human intervention but need further research for accurate prediction. Developing a fully dynamic, learnable early-exit policy that considers the network’s nature, input variations, and an unstable environment while deciding on early termination without compromising accuracy is a promising research direction.
8.6 Advancing Towards Explainable DNNs
Deep learning models are often non-transparent, making their interpretability a crucial research area. Early-exit DNNs, resembling the human visual system, can enhance the explainability of deep learning models by making them more transparent. Their dynamic architecture allows examination of intermediate layers, activations, and outcomes for specific inputs, offering insights into which parts of the model influence predictions. Research in early-exit DNNs with a focus on explainability is an important direction.