Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Resource-demand Estimation for Edge Tensor Processing Units

Published: 08 October 2022 Publication History

Abstract

Machine learning has shown tremendous success in a large variety of applications. The evolution of machine-learning applications from cloud-based systems to mobile and embedded devices has shifted the focus from only quality-related aspects towards the resource demand of machine learning. For embedded systems, dedicated accelerator hardware promises the energy-efficient execution of neural network inferences. Their precise resource demand in terms of execution time and power demand, however, is undocumented. Developers, therefore, face the challenge to fine-tune their neural networks such that their resource demand matches the available budgets.
This article presents Precious, a comprehensive approach to estimate the resource demand of an embedded neural network accelerator. We generate randomised neural networks, analyse them statically, execute them on an embedded accelerator while measuring their actual power draw and execution time, and train estimators that map the statically analysed neural network properties to the measured resource demand. In addition, this article provides an in-depth analysis of the neural networks’ resource demands and the responsible network properties. We demonstrate that the estimation error of Precious can be below 1.5% for both power draw and execution time. Furthermore, we discuss what estimator accuracy is practically achievable and how much effort is required to achieve sufficient accuracy.

1 Introduction

Machine learning has historically been associated with high resource (i.e., power and execution time) demand. Learned systems based on deep artificial neural networks have achieved human-like or even super-human performance for tasks like image recognition [2] or strategic board games [57]. In addition, learned models are becoming omnipresent— deep learning approaches are increasingly replacing any kind of decision-making algorithm that used to be explicitly programmed, for example, network congestion control [16] or task scheduling [8]. While artificial neural networks have existed for a long time, their recent success has been enabled by high-performance hardware (e.g., GPUs and TPUs [27]) running in the cloud [21, 27]. In consequence, state-of-the-art models based on deep learning have a significant energy demand [58].
Due to their tremendous success, machine learning applications have entered the domain of embedded systems, such as cars, smartphones, and wearables [2, 19, 60, 62]. To enable machine learning in embedded devices, applications have to adhere to their limited resources [23, 33, 34, 35, 54] in addition to providing a good quality of service (e.g., prediction accuracy). First, many embedded systems are energy-constrained, either by a battery or by a limited power supply. Second, embedded systems often interact with their environment and therefore also have timing requirements (i.e., responsiveness). In consequence, resource-awareness is crucial for embedded machine learning, as it enables the consideration of resource efficiency and adherence to resource utilisation bounds.
The need for embedded machine learning has led to the development of special-purpose accelerator hardware for neural networks (e.g., Intel Neural Compute Stick [25], Google Coral Edge TPU [6, 44], Nvidia Jetson Nano [7, 11], Kendryte K210 [24]) that are commercially available as a stand-alone product or also integrated on a wide variety of different platforms. These hardware accelerators satisfy the growing demand and interest for machine learning approaches [1, 43] and promise to execute the corresponding machine learning workloads more efficiently than general-purpose hardware, using the accelerator-specific integrated circuits.
To make use of dedicated resource-efficient accelerator hardware, embedded system designers still need methods and means to anticipate the resource-demand of machine learning workloads. While resource-demand is originally caused by the hardware, it is controlled by the executed software. Therefore, resource-efficient embedded machine learning requires both efficient hardware and software.
A straightforward way to determine the power and energy demand for the execution of machine learning workloads is to measure it, for example, with a power meter. However, such power measurements depend on the availability of all final components, including the hardware (i.e., accelerator modules), measurement infrastructure (i.e., power meters) that can be costly, and the software. In particular, measurements at runtime depend on hyperparameters that are fine-tuned during an expensive—with respect to time and energy—training process [58]. Therefore, it is most desirable to estimate the energy demand before all components of the final system are available. In particular, system developers need to be able to define neural-network parameters prior to the resource-intensive training process. However, these parameters influence non-functional system properties, such as latency and energy demand. Hence, a resource-demand model enables developers to assess the influence of parameter choices and revise bad choices before spending resources on the training process, needlessly.
This article presents a comprehensive approach and the corresponding tool-chain to predict the resource-demand of embedded machine learning workloads a priori to their execution [51, 52]. Our tool-chain estimates the resource demand (i.e., power and execution time) of neural networks based on statically available information only. Thus, embedded system designers can estimate the resource demand in an early development stage before training, without the need for the hardware and measurement infrastructure. Our system, hence, enables the resource-aware development of deep-learning applications.
This article, in particular, tackles the challenge of analysing and estimating the energy demand of machine learning workloads. We analyse the power and energy demand of a hardware accelerator for machine learning (i.e., Google Coral [44]). We present and discuss the concept and implementation of Precious, an approach for the estimation of the energy demand for machine learning workloads. Based on the conducted energy measurements at the hardware level, Precious utilises machine learning techniques itself to train models that enable resource-demand prediction. Besides the development of individual resource-demand models, this article generalises the observed behaviour. We describe which neural network properties need consideration in modelling the resource demand. In addition, we demonstrate that the device’s temperature and data transmissions effectively limit the achievable estimation accuracy of the power draw and the execution time, respectively.
The contributions of this article are the following:
We measure, analyse, and discuss the power draw and execution time of different neural network architectures on commercially available accelerator hardware.
Based on our analysis, we identify important network properties and make suggestions for efficient network architectures, which are useful for developers.
We present Precious, a comprehensive approach to model the resource demand (i.e., power and execution time) of neural networks based on statically available information.
We evaluate the models generated by Precious empirically for their estimation accuracy.
We discuss the limitations of predictability of the resource demand of neural network accelerators when considering statically available information only. We demonstrate how the temperature influences the power draw and that execution time variance is caused by the communication with the accelerator hardware.
The rest of this article is structured as follows: First, Section 2 puts our work in context and discusses several application scenarios that can benefit from resource-demand models. Second, Section 3 describes our measurement setup and analyses the execution time and power draw of neural networks on the TPU platform. Then, Section 4 presents the design and implementation of Precious. Section 5 evaluates various models generated by Precious empirically and puts their accuracy into context by analysing the limits of predictability of the accelerator. Section 6 discusses related work and Section 7 concludes this article.

2 Application Scenarios of Resource-demand Prediction Models

Resource-demand prediction models are important for both the design phase and the operation of neural networks. In general, resource awareness is a prerequisite for three optimisation goals—resource-demand reduction techniques, tradeoffs between the resource demand and application-specific utility metrics, and the optimisation of a utility metric while adhering to resource limits. For such optimisations, resource-demand models take an enabling role by providing the necessary information—these optimisations depend upon the models. For neural network accelerators, related optimisation problems occur in various scenarios.
Hyperparameter Optimisation. One important stage during the development of a neural network is the hyperparameter selection, which is decisive for the potential quality of a neural network. Finding a good hyperparameter set is complex due to the enormous number of possible hyperparameter combinations. Techniques like, for example, random search [4] have been developed to relieve developers of manually selecting hyperparameters. Due to resource restrictions (i.e., maximum execution time, limited memory or battery), however, not all hyperparameter combinations are applicable in a specific scenario. A resource-demand prediction model can restrict the search space to only applicable hyperparameter combinations (e.g., restricting the total number of weights) and thus helps to guide the hyperparameter-optimisation efforts and skip infeasible combinations. Furthermore, if several potential hyperparameter sets with similar quality are found, then the prediction model can be used to select the most resource-efficient set.
Neural Network Pruning and Specialisation. Pruning neural networks after training [3, 9, 36] and applying specialised data types [5, 28, 59] have the potential of significantly reducing the resource demand with no or only little quality loss. Usually, the quality loss can be easily determined by re-testing the pruned and specialised network on an evaluation dataset. However, the resource demand (e.g., energy demand) requires laborious and time-consuming measurements. A resource-demand prediction model can replace measurements and thus enable a completely automated pruning and specialisation process in software. That way, the desired tradeoff between network quality and resource efficiency can be automatically determined.
Hardware Platform Selection. For new neural network–driven applications, the required minimal network quality and corresponding neural network architecture are often known before the hardware platform is selected. Resource-demand prediction models can help to select an appropriate hardware platform, in terms of resource and (monetary) cost efficiency, without access to the hardware itself. In addition, this process can be combined with the pruning and specialisation approach described in the previous paragraph.
Energy-aware Runtime Systems. In many cases, systems can utilise heterogeneous hardware to execute neural networks, for example, CPUs, GPUs, or TPUs. Although specialised hardware such as GPUs and TPUs usually implies lower resource demand for neural network execution, the network and data transfer from and to the CPU imposes additional overhead. Thus, an energy-aware runtime system has to make a decision, at runtime and depending on the network, which hardware to use. Resource-demand prediction models enable such decisions, whether the lower execution time outweighs the additional overhead, and hence are one building block for energy-aware runtime systems.

3 Resource Demand of Embedded Neural Network Accelerators

This section introduces the Google Coral Edge TPU, the accelerator hardware considered. We first introduce the accelerator and our measurement setup. Then, we discuss its observable behaviour based on measured power traces.

3.1 Neural Network Accelerator

The Google Coral Edge TPU [44] is a hardware accelerator for neural network inferences. It is designed for embedded systems with limited power supply (e.g., smartphones), and it promises high power efficiency. The version considered in this article is attached via USB.
This TPU is designed to compute inferences of TensorFlow Lite (TFLite) [39] compatible models. Neural networks can be developed, for example, with TensorFlow in combination with Keras [43] or similar interfaces. This step creates standard TensorFlow “models,” which can be translated to a TFLite-compatible format using the TFLite converter [41]. Afterwards, the “Edge TPU compiler” [40] prepares the TFLite-compatible neural network for the TPU. For example, the neural network is converted to an 8-bit fixed-point number format (i.e., “quantisation”). At runtime, the compiled neural network is passed to an “interpreter” object, which delegates the execution to one of the available processors. This interpreter encapsulates and thus hides communication details. For example, if the neural network is small enough to fit into the 8 MiB memory of the TPU, then it is fully loaded when computing the first inference. If, however, the neural network is too large, then the compiler and interpreter transparently use “off-chip memory” and apply a “streaming” mechanism as a work-around.
While some TPU architectures are described in the literature [27], this specific hardware platform is poorly documented [45]. In addition, the compiler and the interpreter are mostly closed-source, and the neural network is encoded in an undocumented binary file format. Unfortunately, this lack of documentation is typical for commercially available neural network accelerators.
The absence of comprehensive public documentation makes it impossible to create timing and power models by construction. Instead, our approach creates these models by learning, that is, by monitoring the resource demand and deriving models using statistical methods.

3.2 Measurement Setup

Figure 1 visualises the structure of our energy measurement setup, and Figure 2 shows a picture of the actual hardware setup. The host system submits neural networks to the Coral Edge accelerator [44] for execution.1Precious intercepts the power supply of the accelerator at its USB connection. It uses a shunt resistor together with an LTC2991 voltage and current monitoring sensor [15], which has a 14-bit ADC to measure the power draw. A microcontroller (“sampling chip”) polls the power measurement values periodically via I2C, aggregates them, and eventually communicates them to the host system via USB.
Fig. 1.
Fig. 1. The energy-demand measurement setup intercepts the power supply of the accelerator at its USB connection.
Fig. 2.
Fig. 2. Picture of the energy-demand measurement setup with ① the TPU, ② ADC and shunt, and ③ the sampling chip.

3.3 Power Traces

This setup allows Precious to create detailed power traces, as shown in Figure 3. According to our measurements, the accelerator’s idle power draw is between 280 mW and 300 mW. Even before the actual inference begins, we observe significant fluctuations in power demand that go hand-in-hand with the creation of the TFLite interpreter object. This software object is necessary for the execution of the model [41] and switches the accelerator to a state of increased power demand. During the first two seconds of the measurement, we observe power fluctuations ahead of the actual inference execution (at around three seconds into the measurement). This means that there exists a pre-processing stage before the first execution of an inference.
Fig. 3.
Fig. 3. Power-draw traces of neural network inferences over time, including the TPU’s pre-processing and post-processing stages (top), and with higher temporal resolution and focussed on the inference stage (bottom).

3.4 Pre-processing Stage

The pre-processing stage contains a low-power substage and a high-power substage. Both the duration of the low-power substage and the power draw (\(\le\)300 mW) during this time are independent of neural network properties. For the high-power substage, the power draw is independent of the neural network (\(\approx\)1,000 mW), but its duration varies—there is a positive correlation to the network size (Spearman correlation coefficient \(\approx\)0.5). However, during this phase, the neural network is not transmitted to the TPU, as documented in Reference [42] and confirmed by our USB communication traces. The most plausible explanation is that the host system prepares the neural network for execution on the TPU during this stage, while the TPU is in an active state, ready to receive the neural network for execution.

3.5 Inferences Stage

The TPU executes inferences only during the middle part of the power trace (in Figure 3, beginning at around three seconds until ca. six and a half seconds). During the first inference, the neural network gets loaded onto the TPU, so it lasts longer than all other inferences [42]. Therefore, its measurement results are excluded from further processing in Precious. The subsequent inferences yield repeatable measurement results from which Precious can derive its power and time models.
In Figure 3, the measured values of several inferences of the same convolutional and fully connected neural networks were superimposed to visualise the power course of an inference. The superimposed visualisation allows a detailed analysis of the power draw over time, even though a single inference executes within a few milliseconds. This is due to the intentionally deterministic execution behaviour of the TPU [27]. The measurements show clear differences between fully connected (“dense”) and convolutional networks. Convolutional networks have clearly repeating “hills” and a higher power draw of 1,200 mW to 1,400 mW. Our communication analysis (cf. Section 5.5) shows that the power drop between hills is caused by data transmissions between the TPU and host. In comparison, dense networks show only minor oscillations near 1,150 mW, as shown in Figure 3. For both network types, the observed behaviour is repeatable.
In summary, both the execution time and the power draw during the inferences stage depends strongly on the neural network. As Figures 5 and 6 show, there are correlations between properties of the neural network and the resource demand for its execution. Therefore, Sections 4 and 5 describe the design, implementation, and evaluation of resource-demand prediction models for the inference stage.

3.6 Post-processing Stage

After the inferences stage, the TPU remains in an active high-power state for approximately 3.5 seconds before returning to a low-power idle state. This post-processing stage has a higher power draw that is similar to the pre-processing stage and higher than the TPU’s initial idle state. We assume that the pre- and post-processing stage correspond to different USB transmission and power states of the Cortex-M0+ microprocessor that operates on the TPU stick. The duration of the post-processing stage appears to be constant, and in particular, independent of the properties of the executed neural network. This post-processing stage was only observable in TensorFlow version 2.0; the behaviour changed when we updated our host system to version 2.1.2 In the newer version, the TPU remains in a high-power state (\(\approx\)1,000 mW), rather than switching to a low-power idle state (\(\le\)300 mW).

4 Design and Implementation

In this section, we present the Precious approach for the resource estimation of neural networks (NN). We outline design considerations of Precious and discuss its current implementation, which uses a Google Coral Edge Tensor Processing Unit (TPU), a commercially available off-the-shelf neural network accelerator for embedded systems that connects to its host system via USB [44]. Precious organises the process of estimating the resource demand of neural networks in five phases (see Figure 4). First, Precious generates randomised neural networks. Second, it applies a static neural network analysis. Third, it evaluates their resource demands. Fourth, a machine learning approach builds models that map a combination of neural network properties (e.g., number of layers) to its resource demand. In the last phase, applications can utilise these models to estimate the resource demand of their networks. The phases, necessary steps within a phase, and our implementation for each phase are described in the following sections.
Fig. 4.
Fig. 4. Precious comprises five phases for the estimation of execution time and power draw of neural networks.

4.1 Neural Network Generation Phase

In the first phase, Precious uses the machine learning framework TensorFlow 2.1 with Keras [43] and Python 3.6 to generate randomised neural networks. It supports three different network types—convolutional (“conv2d”), fully connected (“dense”), and heterogeneous networks [30]. Generated neural networks are either homogeneous or heterogeneous. First, homogeneous networks consist of either only convolutional or only fully connected layers. Second, heterogeneous networks contain both convolutional and fully connected layers. As of now, the implementation supports “relu” as an activation function and imposes restrictions on the dimensions of convolutional networks. For each layer, the input dimension is the same as the output dimension. All generated homogeneous networks consist of layers of the same type and dimension (for convolutional layers this implies the use of the “same” padding). Convolutional layers always have a square-dimension input with a depth of 3 and 3 filters with the dimension \((3,3)\). Our system varies the number of layers (between 2 and 250) and the dimensions (between 100 and 1,024), randomly. Generated heterogeneous networks contain first convolutional layers, a “maxPool2d” and a “flatten” layer, followed by dense layers. The number of layers varies between 10 and 1,024, and the range of layer dimensions is equal to the generation of homogeneous networks. For all network types, all network-internal parameters (i.e., weights) are initialised to random values. The neural networks remain untrained, as we use Precious to examine the resource demand of different network types and configurations and not the resource demand for a specifically trained network.
To execute a neural network on the accelerator, it must first be converted to a TensorFlow Lite (TFLite) model. In addition, all parameters must be in an 8-bit fixed-point number format. The conversion to this format is either achieved during training (i.e., “quantisation-aware training”) or afterwards (i.e., “post-training quantisation”). Since Precious only uses randomised instead of meaningfully trained weights, it uses the latter technique to translate the randomised parameters to 8-bit values. Furthermore, this technique also allows to convert existing pre-trained neural networks and execute them on the accelerator. This neural network translation process is executed on the host system using the Edge TPU compiler [40].

4.2 Neural Network Analysis Phase

In the second phase, Precious analyses several properties of the neural network. This analysis is entirely static, that is, all information is available without executing the neural network.
First, the TensorFlow format already offers information about the number of layers and the layer dimensions. For the generated neural networks in Precious, these properties are randomly chosen. For applications that utilise neural networks, these properties are selected by the application developer and potentially adjusted during or after the training process. Second, Precious derives the number of (internal) neural network parameters, such as weights and biases, from the number and dimension of the neural network layers. Furthermore, the number of multiply-accumulate (MAC) operations is computed (i.e., the number of MAC operations needed per neural network execution). Third, the Edge TPU compiler provides the network size (i.e., memory demand) of the neural network. The memory demand is important, because small neural networks fit into the 8 MiB on-chip memory, but larger neural networks transparently use a “streaming” mechanism where parts of the neural network are fetched on-demand during inferences. Finally, the file system tells the file size of the neural network in the binary format used by the accelerator after compilation. In summary, these easily accessible properties later serve as features for the training phase that creates a model that maps from the neural network properties to their resource demand. The labels (i.e., the resource demand) needed for the training phase are collected in the neural network execution phase.
Figure 5 summarises the influence of the number of MAC operations on the execution time and power draw of fully connected and convolutional networks. Some correlations are clearly visible. Similarly, Figure 6 visualises the influence of the memory demand of neural networks on the resource demand. The figures look similar, because the number of MAC operations and the memory demand correlate positively. However, both properties individually are not suitable as resource-demand predictors, since both figures show a large amount of noise. The figures look noisy, but as the resource demand measurements are repeatable, the noise cannot be run-to-run variation. Instead, the shown variance is caused by other neural network properties that are not visible, as the many-dimensional property space is reduced to two-dimensional representations. In consequence, no single property is sufficient to accurately estimate power draw and execution time. Instead, Precious has to consider multiple properties in the estimation models.
Fig. 5.
Fig. 5. Influence of the number of MAC operations on the execution time and power draw for all generated dense (left) and convolutional (right) neural networks.
Fig. 6.
Fig. 6. The execution time of fully connected neural networks strongly depends on the memory demand of the neural network. The tipping point between two visible linear relations is at 8 MiB, the on-chip memory capacity of the TPU.

4.3 Neural Network Execution Phase

In the third phase, Precious measures the energy demand of neural networks. As described in Section 3, Figure 1 visualises the structure of our energy measurement setup, and Figure 2 shows a picture of the actual hardware setup. For the power and execution time measurements, the host system submits neural networks to the Coral Edge accelerator [44] for execution. The input data for the execution is randomised, but used repeatedly for each neural network, and no batching is applied.
Our embedded TPU [44] can be configured in two variants, a standard variant (STD) and a second variant (MAX) that operates with the maximum operating frequency [42]. Both the power demand and execution time of the TPU depend on this configuration. We use the former (i.e., slower) variant for our implementation of Precious, because the latter suffers from documented thermal problems [42]. To avoid overheating, Precious nevertheless makes idle periods during the evaluation to maintain a constant temperature. Section 5.4 looks at the self-induced warm-up in detail.
Each neural network executes five iterations. Each iteration repeatedly executes inferences of the neural network and lasts at least 30 seconds. The power draw and execution time are averaged during each iteration, ignoring the first inference, as it takes longer because the neural network is loaded onto the TPU. We then combine the five iterations by computing the median. In general, power draw and execution time show little run-to-run variance—for example, the detailed power trace in Figure 3 combines measurements from multiple inferences. The obtained power and time measurements serve as labels in the training phase.

4.4 Training Phase

The training phase applies supervised learning where the neural network properties (cf., Section 4.2) are the features and the resource demand (cf., Section 4.3) is the label.
The total dataset contains 2,992 homogeneous and 127 heterogeneous neural networks that are partitioned into training (80%), validation (10%), and evaluation (10%) sets. Each regressor trains a model using the training set, only. Hyperparameters were fine-tuned using randomised parameter optimisation (RPO) with 100 iterations using the validation set. The hyperparameter optimisation was only conducted for the homogeneous neural network dataset and the results were reused for the heterogeneous neural networks. Consequently, the validation dataset were merged into the evaluation dataset for heterogeneous networks. Importantly, the regressors never access the evaluation set—that set is only used for the evaluation in Section 5.2. We create, in summary, 10 estimators—for time and power, and for small dense (i.e., without off-chip memory), large dense (i.e., with off-chip memory), convolutional neural networks, and for small (i.e., without off-chip memory) and large (i.e., with off-chip memory) heterogeneous networks.
The models are trained by, in total, eight regressors3 with 18 configurations (denoted as (*)) in the non-heterogeneous case. Models are either linear [14] or ensemble [13], that is, models that internally combine different machine learning techniques. Table 1 gives an overview of the regressors and the models they create. Regressors for ensemble models are configurable by the error metric that is minimised during training, such as the Friedman mean squared error (FMSE), the mean absolute error (MAE), or the mean squared error (MSE). For comparison, an additional dummy regressor computes the mean (MEA) or the median (MED) of the labels, ignoring all features. This dummy regressor serves as a baseline for the evaluation of linear and ensemble models. Based on the results for homogeneous networks, we used the ETR(MAE) regressor to implement the models for heterogeneous networks.
Table 1.
Model TypeRegressorDescription
DummyDR(*)Dummy regressor
\(\Lsh\) (MEA,MED)Mean, median
LinearRR(*)RANSAC regressor
\(\Lsh\) (LR)Linear base estimator
\(\Lsh\) (R)Ridge base estimator
LRLinear regressor
HRHuber regressor
EnsembleETR(*)Extra Trees regressor
\(\Lsh\) (FMSE,MAE,MSE)(error metrics)
RFR(*)Random Forest regressor
\(\Lsh\) (FMSE,MAE,MSE)(error metrics)
DTR(*)Decision Tree regressor
\(\Lsh\) (FMSE,MAE,MSE)(error metrics)
ABR(*)AdaBoost regressor
\(\Lsh\) (FMSE,MAE,MSE)(error metrics)
Table 1. Overview of Regressors for Linear and Ensemble Models Examined to Predict the Power Demand and Execution Time for Executing Homogeneous Neural Networks on the TPU
We intentionally did not train models that use deep learning or neural networks internally—neural networks are only the subject of modelling, not the implementation technique. The relatively simple regressors used in this article require significantly lower training efforts, but the evaluation shows that they provide accurate estimations.

4.5 Application Phase

In the final phase, the developers of deep learning applications apply the trained models to estimate the resource demand of their neural networks. The intended use-case is that phases 1 to 4 are executed once, for example, by the hardware vendor, and the resulting model is distributed. In contrast to cycle-accurate simulators, such measurement-based resource-demand models can be published without the risk of leaking intellectual property, as they are derived from user-observable properties only. Using these models, deep-learning application developers can determine the resource demand of different network architectures—without power measurement infrastructure and before training—and restrict the training process to models that satisfy all resource constraints. Further potential application scenarios are discussed in detail in Section 2.

5 Evaluation

This section evaluates the models trained by Precious empirically to answer the following questions:
Which analysed neural network properties are useful as features for resource-demand models?
What estimation accuracy is realistically achievable, depending on the model complexity?
What differences between homogeneous and heterogeneous neural networks can be observed?
What are the inherent limitations to the resource-demand estimation accuracy for the TPU?
What operating frequency of the TPU is optimal for which neural network?

5.1 Feature Selection

In general, the properties derived by static analysis carry information on the resource-demand of the neural network. We analyse the importance of individual properties by computing the Spearman correlation coefficient between a feature and the power draw (\(r_p\)) and between the feature and the execution time (\(r_t\)). A high correlation (i.e., values near 1 or \(-1\)) indicates a strong dependency between the respective property and the resource demand. In consequence, such a property is a useful feature for a resource-demand estimator. Table 2 summarises the Spearman correlation coefficients for homogeneous (i.e., convolutional, small dense, and large dense) and heterogeneous (i.e., small and large) neural network properties used by Precious. Input Features (Homogeneous Networks). Each property has a high correlation in at least one case, indicating that each property contributes to the estimation accuracy of the resource-demand models. In most cases, the number of MAC operations correlates strongly with the resource demand, in particular, considering the execution time data visualised in Figure 5. For large dense models with parameter streaming, the execution time correlates extremely strong with the amount of off-chip data, as indicated in Figure 6. However, Figures 5 and 6 also demonstrate that neither property alone is sufficient to estimate the resource demand. Instead, the combination of properties is needed for an accurate estimation.
Table 2.
 Homogeneous NetworksHeterogeneous Networks
FeatureConvolutionalSmall DenseLarge DenseSmallLarge
  \(r_{t}\) \(r_{p}\) \(r_{t}\) \(r_{p}\) \(r_{t}\) \(r_{p}\) \(r_{t}\) \(r_{p}\) \(r_{t}\) \(r_{p}\)
size \(_{\rm on-chip}\)0.24710.82760.79920.9361\(-\)0.4723\(-\)0.29510.25420.1249\(-\)0.17770.0436
size\(_{\rm off-chip}\)----0.99990.5302--0.6085\(-\)0.6280
size\(_{\rm file}\)0.88220.60280.81260.93520.99980.52810.32250.16830.6397\(-\)0.5741
input dim.0.94770.37390.23600.72330.75960.55440.93310.49060.68890.6546
# layers0.24710.82760.71400.14240.39890.03360.04330.35770.4346\(-\)0.2421
# parameters0.24710.82760.76140.94110.99820.53390.11960.24470.6026\(-\)0.5951
# MACs0.99090.69280.76050.94120.99820.53400.99440.67260.86640.5367
\(\Lsh\) dense--0.76050.94120.99820.53400.24340.11810.6002\(-\)0.5968
\(\Lsh\) conv.0.99090.6928----0.99500.68530.77840.6245
Table 2. Spearman Correlation Coefficient between Statically Analysable Neural Network Properties and the Measured Execution Time ( \(r_t\)) and Power Draw ( \(r_p\))
Input Features (Heterogeneous Networks). For the resource-demand estimation of heterogeneous networks, we analysed the same input features as used for homogeneous networks. Additionally, we analyse the number of MAC operations disaggregated for convolutional and dense layers, respectively. We deliberately refrain from using more complex representations of neural network topologies (e.g., order of layer types), but continue to focus on metrics, which are easy to statically determine and easy to represent, for example, the number of parameters or MAC operations. This allows us to still use machine learning techniques with relatively little overhead.
The Spearman correlation coefficients show that the memory demand for small heterogeneous networks is only lightly correlated with the resource demand. Due to their heterogeneity, networks of the same size can have very different topologies and hence resource demand. The off-chip memory demand for large heterogeneous networks, however, determines the overhead for parameter streaming and is hence more strongly correlated.
In contrast to homogeneous networks, the number of layers and parameters are only lightly correlated, as different layers and parameters potentially yield very different behaviour, depending on the network topology. Furthermore, to fit on the TPU, small heterogeneous networks only have a limited potential number of layers. On the other side, the number of MAC operations proved again to be strongly correlated with the resource demand and remains one of the most important input features. Furthermore, the number of MAC operations for convolutional layers significantly outweighs the number of MAC operations for dense layers.
Based on these analyses, we selected the off-chip memory demand, total memory demand, and the three metrics regarding the number of MAC operations as inputs to the resource-demand estimators for heterogeneous networks.

5.2 Model Accuracy (Homogeneous Neural Networks)

Figure 7 summarises the estimation accuracy of all models trained to estimate the resource demand for homogeneous networks. We measure the resource demand for the evaluation dataset and compare this “true” value with the estimation based on the trained models. We use the mean absolute percentage error (MAPE) as error metric, because it normalises the error to the measurement, and the individual resource demand can vary (network-to-network) significantly. The graphic is cut off at 15% to maintain readability. The evaluation demonstrates the following:
Fig. 7.
Fig. 7. Linear and ensemble model evaluation of the prediction quality for (a) dense networks without and (b) with off-chip memory demand, and (c) convolutional networks.
The dummy regressor performs badly at modelling the execution time. This demonstrates that execution-time modelling is not trivial and neural networks properties must be considered.
The ABR(*) regressors cannot model the execution time of neural networks accurately. For dense and convolutional networks, the estimation error is larger than all other ensemble and linear models. This relatively bad accuracy might be caused by an insufficient number of hyperparameter tuning iterations. The accuracy of the outlier-robust HR regressor is consistently worse than a simple linear regressor.
The execution time of dense networks without off-chip memory (i.e., without streaming) can be generally well approximated. Linear and ensemble models achieve similar accuracy. This means that the relation between the chosen neural network properties and the execution time is linear (as indicated by Figure 6). For the execution time of dense networks with off-chip memory (i.e., with streaming), the linear models perform worse than ensemble models. This means that, in contrast to small networks without off-chip memory, the relation is no longer strictly linear. There is no single supreme regressor, as ensemble models (ETR(*), RFR(*)) achieve similar accuracy. The execution time of convolutional networks is harder to predict than dense networks, as the remaining MAPE is higher. Similar to dense networks with off-chip memory, ensemble models achieve a higher accuracy than linear models. As an exception, the ABR regressors converge too slowly during our experiments.
The power draw of dense networks without off-chip memory is trivial to predict. Linear, ensemble, and dummy models have similar accuracy. This is indicated in Figure 3, as the power draw of dense networks is almost constant, with only minor oscillations. For dense networks with off-chip memory, the power draw is more difficult than for small networks without off-chip memory. Linear and ensemble models achieve similar accuracy, and the dummy regressor is slightly worse. For convolutional networks, the power can be approximated by linear models as well. Linear models and ensemble models are similarly accurate, and both are more accurate than the dummy regressor. Only the HR regressor achieves a bad accuracy.
In summary, regressors can achieve an error below 1% for homogeneous networks and for both power draw and execution time. This is due to the deterministic characteristics of neural network executions in general and the deterministic hardware platform in particular. As our approach is not restricted to the TPU used in our evaluation, estimators for similar accelerators (e.g., a Kendryte K210) can be trained without changes to our approach and we expect similarly low errors.

5.3 Model Accuracy (Heterogeneous Neural Networks)

Based on the results for homogeneous networks, we trained four additional estimators to predict the execution time and power demand for small and large heterogeneous networks. Therefore, we utilised the ETR(MAE) regressor, as it showed consistently good behaviour for all homogeneous networks.
The mean absolute percentage error (MAPE) to estimate the power demand and execution time is 1.5% and 8.6% for small heterogeneous networks, respectively. For large heterogeneous networks, the MAPE for the power demand and execution time is 2.2% and 9.2%, respectively. Hence, the estimation error for small heterogeneous networks is slightly smaller, which is due to the more predictable behaviour if no parameter streaming is required.
Even though the accuracy is lower compared to the estimators for homogeneous networks, the heterogeneous-network estimators achieve lower or comparable accuracy results compared to state-of-the-art estimators without access to cycle-accurate simulators (see the related work in Section 6). Furthermore, we achieve these results with statically determinable features and low-overhead prediction methods.
However, the network topology may comprise information relevant for the execution time, which is not accessible for the regressors in our implementation. Using more sophisticated implementation techniques (e.g., neural networks) could allow us to use more expressive network representations and utilise such information to achieve potentially higher accuracies. However, using more sophisticated modelling techniques would also come with increased overheads for training and execution. The following sections will put the accuracy of linear and ensemble models into context by comparing their estimation errors to the influence of self-induced heat and data transmissions.

5.4 Temperature Dependency

The power draw of transistor-based logic circuits generally depends on the temperature, which is influenced by the ambient temperature and also self-induced heat. The ambient temperature varies over time for most embedded systems, and also the self-induced warm-up is typically not precisely controllable. In consequence, the power draws of identical workloads differ between executions. Models trained with only statically available data—such as the models trained by Precious—are oblivious of temperature changes during runtime and therefore cannot capture this run-to-run variance. We evaluate the influence of the device temperature on its power draw—it constitutes a lower bound for the achievable estimation error of any power-demand model under realistic conditions.
To obtain temperature values, we utilise a BME280 temperature sensor from Bosch Sensortec. We sample temperature values at \(40 \,\mathrm{Hz}\) with 1°C accuracy and 0.01°C precision. During measurements, the temperature sensor is attached to the surface of the TPU casing.
Figure 8 shows the power demand and surface temperature of the TPU executing a convolutional network. The TPU surface temperature starts at ambient temperature (24.6°C) and rises within 20 inferences by 5.8°C. Around inference 20 the temperature stabilises at 30.4°C. The power demand shows the expected correlation with temperature and rises by \(39.0 \,\mathrm{m}\mathrm{W}\) and stabilises at 1,332.1 mW after 20 inferences, which is an increase of 2.9%. In comparison, most power models discussed above achieve an accuracy better than 2.0%, which is only possible because the device temperature is controlled in our experiments. Therefore, linear and ensemble models achieve sufficient accuracy, and more elaborate modelling techniques cannot improve the accuracy further under realistic conditions.
Fig. 8.
Fig. 8. Temperature and power demand increase for repeated execution of neural networks.

5.5 Data Transmission

One important factor for the efficient usage of an external accelerator is the communication between the accelerator and host system, that is, the USB communication in our case. As already described in Section 3, the TPU has a pre-processing, inference, and post-processing stage. This section further analyses the USB traffic during the inference stage and its influence on the TPU’s efficiency.
Figure 9 shows the power demand during a convolutional network inference and the respective USB traffic from and to the host system. The power data consists of 210 superimposed inferences of the same network and the same input data to improve the level of detail. During the reception and transmission of data, the power demand of the TPU significantly decreases from approximately \(1.450 \,\mathrm{m}\mathrm{W}\) to \(1.150 \,\mathrm{m}\mathrm{W}\). Hence, the data transmission phases are one reason for the very characteristic power traces of the TPU and thus influence the energy demand and efficiency of the TPU.
Fig. 9.
Fig. 9. Power demand (red) and USB traffic from (black) and to (green) the host. The power values result from 210 superimposed inferences to improve the level of detail.
USB traffic during the inference stage usually occurs due to one of two reasons:
The neural networks’ weights and instructions how to arrange and execute weights are transmitted. Furthermore, if the neural network does not fit into the on-chip memory, then parts of the network and potentially instructions on how to to partially execute the network are continuously streamed.
The input data is transmitted to the accelerator. After the (partial) execution the (intermediate) results and dequantisation instructions are sent back to the host.
Due to the reduced power demand during USB transmission phases, we assume that the execution units of the TPU are not or not fully utilised and the TPU waits for additional data to arrive or send. Thus, we observe that USB traffic negatively affects the TPU’s utilisation. Figure 10 shows the average utilisation (i.e., MAC operations per second) for different input-data dimensions for convolutional and fully connected (“dense”) networks, respectively. For dense networks, the network size depends on the number of layers and the layers’ dimensions. Consequently, for higher input dimensions, the network may at some point be too big to fit into the on-chip memory and must be continuously streamed during execution, which significantly decreases the utilisation and, therefore, the TPU’s efficiency (left plot in Figure 10). However, dense networks fitting into the on-chip memory achieve a higher utilisation compared to the convolutional networks in our dataset (right plot in Figure 10).
Fig. 10.
Fig. 10. Average utilisation for dense and convolutional networks with different sizes. The left plot shows dense networks fitting completely into the on-chip memory (light blue) and streamed networks not fitting into the on-chip memory (orange). The right plot shows convolutional networks with the same inputs and constant (yellow) and variable output sizes (pink).
Most convolutional networks in our dataset fit into the on-chip memory, as the convolutional networks’ sizes are independent of the input-data dimension, but only depend on the number and dimension of the filters. Accordingly, the utilisation is relatively constant for different input dimensions (right plot in Figure 10; pink graph), although lower compared to dense networks fitting into the on-chip memory. This reduced utilisation is presumably due to the bigger input-data sizes for our convolutional networks and the corresponding USB traffic. One interesting finding is that the incoming traffic decreases if the outgoing traffic is reduced (by reducing the output dimension). For a subset of the convolutional networks, we executed identical convolutional networks with the same data and only one additional small layer to harmonise the output size to a constant size of three filters with dimension 14x14 (right plot in Figure 10; yellow graph). We found that not only the output traffic, but also the input traffic was reduced by the additional layer, and consequently this traffic reduction led to an increased utilisation. We assume that the reduced output complexity also reduced the complexity of the execution instructions and hence decreased the traffic from the host.
In summary, reducing the USB traffic significantly increases the TPU’s utilisation and efficiency. Especially streaming the network on-the-fly for networks exceeding the on-chip memory decreases the TPU’s utilisation. However, the same holds for the input and output data transmissions. In general, the influence of the USB communication on the TPU’s utilisation demonstrates that, in many cases, the TPU’s performance is communication-bound. In this case, an accurate execution-time model has to take communication timings into account—however, Precious’ models do not consider them, because the exact timings are not statically analysable, in particular, with bus contention. Nevertheless, the USB traffic size is one example of a property that can be easily optimised by a neural network developer with a resource-prediction model at hand. These differences are further increased for different execution frequencies presented in the following section.

5.6 Frequency Selection

The TPU supports two different power modes, that is, the STD and MAX mode. This section analyses the impact on the power demand, execution time, and energy demand of both power modes. The TPU runs with the full frequency in MAX mode, whereas in STD mode the TPU is throttled. However, the MAX mode may not be applicable, depending on the deployment scenario due to a significantly increased heat dissipation [42].
Figure 11 shows the execution time and power demand differences between both power modes for dense and convolutional networks, respectively. The difference is calculated by subtracting the STD execution time/power demand values from the MAX mode values. Enabling the MAX power mode leads to decreased execution times for both network types. However, the gains for the convolutional networks are significantly higher, because the dense networks’ execution times are more affected by data transmissions (cf. Section 5.5). As expected, the power demand increases when using the MAX power mode for both network types. The power difference between STD and MAX mode is relatively constant if the TPU is fully utilised (as seen with the convolutional networks). If the network is too small to fully utilise the TPU’s on-chip memory, for example, the first dense networks with less than \(2 \cdot 10^7\) MAC operations in Figure 11, then the power differences rise linearly with the number of MAC operations.
Fig. 11.
Fig. 11. Execution time and power demand difference between STD and MAX power modes for fully convolutional (“dense”) and convolutional networks, respectively. The difference is calculated by subtracting the STD execution time/power demand values from the MAX mode values.
Figure 12 shows the energy demand difference between both power modes for the dense and convolutional networks, respectively. Again, the difference is calculated by subtracting the STD energy demand from the MAX mode energy demand. Dense networks with less than approximately \(2 \cdot 10^7\) MAC operations can be stored entirely on the on-chip memory and need no additional data streaming. For these networks, the energy demand is slightly lower in MAX mode than in STD mode. For bigger dense networks (i.e., with more MAC operations), additional traffic is required and the increased power demand is not compensated by the reduced execution time. Hence, for these bigger networks, the energy demand is increased in MAX mode. Most convolutional networks in our dataset fit into the on-chip memory and do not require additional traffic. Consequently, for all convolutional networks, the energy demand is reduced in MAX mode.
Fig. 12.
Fig. 12. Energy demand difference between STD and MAX power modes for fully convolutional (“dense”) and convolutional networks, respectively. The difference is calculated by subtracting the STD energy demand from the MAX mode values.
In summary, for networks fitting into the on-chip memory, the energy demand is significantly reduced in MAX mode due to the decreased execution time. For all other networks, the energy demand is increased due to the increased power demand and only slightly decreased execution time. Hence, for scenarios where the additional heat dissipation can be handled and the network fits into the on-chip memory, the MAX mode can help to reduce the energy demand.

5.7 Discussion and Generalisation

In summary, the results indicate that the generated models estimate the resource demand sufficiently accurate, despite relatively simple regressors. However, the results are obtained for only one hardware platform. For an extension to further hardware platforms, Precious needs only minor adjustments. First, Precious has to replace TPU-specific aspects, such as the off-chip memory usage. However, other important neural network properties, like the number of MAC operations and the number of neural network parameters, can be reused. Second, for the execution phase the hardware platform has to be replaced. Depending on the accelerator, it may be challenging to accurately measure the power draw if it is tightly integrated in the system-on-chip. Considering the achieved accuracy of resource-demand models, other hardware platforms may need more complex regressors or provide worse results, as TPUs are specifically designed for deterministic execution [27]. Papers that examine alternative hardware platforms in detail are discussed in the following section.

6 Related Work

Previous work has thoroughly examined the power and performance characteristics of CPUs, GPUs, and TPUs [21, 27, 32, 37, 50, 58] in data centres. However, deep learning plays an increasingly important role in embedded systems [2, 54, 60]. Therefore, our work focuses on neural network execution on embedded platforms. In comparison to regular and large-scale systems, embedded platforms have a much lower power draw and thus very different power-to-performance characteristics. Furthermore, the main focus for embedded systems is on response time rather than throughput [10]. Table 3 shows an overview of related work with their considered hardware, scope, and metrics, respectively. The Estimation column denotes works, which conduct relevant analyses of the resource demand but do not implement estimation models.
Table 3.
PaperHardwareScopeMetricEstimation
TimePower
Hönig et al. [22]CPUembedded \(✗\) \(\checkmark\) \(\checkmark\)
Sieh et al. [56]CPUembedded \(\checkmark\) \(\checkmark\) \(\checkmark\)
Ganesan et al. [17]CPUmobile \(\checkmark\) \(✗\) \(\checkmark\)
Wu et al. [61]GPUgeneral purpose \(\checkmark\) \(\checkmark\) \(\checkmark\)
Lu et al. [46]CPU, GPUmobile \(\checkmark\) \(✗\) \(\checkmark\)
Rodrigues et al. [53]CPU, GPUmobile \(✗\) \(\checkmark\) \(\checkmark\)
Greathouse et al. [18]CPU, GPUgeneral purpose \(\checkmark\) \(\checkmark\) \(\checkmark\)
Li et al. [37]CPU, GPUgeneral purpose \(\checkmark\) \(\checkmark\) \(✗\)
Kljucaric et al. [31]CPU, GPU, TPUmixed \(\checkmark\) \(\checkmark\) \(✗\)
Gupta et al. [20]TPUembedded \(\checkmark\) \(✗\) \(\checkmark\)
Kaufman et al. [29]TPUembedded \(\checkmark\) \(✗\) \(\checkmark\)
Libutti et al. [38]TPUembedded \(\checkmark\) \(\checkmark\) \(✗\)
Precious (this work)TPUembedded \(\checkmark\) \(\checkmark\) \(\checkmark\)
Table 3. Overview of Related Work to Model the Resource Demand
For each work, the considered hardware, respective scope, and metrics are shown. The power column subsumes both power and energy metrics. Additionally, the Estimation column denotes whether works conduct only resource-demand analyses or also implement estimation models.
Li et al. [37] compare the resource demand of training frameworks for convolutional neural networks on CPUs and GPUs. They further provide information on the effects of performance-tuning parameters, such as dynamic voltage and frequency scaling (DVFS) and hyper-threading, on the energy efficiency of the training processes. Our work presented in this article, in comparison, targets hardware accelerators on embedded platforms.
Likewise, Lu et al. [46] focus on estimating the memory and execution time for convolutional neural networks on CPUs and GPUs in embedded systems with an accuracy of at least 78% for the execution time. The difference in accuracy compared to our work is assumingly due to the inherently less deterministic hardware platform. They share our motivation that resource-demand estimations are highly valuable at the design phase of neural networks and also identify MACs as the most important feature for resource-demand estimations. However, they do not consider the power demand, which is a key concern on embedded platforms, in their estimations and focus on CPUs and GPUs instead of hardware accelerators.
In contrast, Rodrigues et al. [53] analyse the required input feature complexity for energy-demand predictions on CPUs and GPUs in embedded systems. Similar to this work, they identify MACs as the most suitable feature for precise energy estimations, although focusing on more powerful Linux-based hardware platforms compared to our hardware accelerator. They reach accuracies between 76% and 85%, which is assumingly due to their more complex hardware platform.
Jouppi et al. [27] describe the architecture of a TPU deployed in data centres. The work compares the performance and power demand of a TPU to CPUs and GPUs, showing that the TPU outperforms both hardware alternatives. The authors also discuss that the execution model of the TPU is more deterministic compared to CPUs and GPUs. The consequence is that the resource demand (in particular, power draw and response times) is much more predictable. Our measurements, in particular Figure 3, confirm the deterministic, repeatable behaviour. However, access to the TPU presented in the article is only available via cloud services.
Based on a cycle-accurate simulator, Gupta et al. [20] built a “latency predictor” for the Google Coral Edge TPU [44]. This latency predictor is then used together with the model accuracy to iteratively refine neural networks, until it achieves the desired prediction quality in the available response time budget. The authors further report that the TPU operates more power-efficiently when the model fits the on-chip memory, which is confirmed by our measurements. In general, cycle-accurate simulators [48, 49, 55] are capable of very precise execution time and power demand estimations. However, as they may leak the intellectual property of companies, they are usually not publicly available for commercial products. To the best of our knowledge, this is the case for the cycle-accurate simulator for the TPU used in this work.
Kaufman et al. [29] model the execution time of tensor computations with a feed-forward neural network with a maximum error of 13%. In comparison, we demonstrate that even simple machine learning techniques, such as linear models, can estimate the resource demand for embedded accelerators adequately.
Sieh et al. [56] create execution-time and energy-demand models for an embedded microcontroller. Similar to our approach, they generate input programs automatically, measure the resource demand, and derive models. They formulate and solve an integer linear program (ILP) that yields the per-instruction resource demand of the examined hardware. Hönig et al. [22] use deep neural networks for energy models that also account for inter-instruction effects, for example, related to caches. TPUs, in comparison, avoid such inter-instruction effects [27]. Instead, tensor processing units aim at providing predictable execution times by exploiting data parallelism in hardware [12, 26]. Thus, high overall performance is achieved with a much more deterministic execution model compared to CPUs. In consequence, the resource demand prediction of a neural network accelerator can work with simpler models.
Wu et al. [61] also use machine learning techniques to estimate the performance and power demand for GPGPUs to select (or build) appropriate GPUs for a given application with a maximum error of 15% (execution time) and 10% (power). In a subsequent work, Greathouse et al. extended the approach for heterogeneous systems (i.e., mainly combinations of GPUs and CPUs) with similar accuracies [18].
With a strict focus on the Intel Neural Compute Stick 2 and Google Coral TPUs, Libutti et al. [38] compare functional and non-functional properties of the two TPUs. The authors perform power measurements with an INA2019-based setup and compare the achieved accuracy versus performance and energy efficiency of the different TPUs. For the comparison, the work utilises the MLPerf benchmarks [47]. In our work, we also focus on one of the two platforms (i.e., the Google Coral TPU) and we additionally provide models on the basis of time and power measurements. To improve the reliability of the models, we further consider and discuss the impact of the ambient temperature.
Kljucaric et al. [31] conducted an architectural analysis of deep learning on edge accelerators. The case study particularity focuses on performance analyses of three TPUs: NVIDIA AGX, Google Coral TPU, and Intel Neural Compute Stick 2. The authors conclude that different workload requirements (i.e., memory size) determine which of the TPUs is most efficient. This underlines the importance of methods that provide resource requirement estimations as presented with our work on Precious in this article.
The measurement-based creation of resource-demand models of DNNs is subject to the work [17] by Ganesan et al. The paper focuses on latency measurements for a small signature set of networks to generate generalised models for mobile systems. Similar to our approach, the authors concentrate on providing accurate resource-demand models for neural networks based on measurements but leave out the consideration of power and energy as important system resources.

7 Conclusion

This article has presented Precious, a comprehensive approach to estimate the resource demand of neural network inferences on embedded hardware accelerators. In total, Precious comprises five stages to create and apply models based on execution time and power measurements. We demonstrate that relatively simple models achieve accurate estimations. The regressors trained by Precious achieve an error below 1% for the power draw and 1.5% for the execution time for homogeneous neural networks. The respective errors for heterogeneous neural networks are around 2% for the power draw and 9% for the execution time. In addition, our experiments show that the accuracy in power estimation is similar to the variation caused by self-induced heat and how the USB communication affects execution time and energy efficiency.

Footnotes

1
Throughout this article, a “neural network execution” is equivalent to computing an inference with one input tensor.
2
Since the TPU behaviour changes, we use TensorFlow 2.0 for this illustration but version 2.1 for the rest of the article.
3
We use the term “model” for a mapping of features to labels and “regressor” for a technique that creates a model using a training dataset.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX, 265–283.
[2]
Sergei Alyamkin, Matthew Ardi, Alexander Berg, Achille Brighton, Bo Chen, Yiran Chen, Hsin-Pai Cheng, Zichen Fan, Chen Feng, Bo Fu, Kent Gauen, Abhinav Goel, Alexander Goncharenko, Xuyang Guo, Soonhoi Ha, Andrew Howard, Xiao Hu, Yuanjun Huang, Donghyun Kang, Jaeyoun Kim, Jong Gook Ko, Alexander Kondratyev, Junhyeok Lee, Seungjae Lee, Suwoong Lee, Zichao Li, Zhiyu Liang, Juzheng Liu, Xin Liu, Yang Lu, Yung-Hsiang Lu, Deeptanshu Malik, Hong Hanh Nguyen, Eunbyung Park, Denis Repin, Liang Shen, Tao Sheng, Fei Sun, David Svitov, George Thiruvathukal, Baiwu Zhang, Jingchi Zhang, Xiaopeng Zhang, and Shaojie Zhuo. 2019. Low-power computer vision: Status, challenges, opportunities. IEEE J. Emerg. Select. Topics Circ. Syst. 9, 2 (2019), 411–421.
[3]
Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 13, 3 (2017).
[4]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, Feb. (2012), 281–305.
[5]
Neil Burgess, Jelena Milanovic, Nigel Stephens, Konstantinos Monachopoulos, and David Mansell. 2019. Bfloat16 processing for neural networks. In Proceedings of the 26th Symposium on Computer Arithmetic (ARITH’19). IEEE, 88–91.
[6]
Stephen Cass. 2019. Taking AI to the edge: Google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56, 5 (2019), 16–17.
[7]
Stephen Cass. 2020. Nvidia makes it easy to embed AI: The Jetson Nano packs a lot of machine-learning power into DIY projects - [Hands on]. IEEE Spectrum 57, 7 (2020), 14–16.
[8]
Mingxi Cheng, Ji Li, and Shahin Nazarian. 2018. DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In Proceedings of the 23rd Asia and South Pacific Design Automation Conference (ASP-DAC’18). IEEE, 129–134.
[9]
Jaeyong Chung and Taehwan Shin. 2016. Simplifying deep neural networks for neuromorphic architectures. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, 1–6.
[10]
NVIDIA Corp.2015. GPU-based Deep Learning Inference: A Performance and Power Analysis. Retrieved from https://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf.
[11]
NVIDIA Corp.2021. Jetson Nano Developer Kit. Retrieved from https://developer.nvidia.com/embedded/jetson-nano-developer-kit.
[12]
Jeff Dean, David Patterson, and Cliff Young. 2018. A new golden age in computer architecture: Empowering the machine-learning revolution. IEEE Micro 38, 2 (2018), 21–29.
[13]
Scikit-Learn Developers. 2021. sklearn.ensemble: Ensemble Methods. Retrieved from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble.
[14]
Scikit-Learn Developers. 2021. sklearn.linear_model: Linear Models. Retrieved from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model.
[15]
Analog Devices. 2020. LTC2991. Retrieved from https://www.analog.com/en/products/ltc2991.html.
[16]
Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. 2018. PCC vivace: Online-learning congestion control. In Proceedings of the 15th Symposium on Networked Systems Design and Implementation (NSDI’18). USENIX, 343–356.
[17]
Vinod Ganesan, Surya Selvam, Sanchari Sen, Pratyush Kumar, and Anand Raghunathan. 2020. A case for generalizable DNN cost models for mobile devices. In Proceedings of the 16th International Symposium on Workload Characterization (IISWC’20). IEEE, 169–180.
[18]
Joseph L. Greathouse and Gabriel H. Loh. 2018. Machine learning for performance and power modeling of heterogeneous systems (invited paper). In Proceedings of the 37th International Conference on Computer-Aided Design (ICCAD’18). ACM, 1–6.
[19]
Samuel Greengard. 2020. AI on edge. Commun. ACM 63, 9 (2020), 18–20.
[20]
Suyog Gupta and Mingxing Tan. 2019. EfficientNet-EdgeTPU: Creating Accelerator-optimized Neural Networks with AutoML. Retrieved from https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html.
[21]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 620–629.
[22]
Timo Hönig, Benedict Herzog, and Wolfgang Schröder-Preikschat. 2019. Energy-demand estimation of embedded devices using deep artificial neural networks. In Proceedings of the 34th Symposium on Applied Computing (SAC’19). ACM, 617–624.
[23]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: Running deep neural networks on Android smartphones. In Proceedings of the Perceptual Image Restoration and Manipulation Workshop and Challenge (PIRM’18). Springer International Publishing, 288–314.
[24]
Canaan Inc. 2021. Kendryte K210. Retrieved from https://canaan.io/product/kendryteai.
[25]
Intel Inc.2020. Intel Neural Compute Stick 2. Retrieved from https://software.intel.com/en-us/neural-compute-stick.
[26]
Norman Jouppi, Cliff Young, Nishant Patil, and David Patterson. 2018. Motivation for and evaluation of the first tensor processing unit. IEEE Micro 38, 3 (May 2018), 10–19.
[27]
Norman Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre Luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, 1–12.
[28]
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. 2019. A study of BFLOAT16 for deep learning training. Retrieved from https://arxiv.org/abs/1905.12322.
[29]
Samuel Kaufman, Phitchaya Phothilimtha, and Mike Burrows. 2019. Learned TPU cost model for XLA tensor programs. In Proceedings of the Workshop on ML for Systems at the 33rd Conference on Neural Information Processing Systems (NeurIPS’19). 1–6.
[30]
Keras Special Interest Group. 2020. Core Layers. Retrieved from https://keras.io/api/layers/core_layers/.
[31]
Luke Kljucaric, Alex Johnson, and Alan George. 2020. Architectural analysis of deep learning on edge accelerators. In Proceedings of the 24th High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–7.
[32]
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. Quantifying the carbon emissions of machine learning. Retrieved from https://arxiv.org/abs/1910.09700.
[33]
Nicolas Lane, Sourav Bhattacharya, Akhil Mathur, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. 2017. Squeezing deep learning into mobile and embedded devices. IEEE Pervas. Comput. 16, 3 (July 2017), 82–88.
[34]
Nicholas Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (HotMobile’15). ACM, 117–122.
[35]
Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and Soheil Ghiasi. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on Android. In Proceedings of the 24th International Conference on Multimedia (MM’16). ACM, 1201–1205.
[36]
Kwangbae Lee, Hoseung Kim, Hayun Lee, and Dongkun Shin. 2020. Flexible group-level pruning of deep neural networks for on-device machine learning. In Proceedings of the 23rd Conference on Design, Automation and Test in Europe (DATE’20). 79–84.
[37]
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. 2016. Evaluating the energy efficiency of deep convolutional neural networks on CPUs and GPUs. In Proceedings of the International Conferences on Sustainable Computing and Communications (SustainCom’16). IEEE, 477–484.
[38]
Leandro Ariel Libutti, Francisco Igual, Luis Piñuel, Laura De Giusti, and Marcelo Naiouf. 2020. Benchmarking performance and power of USB accelerators for inference with MLPerf. In Proceedings of the 1st Workshop on Accelerated Machine Learning (AccML’20) at the European Network on High-performance Embedded Architecture and Compilation (HiPEAC’20).
[39]
Google LLC. 2020. Deploy Machine Learning Models on Mobile and IoT Devices. Retrieved from https://www.tensorflow.org/lite/.
[40]
Google LLC. 2020. Edge TPU Compiler. Retrieved from https://coral.ai/docs/edgetpu/compiler/.
[41]
Google LLC. 2020. Get Started with TensorFlow Lite. Retrieved from https://www.tensorflow.org/lite/guide/get_started.
[42]
Google LLC. 2020. Get Started with the USB Accelerator. Retrieved from https://coral.ai/docs/accelerator/get-started/.
[43]
[44]
Google LLC. 2020. USB Accelerator. Retrieved from https://www.coral.ai/products/accelerator.
[45]
Google LLC. 2020. USB Accelerator Datasheet. Retrieved from https://coral.ai/docs/accelerator/datasheet/.
[46]
Zongqing Lu, Swati Rallapalli, Kevin Chan, and Thomas La Porta. 2017. Modeling the resource requirements of convolutional neural networks on mobile devices. In Proceedings of the 25th International Conference on Multimedia (MM’17). ACM, 1663–1671.
[47]
Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, Hanlin Tang, Gu-Yeon Wei, and Carole-Jean Wu. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (Mar. 2020), 8–16.
[48]
Siddharth Nilakantan, Karthik Sangaiah, Ankit More, Giordano Salvadory, Baris Taskin, and Mark Hempstead. 2015. SynchroTrace: Synchronization-aware architecture-agnostic traces for light-weight multicore simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, 278–287.
[49]
Subhankar Pal, Kuba Kaszyk, Siying Feng, Björn Franke, Murray Cole, Michael O’Boyle, Trevor Mudge, and Ronald G. Dreslinski. 2020. HETSIM: Simulating large-scale heterogeneous systems using a trace-driven, synchronization and dependency-aware framework. In Proceedings of the 16th International Symposium on Workload Characterization (IISWC’20). IEEE, 13–24.
[50]
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training. Retrieved from https://arxiv.org/abs/2104.10350.
[51]
Stefan Reif, Benedict Herzog, Judith Hemp, Timo Hönig, and Wolfgang Schröder-Preikschat. 2020. Precious: Resource-demand estimation for embedded neural network accelerators. In Proceedings of the 1st International Workshop on Benchmarking Machine Learning Workloads on Emerging Hardware (Challenge’20). 1–9.
[52]
Stefan Reif, Benedict Herzog, Judith Hemp, Wolfgang Schröder-Preikschat, and Timo Hönig. 2021. Poster: AI waste prevention: Time and power estimation for edge tensor processing units. In Proceedings of the 12th International Conference on Future Energy Systems (e-Energy’21). ACM, 300–301.
[53]
Crefeda Faviola Rodrigues, Graham Riley, and Mikel Luján. 2020. Energy predictive models for convolutional neural networks on mobile platforms. Retrieved from https://arxiv.org/abs/2004.05137.
[54]
Roy Schwartz, Jesse Dodge, Noah Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM 63, 12 (2020), 54–63.
[55]
Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and SoC interfaces using gem5-Aladdin. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.
[56]
Volkmar Sieh, Robert Burlacu, Timo Hönig, Heiko Janker, Phillip Raffeck, Peter Wägemann, and Wolfgang Schröder-Preikschat. 2017. An end-to-end toolchain: From automated cost modeling to static WCET and WCEC analysis. In Proceedings of the 20th International Symposium on Real-time Distributed Computing (ISORC’17). IEEE, 158–167.
[57]
David Silver, Aja Huang, Chris Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
[58]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 1–6.
[59]
Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca Benini. 2018. A transprecision floating-point platform for ultra-low power computing. In Proceedings of the 21st Conference on Design, Automation and Test in Europe (DATE’18). IEEE, 1051–1056.
[60]
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. 2019. Machine learning at Facebook: Understanding inference at the edge. In Proceedings of the 25th International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 331–344.
[61]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564–576.
[62]
Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In Proceedings of the World Wide Web Conference (WWW’19). ACM, 2125–2136.

Cited By

View all
  • (2024)A Hybrid Framework Leveraging Whale Optimization and Deep Learning With Trust-Index for Attack Identification in IoT NetworksIEEE Access10.1109/ACCESS.2024.337469112(36296-36310)Online publication date: 2024
  • (2022)Bears: Building Energy-Aware Reconfigurable Systems2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9964629(1-8)Online publication date: 21-Nov-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 21, Issue 5
September 2022
526 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3561947
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 08 October 2022
Online AM: 21 March 2022
Accepted: 20 February 2022
Revised: 12 January 2022
Received: 15 July 2021
Published in TECS Volume 21, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural network accelerator
  2. resource awareness

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)439
  • Downloads (Last 6 weeks)58
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Hybrid Framework Leveraging Whale Optimization and Deep Learning With Trust-Index for Attack Identification in IoT NetworksIEEE Access10.1109/ACCESS.2024.337469112(36296-36310)Online publication date: 2024
  • (2022)Bears: Building Energy-Aware Reconfigurable Systems2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9964629(1-8)Online publication date: 21-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media