1 Introduction
Machine learning has historically been associated with high resource (i.e., power and execution time) demand. Learned systems based on deep artificial neural networks have achieved human-like or even super-human performance for tasks like image recognition [
2] or strategic board games [
57]. In addition, learned models are becoming omnipresent— deep learning approaches are increasingly replacing any kind of decision-making algorithm that used to be explicitly programmed, for example, network congestion control [
16] or task scheduling [
8]. While artificial neural networks have existed for a long time, their recent success has been enabled by high-performance hardware (e.g., GPUs and TPUs [
27]) running in the cloud [
21,
27]. In consequence, state-of-the-art models based on deep learning have a significant energy demand [
58].
Due to their tremendous success, machine learning applications have entered the domain of embedded systems, such as cars, smartphones, and wearables [
2,
19,
60,
62]. To enable machine learning in embedded devices, applications have to adhere to their limited resources [
23,
33,
34,
35,
54] in addition to providing a good quality of service (e.g., prediction accuracy). First, many embedded systems are energy-constrained, either by a battery or by a limited power supply. Second, embedded systems often interact with their environment and therefore also have timing requirements (i.e., responsiveness). In consequence, resource-awareness is crucial for embedded machine learning, as it enables the consideration of resource efficiency and adherence to resource utilisation bounds.
The need for embedded machine learning has led to the development of special-purpose accelerator hardware for neural networks (e.g., Intel Neural Compute Stick [
25], Google Coral Edge TPU [
6,
44], Nvidia Jetson Nano [
7,
11], Kendryte K210 [
24]) that are commercially available as a stand-alone product or also integrated on a wide variety of different platforms. These hardware accelerators satisfy the growing demand and interest for machine learning approaches [
1,
43] and promise to execute the corresponding machine learning workloads more efficiently than general-purpose hardware, using the accelerator-specific integrated circuits.
To make use of dedicated resource-efficient accelerator hardware, embedded system designers still need methods and means to anticipate the resource-demand of machine learning workloads. While resource-demand is originally caused by the hardware, it is controlled by the executed software. Therefore, resource-efficient embedded machine learning requires both efficient hardware and software.
A straightforward way to determine the power and energy demand for the execution of machine learning workloads is to measure it, for example, with a power meter. However, such power measurements depend on the availability of all final components, including the hardware (i.e., accelerator modules), measurement infrastructure (i.e., power meters) that can be costly, and the software. In particular, measurements at runtime depend on hyperparameters that are fine-tuned during an expensive—with respect to time and energy—training process [
58]. Therefore, it is most desirable to estimate the energy demand
before all components of the final system are available. In particular, system developers need to be able to define neural-network parameters prior to the resource-intensive training process. However, these parameters influence non-functional system properties, such as latency and energy demand. Hence, a resource-demand model enables developers to assess the influence of parameter choices and revise bad choices before spending resources on the training process, needlessly.
This article presents a comprehensive approach and the corresponding tool-chain to
predict the resource-demand of embedded machine learning workloads
a priori to their execution [
51,
52]. Our tool-chain estimates the resource demand (i.e., power and execution time) of neural networks based on statically available information only. Thus, embedded system designers can estimate the resource demand in an early development stage before training, without the need for the hardware and measurement infrastructure. Our system, hence, enables the resource-aware development of deep-learning applications.
This article, in particular, tackles the challenge of analysing and estimating the energy demand of machine learning workloads. We analyse the power and energy demand of a hardware accelerator for machine learning (i.e., Google Coral [
44]). We present and discuss the concept and implementation of
Precious, an approach for the estimation of the energy demand for machine learning workloads. Based on the conducted energy measurements at the hardware level,
Precious utilises machine learning techniques itself to train models that enable resource-demand prediction. Besides the development of individual resource-demand models, this article generalises the observed behaviour. We describe which neural network properties need consideration in modelling the resource demand. In addition, we demonstrate that the device’s temperature and data transmissions effectively limit the achievable estimation accuracy of the power draw and the execution time, respectively.
The contributions of this article are the following:
•
We measure, analyse, and discuss the power draw and execution time of different neural network architectures on commercially available accelerator hardware.
•
Based on our analysis, we identify important network properties and make suggestions for efficient network architectures, which are useful for developers.
•
We present Precious, a comprehensive approach to model the resource demand (i.e., power and execution time) of neural networks based on statically available information.
•
We evaluate the models generated by Precious empirically for their estimation accuracy.
•
We discuss the limitations of predictability of the resource demand of neural network accelerators when considering statically available information only. We demonstrate how the temperature influences the power draw and that execution time variance is caused by the communication with the accelerator hardware.
The rest of this article is structured as follows: First, Section
2 puts our work in context and discusses several application scenarios that can benefit from resource-demand models. Second, Section
3 describes our measurement setup and analyses the execution time and power draw of neural networks on the TPU platform. Then, Section
4 presents the design and implementation of
Precious. Section
5 evaluates various models generated by
Precious empirically and puts their accuracy into context by analysing the limits of predictability of the accelerator. Section
6 discusses related work and Section
7 concludes this article.
2 Application Scenarios of Resource-demand Prediction Models
Resource-demand prediction models are important for both the design phase and the operation of neural networks. In general, resource awareness is a prerequisite for three optimisation goals—resource-demand reduction techniques, tradeoffs between the resource demand and application-specific utility metrics, and the optimisation of a utility metric while adhering to resource limits. For such optimisations, resource-demand models take an enabling role by providing the necessary information—these optimisations depend upon the models. For neural network accelerators, related optimisation problems occur in various scenarios.
Hyperparameter Optimisation. One important stage during the development of a neural network is the hyperparameter selection, which is decisive for the potential quality of a neural network. Finding a good hyperparameter set is complex due to the enormous number of possible hyperparameter combinations. Techniques like, for example,
random search [
4] have been developed to relieve developers of manually selecting hyperparameters. Due to resource restrictions (i.e., maximum execution time, limited memory or battery), however, not all hyperparameter combinations are applicable in a specific scenario. A resource-demand prediction model can restrict the search space to only applicable hyperparameter combinations (e.g., restricting the total number of weights) and thus helps to guide the hyperparameter-optimisation efforts and skip infeasible combinations. Furthermore, if several potential hyperparameter sets with similar quality are found, then the prediction model can be used to select the most resource-efficient set.
Neural Network Pruning and Specialisation. Pruning neural networks after training [
3,
9,
36] and applying specialised data types [
5,
28,
59] have the potential of significantly reducing the resource demand with no or only little quality loss. Usually, the quality loss can be easily determined by re-testing the pruned and specialised network on an evaluation dataset. However, the resource demand (e.g., energy demand) requires laborious and time-consuming measurements. A resource-demand prediction model can replace measurements and thus enable a completely automated pruning and specialisation process in software. That way, the desired tradeoff between network quality and resource efficiency can be automatically determined.
Hardware Platform Selection. For new neural network–driven applications, the required minimal network quality and corresponding neural network architecture are often known before the hardware platform is selected. Resource-demand prediction models can help to select an appropriate hardware platform, in terms of resource and (monetary) cost efficiency, without access to the hardware itself. In addition, this process can be combined with the pruning and specialisation approach described in the previous paragraph.
Energy-aware Runtime Systems. In many cases, systems can utilise heterogeneous hardware to execute neural networks, for example, CPUs, GPUs, or TPUs. Although specialised hardware such as GPUs and TPUs usually implies lower resource demand for neural network execution, the network and data transfer from and to the CPU imposes additional overhead. Thus, an energy-aware runtime system has to make a decision, at runtime and depending on the network, which hardware to use. Resource-demand prediction models enable such decisions, whether the lower execution time outweighs the additional overhead, and hence are one building block for energy-aware runtime systems.
4 Design and Implementation
In this section, we present the
Precious approach for the resource estimation of
neural networks (NN). We outline design considerations of
Precious and discuss its current implementation, which uses a Google Coral Edge
Tensor Processing Unit (TPU), a commercially available off-the-shelf neural network accelerator for embedded systems that connects to its host system via USB [
44].
Precious organises the process of estimating the resource demand of neural networks in five phases (see Figure
4). First,
Precious generates randomised neural networks. Second, it applies a static neural network analysis. Third, it evaluates their resource demands. Fourth, a machine learning approach builds models that map a combination of neural network properties (e.g., number of layers) to its resource demand. In the last phase, applications can
utilise these models to estimate the resource demand of their networks. The phases, necessary steps within a phase, and our implementation for each phase are described in the following sections.
4.1 Neural Network Generation Phase
In the first phase,
Precious uses the machine learning framework TensorFlow 2.1 with Keras [
43] and Python 3.6 to generate randomised neural networks. It supports three different network types—convolutional (“conv2d”), fully connected (“dense”), and heterogeneous networks [
30]. Generated neural networks are either homogeneous or heterogeneous. First, homogeneous networks consist of either only convolutional or only fully connected layers. Second, heterogeneous networks contain both convolutional and fully connected layers. As of now, the implementation supports “relu” as an activation function and imposes restrictions on the dimensions of convolutional networks. For each layer, the input dimension is the same as the output dimension. All generated homogeneous networks consist of layers of the same type and dimension (for convolutional layers this implies the use of the “same” padding). Convolutional layers always have a square-dimension input with a depth of 3 and 3 filters with the dimension
\((3,3)\). Our system varies the number of layers (between 2 and 250) and the dimensions (between 100 and 1,024), randomly. Generated heterogeneous networks contain first convolutional layers, a “maxPool2d” and a “flatten” layer, followed by dense layers. The number of layers varies between 10 and 1,024, and the range of layer dimensions is equal to the generation of homogeneous networks. For all network types, all network-internal parameters (i.e., weights) are initialised to random values. The neural networks remain untrained, as we use
Precious to examine the resource demand of different network types and configurations and not the resource demand for a specifically trained network.
To execute a neural network on the accelerator, it must first be converted to a
TensorFlow Lite (TFLite) model. In addition, all parameters must be in an 8-bit fixed-point number format. The conversion to this format is either achieved during training (i.e., “quantisation-aware training”) or afterwards (i.e., “post-training quantisation”). Since
Precious only uses randomised instead of meaningfully trained weights, it uses the latter technique to translate the randomised parameters to 8-bit values. Furthermore, this technique also allows to convert existing pre-trained neural networks and execute them on the accelerator. This neural network translation process is executed on the host system using the Edge TPU compiler [
40].
4.2 Neural Network Analysis Phase
In the second phase, Precious analyses several properties of the neural network. This analysis is entirely static, that is, all information is available without executing the neural network.
First, the TensorFlow format already offers information about the number of layers and the layer dimensions. For the generated neural networks in Precious, these properties are randomly chosen. For applications that utilise neural networks, these properties are selected by the application developer and potentially adjusted during or after the training process. Second, Precious derives the number of (internal) neural network parameters, such as weights and biases, from the number and dimension of the neural network layers. Furthermore, the number of multiply-accumulate (MAC) operations is computed (i.e., the number of MAC operations needed per neural network execution). Third, the Edge TPU compiler provides the network size (i.e., memory demand) of the neural network. The memory demand is important, because small neural networks fit into the 8 MiB on-chip memory, but larger neural networks transparently use a “streaming” mechanism where parts of the neural network are fetched on-demand during inferences. Finally, the file system tells the file size of the neural network in the binary format used by the accelerator after compilation. In summary, these easily accessible properties later serve as features for the training phase that creates a model that maps from the neural network properties to their resource demand. The labels (i.e., the resource demand) needed for the training phase are collected in the neural network execution phase.
Figure
5 summarises the influence of the number of MAC operations on the execution time and power draw of fully connected and convolutional networks. Some correlations are clearly visible. Similarly, Figure
6 visualises the influence of the memory demand of neural networks on the resource demand. The figures look similar, because the number of MAC operations and the memory demand correlate positively. However, both properties
individually are not suitable as resource-demand predictors, since both figures show a large amount of noise. The figures look
noisy, but as the resource demand measurements are repeatable, the noise cannot be run-to-run variation. Instead, the shown variance is caused by other neural network properties that are not visible, as the many-dimensional property space is reduced to two-dimensional representations. In consequence, no single property is sufficient to accurately estimate power draw and execution time. Instead,
Precious has to consider multiple properties in the estimation models.
4.3 Neural Network Execution Phase
In the third phase,
Precious measures the energy demand of neural networks. As described in Section
3, Figure
1 visualises the structure of our energy measurement setup, and Figure
2 shows a picture of the actual hardware setup. For the power and execution time measurements, the host system submits neural networks to the Coral Edge accelerator [
44] for execution. The input data for the execution is randomised, but used repeatedly for each neural network, and no batching is applied.
Our embedded TPU [
44] can be configured in two variants, a standard variant (
STD) and a second variant (
MAX) that operates with the maximum operating frequency [
42]. Both the power demand and execution time of the TPU depend on this configuration. We use the former (i.e., slower) variant for our implementation of
Precious, because the latter suffers from documented thermal problems [
42]. To avoid overheating,
Precious nevertheless makes idle periods during the evaluation to maintain a constant temperature. Section
5.4 looks at the self-induced warm-up in detail.
Each neural network executes five iterations. Each iteration repeatedly executes inferences of the neural network and lasts at least 30 seconds. The power draw and execution time are averaged during each iteration, ignoring the first inference, as it takes longer because the neural network is loaded onto the TPU. We then combine the five iterations by computing the median. In general, power draw and execution time show little run-to-run variance—for example, the detailed power trace in Figure
3 combines measurements from multiple inferences. The obtained power and time measurements serve as
labels in the training phase.
4.4 Training Phase
The training phase applies supervised learning where the neural network properties (cf., Section
4.2) are the features and the resource demand (cf., Section
4.3) is the label.
The total dataset contains 2,992 homogeneous and 127 heterogeneous neural networks that are partitioned into
training (80%),
validation (10%), and
evaluation (10%) sets. Each regressor trains a model using the training set, only. Hyperparameters were fine-tuned using
randomised parameter optimisation (RPO) with 100 iterations using the validation set. The hyperparameter optimisation was only conducted for the homogeneous neural network dataset and the results were reused for the heterogeneous neural networks. Consequently, the validation dataset were merged into the evaluation dataset for heterogeneous networks. Importantly, the regressors never access the evaluation set—that set is only used for the evaluation in Section
5.2. We create, in summary, 10 estimators—for time and power, and for small dense (i.e., without off-chip memory), large dense (i.e., with off-chip memory), convolutional neural networks, and for small (i.e., without off-chip memory) and large (i.e., with off-chip memory) heterogeneous networks.
The models are trained by, in total, eight regressors
3 with 18 configurations (denoted as
(*)) in the non-heterogeneous case. Models are either
linear [
14] or
ensemble [
13], that is, models that internally combine different machine learning techniques. Table
1 gives an overview of the regressors and the models they create. Regressors for ensemble models are configurable by the error metric that is minimised during training, such as the
Friedman mean squared error (
FMSE), the
mean absolute error (
MAE), or the
mean squared error (
MSE). For comparison, an additional
dummy regressor computes the
mean (MEA) or the
median (MED) of the labels, ignoring all features. This dummy regressor serves as a baseline for the evaluation of linear and ensemble models. Based on the results for homogeneous networks, we used the
ETR(MAE) regressor to implement the models for heterogeneous networks.
We intentionally did not train models that use deep learning or neural networks internally—neural networks are only the subject of modelling, not the implementation technique. The relatively simple regressors used in this article require significantly lower training efforts, but the evaluation shows that they provide accurate estimations.
4.5 Application Phase
In the final phase, the developers of deep learning applications
apply the trained models to estimate the resource demand of their neural networks. The intended use-case is that phases 1 to 4 are executed once, for example, by the hardware vendor, and the resulting model is distributed. In contrast to cycle-accurate simulators, such measurement-based resource-demand models can be published without the risk of leaking intellectual property, as they are derived from user-observable properties only. Using these models, deep-learning application developers can determine the resource demand of different network architectures—without power measurement infrastructure and before training—and restrict the training process to models that satisfy all resource constraints. Further potential application scenarios are discussed in detail in Section
2.
5 Evaluation
This section evaluates the models trained by
Precious empirically to answer the following questions:
•
Which analysed neural network properties are useful as features for resource-demand models?
•
What estimation accuracy is realistically achievable, depending on the model complexity?
•
What differences between homogeneous and heterogeneous neural networks can be observed?
•
What are the inherent limitations to the resource-demand estimation accuracy for the TPU?
•
What operating frequency of the TPU is optimal for which neural network?
5.1 Feature Selection
In general, the properties derived by static analysis carry information on the resource-demand of the neural network. We analyse the importance of individual properties by computing the Spearman correlation coefficient between a feature and the power draw (
\(r_p\)) and between the feature and the execution time (
\(r_t\)). A high correlation (i.e., values near 1 or
\(-1\)) indicates a strong dependency between the respective property and the resource demand. In consequence, such a property is a useful feature for a resource-demand estimator. Table
2 summarises the Spearman correlation coefficients for homogeneous (i.e., convolutional, small dense, and large dense) and heterogeneous (i.e., small and large) neural network properties used by
Precious.
Input Features (Homogeneous Networks). Each property has a high correlation in at least one case, indicating that each property contributes to the estimation accuracy of the resource-demand models. In most cases, the number of MAC operations correlates strongly with the resource demand, in particular, considering the execution time data visualised in Figure
5. For large dense models with parameter streaming, the execution time correlates extremely strong with the amount of off-chip data, as indicated in Figure
6. However, Figures
5 and
6 also demonstrate that neither property alone is sufficient to estimate the resource demand. Instead, the combination of properties is needed for an accurate estimation.
Input Features (Heterogeneous Networks). For the resource-demand estimation of heterogeneous networks, we analysed the same input features as used for homogeneous networks. Additionally, we analyse the number of MAC operations disaggregated for convolutional and dense layers, respectively. We deliberately refrain from using more complex representations of neural network topologies (e.g., order of layer types), but continue to focus on metrics, which are easy to statically determine and easy to represent, for example, the number of parameters or MAC operations. This allows us to still use machine learning techniques with relatively little overhead.
The Spearman correlation coefficients show that the memory demand for small heterogeneous networks is only lightly correlated with the resource demand. Due to their heterogeneity, networks of the same size can have very different topologies and hence resource demand. The off-chip memory demand for large heterogeneous networks, however, determines the overhead for parameter streaming and is hence more strongly correlated.
In contrast to homogeneous networks, the number of layers and parameters are only lightly correlated, as different layers and parameters potentially yield very different behaviour, depending on the network topology. Furthermore, to fit on the TPU, small heterogeneous networks only have a limited potential number of layers. On the other side, the number of MAC operations proved again to be strongly correlated with the resource demand and remains one of the most important input features. Furthermore, the number of MAC operations for convolutional layers significantly outweighs the number of MAC operations for dense layers.
Based on these analyses, we selected the off-chip memory demand, total memory demand, and the three metrics regarding the number of MAC operations as inputs to the resource-demand estimators for heterogeneous networks.
5.2 Model Accuracy (Homogeneous Neural Networks)
Figure
7 summarises the estimation accuracy of all models trained to estimate the resource demand for homogeneous networks. We measure the resource demand for the evaluation dataset and compare this “true” value with the estimation based on the trained models. We use the
mean absolute percentage error (MAPE) as error metric, because it normalises the error to the measurement, and the individual resource demand can vary (network-to-network) significantly. The graphic is cut off at 15% to maintain readability. The evaluation demonstrates the following:
The dummy regressor performs badly at modelling the execution time. This demonstrates that execution-time modelling is not trivial and neural networks properties must be considered.
The ABR(*) regressors cannot model the execution time of neural networks accurately. For dense and convolutional networks, the estimation error is larger than all other ensemble and linear models. This relatively bad accuracy might be caused by an insufficient number of hyperparameter tuning iterations. The accuracy of the outlier-robust HR regressor is consistently worse than a simple linear regressor.
The execution time of
dense networks without off-chip memory (i.e., without streaming) can be generally well approximated. Linear and ensemble models achieve similar accuracy. This means that the relation between the chosen neural network properties and the execution time is linear (as indicated by Figure
6). For the execution time of
dense networks with off-chip memory (i.e., with streaming), the linear models perform worse than ensemble models. This means that, in contrast to small networks without off-chip memory, the relation is no longer strictly linear. There is no single supreme regressor, as ensemble models (
ETR(*),
RFR(*)) achieve similar accuracy. The execution time of
convolutional networks is harder to predict than dense networks, as the remaining MAPE is higher. Similar to dense networks with off-chip memory, ensemble models achieve a higher accuracy than linear models. As an exception, the
ABR regressors converge too slowly during our experiments.
The power draw of
dense networks without off-chip memory is trivial to predict. Linear, ensemble, and dummy models have similar accuracy. This is indicated in Figure
3, as the power draw of dense networks is almost constant, with only minor oscillations. For
dense networks with off-chip memory, the power draw is more difficult than for small networks without off-chip memory. Linear and ensemble models achieve similar accuracy, and the dummy regressor is slightly worse. For
convolutional networks, the power can be approximated by linear models as well. Linear models and ensemble models are similarly accurate, and both are more accurate than the dummy regressor. Only the
HR regressor achieves a bad accuracy.
In summary, regressors can achieve an error below 1% for homogeneous networks and for both power draw and execution time. This is due to the deterministic characteristics of neural network executions in general and the deterministic hardware platform in particular. As our approach is not restricted to the TPU used in our evaluation, estimators for similar accelerators (e.g., a Kendryte K210) can be trained without changes to our approach and we expect similarly low errors.
5.3 Model Accuracy (Heterogeneous Neural Networks)
Based on the results for homogeneous networks, we trained four additional estimators to predict the execution time and power demand for small and large heterogeneous networks. Therefore, we utilised the ETR(MAE) regressor, as it showed consistently good behaviour for all homogeneous networks.
The mean absolute percentage error (MAPE) to estimate the power demand and execution time is 1.5% and 8.6% for small heterogeneous networks, respectively. For large heterogeneous networks, the MAPE for the power demand and execution time is 2.2% and 9.2%, respectively. Hence, the estimation error for small heterogeneous networks is slightly smaller, which is due to the more predictable behaviour if no parameter streaming is required.
Even though the accuracy is lower compared to the estimators for homogeneous networks, the heterogeneous-network estimators achieve lower or comparable accuracy results compared to state-of-the-art estimators without access to cycle-accurate simulators (see the related work in Section
6). Furthermore, we achieve these results with statically determinable features and low-overhead prediction methods.
However, the network topology may comprise information relevant for the execution time, which is not accessible for the regressors in our implementation. Using more sophisticated implementation techniques (e.g., neural networks) could allow us to use more expressive network representations and utilise such information to achieve potentially higher accuracies. However, using more sophisticated modelling techniques would also come with increased overheads for training and execution. The following sections will put the accuracy of linear and ensemble models into context by comparing their estimation errors to the influence of self-induced heat and data transmissions.
5.4 Temperature Dependency
The power draw of transistor-based logic circuits generally depends on the temperature, which is influenced by the ambient temperature and also self-induced heat. The ambient temperature varies over time for most embedded systems, and also the self-induced warm-up is typically not precisely controllable. In consequence, the power draws of identical workloads differ between executions. Models trained with only statically available data—such as the models trained by Precious—are oblivious of temperature changes during runtime and therefore cannot capture this run-to-run variance. We evaluate the influence of the device temperature on its power draw—it constitutes a lower bound for the achievable estimation error of any power-demand model under realistic conditions.
To obtain temperature values, we utilise a BME280 temperature sensor from Bosch Sensortec. We sample temperature values at \(40 \,\mathrm{Hz}\) with 1°C accuracy and 0.01°C precision. During measurements, the temperature sensor is attached to the surface of the TPU casing.
Figure
8 shows the power demand and surface temperature of the TPU executing a convolutional network. The TPU surface temperature starts at ambient temperature (24.6°C) and rises within 20 inferences by 5.8°C. Around inference 20 the temperature stabilises at 30.4°C. The power demand shows the expected correlation with temperature and rises by
\(39.0 \,\mathrm{m}\mathrm{W}\) and stabilises at 1,332.1 mW after 20 inferences, which is an increase of 2.9%. In comparison, most power models discussed above achieve an accuracy better than 2.0%, which is only possible because the device temperature is controlled in our experiments. Therefore, linear and ensemble models achieve
sufficient accuracy, and more elaborate modelling techniques cannot improve the accuracy further under realistic conditions.
5.5 Data Transmission
One important factor for the efficient usage of an external accelerator is the communication between the accelerator and host system, that is, the USB communication in our case. As already described in Section
3, the TPU has a pre-processing, inference, and post-processing stage. This section further analyses the USB traffic during the inference stage and its influence on the TPU’s efficiency.
Figure
9 shows the power demand during a convolutional network inference and the respective USB traffic from and to the host system. The power data consists of 210 superimposed inferences of the same network and the same input data to improve the level of detail. During the reception and transmission of data, the power demand of the TPU significantly decreases from approximately
\(1.450 \,\mathrm{m}\mathrm{W}\) to
\(1.150 \,\mathrm{m}\mathrm{W}\). Hence, the data transmission phases are one reason for the very characteristic power traces of the TPU and thus influence the energy demand and efficiency of the TPU.
USB traffic during the inference stage usually occurs due to one of two reasons:
•
The neural networks’ weights and instructions how to arrange and execute weights are transmitted. Furthermore, if the neural network does not fit into the on-chip memory, then parts of the network and potentially instructions on how to to partially execute the network are continuously streamed.
•
The input data is transmitted to the accelerator. After the (partial) execution the (intermediate) results and dequantisation instructions are sent back to the host.
Due to the reduced power demand during USB transmission phases, we assume that the execution units of the TPU are not or not fully utilised and the TPU waits for additional data to arrive or send. Thus, we observe that USB traffic negatively affects the TPU’s utilisation. Figure
10 shows the average utilisation (i.e., MAC operations per second) for different input-data dimensions for convolutional and fully connected (“dense”) networks, respectively. For dense networks, the network size depends on the number of layers and the layers’ dimensions. Consequently, for higher input dimensions, the network may at some point be too big to fit into the on-chip memory and must be continuously streamed during execution, which significantly decreases the utilisation and, therefore, the TPU’s efficiency (left plot in Figure
10). However, dense networks fitting into the on-chip memory achieve a higher utilisation compared to the convolutional networks in our dataset (right plot in Figure
10).
Most convolutional networks in our dataset fit into the on-chip memory, as the convolutional networks’ sizes are independent of the input-data dimension, but only depend on the number and dimension of the filters. Accordingly, the utilisation is relatively constant for different input dimensions (right plot in Figure
10; pink graph), although lower compared to dense networks fitting into the on-chip memory. This reduced utilisation is presumably due to the bigger input-data sizes for our convolutional networks and the corresponding USB traffic. One interesting finding is that the
incoming traffic decreases if the
outgoing traffic is reduced (by reducing the output dimension). For a subset of the convolutional networks, we executed identical convolutional networks with the same data and only one additional small layer to harmonise the output size to a constant size of three filters with dimension 14x14 (right plot in Figure
10; yellow graph). We found that not only the output traffic, but also the input traffic was reduced by the additional layer, and consequently this traffic reduction led to an increased utilisation. We assume that the reduced output complexity also reduced the complexity of the execution instructions and hence decreased the traffic from the host.
In summary, reducing the USB traffic significantly increases the TPU’s utilisation and efficiency. Especially streaming the network on-the-fly for networks exceeding the on-chip memory decreases the TPU’s utilisation. However, the same holds for the input and output data transmissions. In general, the influence of the USB communication on the TPU’s utilisation demonstrates that, in many cases, the TPU’s performance is communication-bound. In this case, an accurate execution-time model has to take communication timings into account—however, Precious’ models do not consider them, because the exact timings are not statically analysable, in particular, with bus contention. Nevertheless, the USB traffic size is one example of a property that can be easily optimised by a neural network developer with a resource-prediction model at hand. These differences are further increased for different execution frequencies presented in the following section.
5.6 Frequency Selection
The TPU supports two different power modes, that is, the
STD and
MAX mode. This section analyses the impact on the power demand, execution time, and energy demand of both power modes. The TPU runs with the full frequency in
MAX mode, whereas in
STD mode the TPU is throttled. However, the
MAX mode may not be applicable, depending on the deployment scenario due to a significantly increased heat dissipation [
42].
Figure
11 shows the execution time and power demand differences between both power modes for dense and convolutional networks, respectively. The difference is calculated by subtracting the
STD execution time/power demand values from the
MAX mode values. Enabling the
MAX power mode leads to decreased execution times for both network types. However, the gains for the convolutional networks are significantly higher, because the dense networks’ execution times are more affected by data transmissions (cf. Section
5.5). As expected, the power demand increases when using the
MAX power mode for both network types. The power difference between
STD and
MAX mode is relatively constant if the TPU is fully utilised (as seen with the convolutional networks). If the network is too small to fully utilise the TPU’s on-chip memory, for example, the first dense networks with less than
\(2 \cdot 10^7\) MAC operations in Figure
11, then the power differences rise linearly with the number of MAC operations.
Figure
12 shows the energy demand difference between both power modes for the dense and convolutional networks, respectively. Again, the difference is calculated by subtracting the
STD energy demand from the
MAX mode energy demand. Dense networks with less than approximately
\(2 \cdot 10^7\) MAC operations can be stored entirely on the on-chip memory and need no additional data streaming. For these networks, the energy demand is slightly lower in
MAX mode than in
STD mode. For bigger dense networks (i.e., with more MAC operations), additional traffic is required and the increased power demand is not compensated by the reduced execution time. Hence, for these bigger networks, the energy demand is increased in
MAX mode. Most convolutional networks in our dataset fit into the on-chip memory and do not require additional traffic. Consequently, for all convolutional networks, the energy demand is reduced in
MAX mode.
In summary, for networks fitting into the on-chip memory, the energy demand is significantly reduced in MAX mode due to the decreased execution time. For all other networks, the energy demand is increased due to the increased power demand and only slightly decreased execution time. Hence, for scenarios where the additional heat dissipation can be handled and the network fits into the on-chip memory, the MAX mode can help to reduce the energy demand.
5.7 Discussion and Generalisation
In summary, the results indicate that the generated models estimate the resource demand sufficiently accurate, despite relatively simple regressors. However, the results are obtained for only one hardware platform. For an extension to further hardware platforms,
Precious needs only minor adjustments. First,
Precious has to replace TPU-specific aspects, such as the off-chip memory usage. However, other important neural network properties, like the number of MAC operations and the number of neural network parameters, can be reused. Second, for the execution phase the hardware platform has to be replaced. Depending on the accelerator, it may be challenging to accurately measure the power draw if it is tightly integrated in the system-on-chip. Considering the achieved accuracy of resource-demand models, other hardware platforms may need more complex regressors or provide worse results, as TPUs are specifically designed for deterministic execution [
27]. Papers that examine alternative hardware platforms in detail are discussed in the following section.
6 Related Work
Previous work has thoroughly examined the power and performance characteristics of CPUs, GPUs, and TPUs [
21,
27,
32,
37,
50,
58] in data centres. However, deep learning plays an increasingly important role in embedded systems [
2,
54,
60]. Therefore, our work focuses on neural network execution on embedded platforms. In comparison to regular and large-scale systems, embedded platforms have a much lower power draw and thus very different power-to-performance characteristics. Furthermore, the main focus for embedded systems is on response time rather than throughput [
10]. Table
3 shows an overview of related work with their considered hardware, scope, and metrics, respectively. The
Estimation column denotes works, which conduct relevant analyses of the resource demand but do not implement estimation models.
Li et al. [
37] compare the resource demand of training frameworks for convolutional neural networks on CPUs and GPUs. They further provide information on the effects of performance-tuning parameters, such as
dynamic voltage and frequency scaling (DVFS) and hyper-threading, on the energy efficiency of the training processes. Our work presented in this article, in comparison, targets hardware accelerators on embedded platforms.
Likewise, Lu et al. [
46] focus on estimating the memory and execution time for convolutional neural networks on CPUs and GPUs in embedded systems with an accuracy of at least 78% for the execution time. The difference in accuracy compared to our work is assumingly due to the inherently less deterministic hardware platform. They share our motivation that resource-demand estimations are highly valuable at the design phase of neural networks and also identify MACs as the most important feature for resource-demand estimations. However, they do not consider the power demand, which is a key concern on embedded platforms, in their estimations and focus on CPUs and GPUs instead of hardware accelerators.
In contrast, Rodrigues et al. [
53] analyse the required input feature complexity for energy-demand predictions on CPUs and GPUs in embedded systems. Similar to this work, they identify MACs as the most suitable feature for precise energy estimations, although focusing on more powerful Linux-based hardware platforms compared to our hardware accelerator. They reach accuracies between 76% and 85%, which is assumingly due to their more complex hardware platform.
Jouppi et al. [
27] describe the architecture of a TPU deployed in data centres. The work compares the performance and power demand of a TPU to CPUs and GPUs, showing that the TPU outperforms both hardware alternatives. The authors also discuss that the execution model of the TPU is more deterministic compared to CPUs and GPUs. The consequence is that the resource demand (in particular, power draw and response times) is much more predictable. Our measurements, in particular Figure
3, confirm the deterministic, repeatable behaviour. However, access to the TPU presented in the article is only available via cloud services.
Based on a cycle-accurate simulator, Gupta et al. [
20] built a “latency predictor” for the Google Coral Edge TPU [
44]. This latency predictor is then used together with the model accuracy to iteratively refine neural networks, until it achieves the desired prediction quality in the available response time budget. The authors further report that the TPU operates more power-efficiently when the model fits the on-chip memory, which is confirmed by our measurements. In general, cycle-accurate simulators [
48,
49,
55] are capable of very precise execution time and power demand estimations. However, as they may leak the intellectual property of companies, they are usually not publicly available for commercial products. To the best of our knowledge, this is the case for the cycle-accurate simulator for the TPU used in this work.
Kaufman et al. [
29] model the execution time of tensor computations with a feed-forward neural network with a maximum error of 13%. In comparison, we demonstrate that even simple machine learning techniques, such as linear models, can estimate the resource demand for embedded accelerators adequately.
Sieh et al. [
56] create execution-time and energy-demand models for an embedded microcontroller. Similar to our approach, they generate input programs automatically, measure the resource demand, and derive models. They formulate and solve an
integer linear program (ILP) that yields the per-instruction resource demand of the examined hardware. Hönig et al. [
22] use deep neural networks for energy models that also account for inter-instruction effects, for example, related to caches. TPUs, in comparison, avoid such inter-instruction effects [
27]. Instead, tensor processing units aim at providing predictable execution times by exploiting data parallelism in hardware [
12,
26]. Thus, high overall performance is achieved with a much more deterministic execution model compared to CPUs. In consequence, the resource demand prediction of a neural network accelerator can work with simpler models.
Wu et al. [
61] also use machine learning techniques to estimate the performance and power demand for GPGPUs to select (or build) appropriate GPUs for a given application with a maximum error of 15% (execution time) and 10% (power). In a subsequent work, Greathouse et al. extended the approach for heterogeneous systems (i.e., mainly combinations of GPUs and CPUs) with similar accuracies [
18].
With a strict focus on the Intel Neural Compute Stick 2 and Google Coral TPUs, Libutti et al. [
38] compare functional and non-functional properties of the two TPUs. The authors perform power measurements with an INA2019-based setup and compare the achieved accuracy versus performance and energy efficiency of the different TPUs. For the comparison, the work utilises the MLPerf benchmarks [
47]. In our work, we also focus on one of the two platforms (i.e., the Google Coral TPU) and we additionally provide models on the basis of time and power measurements. To improve the reliability of the models, we further consider and discuss the impact of the ambient temperature.
Kljucaric et al. [
31] conducted an architectural analysis of deep learning on edge accelerators. The case study particularity focuses on performance analyses of three TPUs: NVIDIA AGX, Google Coral TPU, and Intel Neural Compute Stick 2. The authors conclude that different workload requirements (i.e., memory size) determine which of the TPUs is most efficient. This underlines the importance of methods that provide resource requirement estimations as presented with our work on
Precious in this article.
The measurement-based creation of resource-demand models of DNNs is subject to the work [
17] by Ganesan et al. The paper focuses on latency measurements for a small
signature set of networks to generate generalised models for mobile systems. Similar to our approach, the authors concentrate on providing accurate resource-demand models for neural networks based on measurements but leave out the consideration of power and energy as important system resources.