Using Dynamic Compilation to Achieve Ninja Performance for CNN Training on Many-Core Processors

Ankush Mandal¹⁶,
Rajkishore Barik¹⁷ &
Vivek Sarkar¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11014))

Included in the following conference series:

European Conference on Parallel Processing

2641 Accesses
2 Citations

Abstract

Convolutional Neural Networks (CNNs) represent a class of Deep Neural Networks that is growing in importance due to their state-of-the-art performance in pattern recognition tasks in various domains, including image recognition, speech recognition, and natural language processing. However, CNNs are very time consuming to train due to the computationally intensive nature of their convolution operations. Typically, a convolution operation is exposed as a library API that duplicates and reorganizes input tensors under-the-hood in order to leverage existing matrix-matrix multiplication (GEMM) BLAS routines. Unfortunately, this widely-used approach suffers not only from memory expansion but also from memory bandwidth limitations. Moreover, although there has been a significant amount of past work on optimizing CNNs on GPUs, those approaches are not directly applicable to many-core CPU platforms such as Intel Xeon Phi.

In this paper, we show how a novel dynamic code generation approach can be used to implement convolution on Intel Knights Landing systems with AVX-512 support, so as to obtain order-of-magnitude performance improvements compared to the GEMM-based approach. Moreover, our approach gives robust performance across different convolution layers of the state-of-the-art CNNs, such as AlexNet, GoogleNetV1, Overfeat, and Vgga. The methods in this paper should be applicable to future many-core CPU platforms with vector lengths of 512 bits or larger.

Rajkishore Barik contributed to this work when he was at Intel Labs, Santa Clara CA 95054, USA.

You have full access to this open access chapter, Download conference paper PDF

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

Optimized Code Generation for Deep Neural Networks

Keywords

1 Introduction

Concepts from the field of Machine Learning drive many aspects of modern society, from social networks to recommendations on e-commerce, and are powering an increasing number of consumer products, including cameras, and self-driving cars. In particular, Deep Learning (DL) has become one of the most critical technologies, demonstrating equal or even better than human-level performance for tasks in domains such as object recognition, board games, and speech recognition. This became possible due to two reasons - (1) Deep Neural Networks (DNNs) can learn features automatically from large datasets and represent complex functions using multiple hidden layers, (2) recent advances in processor technologies made it possible to satisfy the huge computational requirement associated with DL.

Although different DNNs aim at different problems, one of the most critical DL applications today is image recognition [11], and currently, Convolution Neural Network (CNN) is the state-of-the-art DNN for image recognition. A CNN consists of multiple hidden layers, and among these layers, the core of a CNN is the convolution layer. It is also the most computationally expensive layer [1] of a CNN where it performs a large number of small convolutions. As an abundant amount of data parallelism is available in the computation of convolution through many images or mini-batch size and feature maps or channels, massively parallel architectures such as GPUs, in particular, have been used for training and inference on CNNs. As a consequence, all existing CNN frameworks [2, 5, 10, 12] have GPU backends that implement convolution layers as libraries using cuDNN [4]. However, recent advancement in many-core CPUs, such as Intel Knights Landing (KNL) [16] with 68–72 cores and AVX-512 support, have made it potentially capable of delivering significantly high erformance (6TFLOPs for single precision). Still, many-core CPUs have not been explored much from the perspective of optimizing CNNs due to several challenges – (a) low-end cores have high penalties for branches and memory accesses, (b) although vectorization of regular apps is simple for AVX-512 on KNL, it is extremely difficult to extract peak performance due to the cores being two issue-wide and at the same time having two VPUs (it is practically impossible to saturate the issues only with vector floating point instructions), (b) it is harder to hide memory latency because of inherent latency oriented design, and (c) cache prefetching plays an important role in performance but it can be challenging to get right. In order to overcome these challenges and get near-peak performance, we require very high-quality code generation.

A widely used approach to implementing convolutions in CNNs is to flatten the corresponding input data (image2column or im2col operation [3]) and use standard matrix multiplications (GEMM) on the flattened data. One of the main reasons behind the popularity of this method is ample availability of optimized libraries for GEMM operations. While it is easier to implement, this method has severe drawbacks when aiming for high performance on CPUs. The image flattening step is a data copy and redistribution operation which is purely memory bandwidth bound. Even though the GEMM computation is highly optimized, the flattening step acts as a bottleneck and creates a huge performance penalty. On the other hand, the direct convolution method does not involve the im2col operation.

Even though CNNs provide state-of-the-art accuracy, training CNNs requires an enormous amount of time and can span several weeks. For example, it requires 21 days to train GoogleNetV1 [17] with the ImageNet dataset on a single Nvidia^® K20 GPU [8]. Training on CNNs involves forward propagation and back-propagation phases. Although our method is applicable for convolutions in all phases, for the purpose of demonstration in this paper, we choose convolution in back-propagation as a focused candidate problem. We believe that it is a good candidate problem since it involves a more complex data reuse pattern than forward propagation. Due to this complex data reuse over the data it writes to, it is much harder to exploit data reuse in back-propagation, thereby posing a more challenging optimization problem than the forward propagation case.

In this work, we leverage the direct convolution approach to avoid the expensive memory operations associated with an im2col operation and to optimize the convolution in back-propagation. Another critical aspect to consider, when trying to optimize convolution in CNNs, is that the input parameters for convolution vary significantly across layers of a CNN and also across different CNNs. Thus, the parameter values are only known during runtime, making it hard to achieve good performance through static compilation. In this work, we instead explore runtime code specialization for optimizing CNNs on many-core CPUs with large vector lengths.

Our main contributions in this work are:

For optimizing convolutions in state-of-the-art CNNs, we propose a novel dynamic code generation approach targeting high performance on Intel’s Knights Landing architecture. Prior work has shown that it is a daunting task to extract peak performance on Xeon Phi processors even for regular matrix-matrix multiplication application [6]. Our research novelty lies in using a low overhead dynamic code generator to achieve close to possible peak performance for convolutions on Xeon Phi processors. This code generator not only performs standard compiler optimizations including register allocation, loop unrolling, tiling, vectorization, latency hiding, software pipelining, and software prefetching, but it also specializes generated code based on the parameters of the convolution operation, which vary widely across layers and networks.
As another research contribution, our work debunks the claim that direct convolution is not a good method when aiming for high performance [4]. Almost all existing approaches in CNNs use GEMM formulation for convolutions, which has performance bottleneck due to a memory bandwidth bound step. We show that the direct convolution method, with our runtime code specialization, can achieve order-of-magnitude performance improvement by avoiding such overhead.
We provide a thorough performance analysis of our implementation of direct convolution in back-propagation on KNL for several state-of-the-art CNNs. We further compare our performance with other leading approaches on KNL, such as Intel^® MKL-DNN and ZNNPhi.

2 Background

As our work focuses on optimizing the costly convolution operation associated with the convolution layers, we start with a brief description of it. During a convolution operation, each output pixel is generated from the weighted sum of a spatially connected neighborhood of inputs. Specifically, the operation adds each element of the input image with elements from a defined region after multiplying all the elements with specific weights from filter data.

In case of CNNs, we usually perform convolution over a batch of images. This is termed as batched convolution [4]. A batched convolution deals with three four-dimensional tensors: \(I \in {\varvec{R}}^{NCHW}\) as input image data, \(O \in {\varvec{R}}^{NKPQ}\) as output data, and \(F \in {\varvec{R}}^{KCRS}\) as filter data. The input data ranges over N images in a mini-batch, C input image feature maps, H rows or image height, W columns or image width. The filter data ranges over K output feature maps, C input feature maps, R rows or filter height, and S columns or filter width. A mathematical definition of the convolution operation can be found in [4] and other references.

One interesting observation regarding the values of the parameters mentioned above is that they vary significantly across different convolution layers of a CNN and also across different CNNs. For example, in the case of GoogleNetV1 [17], the input feature map or channel (C) ranges from 16 to 832. For the same CNN, the image height (H) and width (W) vary from 224 to 7. Thus, even though the parallelism in a convolution operation may appear to be straightforward, efficient exploitation of this parallelism can be very challenging because of the substantial variation in loop lengths based on the input data. Due to these widely varying parameter values, it is almost impossible to propose a single optimized solution for computing the convolution that gives excellent performance in every scenario. We describe our approach to addressing this challenge in Sect. 3.

3 Overview of Our Approach

In this section, we present our novel code generation approach to optimize direct convolution for parallel execution on KNL. As mentioned in Sects. 1 and 2, the input parameters for the convolution operation in CNNs vary widely. Moreover, the dynamic values of these parameters are only known at execution time. Further, the computational pattern of the convolution kernel depends on the input parameter values. For example, when the filter height (R) and width (S) are 5, the density of arithmetic operations is almost 25 times for image tensors compared to the scenario when R and S are 1. The apparent dependence of the kernel runtime behavior on dynamic parameter values given at execution time indicates that achieving good performance through static compilation is very hard for our problem. So, as a more suitable alternative, we perform runtime code specialization and adopt a Just-In-Time (JIT) compilation approach^{Footnote 1}. We determine the optimization factors from the input parameter values at runtime and provide our dynamic code generator with such factors to produce highly optimized code for the kernel. We show in Sect. 6.2 that, from the performance perspective, our JIT-based method is highly adaptable to a wide range of input parameter values compared to other-state-of-the-art methods.

Figure 1 gives a high-level overview of our approach. First, we start with manually applying standard compiler optimizations to the naive code. Then we take the optimized code and abstract several innermost loops to the JITer. In the static code segment, we refer to the output of the JITer as a function pointer. During static compilation of the code, we leverage widely available threading libraries such as OpenMP for parallelization of the outermost loops. Now, at runtime, we create a descriptor of optimization factors from the parameter values. We pass this descriptor to the JITer to produce optimized code for the abstracted code at runtime. Then JITer creates an in-memory function and returns a pointer to it. We use this function pointer in the static code segment to execute the JITer generated code.

4 Runtime Code Specialization

To start with, we show a “C” style pseudocode^{Footnote 2} of a straightforward, but naive, implementation of the kernel in Fig. 2. In this section, we discuss the runtime specialization of this code and separation of code blocks between dynamic code generation and static compilation.

During the design of our dynamic code generator, we exploit the fact that it is targeted for domain-specific JIT code generation (that is, it targets CNN computations). This enables us to design and implement a very low overhead JITer compared to traditional dynamic compilers. At a high level, our JITer can avoid all the steps of handling generic code in a traditional JITer, and directly proceed to the assembly code generation phase because we have manually applied all the high-level optimizations beforehand and know the exact computation sequence inside the JITer. So, we hardcode the register allocation, loads/stores of data, fused multiply-add computation, tiling, unrolling, and prefetching process inside our code generator while the associated factors still depend on the descriptor we pass to our JITer. For the conversion of the assembly code generated by our JITer to machine code, we have extended the dynamic assembler from LIBXSMM [7] which targeted matrix multiplication style applications. Section 5 discusses the optimizations that we consider for the implementation of our JITer on KNL architecture and how the input parameter values influence the factors associated with these optimizations.

After applying the optimizations described in Sect. 5 on the code in Fig. 2, we determine the partitioning of the kernel between the code that is statically compiled using standard compilers such as Intel^® ICC and the code that we generate at runtime. Figure 3 depicts that partition. The idea here is to keep the overhead of JIT code generation as low as possible. We achieve this by leaving low-level optimizations and parallelization to a static compiler and amortizing the cost of JIT code generation over the outermost loops.

Figure 3 also depicts the interface for our dynamic code generator. First, we create a descriptor (bp_desc) of optimization factors depending on the runtime values of the input parameters. Then we pass the descriptor to our JITer (bp_jit). The JITer generates optimized code using the descriptor and returns a function pointer (conv_bp). We use this function pointer to execute the JITed code inside the innermost loop that we statically compile (in this case, theoiloop in line 9). During the descriptor creation, we derive several crucial optimization factors depending on the input parameter values, such as register blocking factor, cache blocking factor, which loops to unroll, and how much to prefetch in each iteration. Additionally, we ensure that the JITed code fits in L1 instruction cache and that the data footprint of JITed code fits in L1 data cache. This is important because the penalty for missing L1 caches are multiplied by the number of outermost loops and effectively becomes quite high.

5 Optimizations for KNL Many-Core Architecture

5.1 Key Features to Consider for Code Optimization

The processor under discussion is the second generation Intel Xeon Phi many-core processor, codenamed Knights Landing (KNL). An architectural overview of the KNL chip can be found in [16]. The KNL chip features up to 72 out-of-order Silvermont Atom cores, each with 4 hardware-level hyper-threads. A key feature of this processor’s microarchitecture is that each core includes two 512-bit vector processing units (VPUs) for increased SIMD level parallelism, i.e., each core can start the execution of two 16-wide single precision SIMD instructions in the same clock cycle. Another important feature of KNL is that it supports explicit instructions to prefetch data into L1 or L2 caches (via and instructions respectively).

5.2 Fine-Grain Parallelism and Related Optimizations

Data Layout - Needless to say, the large number of on-chip VPUs makes vectorization a critical optimization to consider for KNL. Keeping this in mind, we design the data layouts for the tensors to favor vectorization on x86 systems. From our domain knowledge, we find that the input feature map, C, and the output feature map, K, are typically multiples of vector length on x86 architectures. So, we block these dimensions by the vector length and bring the blocking factor to the innermost dimension to have contiguous SIMD access for the tensors. The resulting data layouts are as follows: (a) \(Input:\; NCHW \longrightarrow NC_{B_I}HWB_I, \, B_I=VLEN\), (b) \(Output:\; NKPQ \longrightarrow NK_{B_O}PQB_O, \, B_O=VLEN\), and (c) \(Filter \, or \, Weight:\; KCRS \longrightarrow K_{B_O}C_{B_I}RS{B_O}{B_I}\).

Vectorization - Following the notations in Fig. 2, we block the ifm loop by a factor of vector length (\(B_I\)) and bring that blocking factor loop to the innermost position, i.e., after ki loop. Then we vectorize the loop and perform the computation of the loop with a single fused multiply and add () vector instruction.

Is Vectorization All We Need? - Let’s consider where we are in the performance landscape after vectorization. To get some insight, we present performance for convolution layers from Overfeat [14] in Table 1. As we can see, we gain significant performance improvement over the naive code in Fig. 2 through vectorization. However, the theoretical peak performance of KNL is 6 TFLOPs and large HPC-style matrix-multiplication benchmarks (Top500 benchmark [13]) achieve roughly 4.5 TFLOPs on KNL for single precision. This means we are still a large distance away from the peak performance. So, how can we do better? The following describes other equally important optimizations for KNL which improve the performance beyond vectorization, and we finally show in Sect. 6 that we achieve an order-of-magnitude better performance and sometimes, even close to peak performance (>4 TFLOPs).

Table 1. Performance gain from vectorization (Using 64 threads)

Full size table

Exploiting Instruction-Level Parallelism - KNL has 32 vector registers per core. We use these registers for register blocking to increase both instruction-level parallelism and register reuse. Our input tensor layout is \(NC_{B_I}HWB_I\), and we vectorize along the \(B_I\) dimension. Therefore, a good candidate for register blocking would be the next innermost dimension, i.e., W dimension. Correspondingly, we perform register blocking along oi loop and bring the blocking factor(\({B_Q}\)) loop inside ki loop.

Optimizing Vector Load and Stores - Vector loads and stores can be quite expensive in number of cycles, even when their data resides in L1 cache. It is therefore important to reduce the number of loads and stores as much as possible, so as to reduce their overhead, as well as to reduce the number of stalls in the instruction pipeline. One interesting observation regarding the input access pattern is that we have significant reuse for input values over ki loop when the stride is 1. We exploit this to gain register reuse for the input tensor. The strategy is outlined in Fig. 4. Basically, we rotate the logical register indices from one ki loop iteration to the next iteration. Thus we get significant reuse of physical registers and effectively reduce the number of loads and stores on input tensor. To hide load latencies, we use software pipelining to issue loads on the weight tensor ahead of its usage. The strategy is depicted in Fig. 5.

5.3 Thread-Level Parallelism and Optimizations

We use the standard OpenMP^® threading library for multi-threading. To ensure coarse granularity of work, we use the outermost loops, i.e., img and ifm loops to exploit thread-level parallelism. We collapse the iteration space of these two loops and issue a using the #pragma omp for collapse(2) directive. As the work inside each ifm iteration is similar, there is no problem of load imbalance here.

Cache Blocking - We consider improving thread performance through blocking for L1 Data cache. The output tensor layout is \(NK_{B_O}PQB_O\). If we do cache blocking along \(B_O\), we gain spatial locality for the output tensor. Furthermore, since the input tensor does not depend on \(B_O\), we ensure temporal locality for the input tensor. So, we apply a cache blocking along the ofm loop by a factor of \(B_O\) and bring the tiled loop inside the ki loop.

Software Prefetching - KNL supports explicit L1 and L2 cache prefetch instructions ( and respectively). We use these instructions in our dynamic code generator to hide load latencies by bringing data into cache before their actual usage, while also ensuring that the data are not prefetched too early so as to be evicted before their usage. Our software prefetch pipeline is presented in Fig. 6.

6 Performance Evaluation

In this section, we evaluate the performance of a C-based implementation of our method on a single-socket Intel^® Xeon Phi 7250 processor which is equipped with 68 cores and 16 GB MCDRAM. We used Turbo mode which set the processor frequency at 1.3 GHz. We also configured the processor with FLAT memory mode and QUADRANT cluster mode. We keep all the data on MCDRAM using numactl -membind=1. For multi-threading, we set the number of threads to 64 for all the experiments. We compiled our code with Intel^® C++ Compiler (ICC) 2017 with the “-O2” flag.

For evaluating our approach, we chose four state-of-the-art CNNs, namely Alexnet [11], Overfeat [14], Vgga [15], and GoogleNet_V1 [17]. We get 67 convolution layers in total from these CNNs. However, to improve readability, we present performance results on 12 convolution layers from these CNNs. We tried to include as diverse parameter values as possible. The parameter values of the selected convolution layers are presented in Table 2.

Table 2. Parameter values for convolution layers

Full size table

6.1 Comparison with GEMM-Based Method

As we can see from Fig. 7, we get an order-of-magnitude performance improvement over GEMM-based method implemented with Intel^® MKL 2018. Moreover, the figure also supports our hypothesis that the image flattening step (im2col) required by the GEMM-based methods incurs significant overhead. We see that the FLOPs measured for only the GEMM operation is much higher than the effective FLOPs for the method. Hence, our work shows that direct convolution method can achieve much higher performance over GEMM-based approach by avoiding memory bandwidth bound im2col operation. One thing to note here, the GEMM operation does not reach very high FLOPs (i.e. >4 TFLOPs) due to the irregular sizes of matrices.

6.2 Comparison with State-of-the-Art Libraries

To compare our method with other state-of-the-art methods, we present performance comparison with Intel^® MKL-DNN [9] and ZNNPhi [18]. Both of them have optimized the convolution operation for KNL. Figure 8 presents the performance results. It shows that our method gives better performance for all the convolution layers except for Alexnet_CONV2 where MKL-DNN gives the best performance. It proves the importance of our adaptable runtime code specialization which decides the optimization factors depending on the execution time values of input parameters. We see that even MKL-DNN, a highly optimized manually tuned library by experts, fails to capture specific scenarios and gives quite poor performance, for example, Vgga_CONV2 and Googlenetv1_CONV18. On the other hand, ZNNPhi generates several kernels with different values of the optimization parameters. We only present the best performance achieved among those kernels. In general, our method gives much better performance than ZNNPhi except for convolution layers from Vgga where the performance is similar. Another important advantage of our method is that we do not incur the overhead of any benchmarking or auto-tuning step involving several kernels to choose the best one.

6.3 Overhead of JIT Code Generation

Figure 9 shows an evaluation of the overhead of our dynamic code generation using the following metric: code generation time as a percentage of the total execution time for convolution over a mini-batch. In reality, the kernel is executed over several iterations during the training step, while JIT code generation is required only once. Hence, in practice, the cost of JIT code generation is amortized over multiple executions of the kernel with the same parameter values, which in most cases is well over 1,000. Nevertheless, even for a single execution, we see negligible overhead for many convolution layers, especially the ones with high iteration space. In case of kernels with comparatively low iteration space, such as Googlenetv1_CONV18 and Googlenetv1_CONV25, we see a discernible overhead (but still under 10%) because they have significantly small execution times (9.6 ms for Googlenetv1_CONV18). However, with amortization from the number of iterations, this small overhead becomes negligible.

7 Conclusion

Convolution Neural Networks (CNN) are state-of-the-art Deep Neural Networks for image recognition applications today. The core of these CNNs is the convolution layer, which performs a large number of small convolutions with irregular dimensions. CNN training requires massive computing power, and it turns out that the convolution operation is the key performance enabler for CNNs. As a primary contribution of this work, we propose a novel low overhead dynamic code generation approach for runtime code specialization based on the input parameter values for convolution. We demonstrate that an efficient implementation of direct convolution in back-propagation using our approach can achieve close to peak performance in many cases on the Intel Knights Landing (KNL) processor. Furthermore, we debunk the claim that the direct convolution method is not suitable for high performance. We show that the direct convolution method, using our approach, can achieve a significant performance improvement over the GEMM based method on KNL. Finally, we compare our performance results with other cutting-edge approaches on KNL, such as MKL-DNN and ZNNPhi for several convolution layers of state-of-the-art CNNs. The comparison supports the robustness of our method on performance over a wide range of input parameter values. We have released our implementation at https://github.com/hfp/libxsmm, which is currently used by high-level frameworks such as TensorFlow.

Notes

1.
For convenience, we will use “JIT” as a shorthand root in words such as “JITer” and “JITed”.
2.
Array accesses appear within “( )” instead of “[ ]” due to the use of macros e.g., A(i, j, k, l) denotes location A [i*dim2*dim3*dim4 + j*dim3*dim4 + k*dim4 + l].

References

Awan, A.A., et al.: An in-depth performance characterization of CPU- and GPU-based DNN training on modern architectures. In: Proceedings of the Machine Learning on HPC Environments. MLHPC 2017, pp. 8:1–8:8 (2017)
Google Scholar
Bergstra, J., et al.: Theano: a CPU and GPU math compiler in Python. In: Proceedings of 9th Python in Science Conference, pp. 1–7 (2010)
Google Scholar
Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft (2006)
Google Scholar
Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop, No. EPFL-CONF-192376 (2011)
Google Scholar
Heinecke, A., et al.: Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi coprocessor. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IPDPS 2013, pp. 126–137 (2013)
Google Scholar
Heinecke, A., Pabst, H., Henry, G.: LIBXSMM: a high performance library for small matrix multiplications. In: Poster and Extended Abstract Presented at SC (2015)
Google Scholar
Iandola, F.N., et al.: FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2592–2600 (2016)
Google Scholar
Intel: MKL-DNN (2017). https://github.com/01org/mkl-dnn
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)
Google Scholar
Krizhevsky, A., et al.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Martin, A., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). tensorflow.org
Meur, H., et al.: Top500 list, June 2016. https://www.top500.org/
Sermanet, P., et al.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Article Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Zlateski, A., Seung, H.S.: ZNNPhi (2017). https://github.com/seung-lab/znnphi-release

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, 30332, USA
Ankush Mandal & Vivek Sarkar
Uber Technologies Inc., San Francisco, USA
Rajkishore Barik

Authors

Ankush Mandal
View author publications
You can also search for this author in PubMed Google Scholar
Rajkishore Barik
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Sarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankush Mandal .

Editor information

Editors and Affiliations

Department of Computer Science, University of Torino, Torino, Italy
Marco Aldinucci
Department of Computer Science, University of Torino, Torino, Italy
Luca Padovani
Department of Computer Science, University of Pisa, Pisa, Italy
Massimo Torquati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mandal, A., Barik, R., Sarkar, V. (2018). Using Dynamic Compilation to Achieve Ninja Performance for CNN Training on Many-Core Processors. In: Aldinucci, M., Padovani, L., Torquati, M. (eds) Euro-Par 2018: Parallel Processing. Euro-Par 2018. Lecture Notes in Computer Science(), vol 11014. Springer, Cham. https://doi.org/10.1007/978-3-319-96983-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-96983-1_19
Published: 01 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96982-4
Online ISBN: 978-3-319-96983-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics