O3BNN: An out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning

T Geng, T Wang, C Wu, C Yang, W Wu, A Li… - Proceedings of the …, 2019 - dl.acm.org
Proceedings of the ACM International Conference on Supercomputing, 2019dl.acm.org
Binarized Neural Networks (BNN) have drawn tremendous attention due to significantly
reduced computational complexity and memory demand. They have especially shown great
potential in cost-and power-restricted domains, such as IoT and smart edge-devices, where
reaching a certain accuracy bar is often sufficient, and real-time is highly desired. In this
work, we demonstrate that the highly-condensed BNN model can be shrunk significantly
further by dynamically pruning irregular redundant edges. Based on two new observations …
Binarized Neural Networks (BNN) have drawn tremendous attention due to significantly reduced computational complexity and memory demand. They have especially shown great potential in cost- and power-restricted domains, such as IoT and smart edge-devices, where reaching a certain accuracy bar is often sufficient, and real-time is highly desired.
In this work, we demonstrate that the highly-condensed BNN model can be shrunk significantly further by dynamically pruning irregular redundant edges. Based on two new observations on BNN-specific properties, an out-of-order (OoO) architecture - O3BNN, can curtail edge evaluation in cases where the binary output of a neuron can be determined early. Similar to Instruction-Level-Parallelism (ILP), these fine-grained, irregular, runtime pruning opportunities are traditionally presumed to be difficult to exploit. We evaluate our design on an FPGA platform using three well-known networks, including VggNet-16, AlexNet for ImageNet, and a VGG-like network for Cifar-10. Results show that the out-of-order approach can prune 27%, 16%, and 42% of the operations for the three networks respectively, without any accuracy loss, leading to at least 1.7×, 1.5×, and 2.1× speedups over state-of-the-art BNN implementations on FPGA/GPU/CPU. Since the approach is inference runtime pruning, no retraining or fine-tuning is needed. We demonstrate the design on an FPGA platform; however, this is only for showcasing the method: the approach does not rely on any FPGA-specific features and can thus be adopted by other devices as well.
ACM Digital Library