research-article

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Authors:

Matteo Interlandi,

Jason CongAuthors Info & Claims

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 456 - 469

https://doi.org/10.1145/2987550.2987569

Published: 05 October 2016 Publication History

Abstract

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators.

In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

References

[1]

Apache Hadoop. https://hadoop.apache.org. Accessed: 2016-05-24.

[2]

Apache parquet. https://parquet.apache.org/. Accessed: 2016-05-24.

[3]

Aparapi in amd developer website. http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/. Accessed: 2016-05-24.

[4]

Facebook engineering (2012) under the hood: Scheduling mapreduce jobs more efficiently with corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. Accessed: 2016-01-30.

[5]

HTCondor. https://research.cs.wisc.edu/htcondor. Accessed: 2016-05-24.

[6]

Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html. Accessed: 2016-05-17.

[7]

Large scale distributed deep learning on Hadoop clusters. http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop. Accessed: 2016-05-24.

[8]

The MNIST database of handwritten digits. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m. Accessed: 2016-05-24.

[9]

Project Tungsten: Bringing Apache Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. Accessed: 2016-08-10.

[10]

The snappy-java port. https://github.com/xerial/snappy-java. Accessed: 2016-08-01.

[11]

Spark MLlib. http://spark.apache.org/mllib/. Accessed: 2016-05-24.

[12]

Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. Accessed: 2016-05-17.

[13]

Brech, B., Rubio, J., and Hollinger, M. IBM Data Engine for NoSQL - Power Systems Edition. Tech. rep., IBM Systems Group, 2015.

[14]

Byma, S., Steffan, J. G., Bannazadeh, H., Garcia, A. L., and Chow, P. FPGAs in the cloud: Booting virtualized hardware accelerators with openstack. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on (2014), IEEE, pp. 109--116.

Digital Library

[15]

Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei., P. When Apache Spark meets FPGAs: A case study for next-generation dna sequencing acceleration. In The 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016).

[16]

Chen, Y. T., Cong, J., Lei, J., and Wei, P. A novel high-throughput acceleration engine for read alignment. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on (May 2015), pp. 199--202.

Digital Library

[17]

Chen, Y.-T., Cong, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing. High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).

[18]

Cong, J., Huang, M., Wu, D., and Yu, C. H. Heterogeneous datacenters: Options and opportunities. In Proceedings of the 53nd Annual Design Automation Conference (2016), ACM.

Digital Library

[19]

Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473--491.

Digital Library

[20]

El-Helw, I., Hofman, R., and Bal, H. E. Glasswing: Accelerating mapreduce on multi-core and many-core clusters. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC '14, ACM, pp. 295--298.

Digital Library

[21]

Esmaeilzadeh, H., Blem, E., St.Amant, R., Sankar-alingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (June 2011), pp. 365--376.

Digital Library

[22]

Grossman, M., Breternitz, M., and Sarkar, V. HadoopCL: Mapreduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW '13, IEEE Computer Society, pp. 1918--1927.

Digital Library

[23]

Grossman, M., and Sarkar, V. Swat: A programmable, in-memory, distributed, high-performance computing platform. The 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2016).

Digital Library

[24]

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295--308.

Digital Library

[25]

Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 217--226.

Digital Library

[26]

Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review (2007), vol. 41, ACM, pp. 59--72.

Digital Library

[27]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).

[28]

K. Choi, Y., and Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions on Biomedical Circuits and Systems 10, 3 (June 2016), 754--767.

[29]

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).

[30]

Li, P., Luo, Y., Zhang, N., and Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on (Aug 2015), pp. 347--348.

[31]

Lin, Z., and Chow, P. Zcluster: A Zynq-based Hadoop cluster. In Field-Programmable Technology (FPT), 2013 International Conference on (Dec 2013), pp. 450--453.

[32]

Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, May 2015), USENIX Association, pp. 293--307.

Digital Library

[33]

Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (June 2014), ieeexplore.ieee.org, pp. 13--24.

Digital Library

[34]

Rajagopalan, V., Boppana, V., Dutta, S., Taylor, B., and Wittig, R. Xilinx Zynq-7000 EPP--an extensible processing platform family. In 23rd Hot Chips Symposium (2011), pp. 1352--1357.

[35]

Rossbach, C. J., Yu, Y., Currey, J., Martin, J.-P., and Fetterly, D. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 49--68.

Digital Library

[36]

Sabne, A., Sakdhnagool, P., and Eigenmann, R. HeteroDoop: A MapReduce programming system for accelerator clusters. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 235--246.

Digital Library

[37]

Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 351--364.

Digital Library

[38]

Segal, O., Colangelo, P., Nasiri, N., Qian, Z., and Margala, M. SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR abs/1505.01120 (2015).

[39]

Shan, Y., Wang, B., Yan, J., Wang, Y., Xu, N., and Yang, H. FPMR: Mapreduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 93--102.

Digital Library

[40]

Stuart, J. A., and Owens, J. D. Multi-GPU mapreduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (Washington, DC, USA, 2011), IPDPS '11, IEEE Computer Society, pp. 1068--1079.

Digital Library

[41]

Tsoi, K. H., and Luk, W. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 115--124.

Digital Library

[42]

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5.

Digital Library

[43]

Wang, Z., Zhang, S., He, B., and Zhang, W. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2016), 1--1.

[44]

Yeung, J. H. C., Tsang, C. C., Tsoi, K. H., Kwan, B. S. H., Cheung, C. C. C., Chan, A. P. C., and Leong, P. H. W. Map-reduce as a programming model for custom computing machines. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on (April 2008), pp. 149--159.

Digital Library

[45]

Yin, D., Li, G., and Huang, K.-D. Scalable MapReduce framework on FPGA. In Lecture Notes in Computer Science, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer Berlin Heidelberg, 2012, pp. 280--294.

[46]

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2--2.

Digital Library

[47]

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010), pp. 10--10.

Digital Library

[48]

Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2015), FPGA '15, ACM, pp. 161--170.

Digital Library

Cited By

Bai YSohrabizadeh AQin ZHu ZSun YCong JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668084
Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.14778/3565838.3565842
Qing JLeong PYeung R(2023)Performance Analysis and Optimal Design of BATS Code: A Hardware PerspectiveIEEE Transactions on Vehicular Technology10.1109/TVT.2023.326135972:8(9733-9745)Online publication date: Aug-2023
https://doi.org/10.1109/TVT.2023.3261359
Show More Cited By

Index Terms

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

A Study of FPGA Virtualization and Accelerator Scheduling
ETCD'17: Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Deploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult ...
Field Programmable Gate Array (FPGA) Accelerator Sharing
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

October 2016

534 pages

ISBN:9781450345255

DOI:10.1145/2987550

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '16

Sponsor:

SoCC '16: ACM Symposium on Cloud Computing

October 5 - 7, 2016

CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

76
Total Citations
View Citations
1,447
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bai YSohrabizadeh AQin ZHu ZSun YCong JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668084
Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.14778/3565838.3565842
Qing JLeong PYeung R(2023)Performance Analysis and Optimal Design of BATS Code: A Hardware PerspectiveIEEE Transactions on Vehicular Technology10.1109/TVT.2023.326135972:8(9733-9745)Online publication date: Aug-2023
https://doi.org/10.1109/TVT.2023.3261359
Nasiri HDarjani AKavand NGoudarzi M(2023)H-Storm: A Hybrid CPU-FPGA Architecture to Accelerate Apache StormJournal of Grid Computing10.1007/s10723-023-09692-921:4Online publication date: 7-Nov-2023
https://doi.org/10.1007/s10723-023-09692-9
Takano KOda TOzaki RUejima AKohata M(2022)Implementation and Evaluation of Distributed Processing on a PC-FPGA Hybrid SystemPC-FPGA複合システムにおける分散処理の実現と評価IEEJ Transactions on Electronics, Information and Systems10.1541/ieejeiss.142.1199142:11(1199-1207)Online publication date: 1-Nov-2022
https://doi.org/10.1541/ieejeiss.142.1199
Al Qassem LStouraitis TDamiani EElfadel I(2022)FPGAaaS: A Survey of Infrastructures and SystemsIEEE Transactions on Services Computing10.1109/TSC.2020.297601215:2(1143-1156)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TSC.2020.2976012
Minhas UWoods RNikolopoulos DKarakonstantis G(2022)Efficient, Dynamic Multi-Task Execution on FPGA-Based Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310115333:3(710-722)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TPDS.2021.3101153
Wang XNiu YLiu FXu Z(2022)When FPGA Meets Cloud: A First Look at PerformanceIEEE Transactions on Cloud Computing10.1109/TCC.2020.299254810:2(1344-1357)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TCC.2020.2992548
Mehrabi ASorin DLee B(2022)Spatiotemporal Strategies for Long-Term FPGA Resource Management2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00026(198-209)Online publication date: May-2022
https://doi.org/10.1109/ISPASS55109.2022.00026
Bhambay SMukhopadhyay A(2022)Asymptotic optimality of speed-aware JSQ for heterogeneous service systemsPerformance Evaluation10.1016/j.peva.2022.102320157:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.peva.2022.102320
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents