Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2987550.2987569acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Published: 05 October 2016 Publication History

Abstract

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators.
In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

References

[1]
Apache Hadoop. https://hadoop.apache.org. Accessed: 2016-05-24.
[2]
Apache parquet. https://parquet.apache.org/. Accessed: 2016-05-24.
[3]
Aparapi in amd developer website. http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/. Accessed: 2016-05-24.
[4]
Facebook engineering (2012) under the hood: Scheduling mapreduce jobs more efficiently with corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. Accessed: 2016-01-30.
[5]
HTCondor. https://research.cs.wisc.edu/htcondor. Accessed: 2016-05-24.
[6]
Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html. Accessed: 2016-05-17.
[7]
Large scale distributed deep learning on Hadoop clusters. http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop. Accessed: 2016-05-24.
[8]
The MNIST database of handwritten digits. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m. Accessed: 2016-05-24.
[9]
Project Tungsten: Bringing Apache Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. Accessed: 2016-08-10.
[10]
The snappy-java port. https://github.com/xerial/snappy-java. Accessed: 2016-08-01.
[11]
Spark MLlib. http://spark.apache.org/mllib/. Accessed: 2016-05-24.
[12]
Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. Accessed: 2016-05-17.
[13]
Brech, B., Rubio, J., and Hollinger, M. IBM Data Engine for NoSQL - Power Systems Edition. Tech. rep., IBM Systems Group, 2015.
[14]
Byma, S., Steffan, J. G., Bannazadeh, H., Garcia, A. L., and Chow, P. FPGAs in the cloud: Booting virtualized hardware accelerators with openstack. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on (2014), IEEE, pp. 109--116.
[15]
Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei., P. When Apache Spark meets FPGAs: A case study for next-generation dna sequencing acceleration. In The 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016).
[16]
Chen, Y. T., Cong, J., Lei, J., and Wei, P. A novel high-throughput acceleration engine for read alignment. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on (May 2015), pp. 199--202.
[17]
Chen, Y.-T., Cong, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing. High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).
[18]
Cong, J., Huang, M., Wu, D., and Yu, C. H. Heterogeneous datacenters: Options and opportunities. In Proceedings of the 53nd Annual Design Automation Conference (2016), ACM.
[19]
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473--491.
[20]
El-Helw, I., Hofman, R., and Bal, H. E. Glasswing: Accelerating mapreduce on multi-core and many-core clusters. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC '14, ACM, pp. 295--298.
[21]
Esmaeilzadeh, H., Blem, E., St.Amant, R., Sankar-alingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (June 2011), pp. 365--376.
[22]
Grossman, M., Breternitz, M., and Sarkar, V. HadoopCL: Mapreduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW '13, IEEE Computer Society, pp. 1918--1927.
[23]
Grossman, M., and Sarkar, V. Swat: A programmable, in-memory, distributed, high-performance computing platform. The 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2016).
[24]
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295--308.
[25]
Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 217--226.
[26]
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review (2007), vol. 41, ACM, pp. 59--72.
[27]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).
[28]
K. Choi, Y., and Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions on Biomedical Circuits and Systems 10, 3 (June 2016), 754--767.
[29]
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
[30]
Li, P., Luo, Y., Zhang, N., and Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on (Aug 2015), pp. 347--348.
[31]
Lin, Z., and Chow, P. Zcluster: A Zynq-based Hadoop cluster. In Field-Programmable Technology (FPT), 2013 International Conference on (Dec 2013), pp. 450--453.
[32]
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, May 2015), USENIX Association, pp. 293--307.
[33]
Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (June 2014), ieeexplore.ieee.org, pp. 13--24.
[34]
Rajagopalan, V., Boppana, V., Dutta, S., Taylor, B., and Wittig, R. Xilinx Zynq-7000 EPP--an extensible processing platform family. In 23rd Hot Chips Symposium (2011), pp. 1352--1357.
[35]
Rossbach, C. J., Yu, Y., Currey, J., Martin, J.-P., and Fetterly, D. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 49--68.
[36]
Sabne, A., Sakdhnagool, P., and Eigenmann, R. HeteroDoop: A MapReduce programming system for accelerator clusters. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 235--246.
[37]
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 351--364.
[38]
Segal, O., Colangelo, P., Nasiri, N., Qian, Z., and Margala, M. SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR abs/1505.01120 (2015).
[39]
Shan, Y., Wang, B., Yan, J., Wang, Y., Xu, N., and Yang, H. FPMR: Mapreduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 93--102.
[40]
Stuart, J. A., and Owens, J. D. Multi-GPU mapreduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (Washington, DC, USA, 2011), IPDPS '11, IEEE Computer Society, pp. 1068--1079.
[41]
Tsoi, K. H., and Luk, W. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 115--124.
[42]
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5.
[43]
Wang, Z., Zhang, S., He, B., and Zhang, W. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2016), 1--1.
[44]
Yeung, J. H. C., Tsang, C. C., Tsoi, K. H., Kwan, B. S. H., Cheung, C. C. C., Chan, A. P. C., and Leong, P. H. W. Map-reduce as a programming model for custom computing machines. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on (April 2008), pp. 149--159.
[45]
Yin, D., Li, G., and Huang, K.-D. Scalable MapReduce framework on FPGA. In Lecture Notes in Computer Science, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer Berlin Heidelberg, 2012, pp. 280--294.
[46]
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2--2.
[47]
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010), pp. 10--10.
[48]
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2015), FPGA '15, ACM, pp. 161--170.

Cited By

View all
  • (2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
  • (2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
  • (2023)Performance Analysis and Optimal Design of BATS Code: A Hardware PerspectiveIEEE Transactions on Vehicular Technology10.1109/TVT.2023.326135972:8(9733-9745)Online publication date: Aug-2023
  • Show More Cited By

Index Terms

  1. Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
    October 2016
    534 pages
    ISBN:9781450345255
    DOI:10.1145/2987550
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. FPGA-as-a-service
    2. heterogeneous datacenter

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '16
    Sponsor:
    SoCC '16: ACM Symposium on Cloud Computing
    October 5 - 7, 2016
    CA, Santa Clara, USA

    Acceptance Rates

    SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;
    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
    • (2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
    • (2023)Performance Analysis and Optimal Design of BATS Code: A Hardware PerspectiveIEEE Transactions on Vehicular Technology10.1109/TVT.2023.326135972:8(9733-9745)Online publication date: Aug-2023
    • (2023)H-Storm: A Hybrid CPU-FPGA Architecture to Accelerate Apache StormJournal of Grid Computing10.1007/s10723-023-09692-921:4Online publication date: 7-Nov-2023
    • (2022)Implementation and Evaluation of Distributed Processing on a PC-FPGA Hybrid SystemPC-FPGA複合システムにおける分散処理の実現と評価IEEJ Transactions on Electronics, Information and Systems10.1541/ieejeiss.142.1199142:11(1199-1207)Online publication date: 1-Nov-2022
    • (2022)FPGAaaS: A Survey of Infrastructures and SystemsIEEE Transactions on Services Computing10.1109/TSC.2020.297601215:2(1143-1156)Online publication date: 1-Mar-2022
    • (2022)Efficient, Dynamic Multi-Task Execution on FPGA-Based Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310115333:3(710-722)Online publication date: 1-Mar-2022
    • (2022)When FPGA Meets Cloud: A First Look at PerformanceIEEE Transactions on Cloud Computing10.1109/TCC.2020.299254810:2(1344-1357)Online publication date: 1-Apr-2022
    • (2022)Spatiotemporal Strategies for Long-Term FPGA Resource Management2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00026(198-209)Online publication date: May-2022
    • (2022)Asymptotic optimality of speed-aware JSQ for heterogeneous service systemsPerformance Evaluation10.1016/j.peva.2022.102320157:COnline publication date: 1-Oct-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media