article

Optimizing dataflow applications on heterogeneous environments

Authors:

George Teodoro,

Timothy D. Hartley,

Umit V. Catalyurek,

Renato FerreiraAuthors Info & Claims

Cluster Computing, Volume 15, Issue 2

Pages 125 - 144

https://doi.org/10.1007/s10586-010-0151-6

Published: 01 June 2012 Publication History

Abstract

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

References

[1]

Arpaci-Dusseau, R.H., Anderson, E., Treuhaft, N., Culler, D.E., Hellerstein, J.M., Patterson, D., Yelick, K.: Cluster I/O with river: making the fast case common. In: IOPADS '99: Input/Output for Parallel and Distributed Systems (1999).

[2]

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In: Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 863-874 (2009).

Digital Library

[3]

Berman, F.D., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-level scheduling on distributed heterogeneous networks. In: Supercomputing '96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, p. 39 (1996).

Digital Library

[4]

Beynon, M., Ferreira, R., Kurc, T.M., Sussman, A., Saltz, J.H.: DataCutter: middleware for filtering very large scientific datasets on archival storage systems. In: IEEE Symposium on Mass Storage Systems, pp. 119-134 (2000).

[5]

Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A., Saltz, J.: Distributed processing of very large datasets with Data-Cutter. Parallel Comput. 27(11), 1457-1478 (2001).

Digital Library

[6]

Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu,W.: Coyote: a system for constructing fine-grain configurable communication services. ACM Trans. Comput. Syst. 16(4), 321-366 (1998).

Digital Library

[7]

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777-786 (2004).

Digital Library

[8]

Catalyurek, U., Beynon, M.D., Chang, C., Kurc, T., Sussman, A., Saltz, J.: The virtual microscope. IEEE Trans. Inf. Technol. Biomed. 7(4), 230-248 (2003).

Digital Library

[9]

Fahringer, T., Zima, H.P.: A static parameter based performance prediction tool for parallel programs. In: ICS '93: Proceedings of the 7th International Conference on Supercomputing, pp. 207-219 (1993).

Digital Library

[10]

Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination, consistency properties. Computer science technical report, School of Aviation Medicine, Randolph Field, Texas (1951).

[11]

Hartley, T.D., Catalyurek, U.V., Ruiz, A., Ujaldon, M., Igual, F., Mayo, R.: Biomedical image analysis on a cooperative cluster of gpus and multicores. In: 22nd ACM Intl. Conference on Supercomputing (2008).

Digital Library

[12]

He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: A mapreduce framework on graphics processors. In: Parallel Architectures and Compilation Techniques (2008).

[13]

Hoppe, H.: View-dependent refinement of progressive meshes. In: SIGGRAPH 97 Proc., pp. 189-198 (1997). http://research.microsoft.com/hoppe/

Digital Library

[14]

Hsu, C.H., Chen, T.L., Li, K.C.: Performance effective prescheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener. Comput. Syst. (2007).

[15]

Iverson, M., Ozguner, F., Follen, G.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4th Heterogeneous Computing Workshop (HCW'95) (1995).

[16]

Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Supercomputing '01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), pp. 37-37 (2001).

[17]

Kurc, T., Lee, F., Agrawal, G., Catalyurek, U., Ferreira, R., Saltz, J.: Optimizing reduction computations in a distributed environment. In: SC '03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 9 (2003).

[18]

Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP '09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programm ing, pp. 101-110 (2009).

[19]

Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. ACM SIGPLAN Not. 43(3), 287-296 (2008).

Digital Library

[20]

Low, S., Peterson, L., Wang, L.: Understanding tcp vegas: a duality model. In: Proceedings of ACM Sigmetrics (2001).

Digital Library

[21]

Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd International Symposium on Microarchitecture (MICRO) (2009).

[22]

Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluation of multiresolution optimization strategies for multimodality image registration by maximization of mutual information. Med. Image Anal. 3(4), 373-386 (1999).

[23]

NVIDIA: NVIDIA CUDA SDK (2007). http://nvidia.com/cuda

[24]

O'Malley, S.W., Peterson, L.L.: A dynamic network architecture. ACM Trans. Comput. Syst. 10(2) (1992).

[25]

Patkar, N., Katsuno, A., Li, S., Maruyama, T., Savkar, S., Simone, M., Shen, G., Swami, R., Tovey, D.: Microarchitecture of hal's cpu. In: IEEE International Computer Conference, p. 259 (1995).

[26]

Ramanujam, J.: Toward automatic parallelization and auto-tuning of affine kernels for gpus. In: Workshop on Automatic Tuning for Petascale Systems (2008).

[27]

Rocha, B.M., Campos, F.O., Plank, G., dos Santos, R.W., Liebmann4, M., Haase, G.: Simulations of the electrical activity in the heart with graphic processing units. Accepted for publication in Eighth International Conference on Parallel Processing and Applied Mathematics (2009).

[28]

Rosenfeld, A. (ed.): Multiresolution Image Processing and Analysis. Springer, Berlin (1984).

[29]

Ruiz, A., Sertel, O., Ujaldon, M., Catalyurek, U., Saltz, J., Gurcan, M.: Pathological image analysis using the gpu: Stroma classification for neuroblastoma. In: Proc. of IEEE Int. Conf. on Bioinformatics and Biomedicine (2007).

[30]

Sancho, J.C., Kerbyson, D.J.: Analysis of double buffering on two different multicore architectures: quad-core opteron and the Cell-BE. In: International Parallel and Distributed Processing Symposium (IPDPS) (2008).

[31]

Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H., Gurcan, M.N.: Computer-aided prognosis of neuroblastoma on whole-slide images: classification of stromal development. Pattern Recognit. 42(6) (2009).

[32]

Shimada, H., Ambros, I.M., Dehner, L.P., Ichi Hata, J., Joshi, V.V., Roald, B.: Terminology and morphologic criteria of neuroblastic tumors: recommendation by the international neuroblastoma pathology committee. Cancer 86(2) (1999).

[33]

Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009).

[34]

Sundaram, N., Raghunathan, A., Chakradhar, S.T.: A framework for efficient and scalable execution of domain-specific templates on gpus. In: IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1- 12. (2009).

[35]

Tavares, T., Teodoro, G., Kurc, T., Ferreira, R., Guedes, D., Meira, W.J., Catalyurek, U., Hastings, S., Oster, S., Langella, S., Saltz, J.: An efficient and reliable scientific workflow system. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 445-452 (2007).

[36]

Teodoro, G., Fireman, D., Guedes, D. Jr., Ferreira, R.: Achieving multi-level parallelism in filter-labeled stream programming model. In: The 37th International Conference on Parallel Processing (ICPP) (2008).

Digital Library

[37]

Teodoro, G., Hartley, T.D.R., Catalyurek, U., Ferreira, R.: Runtime optimizations for replicated dataflows on heterogeneous environments. In: Proc. of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC) (2010).

[38]

Teodoro, G., Sachetto, R., Fireman, D., Guedes, D., Ferreira, R.: Exploiting computational resources in distributed heterogeneous platforms. In: 21st International Symposium on Computer Architecture and High Performance Computing, pp. 83-90 (2009).

[39]

Teodoro, G., Sachetto, R., Sertel, O., Gurcan, M. Jr., Catalyurek, U., Ferreira, R.: Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In: IEEE Cluster (2009).

[40]

Teodoro, G., Tavares, T., Ferreira, R., Kurc, T., Meira, W., Guedes, D., Pan, T., Saltz, J.: Run-time support for efficient execution of scientific workflows on distributed environmments. In: International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil (2006).

[41]

Vrsalovic, D.F., Siewiorek, D.P., Segall, Z.Z., Gehringer, E.F.: Performance prediction and calibration for a class of multiprocessors. IEEE Trans. Comput. 37(11) (1988).

Digital Library

[42]

Welsh, M., Culler, D., Brewer, E.: Seda: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev. 35(5), 230-243 (2001).

Digital Library

[43]

Woods, B., Clymer, B., Saltz, J., Kurc, T.: A parallel implementation of 4-dimensional haralick texture analysis for disk-resident image datasets. In: SC '04: Proceedings of the 204 ACM/IEEE Conference on Supercomputing (2004).

Cited By

Teodoro GKurc TAndrade GKong JFerreira RSaltz J(2017)Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICsInternational Journal of High Performance Computing Applications10.1177/109434201559451931:1(32-51)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1177/1094342015594519
Kougka GGounaris ATsichlas K(2015)Practical algorithms for execution engine selection in data flowsFuture Generation Computer Systems10.1016/j.future.2014.11.01145:C(133-148)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1016/j.future.2014.11.011
Huang YChieu B(2015)Architecture for video streaming application on heterogeneous platformMultimedia Tools and Applications10.1007/s11042-014-1856-y74:13(4927-4945)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1007/s11042-014-1856-y
Show More Cited By

Optimizing dataflow applications on heterogeneous environments
1. General and reference
  1. Cross-computing tools and techniques
2. Social and professional topics
  1. Professional topics
    1. Computing profession

Recommendations

Run-time optimizations for replicated dataflows on heterogeneous environments
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements ...
Accelerating simulation of agent-based models on heterogeneous architectures
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Cluster Computing

Cluster Computing Volume 15, Issue 2

June 2012

119 pages

ISSN:1386-7857

Issue’s Table of Contents

Copyright © Copyright © 2012 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Teodoro GKurc TAndrade GKong JFerreira RSaltz J(2017)Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICsInternational Journal of High Performance Computing Applications10.1177/109434201559451931:1(32-51)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1177/1094342015594519
Kougka GGounaris ATsichlas K(2015)Practical algorithms for execution engine selection in data flowsFuture Generation Computer Systems10.1016/j.future.2014.11.01145:C(133-148)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1016/j.future.2014.11.011
Huang YChieu B(2015)Architecture for video streaming application on heterogeneous platformMultimedia Tools and Applications10.1007/s11042-014-1856-y74:13(4927-4945)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1007/s11042-014-1856-y
Pereira ARamos LGóes L(2015)PSkelConcurrency and Computation: Practice & Experience10.1002/cpe.347927:17(4938-4953)Online publication date: 10-Dec-2015
https://dl.acm.org/doi/10.1002/cpe.3479
Aji ATeodoro GWang F(2014)HaggisProceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data10.1145/2676536.2676539(15-20)Online publication date: 4-Nov-2014
https://dl.acm.org/doi/10.1145/2676536.2676539
Teodoro GPan TKurc TKong JCooper LKlasky SSaltz J(2014)Region templatesParallel Computing10.1016/j.parco.2014.09.00340:10(589-610)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1016/j.parco.2014.09.003

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents