Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Optimizing dataflow applications on heterogeneous environments

Published: 01 June 2012 Publication History

Abstract

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

References

[1]
Arpaci-Dusseau, R.H., Anderson, E., Treuhaft, N., Culler, D.E., Hellerstein, J.M., Patterson, D., Yelick, K.: Cluster I/O with river: making the fast case common. In: IOPADS '99: Input/Output for Parallel and Distributed Systems (1999).
[2]
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In: Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 863-874 (2009).
[3]
Berman, F.D., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-level scheduling on distributed heterogeneous networks. In: Supercomputing '96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, p. 39 (1996).
[4]
Beynon, M., Ferreira, R., Kurc, T.M., Sussman, A., Saltz, J.H.: DataCutter: middleware for filtering very large scientific datasets on archival storage systems. In: IEEE Symposium on Mass Storage Systems, pp. 119-134 (2000).
[5]
Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A., Saltz, J.: Distributed processing of very large datasets with Data-Cutter. Parallel Comput. 27(11), 1457-1478 (2001).
[6]
Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu,W.: Coyote: a system for constructing fine-grain configurable communication services. ACM Trans. Comput. Syst. 16(4), 321-366 (1998).
[7]
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777-786 (2004).
[8]
Catalyurek, U., Beynon, M.D., Chang, C., Kurc, T., Sussman, A., Saltz, J.: The virtual microscope. IEEE Trans. Inf. Technol. Biomed. 7(4), 230-248 (2003).
[9]
Fahringer, T., Zima, H.P.: A static parameter based performance prediction tool for parallel programs. In: ICS '93: Proceedings of the 7th International Conference on Supercomputing, pp. 207-219 (1993).
[10]
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination, consistency properties. Computer science technical report, School of Aviation Medicine, Randolph Field, Texas (1951).
[11]
Hartley, T.D., Catalyurek, U.V., Ruiz, A., Ujaldon, M., Igual, F., Mayo, R.: Biomedical image analysis on a cooperative cluster of gpus and multicores. In: 22nd ACM Intl. Conference on Supercomputing (2008).
[12]
He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: A mapreduce framework on graphics processors. In: Parallel Architectures and Compilation Techniques (2008).
[13]
Hoppe, H.: View-dependent refinement of progressive meshes. In: SIGGRAPH 97 Proc., pp. 189-198 (1997). http://research.microsoft.com/hoppe/
[14]
Hsu, C.H., Chen, T.L., Li, K.C.: Performance effective prescheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener. Comput. Syst. (2007).
[15]
Iverson, M., Ozguner, F., Follen, G.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4th Heterogeneous Computing Workshop (HCW'95) (1995).
[16]
Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Supercomputing '01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), pp. 37-37 (2001).
[17]
Kurc, T., Lee, F., Agrawal, G., Catalyurek, U., Ferreira, R., Saltz, J.: Optimizing reduction computations in a distributed environment. In: SC '03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 9 (2003).
[18]
Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP '09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programm ing, pp. 101-110 (2009).
[19]
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. ACM SIGPLAN Not. 43(3), 287-296 (2008).
[20]
Low, S., Peterson, L., Wang, L.: Understanding tcp vegas: a duality model. In: Proceedings of ACM Sigmetrics (2001).
[21]
Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd International Symposium on Microarchitecture (MICRO) (2009).
[22]
Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluation of multiresolution optimization strategies for multimodality image registration by maximization of mutual information. Med. Image Anal. 3(4), 373-386 (1999).
[23]
NVIDIA: NVIDIA CUDA SDK (2007). http://nvidia.com/cuda
[24]
O'Malley, S.W., Peterson, L.L.: A dynamic network architecture. ACM Trans. Comput. Syst. 10(2) (1992).
[25]
Patkar, N., Katsuno, A., Li, S., Maruyama, T., Savkar, S., Simone, M., Shen, G., Swami, R., Tovey, D.: Microarchitecture of hal's cpu. In: IEEE International Computer Conference, p. 259 (1995).
[26]
Ramanujam, J.: Toward automatic parallelization and auto-tuning of affine kernels for gpus. In: Workshop on Automatic Tuning for Petascale Systems (2008).
[27]
Rocha, B.M., Campos, F.O., Plank, G., dos Santos, R.W., Liebmann4, M., Haase, G.: Simulations of the electrical activity in the heart with graphic processing units. Accepted for publication in Eighth International Conference on Parallel Processing and Applied Mathematics (2009).
[28]
Rosenfeld, A. (ed.): Multiresolution Image Processing and Analysis. Springer, Berlin (1984).
[29]
Ruiz, A., Sertel, O., Ujaldon, M., Catalyurek, U., Saltz, J., Gurcan, M.: Pathological image analysis using the gpu: Stroma classification for neuroblastoma. In: Proc. of IEEE Int. Conf. on Bioinformatics and Biomedicine (2007).
[30]
Sancho, J.C., Kerbyson, D.J.: Analysis of double buffering on two different multicore architectures: quad-core opteron and the Cell-BE. In: International Parallel and Distributed Processing Symposium (IPDPS) (2008).
[31]
Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H., Gurcan, M.N.: Computer-aided prognosis of neuroblastoma on whole-slide images: classification of stromal development. Pattern Recognit. 42(6) (2009).
[32]
Shimada, H., Ambros, I.M., Dehner, L.P., Ichi Hata, J., Joshi, V.V., Roald, B.: Terminology and morphologic criteria of neuroblastic tumors: recommendation by the international neuroblastoma pathology committee. Cancer 86(2) (1999).
[33]
Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009).
[34]
Sundaram, N., Raghunathan, A., Chakradhar, S.T.: A framework for efficient and scalable execution of domain-specific templates on gpus. In: IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1- 12. (2009).
[35]
Tavares, T., Teodoro, G., Kurc, T., Ferreira, R., Guedes, D., Meira, W.J., Catalyurek, U., Hastings, S., Oster, S., Langella, S., Saltz, J.: An efficient and reliable scientific workflow system. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 445-452 (2007).
[36]
Teodoro, G., Fireman, D., Guedes, D. Jr., Ferreira, R.: Achieving multi-level parallelism in filter-labeled stream programming model. In: The 37th International Conference on Parallel Processing (ICPP) (2008).
[37]
Teodoro, G., Hartley, T.D.R., Catalyurek, U., Ferreira, R.: Runtime optimizations for replicated dataflows on heterogeneous environments. In: Proc. of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC) (2010).
[38]
Teodoro, G., Sachetto, R., Fireman, D., Guedes, D., Ferreira, R.: Exploiting computational resources in distributed heterogeneous platforms. In: 21st International Symposium on Computer Architecture and High Performance Computing, pp. 83-90 (2009).
[39]
Teodoro, G., Sachetto, R., Sertel, O., Gurcan, M. Jr., Catalyurek, U., Ferreira, R.: Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In: IEEE Cluster (2009).
[40]
Teodoro, G., Tavares, T., Ferreira, R., Kurc, T., Meira, W., Guedes, D., Pan, T., Saltz, J.: Run-time support for efficient execution of scientific workflows on distributed environmments. In: International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil (2006).
[41]
Vrsalovic, D.F., Siewiorek, D.P., Segall, Z.Z., Gehringer, E.F.: Performance prediction and calibration for a class of multiprocessors. IEEE Trans. Comput. 37(11) (1988).
[42]
Welsh, M., Culler, D., Brewer, E.: Seda: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev. 35(5), 230-243 (2001).
[43]
Woods, B., Clymer, B., Saltz, J., Kurc, T.: A parallel implementation of 4-dimensional haralick texture analysis for disk-resident image datasets. In: SC '04: Proceedings of the 204 ACM/IEEE Conference on Supercomputing (2004).

Cited By

View all
  • (2017)Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICsInternational Journal of High Performance Computing Applications10.1177/109434201559451931:1(32-51)Online publication date: 1-Jan-2017
  • (2015)Practical algorithms for execution engine selection in data flowsFuture Generation Computer Systems10.1016/j.future.2014.11.01145:C(133-148)Online publication date: 1-Apr-2015
  • (2015)Architecture for video streaming application on heterogeneous platformMultimedia Tools and Applications10.1007/s11042-014-1856-y74:13(4927-4945)Online publication date: 1-Jun-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Cluster Computing
Cluster Computing  Volume 15, Issue 2
June 2012
119 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2012

Author Tags

  1. Filter-stream
  2. GPGPU
  3. Run-time optimizations

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICsInternational Journal of High Performance Computing Applications10.1177/109434201559451931:1(32-51)Online publication date: 1-Jan-2017
  • (2015)Practical algorithms for execution engine selection in data flowsFuture Generation Computer Systems10.1016/j.future.2014.11.01145:C(133-148)Online publication date: 1-Apr-2015
  • (2015)Architecture for video streaming application on heterogeneous platformMultimedia Tools and Applications10.1007/s11042-014-1856-y74:13(4927-4945)Online publication date: 1-Jun-2015
  • (2015)PSkelConcurrency and Computation: Practice & Experience10.1002/cpe.347927:17(4938-4953)Online publication date: 10-Dec-2015
  • (2014)HaggisProceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data10.1145/2676536.2676539(15-20)Online publication date: 4-Nov-2014
  • (2014)Region templatesParallel Computing10.1016/j.parco.2014.09.00340:10(589-610)Online publication date: 1-Dec-2014

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media