article

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Authors:

Yu DengAuthors Info & Claims

The Journal of Supercomputing, Volume 47, Issue 2

Pages 171 - 197

https://doi.org/10.1007/s11227-008-0186-0

Published: 01 February 2009 Publication History

Abstract

FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.

References

[1]

Kapasi UJ, Rixner S, Dally WJ et al (2003) Programmable stream processors. IEEE Comput 54-62.

[2]

Khailany B (2003) The VLSI implementation and evaluation of area-and energy-efficient streaming media processors. Ph.D. thesis, Stanford University.

[3]

Taylor M, Kim J, Miller J et al (2002) The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25-35.

Digital Library

[4]

Burger D, Keckler SW, McKinley KS et al (2004) Scaling to the end of silicon with EDGE architectures. Computer 37(7):44-55.

Digital Library

[5]

Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of ASPLOS'06, California, USA.

[6]

Andrew AL, Thies W, Amarasinghe S (2003) Linear analysis and optimization of stream programs. In: Proceedings of the SIGPLAN'03 conference on programming language design and implementation, San Diego, CA.

[7]

Owens JD, Rixner S et al (2002) Media processing applications on the imagine stream processor. In: Proceedings of the 2002 international conference on computer design.

[8]

Yang X, Yan X, Xing Z et al (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA'07: Proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210-219.

[9]

Amarasinghe S et al (2003) Stream languages and programming models. In: Proceedings of the international conference on parallel architectures and compilation techniques 2003.

[10]

Mattson P (2002) A programming system for the imagine media processor. Ph.D. thesis, Dept of Electrical Engineering, Stanford University.

[11]

Du J, Yang X et al (2007) Architecture-based optimization for mapping scientific applications to imagine. In: ISPA'07: Proceedings of the 2007 international symposium on parallel and distributed processing with applications, Ontario, Canada.

[12]

Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT'06: Proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33-42.

[13]

Johnsson O, Stenemo M, ul-Abdin Z (2005) Programming & implementation of streaming applications. Master's thesis, Computer and Electrical Engineering Halmstad University.

[14]

Ahn JH, Dally WJ et al (2004). Evaluating the imagine stream architecture. In: Proceedings of the annual international symposium on computer architecture 2004.

Digital Library

[15]

Jayasena NS (2005) Memory hierarchy design for stream computing. Ph.D. thesis, Stanford University.

[16]

Wolf ME, Lam M (1991) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst 2(4):452-471.

Digital Library

[17]

Kuck D, Kuhn R et al (1981) Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg, VA, January 1981.

[18]

Wolfe MJ (1996) High performance compilers for parallel computing. Addison-Wesley, Reading.

[19]

Du J, Yang X et al (2006) Scientific computing applications on the imagine stream processor. In: Proceedings of the 11th Asia-pacific computer systems architecture conference, Shanghai, China.

[20]

Fan Z, Qiu F et al (2004) Gpu cluster for high performance computing. In: Proceedings of supercomputing conference 2004.

[21]

Harris MJ, Baxter WV et al (2003) Simulation of cloud dynamics on graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, Switzerland, pp 92-101.

[22]

Bolz J, Farmer I, Grinspun E, Schröder P (2003) Sparse matrix solvers on the Gpu: conjugate gradients and multigrid. ACM Trans Graph 22(3):917-924.

Digital Library

[23]

Dally WJ, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: Proceedings of supercomputing conference 2003.

[24]

Erez M, Ahn J et al (2004) Analysis and performance results of a molecular modeling application on Merrimac. In: Proceedings of supercomputing conference 2004.

[25]

Erez M (2007) Merrimac--high-performance, highly-efficient scientific computing with streams. Ph.D. thesis, Dept of Electrical Engineering, Stanford University.

[26]

Erez M, Ahn J et al (2007) Executing irregular scientific applications on stream architectures. In: (ICS'07): Proceedings of the 21th ACM international conference on supercomputing.

[27]

Griem G, Oliker L (2003) Transitive closure on the imagine stream processor. In: Proceedings of the 5th workshop on media and streaming processors, San Diego, CA.

[28]

Ahn J, Dally WJ, Erez M (2007) Tradeoff between data-, instruction-, and thread-level parallelism in stream processors. In: (ICS'07): Proceedings of the 21th ACM international conference on supercomputing.

[29]

Sermulins J, Thies W et al (2005) Cache aware optimization of stream programs. In: Proceedings of LCTES'05, Chicago, Illinois, USA.

[30]

Wolf M, Lam M (1991) A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN'91 conference on programming language design and implementation, Ontario, Canada, pp 30-44.

[31]

McKinley K, Carr S, Tseng CW (1996) Improving data locality with loop transformations. ACM Trans Program Lang Syst.

[32]

Li W (1993) Compiling for NUMA parallel machines. Ph.D. thesis, Cornell University.

[33]

Kandemir M, Choudhary A et al (1999) A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans Parallel Distrib Syst 10(2):115-135.

[34]

Cierniak M, Li W (1995) Unifying Data and control transformations for distributed shared memory machines. In: ACM SIGPLAN IPDPS, pp 205-217.

[35]

Kandemir M, Choudhary A et al (1998) Improving locality using loop and data transformations in an integrated framework. In: Proceedings of international symposium on microarchitecture, pp 285-297.

[36]

Kandemir M, Banerjee P et al (2001) Static and dynamic locality optimizations using integer linear programming. IEEE Trans Parallel Distrib Syst 12(9):922-940.

Digital Library

[37]

Kandemir M et al (1999) A graph based framework to detect optimal memory layouts for improving data locality. In: Proceedings of the 13th international parallel processing symposium, San Juan, Puerto Rico, pp 738-743.

[38]

O'Boyle M, Knijnenburg P (1996) Non-singular data transformations: definition, validity, applications. In: Proceedings of 6th workshop on compilers for parallel computers, pp 287-297.

[39]

Garcia J, Ayguade E et al (1996) Dynamic data distribution with control flow analysis. In: Proceedings of supercomputing conference 1996.

Cited By

Wang LXue JYang X(2018)Optimizing modulo scheduling to achieve reuse and concurrency for stream processorsThe Journal of Supercomputing10.1007/s11227-010-0522-z59:3(1229-1251)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-010-0522-z
Bulić PDobravec T(2018)An approximate method for filtering out data dependencies with a sufficiently large distance between memory referencesThe Journal of Supercomputing10.1007/s11227-009-0364-856:2(226-244)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-009-0364-8
Kanov KPerlman EBurns RAhmad YSzalay ALathrop SCosta JKramer W(2011)I/O streaming evaluation of batch queries for data-intensive computational turbulenceProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063423(1-10)Online publication date: 12-Nov-2011
https://dl.acm.org/doi/10.1145/2063384.2063423

Index Terms

Recommendations

Implementation and Evaluation of Matrix-matrix Multiplication on FT64 Stream Processor
ICCIT '07: Proceedings of the 2007 International Conference on Convergence Information Technology

The FT64 stream processor is a 64-bit stream processor for scientific computing. It is necessary to research efficient implementation of scientific applications on FT64 to exploit the powerful ability. Matrix-matrix multiplication is an important kernel ...
Laplace transformation on the FT64 stream processor
ACSAC'07: Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

The stream architecture is one of the emerging architectures that address the memory-wall problem of modern processors. While it is successful in the domain of multimedia, its efficiency to scientific applications is increasingly concerned. This paper ...
Implementation and Evaluation of Specific Data-Intensive Scientific Applications on the FT64 Stream Processor
CIT '07: Proceedings of the 7th IEEE International Conference on Computer and Information Technology

The stream architecture is one of the emerging architectures that address the memory-wall problem of modern processors. While it is successful in the domain of multimedia, its efficiency to scientific applications is increasingly concerned. This paper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 47, Issue 2

February 2009

142 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2009 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang LXue JYang X(2018)Optimizing modulo scheduling to achieve reuse and concurrency for stream processorsThe Journal of Supercomputing10.1007/s11227-010-0522-z59:3(1229-1251)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-010-0522-z
Bulić PDobravec T(2018)An approximate method for filtering out data dependencies with a sufficiently large distance between memory referencesThe Journal of Supercomputing10.1007/s11227-009-0364-856:2(226-244)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-009-0364-8
Kanov KPerlman EBurns RAhmad YSzalay ALathrop SCosta JKramer W(2011)I/O streaming evaluation of batch queries for data-intensive computational turbulenceProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063423(1-10)Online publication date: 12-Nov-2011
https://dl.acm.org/doi/10.1145/2063384.2063423

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents