Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Published: 01 February 2009 Publication History

Abstract

FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.

References

[1]
Kapasi UJ, Rixner S, Dally WJ et al (2003) Programmable stream processors. IEEE Comput 54-62.
[2]
Khailany B (2003) The VLSI implementation and evaluation of area-and energy-efficient streaming media processors. Ph.D. thesis, Stanford University.
[3]
Taylor M, Kim J, Miller J et al (2002) The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25-35.
[4]
Burger D, Keckler SW, McKinley KS et al (2004) Scaling to the end of silicon with EDGE architectures. Computer 37(7):44-55.
[5]
Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of ASPLOS'06, California, USA.
[6]
Andrew AL, Thies W, Amarasinghe S (2003) Linear analysis and optimization of stream programs. In: Proceedings of the SIGPLAN'03 conference on programming language design and implementation, San Diego, CA.
[7]
Owens JD, Rixner S et al (2002) Media processing applications on the imagine stream processor. In: Proceedings of the 2002 international conference on computer design.
[8]
Yang X, Yan X, Xing Z et al (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA'07: Proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210-219.
[9]
Amarasinghe S et al (2003) Stream languages and programming models. In: Proceedings of the international conference on parallel architectures and compilation techniques 2003.
[10]
Mattson P (2002) A programming system for the imagine media processor. Ph.D. thesis, Dept of Electrical Engineering, Stanford University.
[11]
Du J, Yang X et al (2007) Architecture-based optimization for mapping scientific applications to imagine. In: ISPA'07: Proceedings of the 2007 international symposium on parallel and distributed processing with applications, Ontario, Canada.
[12]
Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT'06: Proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33-42.
[13]
Johnsson O, Stenemo M, ul-Abdin Z (2005) Programming & implementation of streaming applications. Master's thesis, Computer and Electrical Engineering Halmstad University.
[14]
Ahn JH, Dally WJ et al (2004). Evaluating the imagine stream architecture. In: Proceedings of the annual international symposium on computer architecture 2004.
[15]
Jayasena NS (2005) Memory hierarchy design for stream computing. Ph.D. thesis, Stanford University.
[16]
Wolf ME, Lam M (1991) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst 2(4):452-471.
[17]
Kuck D, Kuhn R et al (1981) Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg, VA, January 1981.
[18]
Wolfe MJ (1996) High performance compilers for parallel computing. Addison-Wesley, Reading.
[19]
Du J, Yang X et al (2006) Scientific computing applications on the imagine stream processor. In: Proceedings of the 11th Asia-pacific computer systems architecture conference, Shanghai, China.
[20]
Fan Z, Qiu F et al (2004) Gpu cluster for high performance computing. In: Proceedings of supercomputing conference 2004.
[21]
Harris MJ, Baxter WV et al (2003) Simulation of cloud dynamics on graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, Switzerland, pp 92-101.
[22]
Bolz J, Farmer I, Grinspun E, Schröder P (2003) Sparse matrix solvers on the Gpu: conjugate gradients and multigrid. ACM Trans Graph 22(3):917-924.
[23]
Dally WJ, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: Proceedings of supercomputing conference 2003.
[24]
Erez M, Ahn J et al (2004) Analysis and performance results of a molecular modeling application on Merrimac. In: Proceedings of supercomputing conference 2004.
[25]
Erez M (2007) Merrimac--high-performance, highly-efficient scientific computing with streams. Ph.D. thesis, Dept of Electrical Engineering, Stanford University.
[26]
Erez M, Ahn J et al (2007) Executing irregular scientific applications on stream architectures. In: (ICS'07): Proceedings of the 21th ACM international conference on supercomputing.
[27]
Griem G, Oliker L (2003) Transitive closure on the imagine stream processor. In: Proceedings of the 5th workshop on media and streaming processors, San Diego, CA.
[28]
Ahn J, Dally WJ, Erez M (2007) Tradeoff between data-, instruction-, and thread-level parallelism in stream processors. In: (ICS'07): Proceedings of the 21th ACM international conference on supercomputing.
[29]
Sermulins J, Thies W et al (2005) Cache aware optimization of stream programs. In: Proceedings of LCTES'05, Chicago, Illinois, USA.
[30]
Wolf M, Lam M (1991) A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN'91 conference on programming language design and implementation, Ontario, Canada, pp 30-44.
[31]
McKinley K, Carr S, Tseng CW (1996) Improving data locality with loop transformations. ACM Trans Program Lang Syst.
[32]
Li W (1993) Compiling for NUMA parallel machines. Ph.D. thesis, Cornell University.
[33]
Kandemir M, Choudhary A et al (1999) A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans Parallel Distrib Syst 10(2):115-135.
[34]
Cierniak M, Li W (1995) Unifying Data and control transformations for distributed shared memory machines. In: ACM SIGPLAN IPDPS, pp 205-217.
[35]
Kandemir M, Choudhary A et al (1998) Improving locality using loop and data transformations in an integrated framework. In: Proceedings of international symposium on microarchitecture, pp 285-297.
[36]
Kandemir M, Banerjee P et al (2001) Static and dynamic locality optimizations using integer linear programming. IEEE Trans Parallel Distrib Syst 12(9):922-940.
[37]
Kandemir M et al (1999) A graph based framework to detect optimal memory layouts for improving data locality. In: Proceedings of the 13th international parallel processing symposium, San Juan, Puerto Rico, pp 738-743.
[38]
O'Boyle M, Knijnenburg P (1996) Non-singular data transformations: definition, validity, applications. In: Proceedings of 6th workshop on compilers for parallel computers, pp 287-297.
[39]
Garcia J, Ayguade E et al (1996) Dynamic data distribution with control flow analysis. In: Proceedings of supercomputing conference 1996.

Cited By

View all
  • (2018)Optimizing modulo scheduling to achieve reuse and concurrency for stream processorsThe Journal of Supercomputing10.1007/s11227-010-0522-z59:3(1229-1251)Online publication date: 31-Dec-2018
  • (2018)An approximate method for filtering out data dependencies with a sufficiently large distance between memory referencesThe Journal of Supercomputing10.1007/s11227-009-0364-856:2(226-244)Online publication date: 31-Dec-2018
  • (2011)I/O streaming evaluation of batch queries for data-intensive computational turbulenceProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063423(1-10)Online publication date: 12-Nov-2011

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing
The Journal of Supercomputing  Volume 47, Issue 2
February 2009
142 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2009

Author Tags

  1. D&C Matrix
  2. FT64
  3. Program transformation
  4. Stream organization
  5. Streamization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Optimizing modulo scheduling to achieve reuse and concurrency for stream processorsThe Journal of Supercomputing10.1007/s11227-010-0522-z59:3(1229-1251)Online publication date: 31-Dec-2018
  • (2018)An approximate method for filtering out data dependencies with a sufficiently large distance between memory referencesThe Journal of Supercomputing10.1007/s11227-009-0364-856:2(226-244)Online publication date: 31-Dec-2018
  • (2011)I/O streaming evaluation of batch queries for data-intensive computational turbulenceProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063423(1-10)Online publication date: 12-Nov-2011

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media