Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/781131.781165acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
Article

Region-based hierarchical operation partitioning for multicluster processors

Published: 09 May 2003 Publication History

Abstract

Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.

References

[1]
A. Aletà, J. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001.
[2]
A. Aletà, J. Codina, J. Sánchez, A. González, and D. Kaeli. Exploiting pseudo-schedules to guide data dependence graph partitioning. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002.
[3]
A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1992.
[4]
J. Codina, J. Sánchez, and A. González. URACAM: A unified register allocation, cluster assignment and modulo scheduling approach. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001.
[5]
G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, Feb. 1998.
[6]
J. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985.
[7]
P. Faraboschi, G. Desoli, and J. Fisher. Clustered instruction-level parallel processors. Technical Report HPL-98-204, Hewlett-Packard Laboratories, Dec. 1998.
[8]
K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th Annual International Symposium on Microarchitecture, Dec. 1997.
[9]
B. Fields, R. Bodík, and M. D. Hill. Slack: Maximizing performance under technological constraints. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002.
[10]
J. Fisher. Very long instruction word architectures and the ELI-52. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 140--150, June 13--17, 1983.
[11]
R. Hank, W. Hwu, and B. Rau. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 158--168, Nov. 1995.
[12]
B. Hendrickson and R. Leland. The Chaco User's Guide. Sandia National Laboratories, July 1995.
[13]
J. Hiser, S. Carr, and P. Sweany. Global register partitioning. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 13--23, Oct. 2000.
[14]
K. Kailas, K. Ebcioglu, and A. Agrawala. CARS: A new code generation framework for clustered ILP processors. In Proceeding of the 2001 International Conference on High Performance Computer Architecture, pages 133--142, Feb. 2001.
[15]
G. Karypis and V. Kumar. Metis: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse Matrices. University of Minnesota, Sept. 1998.
[16]
G. Krishnamurthy, E. Granston, and E. Stotzer. Affinity-based cluster assignment for unrolled loops. In Proceedings of the 2002 International Conference on Supercomputing, pages 107--116, June 2002.
[17]
V. Lapinskii, M. Jacome, and G. de~Veciana. High-quality operation binding for clustered VLIW datapaths. In Proceedings of the 2001 Design Automation Conference, June 2001.
[18]
W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the 35th Annual International Symposium on Microarchitecture, Nov. 2002.
[19]
R. Leupers. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000.
[20]
J. Liou and M. Palis. A new heuristic for scheduling parallel programs on multiprocessor. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 358--365, Oct. 1998.
[21]
P. Lowney et~al. The Multiflow Trace Scheduling compiler. The Journal of Supercomputing, 7(1-2):51--142, 1993.
[22]
E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 103--114, Nov. 1998.
[23]
E. Özer, S. Banerjia, and T. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 308--315, Nov. 1998.
[24]
B. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63--74, Nov. 1994.
[25]
Trimaran. An infrastructure for research in ILP. http://www.trimaran.org.
[26]
T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 1994.

Cited By

View all
  • (2016)To waffinity and beyondProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026910(419-434)Online publication date: 2-Nov-2016
  • (2014)A hierarchical framework to enhance scalability and performance of scheduling and mapping algorithmsProceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn)10.1109/ESLsyn.2014.6850384(1-6)Online publication date: May-2014
  • (2013)A constraint programming approach for integrated spatial and temporal scheduling for clustered architecturesACM Transactions on Embedded Computing Systems10.1145/251247013:1(1-23)Online publication date: 5-Sep-2013
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
June 2003
360 pages
ISBN:1581136625
DOI:10.1145/781131
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. instruction scheduling
  3. instruction-level parallelism
  4. multicluster processor
  5. operation partitioning
  6. region-based compilation

Qualifiers

  • Article

Conference

PLDI03
Sponsor:

Acceptance Rates

PLDI '03 Paper Acceptance Rate 28 of 131 submissions, 21%;
Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2016)To waffinity and beyondProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026910(419-434)Online publication date: 2-Nov-2016
  • (2014)A hierarchical framework to enhance scalability and performance of scheduling and mapping algorithmsProceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn)10.1109/ESLsyn.2014.6850384(1-6)Online publication date: May-2014
  • (2013)A constraint programming approach for integrated spatial and temporal scheduling for clustered architecturesACM Transactions on Embedded Computing Systems10.1145/251247013:1(1-23)Online publication date: 5-Sep-2013
  • (2013)Low cost control flow protection using abstract control signaturesACM SIGPLAN Notices10.1145/2499369.246556848:5(3-12)Online publication date: 20-Jun-2013
  • (2013)Low cost control flow protection using abstract control signaturesProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2491899.2465568(3-12)Online publication date: 20-Jun-2013
  • (2013)Low cost control flow protection using abstract control signaturesProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2465554.2465568(3-12)Online publication date: 20-Jun-2013
  • (2012)SIMD defragmenterACM SIGPLAN Notices10.1145/2248487.215101447:4(363-374)Online publication date: 3-Mar-2012
  • (2012)SIMD defragmenterACM SIGARCH Computer Architecture News10.1145/2189750.215101440:1(363-374)Online publication date: 3-Mar-2012
  • (2012)SIMD defragmenterProceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems10.1145/2150976.2151014(363-374)Online publication date: 3-Mar-2012
  • (2012)An Address-Based Compiling Optimization for FFT on Multi-cluster DSPProceedings of the 2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming10.1109/PAAP.2012.17(60-64)Online publication date: 17-Dec-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media