research-article

Moths: Mobile threads for on-chip networks

Authors:

Matthew Misler,

Natalie Enright JergerAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 12, Issue 1s

Article No.: 56, Pages 1 - 22

https://doi.org/10.1145/2435227.2435252

Published: 29 March 2013 Publication History

Abstract

As the number of cores integrated on a single chip continues to increase, communication has the potential to become a severe bottleneck to overall system performance. The presence of thread sharing and the distribution of data across cache banks on the chip can result in longdistance communication. Long-distance communication incurs substantial latency that impacts performance; furthermore, this communication consumes significant dynamic power when packets are switched over many Network-on-Chip (NoC) links and routers. Thread migration can mitigate problems created by long distance communication. This article presents Moths, an efficient runtime algorithm that responds automatically to dynamic NoC traffic patterns, providing beneficial thread migration to decrease overall traffic volume and average packet latency. Moths reduces on-chip network latency by up to 28.4% (18.0% on average) and traffic volume by up to 24.9% (20.6% on average) across a variety of commercial and scientific benchmarks.

References

[1]

Agarwal, N., Krishna, T., Peh, L.-S., and Jha, N. K. 2009. Garnet: A detailed on-chip network model inside a full-system simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.

[2]

Amazon. 2009. Amazon relational database service. http://aws.amazon.com/rds/.

[3]

Baker, T., Snyder, J., and Whalley, D. 1995. Fast context switches: Compiler and architectural support for preemptive scheduling. J. Microprocessors Microsyst. 35--42.

[4]

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the 19th SOSP. 164--177.

Digital Library

[5]

Ben-Itzhak, Y., Cidon, I., and Kolodny, A. 2010. Performance and power aware CMP thread allocation modeling. In Proceedings of the Conference on High Performance Embedded Architectures and Compilers.

Digital Library

[6]

Bertozzi, S., Acquaviva, A., Bertozzi, D., and Poggiali, A. 2006. Supporting task migration in multi-processor systems-on-chip: a feasibility study. In Proceedings of the Conference on Design, Automation and Test in Europe. 15--20.

Digital Library

[7]

Brown, J. A., Porter, L., and Tullsen, D. M. 2011. Fast thread migration via cache working set prediction. In Proceedings of the International Symposium on High Performance Computer Architecture.

Digital Library

[8]

Carvalho, E., Calazans, N., and Moraes, F. 2007. Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In Proceedings of RSP'07.

Digital Library

[9]

Dally, W. and Towles, B. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the Design Automation Conference.

Digital Library

[10]

Dally, W. and Towles, B. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers.

Digital Library

[11]

Das, R., Mutlu, O., Moscibroda, T., and Das, C. R. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd International Symposium on Microarchitecture.

Digital Library

[12]

David, F. M., Carlyle, J. C., and Campbell, R. H. 2007. Context switch overheads for linux on arm platforms. In Proceedings of ExpCS'07.

Digital Library

[13]

Enright Jerger, N., Vantrease, D., and Lipasti, M. 2007. An evaluation of server consolidation workloads for multi-core designs. In Proceedings of the IEEE International Symposium on Workload Characterization. 47--56.

Digital Library

[14]

Grot, B., Keckler, S. W., and Mutlu, O. 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective qos scheme for networks-on-a-chip. In Proceedings of the International Symposium on Microarchitecture.

Digital Library

[15]

Hung, W., Addo-Quaye, C., Theocharides, T., Xie, Y., Vijaykrishnan, N., and Irwin, M. J. 2004. Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture. In Proceedings of the International Conference on Computer Design.

Digital Library

[16]

Kinsy, M., Cho, M. H., Wen, T., Suh, E., van Dijk, M., and Devadas, S. 2009. Application-aware deadlock-free oblivious routing. In Proceedings of the 36th International Symposium on Computer Architecture. 208--219.

Digital Library

[17]

Kumar, A., Kundu, P., Singh, A. P., Peh, L.-S., and Jha, N. K. 2007a. A 4.6tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In Proceedings of the International Conference on Computer Design.

[18]

Kumar, A., Peh, L.-S., Kundu, P., and Jha, N. K. 2007b. Express virtual channels: towards the ideal interconnection fabric. In Proceedings of the 34th ISCA. 150--161.

Digital Library

[19]

Laudon, J. and Lenoski, D. 1997. The sgi origin: a ccnuma highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture.

Digital Library

[20]

Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the PLDI. 190--200.

Digital Library

[21]

Marty, M. R. and Hill, M. D. 2007. Virtual hierarchies to support server consolidation. In Proceedings of the International Symposium on Computer Architecture'07. 46--56.

Digital Library

[22]

McVoy, L. and Staelin, C. 1996. lmbench: portable tools for performance analysis. In Proceedings of the ATEC'96.

Digital Library

[23]

Mogul, J. C. and Borg, A. 1991. The effect of context switches on cache performance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 75--84.

Digital Library

[24]

Mullins, R., West, A., and Moore, S. 2004. Low-latency virtual-channel routers for on-chip networks. In Proceedings of International Symposium on Computer Architecture. 188.

Digital Library

[25]

Nollet, V., Marescaux, T., Avasare, P., and Mignolet, J.-Y. 2005. Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles. In Proceedings of the Conference on Design, Automation and Test in Europe.

Digital Library

[26]

Ogras, U. and Marculescu, R. 2005. Energy- and performance-driven NoC communication architecture synthesis using a decomposition approach. In Proceedings of the Conference on Design, Automation and Test in Europe.

Digital Library

[27]

Peh, L.-S. and Dally, W. J. 2001. A delay model and speculative architecture for pipelined routers. In Proceedings of the International Symposium on High Performance Computer Architecture. 255--266.

Digital Library

[28]

Rajagopalan, M., Lewis, B. T., and Anderson, T. A. 2007. Thread scheduling for multi-core platforms. In Proceedings of HOTOS'07.

Digital Library

[29]

Rangan, K. K., Wei, G.-Y., and Brooks, D. 2009. Thread motion: fine-grained power management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture'09.

Digital Library

[30]

Shang, L., Peh, L.-S., and Jha, N. K. 2003. Powerherd: dynamic satisfaction of peak power constraints in interconnection networks. In Proceedings of the 17th International Conference on Supercomputing. 98--108.

Digital Library

[31]

Suh, G. E., Devadas, S., and Rudolph, L. 2002. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the International Symposium on High Performance Computer Architecture.

Digital Library

[32]

Trivino, F., Sanchez, J. L., Alfaro, F. J., and Flich, J. 2010. Virtualizing network-on-chip resources in chip-multiprocessors. J. Microprocessors Microsyst.

Digital Library

[33]

von Behren, R., Condit, J., Zhou, F., Necula, G. C., and Brewer, E. 2003. Capriccio: scalable threads for internet services. SIGOPS Oper. Syst. Rev. 37, 5.

Digital Library

[34]

Wang, H., Peh, L.-S., and Malik, S. 2003. Power-driven design of router microarchitectures in on-chip networks. In Proceedings of the International Symposium on Microarchitecture.

Digital Library

[35]

Wentzlaff, D., GruenWald III, C., Beckmann, N., Modzelewski, K., Belay, A., Youseff, L., Miller, J., and Agarwal, A. 2010. An operating system for multicore and clouds: Mechanisms and implementation. In Proceedings of the SOCC'10.

Digital Library

[36]

Zhou, X. and Petrov, P. 2006. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In Proceedings of the 43rd Design Automation Conference.

Digital Library

[37]

Zhuravlev, S., Blagodurov, S., and Fedorova, A. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

Cited By

Ancajas DChakraborty KRoy S(2014)Fort-NoCsProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2593144(1-6)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1145/2593069.2593144
Fu WChen TWang CLiu L(2014)Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systemsThe Journal of Supercomputing10.1007/s11227-014-1240-869:3(1491-1516)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1007/s11227-014-1240-8

Index Terms

Moths: Mobile threads for on-chip networks
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
2. Hardware
  1. Integrated circuits
    1. Interconnect

Recommendations

Moths: mobile threads for on-chip networks
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

As the number of cores integrated on a single chip continues to increase, communication has the potential to become a severe bottleneck to overall system performance. The presence of thread sharing and the distribution of data across cache banks on the ...
Wormhole cut-through switching: Flit-level messages interleaving for virtual-channelless network-on-chip

A VLSI microrchitecture of a network-on-chip (NoC) router with a wormhole cut-through switching method is presented in this paper. The main feature of the NoC router is that, the wormhole messages can be interleaved (cut-through) at flit-level in the ...
Design of adaptive communication channel buffers for low-power area-efficient network-on-chip architecture
ANCS '07: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems

Network-on-Chip (NoC)architectures provide a scalable solution to the wire delay constraints in deep submicron VLSI designs. Recent research into the ptimization of NoC architectures has shown that the design of buffers in the NoC routers influences the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 12, Issue 1s

Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems

March 2013

701 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2435227

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 29 March 2013

Accepted: 01 December 2011

Revised: 01 August 2011

Received: 01 March 2011

Published in TECS Volume 12, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Natural Sciences and Engineering Research Council of Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
206
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ancajas DChakraborty KRoy S(2014)Fort-NoCsProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2593144(1-6)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1145/2593069.2593144
Fu WChen TWang CLiu L(2014)Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systemsThe Journal of Supercomputing10.1007/s11227-014-1240-869:3(1491-1516)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1007/s11227-014-1240-8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents