Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Moths: Mobile threads for on-chip networks

Published: 29 March 2013 Publication History

Abstract

As the number of cores integrated on a single chip continues to increase, communication has the potential to become a severe bottleneck to overall system performance. The presence of thread sharing and the distribution of data across cache banks on the chip can result in longdistance communication. Long-distance communication incurs substantial latency that impacts performance; furthermore, this communication consumes significant dynamic power when packets are switched over many Network-on-Chip (NoC) links and routers. Thread migration can mitigate problems created by long distance communication. This article presents Moths, an efficient runtime algorithm that responds automatically to dynamic NoC traffic patterns, providing beneficial thread migration to decrease overall traffic volume and average packet latency. Moths reduces on-chip network latency by up to 28.4% (18.0% on average) and traffic volume by up to 24.9% (20.6% on average) across a variety of commercial and scientific benchmarks.

References

[1]
Agarwal, N., Krishna, T., Peh, L.-S., and Jha, N. K. 2009. Garnet: A detailed on-chip network model inside a full-system simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.
[2]
Amazon. 2009. Amazon relational database service. http://aws.amazon.com/rds/.
[3]
Baker, T., Snyder, J., and Whalley, D. 1995. Fast context switches: Compiler and architectural support for preemptive scheduling. J. Microprocessors Microsyst. 35--42.
[4]
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the 19th SOSP. 164--177.
[5]
Ben-Itzhak, Y., Cidon, I., and Kolodny, A. 2010. Performance and power aware CMP thread allocation modeling. In Proceedings of the Conference on High Performance Embedded Architectures and Compilers.
[6]
Bertozzi, S., Acquaviva, A., Bertozzi, D., and Poggiali, A. 2006. Supporting task migration in multi-processor systems-on-chip: a feasibility study. In Proceedings of the Conference on Design, Automation and Test in Europe. 15--20.
[7]
Brown, J. A., Porter, L., and Tullsen, D. M. 2011. Fast thread migration via cache working set prediction. In Proceedings of the International Symposium on High Performance Computer Architecture.
[8]
Carvalho, E., Calazans, N., and Moraes, F. 2007. Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In Proceedings of RSP'07.
[9]
Dally, W. and Towles, B. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the Design Automation Conference.
[10]
Dally, W. and Towles, B. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers.
[11]
Das, R., Mutlu, O., Moscibroda, T., and Das, C. R. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd International Symposium on Microarchitecture.
[12]
David, F. M., Carlyle, J. C., and Campbell, R. H. 2007. Context switch overheads for linux on arm platforms. In Proceedings of ExpCS'07.
[13]
Enright Jerger, N., Vantrease, D., and Lipasti, M. 2007. An evaluation of server consolidation workloads for multi-core designs. In Proceedings of the IEEE International Symposium on Workload Characterization. 47--56.
[14]
Grot, B., Keckler, S. W., and Mutlu, O. 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective qos scheme for networks-on-a-chip. In Proceedings of the International Symposium on Microarchitecture.
[15]
Hung, W., Addo-Quaye, C., Theocharides, T., Xie, Y., Vijaykrishnan, N., and Irwin, M. J. 2004. Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture. In Proceedings of the International Conference on Computer Design.
[16]
Kinsy, M., Cho, M. H., Wen, T., Suh, E., van Dijk, M., and Devadas, S. 2009. Application-aware deadlock-free oblivious routing. In Proceedings of the 36th International Symposium on Computer Architecture. 208--219.
[17]
Kumar, A., Kundu, P., Singh, A. P., Peh, L.-S., and Jha, N. K. 2007a. A 4.6tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In Proceedings of the International Conference on Computer Design.
[18]
Kumar, A., Peh, L.-S., Kundu, P., and Jha, N. K. 2007b. Express virtual channels: towards the ideal interconnection fabric. In Proceedings of the 34th ISCA. 150--161.
[19]
Laudon, J. and Lenoski, D. 1997. The sgi origin: a ccnuma highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture.
[20]
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the PLDI. 190--200.
[21]
Marty, M. R. and Hill, M. D. 2007. Virtual hierarchies to support server consolidation. In Proceedings of the International Symposium on Computer Architecture'07. 46--56.
[22]
McVoy, L. and Staelin, C. 1996. lmbench: portable tools for performance analysis. In Proceedings of the ATEC'96.
[23]
Mogul, J. C. and Borg, A. 1991. The effect of context switches on cache performance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 75--84.
[24]
Mullins, R., West, A., and Moore, S. 2004. Low-latency virtual-channel routers for on-chip networks. In Proceedings of International Symposium on Computer Architecture. 188.
[25]
Nollet, V., Marescaux, T., Avasare, P., and Mignolet, J.-Y. 2005. Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles. In Proceedings of the Conference on Design, Automation and Test in Europe.
[26]
Ogras, U. and Marculescu, R. 2005. Energy- and performance-driven NoC communication architecture synthesis using a decomposition approach. In Proceedings of the Conference on Design, Automation and Test in Europe.
[27]
Peh, L.-S. and Dally, W. J. 2001. A delay model and speculative architecture for pipelined routers. In Proceedings of the International Symposium on High Performance Computer Architecture. 255--266.
[28]
Rajagopalan, M., Lewis, B. T., and Anderson, T. A. 2007. Thread scheduling for multi-core platforms. In Proceedings of HOTOS'07.
[29]
Rangan, K. K., Wei, G.-Y., and Brooks, D. 2009. Thread motion: fine-grained power management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture'09.
[30]
Shang, L., Peh, L.-S., and Jha, N. K. 2003. Powerherd: dynamic satisfaction of peak power constraints in interconnection networks. In Proceedings of the 17th International Conference on Supercomputing. 98--108.
[31]
Suh, G. E., Devadas, S., and Rudolph, L. 2002. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the International Symposium on High Performance Computer Architecture.
[32]
Trivino, F., Sanchez, J. L., Alfaro, F. J., and Flich, J. 2010. Virtualizing network-on-chip resources in chip-multiprocessors. J. Microprocessors Microsyst.
[33]
von Behren, R., Condit, J., Zhou, F., Necula, G. C., and Brewer, E. 2003. Capriccio: scalable threads for internet services. SIGOPS Oper. Syst. Rev. 37, 5.
[34]
Wang, H., Peh, L.-S., and Malik, S. 2003. Power-driven design of router microarchitectures in on-chip networks. In Proceedings of the International Symposium on Microarchitecture.
[35]
Wentzlaff, D., GruenWald III, C., Beckmann, N., Modzelewski, K., Belay, A., Youseff, L., Miller, J., and Agarwal, A. 2010. An operating system for multicore and clouds: Mechanisms and implementation. In Proceedings of the SOCC'10.
[36]
Zhou, X. and Petrov, P. 2006. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In Proceedings of the 43rd Design Automation Conference.
[37]
Zhuravlev, S., Blagodurov, S., and Fedorova, A. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems.

Cited By

View all
  • (2014)Fort-NoCsProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2593144(1-6)Online publication date: 1-Jun-2014
  • (2014)Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systemsThe Journal of Supercomputing10.1007/s11227-014-1240-869:3(1491-1516)Online publication date: 1-Sep-2014

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 12, Issue 1s
Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
March 2013
701 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2435227
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 29 March 2013
Accepted: 01 December 2011
Revised: 01 August 2011
Received: 01 March 2011
Published in TECS Volume 12, Issue 1s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Thread migration
  2. network-on-chip
  3. runtime algorithm

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Fort-NoCsProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2593144(1-6)Online publication date: 1-Jun-2014
  • (2014)Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systemsThe Journal of Supercomputing10.1007/s11227-014-1240-869:3(1491-1516)Online publication date: 1-Sep-2014

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media