Article

Frequent loop detection using efficient non-intrusive on-chip hardware

Authors:

Ann Gordon-Ross,

Frank VahidAuthors Info & Claims

CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

Pages 117 - 124

https://doi.org/10.1145/951710.951728

Published: 30 October 2003 Publication History

Abstract

Dynamic software optimization methods are becoming increasingly popular for improving software performance and power. The first step in dynamic optimization consists of detecting frequently executed code, or "critical regions." Previous critical region detectors have been targeted to desktop processors. We introduce a critical region detector targeted to embedded processors, with the unique features of being very size and power efficient, and being completely non-intrusive to the software's execution - features needed in timing-sensitive embedded systems. Our detector not only finds the critical regions, but also determines their relative frequencies, a potentially important feature for selecting among alternative dynamic optimization methods. Our detector uses a tiny cache coupled with a small amount of logic. We provide results of extensive explorations across seventeen embedded system benchmarks. We show that highly accurate results can be achieved with only a 0.02% power overhead and acceptable size overhead. Our detector is currently being used as part of a dynamic hardware/software partitioning approach, but is applicable to a wide-variety of situations.

References

[1]

Anderson, J., Berc, L.M., Dean, J., Ghemawat, S., Henzinger, M.R., Leung, S.T.A., Sites, R.L., Vandevoorde, M.T., Waldspurger, C.A., Weihl, W.E. Continuous profiling: where have all the cycles gone? 16th ACM Symp. of Operating Systems Design, 1997.

Digital Library

[2]

Artisan, http://www.artisan.com.

[3]

Bala, V., Duesterwald, E., Banerjia. Dynamo: a transparent dynamic optimization system. Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implemenation, 2000.

Digital Library

[4]

Bellas, N., et al. Energy and performance improvements in microprocessor design using a loop cache. ICCD, pp. 378--383, 1999.

Digital Library

[5]

Burger, D., Austin, T., Bennet, S. Evaluating future microprocessors: the simplescalar toolset. University of Wisconsin-Madison. Computer Science Department Tech. Report CS-TR-1308, July 2000.

[6]

Calder, B., Feller, P., Eustace, A. Value profiling. MICRO pp. 259--267, 1997.

Digital Library

[7]

Cmelik, R., SpixTools - introduction and user's manual, Sun Microsystems Laboratories, Inc. Technical Report SMLI TR 93-6, 2/93.

Digital Library

[8]

Dean, J., Hicks, J., Waldspurger, C.A., Weihl, W.E., Chrysos, G. ProfileMe: Hardware support for instruction level profiling on out-of-order processors, MICRO 1997.

Digital Library

[9]

Gordon-Ross, A., Cotterell, S., Vahid, F. Exploiting fixed programs in embedded systems: a loop cache example. IEEE Computer Architecture Letters, Vol 1, January 2002.

Digital Library

[10]

Govindarajan, S.C., Ramaswamy, G., Mehendale, M. Area and power reduction of embedded DSP systems using instruction compression and re-configurable encoding. International Conference on Computer Aided Design, 2001.

Digital Library

[11]

Grahm, S.L., Kessler, P.B., McKusick, M.K. Gprof: a call graph execution profiler. SIGPLAN Symp. on Compiler Construction, 1982.

Digital Library

[12]

IEEE, IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture, http://standards .ieee.org, 2001.

[13]

Ishihara, Y., Yasuura, H. A power reduction technique with object code merging for application specific embedded processors. Design Automation and Test in Europe, March 2000.

Digital Library

[14]

Hennessy, J.L. and Patterson, D.A. Computer architecture: a quantitative approach. Morgan Kaufmann, 1990.

Digital Library

[15]

Kiefendorff, K. Transistor Budgets Go Ballistic. Microprocessor Report, Volume 12, Number 10, August 1998, pp. 34--43.

[16]

Klaiber, A. The technology behind crusoe processors. Transmeta Technical Brief. January 2000.

[17]

Lee, C., Potkonjak, M., Mangione-Smith, W.H. MediaBench: a tool for evaluating and synthesizing multimedia and communication systems. Proc 30th Annual International Symposium on Microarchitecture, Dec 1997.

Digital Library

[18]

Lee, L.H., Moyer, B., Arends, J. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. International Symposium On Low Power Electronics and Design, 1999.

Digital Library

[19]

Lysecky, R, Vahid, F. A codesigned on-chip logic minimizer. First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2003.

Digital Library

[20]

Lysecky, R., Vahid, F. On-chip logic minimization. Proceedings of the 40th ACM/IEEE Conference on Design Automation (DAC), 2003.

Digital Library

[21]

Malik, A., Moyer, W., Cermak, D. A low power unified cache architecture providing power and performance flexibility. ISLPED, 2000.

Digital Library

[22]

Merten, M.C., Trick, A. R., George, C.N., Gyllenhaal, J., Hwu, W.W. A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization. ISCA 1999.

Digital Library

[23]

MIPS Technologies, http://www.mips.com/content/Products/Cores/32BitCores/MIPS324KFamily/ProductCatalog/P_MIPS324KFamily/productBrief

[24]

Pettis, K., Hansen, R.C. Profile guided code positioning. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 1990.

Digital Library

[25]

Scott, J., Lee, L.H., Chin, A., Arends, J., Moyer, W. Designing the M*CORE M3 CPU architecture. IEEE International Conference on Computer Design (ICCD), 1999.

Digital Library

[26]

Stitt, G., Lysecky, R., Vahid, F. Dyanmic hardware/software partitioning: a first approach. Proceedings of the 40th ACM/IEEE Conference on Design Automation (DAC), 2003.

Digital Library

[27]

Suresh, D.C., Najjar, W.A., Vahid, F., Villarreal, J.R., Stitt, G. Profiling tools for hardware/software partitioning of embedded applications. Languages, Compilers and Tools for Embedded Systems (LCTES), 2003, pp. 189--198.

Digital Library

[28]

Synopsys Inc., http://www.synopsys.com.

[29]

Tubella, J., Gonzalez, A. Control speculation in multithreaded processors through dynamic loop detection. http://www.cs.ucr.edu/~dalton/refs/ann_ref/profiling/hpca98_tubella_dynamic_loop_detection.pdf In Proceedings of the Fourth International Symposium On High Performance Computer Architecture (HPCA), 1998.

Digital Library

[30]

Vtune Environment, Intel Corp., http://developer.intel.com/vtune

[31]

Yang, J., Gupta, Rajiv. Energy efficient frequent value data cache design. MICRO 2002.

Digital Library

[32]

Zagha, M., Larson, B., Turner, S., Itzkowitz, M. Performance analysis using the MIPS R10000 performance counters. Supercomputing, Nov. 1996.

Digital Library

[33]

Zhang, X., et al. System support for automatic profiling and optimizations. Proceedings of the 16th Symposium on Operating System Principles, 1997.

Digital Library

Cited By

Gu JGuo HIshihara T(2013)DLICACM Transactions on Embedded Computing Systems10.1145/251246413:1(1-26)Online publication date: 5-Sep-2013
https://dl.acm.org/doi/10.1145/2512464
Rawlins MGordon-Ross A(2013)Adaptive loop caching using lightweight runtime control flow analysisACM Transactions on Embedded Computing Systems10.1145/2435227.243525112:1s(1-23)Online publication date: 29-Mar-2013
https://dl.acm.org/doi/10.1145/2435227.2435251
Ambrose JRagel RParameswaran S(2012)Randomized Instruction Injection to Counter Power Analysis AttacksACM Transactions on Embedded Computing Systems10.1145/2345770.234578211:3(1-28)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1145/2345770.2345782
Show More Cited By

Index Terms

Frequent loop detection using efficient non-intrusive on-chip hardware
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware

Dynamic software optimization methods are becoming increasingly popular for improving software performance and power. The first step in dynamic optimization consists of detecting frequently executed code, or "critical regions. Most previous critical ...
Efficient hardware-based nonintrusive dynamic application profiling

Application profiling—the process of monitoring an application to determine the frequency of execution within specific regions—is an essential step within the design process for many software and hardware systems. Profiling is often a critical step ...
Dynamic hardware/software partitioning: a first approach
DAC '03: Proceedings of the 40th annual Design Automation Conference

Partitioning an application among software running on a microprocessor and hardware co-processors in on-chip configurable logic has been shown to improve performance and energy consumption in embedded systems. Meanwhile, dynamic software optimization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

October 2003

340 pages

ISBN:1581136765

DOI:10.1145/951710

General Chairs:
Jaime Moreno
IBM Research
,
Praveen Murthy
Fujitsu Labs of America
,
Program Chairs:
Tom Conte
North Carolina State University
,
Paolo Faraboschi
HP Labs

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CASES03

Sponsor:

CASES03: 2003 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

October 30 - November 1, 2003

California, San Jose, USA

Acceptance Rates

CASES '03 Paper Acceptance Rate 31 of 162 submissions, 19%;

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gu JGuo HIshihara T(2013)DLICACM Transactions on Embedded Computing Systems10.1145/251246413:1(1-26)Online publication date: 5-Sep-2013
https://dl.acm.org/doi/10.1145/2512464
Rawlins MGordon-Ross A(2013)Adaptive loop caching using lightweight runtime control flow analysisACM Transactions on Embedded Computing Systems10.1145/2435227.243525112:1s(1-23)Online publication date: 29-Mar-2013
https://dl.acm.org/doi/10.1145/2435227.2435251
Ambrose JRagel RParameswaran S(2012)Randomized Instruction Injection to Counter Power Analysis AttacksACM Transactions on Embedded Computing Systems10.1145/2345770.234578211:3(1-28)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1145/2345770.2345782
Gu JIshihara TLee K(2012)Loop instruction caching for energy-efficient embedded multitasking processors2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia10.1109/ESTIMedia.2012.6507036(97-106)Online publication date: Oct-2012
https://doi.org/10.1109/ESTIMedia.2012.6507036
Paek JChoi KLee J(2011)Binary acceleration using coarse-grained reconfigurable architectureACM SIGARCH Computer Architecture News10.1145/1926367.192637438:4(33-39)Online publication date: 14-Jan-2011
https://dl.acm.org/doi/10.1145/1926367.1926374
Aung YLam SSrikanthan T(2011)Compiler-assisted technique for rapid performance estimation of FPGA-based processors2011 IEEE International SOC Conference10.1109/SOCC.2011.6085116(341-346)Online publication date: Sep-2011
https://doi.org/10.1109/SOCC.2011.6085116
Gu JGuo HKathail VTatge RBarua R(2010)Enabling large decoded instruction loop caching for energy-aware embedded processorsProceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems10.1145/1878921.1878957(247-256)Online publication date: 24-Oct-2010
https://dl.acm.org/doi/10.1145/1878921.1878957
Aung YLam SSrikanthan T(2010)Performance estimation framework for FPGA-based processors2010 International Conference on Field-Programmable Technology10.1109/FPT.2010.5681448(413-416)Online publication date: Dec-2010
https://doi.org/10.1109/FPT.2010.5681448
Mu JLysecky R(2009)Autonomous hardware/software partitioning and voltage/frequency scaling for low-power embedded systemsACM Transactions on Design Automation of Electronic Systems10.1145/1640457.164045915:1(1-20)Online publication date: 28-Dec-2009
https://dl.acm.org/doi/10.1145/1640457.1640459
Lysecky RVahid F(2009)Design and implementation of a MicroBlaze-based warp processorACM Transactions on Embedded Computing Systems10.1145/1509288.15092948:3(1-22)Online publication date: 22-Apr-2009
https://dl.acm.org/doi/10.1145/1509288.1509294
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents