Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Timestamp snooping: an approach for extending SMPs

Published: 01 November 2000 Publication History

Abstract

Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processor's cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies.This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order.We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28% faster than directories, at a cost of 13-43% more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29% faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency.

References

[1]
Y. Afek, G. Brown, and M. Merritt. Lazy Caching. ACM Trans. Prog. Lang. Syst., 15(1):182-205, Jan. 1993.
[2]
A. Agarwal, R. Simoni, M. Horowitz, and J. Hennessy. An Evaluation of Directory Schemes for Cache Coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 280-289, 1988.
[3]
Altavista Business Solutions. http://doc.altavista.com/ business_sohitions/bus_solutions.html.
[4]
Apache HTTP Server Project. http://www.apache.org/ httpd.html.
[5]
E. Artiaga, N. Navarro, X. Martorell, and Y. Becerra. Implementing PARMACS Macros for Shared Memory Multiprocessor Environments. Technical report, Polytechnic University of Catalunya, Department of Computer Architecture Technical Report UPC-DAC-1997-07, Jan. 1997.
[6]
P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. In Proceedings of the 1998 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 151-160, June 1998.
[7]
L.A. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 282-293, June 2000.
[8]
L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 3-14, June 1998.
[9]
E.E. Bilir, R.M. Dickson, Y. Hu, M. Plakal, D.J. Sorin, M.D. Hill, and D. A. Wood. Multicast Snooping: A New Coherence Method Using a Multicast Address Network. In Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999.
[10]
R. Bisiani, A. Nowatzyk, and M. Ravishankar. Coherent Shared Memory on a Message Passing Machine. In Proceedings of the 1989 International Conference on Parallel Processing, pages 1-133-141. ICPP, August 1989.
[11]
J. Borkenhagen and S. Storino. 4th Generation 64-bit PowerPC-Compatible Commercial Processor Design. IBM Whitepaper, January 13, 1999, http://www.rs6000.ibm.com/ resource/technology/nstar.pdf.
[12]
A. Charlesworth. Extending the SMP Envelope. IEEE Micro, pages 39--49, Jan/Feb 1998.
[13]
K. Diefendorff. Power4 Focuses on Memory Bandwidth. Microprocessor Report, 13(13), Oct. 1999.
[14]
J. Duato, S. Yalamanchili, and L. Ni. lnterconnection Networks. IEEE Computer Society Press, 1997.
[15]
S.J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164-169, Jan. 1984.
[16]
R.M. Fujimoto. Parallel Discrete Event Simulation. Commun. ACM, 33(10):30--53, Oct. 1990.
[17]
R. M. Fujimoto. The Virtual Time Machine. In Proceedings of the Second ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1990.
[18]
K. Gharachorloo, M. Sharma, S. Steely, and S.V. Doren. Architecture and Design of AlphaScrver GS320. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Nov. 2000.
[19]
L. Gwennap. Alpha 21364 to Ease Memory Bottleneck. Microprocessor Report, Oct. 1998.
[20]
M. Horowitz, C.-K. K. Yang, and S. Sidiropoulos. High- Speed Electrical Signaling: Overview and Limitations. IEEE Micro, 18(1), January/February 1998.
[21]
C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy Performance of Cache-coherent Multiprocessors Using Micro Benchmarks. In Proceedings of Supercomputing '97, Nov. 1997.
[22]
D. R. Jefferson. Virtual Time. ACM Trans. Prog. Lang. Syst., 7(3):404--425, July 1985.
[23]
S. Kunkel. Personal Communication, Apr. 2000.
[24]
S. Kunkel, B. Armstrong, and P. Vitale. System Optimization for OLTP Workloads. IEEE Micro, pages 56-- 64, May/June 1999.
[25]
A. Landin, E. Hagersten, and S. Haridi. Race-Free Interconnection Networks and Multiprocessor Consistency. In Proceedings of the International Symposium on Computer Architecture, June 1991.
[26]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997.
[27]
G. Lauterbach and T. Horel. UltraSPARC-III: Designing Third Generation 64-Bit Performance. IEEE Micro, 19(3), May/June 1999.
[28]
C.E. Leiserson. Systolic Priority Queues. In Caltech Conference on VLSI, pages 199-214, Jan. 1979.
[29]
T.D. Lovett and R.M. Clapp. STING: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 308-317, May 1996.
[30]
P.S. Magnusson etal. SimlCS/sun4m: A Virtual Workstation. In Proceedings of Usenix Annual Technical Conference, June 1998.
[31]
S.-W. Moon, J. Rexford, and K. G. Shin. Sealable Hardware Priority Queue Architectures for High-Speed Packet Switches. In Proc. IEEE Real-Time Technology and Applications Symposium, pages 203-212, June 1997.
[32]
A. Nowatzyk. Performance Analysis of Hypercube Based Ensemble Machine Architectures. Phd thesis, Carnegie- Mellon, 1989.
[33]
A. Nowatzyk, M. Monger, M. Parkin, E. Kelly, M. Borwne, G. Aybay, and D. Lee. S3.mp: A Multiprocessor in a Matchbox. In Proc. PASA, 1993.
[34]
G.M. Papadopoulos. SC99 State-of-the-Field Address, 1999.
[35]
F. Pong, M. Dubois, and K. Lee. Design and Performance of SMPs with Asynchronous Caches. Technical Report HPL- 1999-149, HP Labs, Nov. 1999.
[36]
A. G. Ranade. How to Emulate Shared Memory. Journal of Computer and System Sciences, 42(3):307-326, 1991.
[37]
P.F. Reynolds, Jr., C. Williams, and R.R. Wagner, Jr. Isotach Networks. IEEE Transactions on Parallel and Distributed Systems, 8(4):337-348, April 1997.
[38]
A. Singhal, D. Broniarczyk, F. Cerauskis, J. Price, L. Yaun, C. Cheng, D. Doblar, S. Fosth, N. Agarwal, K. Harvery, E. Hagersten, and B. Liencres. Gigaplane: A High Performance Bus of Large SMPs. In IEEE Hot Interconnects, pages 41-52, Aug. 1996.
[39]
D.J. Sorin, M. Plakal, M.D. Hill, A.E. Condon, M.M. Martin, and D.A. Wood. Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol. Technical Report 1412, Computer Sciences Department, University of Wisconsin-Madison, Mar. 2000.
[40]
P. Sweazey and A. J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 414--423, June 1986.
[41]
Transaction Processing Performance Council. TPC Benchmark C, Draft Specification, Revision 4.0.q, Aug. 1999.
[42]
Transaction Processing Performance Council. TPC Benchmark H (Decision Support), Standard Specification, Revision 1.1.0, June 1999.
[43]
G. White and P. Vogt. Profusion (tin): A Buffered, Cache Coherent Crossbar Switch. In IEEE Hot Interconnects, pages 87-96, Aug. 1997.
[44]
C. Williams, J. Paul F. Reyolds, and B. R. de Supinski. Delta Coherence Protocols. IEEE Concurrency, 8(3):21-27, July- September 2000.
[45]
S. C. Woo, M. Ohara, E. Torrie, J. P. Shingh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24-36, June 22-24, 1995.

Cited By

View all
  • (2015)OSPREYProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.45(392-405)Online publication date: 18-Oct-2015
  • (2013)Enabling power efficiency through dynamic rerouting on-chipACM Transactions on Embedded Computing Systems10.1145/2485984.248599912:4(1-23)Online publication date: 3-Jul-2013
  • (2005)Reaping the Benefit of Temporal Silence to Improve Communication PerformanceProceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 200510.1109/ISPASS.2005.1430580(258-268)Online publication date: 20-Mar-2005
  • Show More Cited By

Index Terms

  1. Timestamp snooping: an approach for extending SMPs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2000
    Published in SIGPLAN Volume 35, Issue 11

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)OSPREYProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.45(392-405)Online publication date: 18-Oct-2015
    • (2013)Enabling power efficiency through dynamic rerouting on-chipACM Transactions on Embedded Computing Systems10.1145/2485984.248599912:4(1-23)Online publication date: 3-Jul-2013
    • (2005)Reaping the Benefit of Temporal Silence to Improve Communication PerformanceProceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 200510.1109/ISPASS.2005.1430580(258-268)Online publication date: 20-Mar-2005
    • (2002)Temporally silent storesACM SIGOPS Operating Systems Review10.1145/635508.60540136:5(30-41)Online publication date: 1-Oct-2002
    • (2002)Temporally silent storesACM SIGARCH Computer Architecture News10.1145/635506.60540130:5(30-41)Online publication date: 1-Oct-2002
    • (2002)Temporally silent storesACM SIGPLAN Notices10.1145/605432.60540137:10(30-41)Online publication date: 1-Oct-2002
    • (2002)Temporally silent storesProceedings of the 10th international conference on Architectural support for programming languages and operating systems10.1145/605397.605401(30-41)Online publication date: 5-Oct-2002

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media