Nothing Special   »   [go: up one dir, main page]

skip to main content
article

CAS-DSM: a compiler assisted software distributed shared memory

Published: 01 April 2004 Publication History

Abstract

Traditional software Distributed Shared Memory (DSM) systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. The resulting involvement of the OS (kernel) and the associated overhead which is significant, can be avoided by careful compile time analysis and code instrumentation. In this paper, we propose such a Compiler Assisted Software support approach (CAS-DSM). In the CAS-DSM implementation, the involvement of the OS kernel is avoided by instrumenting the application code at the source level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations. We modified CVM, a publicly available software DSM to support the instrumentation inserted by the compiler. Detailed performance evaluation of CAS-DSM is reported using a set of Splash/Splash2 parallel application benchmarks on a distributed memory IBM SP-2 machine. CAS-DSM achieved moderate to good performance improvements for most of the applications compared to the original CVM implementation. Reducing the overheads in polling-based implementation improves the performance of CAS-DSM significantly resulting in an overall improvement of 12-52% over the original CVM implementation.

References

[1]
1. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan Kaufmann Publishers, San Francisco, CA (1996).]]
[2]
2. D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam, The Stanford DASH Multiprocessor, Computer, 25(3):63-79 (March 1992).]]
[3]
3. D. V. James, A. T. Laundrie, S. Gjessing, and G. S. Sohi, Distributed-Directory Scheme: Scalable Coherent Interface, IEEE Comput., 74-77 (June 1990).]]
[4]
4. E. Hagersten, A. Landin, and S. Haridi, DDM--a Cache-Only Memory Architecture, Computer, 25(9):44-54 (September 1992).]]
[5]
5. S. Frank, KSR1: High Performance and Ease of Programming, No Longer an Oxymoron, Proc. of the 5th Ann. ACM Symp. on Parallel Algorithms and Architectures, Velen, Germany, p. 335 (June 30-July 2, 1993).]]
[6]
6. Sarita V. Adve and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial, IEEE Comput., 66-76 (December 1996).]]
[7]
7. V. S. Sunderam, PVM: A Framework for Parallel Distributed Computing, IEEE Concurrency: Practice and Experience, 2(4):315-339 (December 1990).]]
[8]
8. W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press (1994).]]
[9]
9. K. Li, Ivy, A Shared Virtual Memory System for Parallel Computing, International Conference on Parallel Processing, pp. 94-101 (1988).]]
[10]
10. C. Amza, A. L. Cox, S. Dwarakdas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, Treadmarks: Shared Memory Computing on Network of Workstations, IEEE Computer, 18-28 (February 1996).]]
[11]
11. J. K. Bennett, J. B. Carter, and W. Zwaenepoel, Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence, Proc. of the Second ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Seattle, WA, pp. 168-176 (March 14-16, 1990).]]
[12]
12. B. N. Bershad and M. J. Zekauskas, Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Technical report, School of Computer Science, Carnegie Mellon University, Pisstburgh, PA 15213 (1991).]]
[13]
13. M. J. Zekauskas, W. A. Sawdon, and B. N. Bershad, Software Write Detection for a Distributed Shared Memory. The Symposium on Operating Systems Design and Implementation (OSDI), pp. 87-100 (November 1994).]]
[14]
14. P. Keleher, CVM: The Coherent Virtual Machine, University of Maryland, College Park, MD (July 1997). http://www.cs.umd.edu/projects/cvm]]
[15]
15. D. J. Scales, K. Gharachorloo, and C. A. Thekkath, Shasta: A low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory, Proceedings of Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, pp. 174-185 (October 1996).]]
[16]
16. P. Keleher, Tapeworm: High Level Abstractions of Shared Access, Proc. of the 3rd Symp. on Operating System Design and Implementation, pp. 201-214 (February 1999).]]
[17]
17. A. Itzkovitz and A. Schuster, Multiview and Millipage: Fine Grain Sharing in Page Based DSMs, Proc. of the 3rd Symp. on Operating System Design and Implementation, pp. 215-228 (February 1999).]]
[18]
18. H. Lu, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel, Message Passing Versus Distributed Shared Memory on Networks of Workstations, Proc. of the Supercomputing Conference SC-95 (December 1995).]]
[19]
19. OpenMP. http://www.openmp.org.]]
[20]
20. H. Lu, C. Lu, and W. Zwaenepoel, OpenMP on networks of workstations, Proc. of the Supercomputing Conference SC-98 (November 1998).]]
[21]
21. K. Kusano, M. Sato, T. Hosomi, and Y. Seo, The Omni OpenMP Compiler on the Distributed Shared Memory of Cenju-4, Proc. of the Intl. Workshop on OpenMP Applications and Tools, WOMPAT 2001, West Lafayette, IN, USA (July 2001).]]
[22]
22. S. C. Woo, M. Ohara, E. Torrie, J. Pal Shingh, and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, Proc. of the 22nd Ann. Intl. Symp. on Computer Architecture, Santa Margherita Ligure, Italy, pp. 24-36 (June 22-24, 1995).]]
[23]
23. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, Proc. of the 17th Ann. Intl. Symp. on Computer Architecture, Seattle, WA, pp. 15-26 (May 28-31, 1990).]]
[24]
24. J. R. Goodman, Cache Consistency and Sequential Consistency, Technical Report 1006, Dep. of Comp. Sci., University of Wisconsin, Madison (February 1991).]]
[25]
25. M. Dubois, C. Scheurich, and F. Briggs, Memory Access Buffering in Multiprocessors, Proc. of the 13th Ann. Intl. Symp. on Computer Architecture, Tokyo, Japan, pp. 434-442 (June 2-5, 1986).]]
[26]
26. P. Keleher, A. L. Cox, and W. Zwaenepoel, Lazy Release Consistency for Software Distributed Shared Memory, Proc. of the 19th Ann. Intl. Symp. on Computer Architecture, Gold Coast, Australia, pp. 13-21 (May 19-21, 1992).]]
[27]
27. K. L. Johnson, M. F. Kaashoek, and D. Wallach, CRL: High-Performance All-Software Distributed Shared Memory, Proc. of the Fifth Workshop on Scalable Shared Memory Multiprocessors (June 1995).]]
[28]
28. K. Thitikamol and P. Keleher, Multi-Threading and Remote Latency in Software dsms, International Conference on Distributed Computer Systems, Baltimore, Maryland, USA (May 1997).]]
[29]
29. D. A. Bader and J. JaJa, SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs), Journal of Parallel and Distributed Computing, 58(1):92-108 (1999).]]
[30]
30. H. Han and C-W. Tseng, Compile-Time Synchronization Optimizations for Software DSMs, Proc. of the International Parallel Processing Symposium (1998).]]
[31]
31. T. Park and H. Y. Yeom, An Efficient Logging and Recovery Scheme for Lazy Release Consistent Distributed Shared Memory Systems, Future Generation Computer Systems, 17(3):265-278 (2000).]]
[32]
32. Stanford SUIF Compiler Group, The SUIF Parallelizing Compiler Group, Technical report, Stanford University (1994).]]
[33]
33. T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, Sp2 System Architecture, IBM Systems Journal, 34(2):152-184 (1995).]]
[34]
34. C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice, P. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, and P. R. Varker, The sp2 High-Performance Switch, IBM Systems Journal, 34(2):185-204 (1995).]]
[35]
35. Stanford SUIF Compiler Group, An Overview of the Suif Compiler System, Technical report, Stanford University (1994).]]
[36]
36. H. Lu, A. L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel, Compiler and Software Distributed Shared Memory Support for Irregular Applications, Sixth Symposium on Principles and Practices of Parallel Programming, pp. 48-56 (1997).]]
[37]
37. G. E. Blelloch, C. E. Lieserson, B. M. Maggs, C. G. Plaxton, and S. J. Smith, and M. Zagha, A Comparison of Sorting Algorithms for the Connection Machine CM-2, Symposium on Parallel Algorithms and Architectures, pp. 3-16 (July 1991).]]
[38]
38. S. C. Woo, J. P. Singh, and J. L. Hennessy, The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors, Technical report, Stanford University (December 1993).]]
[39]
39. D. Perkovic and P. Keleher, Responsiveness without Interrupts, Proc. of the 1999 Intl. Conf. on Supercomputing (ICS-99), pp. 101-108 (June 1999).]]
[40]
40. I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood, Fine-Grain Access Control for Distributed Shared Memory, Proceedings of Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, pp. 297-306 (October 1996).]]
[41]
41. D. J. Scales and M. S. Lam, An Efficient Memory Layer for Distributed Memory Machines, CSL-TR-94-627, Computer Systems Laboratory, Stanford University (1994).]]
[42]
42. S. Ahuja, N. Carriero, and D. Gelernter, Linda and Friends, Computer, 19(8):26-34 (August 1986).]]
[43]
43. D. Yeung, J. Kubiatowitcz, and A. Agarwal, MGS: A Multigrain Shared Memory System, Proceedings of the Twenty-Third International Symposium on Computer Architecture, pp. 44-55 (May 1996).]]
[44]
44. J. Protić, M. Tomašević, and V. Milutinović, Distributed Shared Memory: Concepts and Systems, IEEE Parallel and Distributed Technology, 63-79 (1996).]]
[45]
45. S. Dwarkdas, A. L. Cox, and W. Zwaenepoel, An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System, Proc. of Sixth Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 186-197, San Jose, CA (October 1996).]]
[46]
46. T. Mowry and A. Gupta, Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors, Journal of Parallel and Distributed Computing12:87-106 (1991).]]

Cited By

View all

Recommendations

Reviews

Bayard Kohlhepp

Research papers can generally be assigned to one of two broad categories: surveys or achievements. Surveys describe progress in a specific research field as a whole. Achievement papers describe a single, incremental advance, obtained by one team in one set of experiments. This paper is both. The authors describe an incremental advance in distributed shared memory (DSM), but they cite so many other DSM projects (46 citations) that the paper is also valuable as a DSM survey. Why does an incremental gain in DSM performance matter__?__ In the current progression of computing, from PCs to pervasive and ubiquitous computing (the post-PC era), an important issue is the need for data and computations to follow the user from one client machine to another. This functionality will exist in all user-facing applications, and it will inevitably require some form of distributed shared memory. Improvements to DSM performance or reliability, therefore, lie squarely on the critical path to the future. The authors report a notable, repeatable DSM performance gain. Many DSM models rely on the underlying virtual memory mechanism of the operating system to signal a page fault when a shared memory page is accessed. The authors documented the elapsed time of this signal handling to be two orders of magnitude (100 times) longer than the elapsed time for a function call. Their objective, then, was to replace signal overhead with function calls. They built a test environment using Coherent Virtual Machine (CVM), a public domain DSM library, and the Stanford University Intermediate Format (SUIF) compiler system. By analyzing memory accesses at compile time using SUIF tools, and judiciously adding their own "instrumented code," the authors were able to avoid page fault signals, and achieve performance improvements of five to 15 percent over plain CVM. I would recommend this paper to three types of readers. First, DSM researchers and developers should take notice of, and consider implementing, this performance gain. Second, anyone just entering this field would benefit from the survey aspect of this paper. Finally, anyone considering compiler-assisted techniques in their own field should review this paper. The authors do a great job of explaining the details of their work, as well as the consequence of each change. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming
International Journal of Parallel Programming  Volume 32, Issue 2
April 2004
87 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2004

Author Tags

  1. Stanford university intermediate form (SUIF)
  2. coherent virtual machine (CVM)
  3. performance evaluation
  4. software distributed shared memory

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2012)CUDA-for-clustersProceedings of the 18th international conference on Parallel Processing10.1007/978-3-642-32820-6_42(415-426)Online publication date: 27-Aug-2012
  • (2011)Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platformACM SIGOPS Operating Systems Review10.1145/1945023.194503545:1(92-100)Online publication date: 18-Feb-2011
  • (2009)DSiMClusterSimulation10.1177/003754970910448385:6(355-374)Online publication date: 1-Jun-2009
  • (2006)Barrier elimination based on access dependency analysis for OpenMPProceedings of the 4th international conference on Parallel and Distributed Processing and Applications10.1007/11946441_36(362-373)Online publication date: 4-Dec-2006
  • (2005)The data diffusion space for parallel computing in clustersProceedings of the 11th international Euro-Par conference on Parallel Processing10.1007/11549468_10(61-71)Online publication date: 30-Aug-2005

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media