Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2318916.2318930acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Evaluating operating system vulnerability to memory errors

Published: 29 June 2012 Publication History

Abstract

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

References

[1]
Adiga, N. R., e. a. An overview of the BlueGene/L supercomputer. In Supercomputing, ACM/IEEE 2002 Conference (nov. 2002), p. 60.
[2]
Ahn, J. 2-step algorithm for enhancing effectiveness of sender-based message logging. In SpringSim '07: Proceedings of the 2007 spring simulation multiconference (2007), pp. 429--434.
[3]
Amarasinghe, S., and et al. Exascale software study: Software challenges in extreme scale systems. http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf, Sept. 2009.
[4]
Batchu, R., Dandass, Y. S., Skjellum, A., and Beddhu, M. MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7, 4 (Jan. 2004), 303--315.
[5]
Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Kogge, P., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R. S., and Yelick, K. Exascale computing study: Technology challenges in achieving exascale systems. http://www.science.energy.gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf, Sept. 2008.
[6]
Bonwick, J., and Adams, J. Magazines and vmem: Extending the slab allocator to many CPUs and arbitrary resources. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference (Berkeley, CA, USA, 2001), USENIX Association, pp. 15--33.
[7]
Bridges, P., Hoemmen, M., Ferreira, K. B., Heroux, M., Soltero, P., and Brightwell, R. Cooperative application/os DRAM fault recovery. Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the Euro-Par Conference, Lecture Notes in Computer Science (2011), --.
[8]
Bronevetsky, G., and de Supinski, B. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd Annual International Conference on Supercomputing (New York, NY, USA, 2008), ICS '08, ACM, pp. 155--164.
[9]
Bronevetsky, G., Marques, D., Pingali, K., and Stodghill, P. Collective operations in application-level fault-tolerant MPI. In Proceedings of the 17th annual international conference on Supercomputing (New York, NY, USA, 2003), ICS '03, ACM, pp. 234--243.
[10]
Chakravorty, S., Mendes, C., and Kalãl', L. Proactive fault tolerance in mpi applications via task migration. Strategy 4297 (2006), 485âĂŞ496.
[11]
Chen, Z., and Dongarra, J. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International (April 2006).
[12]
David A. Wheeler. Sloccount. http://www.dwheeler.com/sloccount, March 1 2012.
[13]
Dell, T. J. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division, Nov. 1997.
[14]
Dopson, D. SoftECC: A system for software memory integrity checking. Master's thesis, Massachusetts Institute of Technology, September 2005.
[15]
Elliot, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., and Engelmann, C. Combining partial redundancy and checkpointing for HPC. In International Conference on Distributed Computing Systems (Los Alamitos, CA, USA, June 2012), IEEE Computer Society Press, pp. 1--11. {to appear}.
[16]
Engelmann, C., and Geist, G. A. A. Super-scalable algorithms for computing on 100,000 processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I (Atlanta, GA, USA, May 22--25, 2005), vol. 3514, Springer Verlag, Berlin, Germany, pp. 313--320.
[17]
Engelmann, C., Ong, H. H., and Scott, S. L. The case for modular redundancy in large-scale high performance computing systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009 (Innsbruck, Austria, Feb. 16--18, 2009), ACTA Press, Calgary, AB, Canada, pp. 189--194.
[18]
Fagg, G. E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., and Dongarra, J. Scalable fault tolerant mpi: Extending the recovery algorithm. In PVM/MPI (2005), B. D. Martino, D. Kranzlmüller, and J. Dongarra, Eds., vol. 3666 of Lecture Notes in Computer Science, Springer, pp. 67--75.
[19]
Ferreira, K., Riesen, R., Stearley, J., III, J. H. L., Oldfield, R., Pedretti, K., Bridges, P., Arnold, D., and Brightwell, R. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis, (SC'11) (Nov 2011).
[20]
Fiala, D., Ferreira, K. B., Mueller, F., and Engelmann, C. A tunable, software-based DRAM error detection and correction library for HPC. In Lecture Notes in Computer Science: Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par) 2011: Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (Bordeaux, France, Aug 2011), Springer Verlag, Berlin, Germany.
[21]
Guermouche, A., Ropars, T., Brunet, E., Snir, M., and Cappello, F. Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (May 2011).
[22]
Huang, K.-H., and Abraham, J. A. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33, 6 (June 1984).
[23]
Hwang, A. A., Stefanovici, I. A., and Schroeder, B. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS '12, ACM, pp. 111--122.
[24]
Inovative Computing Laboratory. FT-MPI. http://icl.cs.utk.edu/ftmpi, March 1 2012.
[25]
Jiang, Q., and Manivannan, D. An optimistic checkpointing and selective approach for consistent global checkpoint collection in distributed systems. In Proceedings of the 2007 IEEE International Parallel and Distributed Processing Symposium (Mar. 2007).
[26]
Johnson, D. B., and Zwaenepoel, W. Recovery in distributed systems using asynchronous and checkpointing. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing (1988), pp. 171--181.
[27]
Kleen, A. mcelog: memory error handling in user space. In Proceedings of Linux Kongress 2010 (Nuremburg, Germany, September 2010).
[28]
Lange, J. R., Pedretti, K. T., Hudson, T., Dinda, P. A., Cui, Z., Xia, L., Bridges, P. G., Gocke, A., Jaconette, S., Levenhagen, M., and Brightwell, R. Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing. In IPDPS'10 (2010), pp. 1--12.
[29]
Li, S., Chen, K., Hsieh, M.-Y., Muralimanohar, N., Kersey, C. D., Brockman, J. B., Rodrigues, A. F., and Jouppi, N. P. System implications of memory reliability in exascale computing. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2011), SC '11, ACM, pp. 46:1--46:12.
[30]
Maruyama, N., Nukada, A., and Matsuoka, S. A high-performance fault-tolerant software framework for memory on commodity GPUs. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (april 2010), pp. 1--12.
[31]
Moody, A., Bronevetsky, G., Mohror, K., and Supinski, B. R. d. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC '10, IEEE Computer Society, pp. 1--11.
[32]
Northwestern University. Palacios: An os independent embeddable vmm. http://v3vee.org/palacios, March 10 2012.
[33]
Oh, N., Shirvani, P., and McCluskey, E. Control-flow checking by software signatures. Reliability, IEEE Transactions on 51, 1 (mar 2002), 111--122.
[34]
Oh, N., Shirvani, P., and McCluskey, E. J. Error detection by duplicated instructions in super-scalar processors. Reliability, IEEE Transactions on 51, 1 (mar 2002), 63--75.
[35]
Plank, J. S., Kim, Y. B., and Dongarra, J. J. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers (Pasadena, CA, USA, June 1995), Los Alamitos, CA, USA: IEEE Comput. Soc. Press, 1995, pp. 351--360.
[36]
Rebaudengo, M., Reorda, M., Violante, M., and Torchiano, M. A source-to-source compiler for generating dependable software. In Source Code Analysis and Manipulation, 2001. Proceedings. First IEEE International Workshop on (2001), pp. 33--42.
[37]
Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. SWIFt: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization (Washington, DC, USA, 2005), CGO'05, IEEE Computer Society, pp. 243--254.
[38]
Sandia National Laboratories. The LAMMPS molecular dynamics simulator. http://lammps.sandia.gov, April 2010.
[39]
Sandia National Laboratory. Kitten lightweight kernel. https://software.sandia.gov/trac/kitten, March 10 2012.
[40]
Schroeder, B., and Gibson, G. A. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN2006) (June 2006).
[41]
Schroeder, B., and Gibson, G. A. Understanding failures in petascale computers. Journal of Physics: Conference Series 78, 1 (2007), 012022.
[42]
Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM errors in the wild: a large-scale field study. Communications of the ACM 54 (February 2011), 100--107.
[43]
Shirvani, P., Saxena, N., and McCluskey, E. Software-implemented EDAC protection against SEUs. Reliability, IEEE Transactions on 49, 3 (sep 2000), 273--284.
[44]
Silva, L. M., and Silva, J. G. An experimental study about diskless checkpointing. In 24th EUROMICRO Conference (Vasteras, Sweden, August 1998), IEEE Computer Society Press, pp. 395--402.
[45]
Simon, H. Exascale challenges for the computational science community. Tech. rep., Lawrence Berkeley National Laboratory and UC Berkeley, Oct. 2010.
[46]
SMEM. Memory reporting tool. http://www.selenic.com/smem/, March 1 2012.

Cited By

View all
  • (2021)Bugs in Security OnionProceedings of the 2021 6th International Conference on Systems, Control and Communications10.1145/3510362.3510363(1-6)Online publication date: 15-Oct-2021
  • (2020)New Emoji Requests from Twitter UsersACM Transactions on Social Computing10.1145/33707503:2(1-25)Online publication date: 19-Apr-2020
  • (2017)Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC ConvertersACM Transactions on Design Automation of Electronic Systems10.1145/305727422:4(1-16)Online publication date: 9-Jun-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ROSS '12: Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
June 2012
82 pages
ISBN:9781450314602
DOI:10.1145/2318916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM failures
  2. fault-tolerance
  3. operating systems

Qualifiers

  • Research-article

Conference

ICS'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 169 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Bugs in Security OnionProceedings of the 2021 6th International Conference on Systems, Control and Communications10.1145/3510362.3510363(1-6)Online publication date: 15-Oct-2021
  • (2020)New Emoji Requests from Twitter UsersACM Transactions on Social Computing10.1145/33707503:2(1-25)Online publication date: 19-Apr-2020
  • (2017)Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC ConvertersACM Transactions on Design Automation of Electronic Systems10.1145/305727422:4(1-16)Online publication date: 9-Jun-2017
  • (2017)Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon EraACM Transactions on Design Automation of Electronic Systems10.1145/305520222:4(1-26)Online publication date: 15-Jun-2017
  • (2016)RelaxFault memory repairACM SIGARCH Computer Architecture News10.1145/3007787.300120544:3(645-657)Online publication date: 18-Jun-2016
  • (2016)Rescuing uncorrectable fault patterns in on-chip memories through error pattern transformationACM SIGARCH Computer Architecture News10.1145/3007787.300120444:3(634-644)Online publication date: 18-Jun-2016
  • (2016)RelaxFault memory repairProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.62(645-657)Online publication date: 18-Jun-2016
  • (2015)What is a Lightweight Kernel?Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/2768405.2768414(1-8)Online publication date: 16-Jun-2015
  • (2015)Balancing reliability, cost, and performance tradeoffs with FreeFault2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2015.7056053(439-450)Online publication date: Mar-2015
  • (2015)Stay Alive, Don't Give UpProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.107(588-594)Online publication date: 8-Sep-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media