Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3458336.3465297acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open access

Cores that don't count

Published: 03 June 2021 Publication History

Abstract

We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent" - the only symptom is an erroneous computation.
We refer to a core that develops such behavior as "mercurial." Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.

References

[1]
Joel Bartlett, Wendy Bartlett, Richard Carr, Dave Garcia, Jim Gray, Robert Horst, Robert Jardine, Dan Lenoski, and Dix McGuire. Fault Tolerance in Tandem Computer Systems. In D. P. Siewiorek and R. Swarz, editors, The theory and practice of reliable system design. Digital Press, 1982.
[2]
Manuel Blum and Sampath Kannan. Designing Programs That Check Their Work. J. ACM, 42(1):269--291, January 1995.
[3]
Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance. In Proc. OSDI, 1999.
[4]
Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen. Deterministic Replay: A Survey. ACM Comput. Surv., 48(2), September 2015.
[5]
P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, and M. Violante. Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science, 47(6):2231--2236, 2000.
[6]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. Upright Cluster Services. In Proc. SOSP, page 277--290, 2009.
[7]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst., 31(3), August 2013.
[8]
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. Silent Data Corruptions at Scale. https://arxiv.org/abs/2102.11245, 2021.
[9]
M. L. Fair, C. R. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber. Reliability, availability, and serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, 48(3.4):519--534, 2004.
[10]
Bo Fang, Panruo Wu, Qiang Guan, Nathan DeBardeleben, Laura Monroe, Sean Blanchard, Zhizong Chen, Karthik Pattabiraman, and Matei Ripeanu. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. In IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pages 72--76, 2016.
[11]
Qiang Guan, Nathan DeBardeleben, Sean Blanchard, and Song Fu. Empirical Studies of the Soft Error Susceptibility Of Sorting Algorithms to Statistical Fault Injection. In Proc. 5th Workshop on Fault Tolerance for HPC at EXtreme Scale (FXTS), page 35--40, 2015.
[12]
Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Deepthi Srinivasan, Biswaranjan Panda, Andrew Baptist, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. ACM Trans. Storage, 14(3), October 2018.
[13]
Dean Hildebrand and Denis Serenyi. Colossus under the hood: a peek into Google's scalable storage system. https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system, 2021.
[14]
M. D. Hill, J. Masters, P. Ranganathan, P. Turner, and J. L. Hennessy. On the Spectre and Meltdown Processor Security Vulnerabilities. IEEE Micro, 39(2):9--19, 2019.
[15]
R. E. Lyons and W. Vanderkulk. The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2):200--209, 1962.
[16]
Riccardo Mariani. Soft Errors on Digital Components. In A. Benso and P. Prinetto, editors, Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation, volume 23 of Frontiers in Electronic Testing. Springer, 2003.
[17]
Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, page 343--356, 2011.
[18]
S. Pandey and B. Vermeulen. Transient errors resiliency analysis technique for automotive safety critical applications. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1--4, 2014.
[19]
Martin Rinard, Cristian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor Leu, and William S. Beebee. Enhancing server availability and security through failure-oblivious computing. In Proc. OSDI, 2004.
[20]
J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-End Arguments in System Design. ACM Trans. Comput. Syst., 2(4):277--288, November 1984.
[21]
T.J.E. Schwarz, Qin Xin, E.L. Miller, D.D.E. Long, A. Hospodor, and S. Ng. Disk Scrubbing in Large Archival Storage Systems. In Proc. MASCOTS, 2004.
[22]
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. AddressSanitizer: A Fast Address Sanity Checker. In Proc. USENIX Annual Technical Conference, 2012.
[23]
Noam Shalev, Eran Harpaz, Hagar Porat, Idit Keidar, and Yaron Weinsberg. CSR: Core Surprise Removal in Commodity Operating Systems. In Proc. ASPLOS, page 773--787, 2016.
[24]
Jan Philipp Thoma, Jakob Feldtkeller, Markus Krausz, Tim Güneysu, and Daniel J. Bernstein. BasicBlocker: Redesigning ISAs to Eliminate Speculative-Execution Attacks. CoRR, abs/2007.15919, 2020.
[25]
Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proc. HPCA, pages 331--342, 2015.
[26]
Jim Turley. ARM Cortex-A76AE Reliably Stays in Lock Step. Electronic Engineering Journal, October 2018. https://www.eejournal.com/article/arm-cortex-a76ae-reliably-stays-in-lock-step/.
[27]
Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. Silent Data Corruption Resilient Two-Sided Matrix Factorizations. SIGPLAN Not., 52(8):415--427, January 2017.

Cited By

View all
  • (2024)ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN AcceleratorsElectronics10.3390/electronics1316324313:16(3243)Online publication date: 15-Aug-2024
  • (2024)Seamless Digital Engineering: A Grand Challenge Driven by NeedsAIAA SCITECH 2024 Forum10.2514/6.2024-1053Online publication date: 4-Jan-2024
  • (2024)Integrated photonic modular arithmetic processorPhotonics Research10.1364/PRJ.52776212:11(2676)Online publication date: 1-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems
June 2021
251 pages
ISBN:9781450384384
DOI:10.1145/3458336
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

HotOS '21
Sponsor:

Upcoming Conference

HOTOS '25
Workshop on Hot Topics in Operating Systems
May 14 - 16, 2025
Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,356
  • Downloads (Last 6 weeks)171
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN AcceleratorsElectronics10.3390/electronics1316324313:16(3243)Online publication date: 15-Aug-2024
  • (2024)Seamless Digital Engineering: A Grand Challenge Driven by NeedsAIAA SCITECH 2024 Forum10.2514/6.2024-1053Online publication date: 4-Jan-2024
  • (2024)Integrated photonic modular arithmetic processorPhotonics Research10.1364/PRJ.52776212:11(2676)Online publication date: 1-Nov-2024
  • (2024)The Case of Unsustainable CPU AffinityACM SIGEnergy Energy Informatics Review10.1145/3698365.36983714:3(32-38)Online publication date: 1-Jul-2024
  • (2024)Understanding Silent Data Corruption in Processors for Mitigating its EffectsACM Transactions on Architecture and Code Optimization10.1145/369082521:4(1-27)Online publication date: 20-Nov-2024
  • (2024)FKeras: A Sensitivity Analysis Tool for Edge Neural NetworksACM Journal on Autonomous Transportation Systems10.1145/36653341:3(1-27)Online publication date: 18-May-2024
  • (2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
  • (2024)The Vulnerability-Adaptive Protection ParadigmCommunications of the ACM10.1145/3647638Online publication date: 15-Aug-2024
  • (2024)New Computer Evaluation Metrics for a Changing WorldCommunications of the ACM10.1145/3637867Online publication date: 3-Jul-2024
  • (2024)Highly Efficient Self-checking Matrix Multiplication on Tiled AMX AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/363333221:2(1-22)Online publication date: 15-Feb-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media