Home Summary Organizers Audience Bios References

HPCA 2024


Computing Systems Resilience to Hardware Faults:
Tackling Complexity and Scale

Sunday, March 3, 2024, Edinburgh, Scotland
All-Day tutorial held in conjunction with HPCA 2024

Tutorial Summary

Tutorial summary: Hardware faults in microprocessors and memory devices are a significant concern in the era of advanced and ubiquitous computing. Faults in silicon chips exist because of imperfections of the fabrication process, incompleteness of the volume manufacturing testing flow, devices variability and marginalities, circuit aging, radiation and other environmental effects, and circuit design bugs. With CPU and memory chips getting constantly more complex and the scale of their deployment in all domains (HPC, cloud, edge, IoT) increasing dramatically, the rates at which computing systems failures or silent computational errors happen due to any of the above threats are frightening.

Chip manufacturers, system integrators, hyperscalers and, of course, the research community are joining forces to tackle the complexity and scale of the problem. This tutorial brings together different perspectives and recent research findings from academia and industry. The critical point is the effort to measure and quantify the scale of the problem through modeling, analysis, and experimentation in simulation and real chips and systems. The importance of early and accurate analysis is manyfold. It can guide effective decisions at circuit design, system design, and software design for the identification and correction of the root causes of faults and the minimization of their impact on the software execution and, through this, to the user's perception about the computing systems correctness of operation.

The tutorial shall begin with a brief overview of key resilience terminology, followed by a review of techniques to model the vulnerability of a microprocessor to hardware faults. The tutorial shall then focus on state-of-the-art simulation-based methods for fast assessment of the vulnerability of the entire system stack to different types of hardware faults (transient, permanent, timing, bugs) in the silicon compute units (CPUs, GPUs, accelerators). The simulation-based strategies are built on early microarchitectural models that allow sufficient hardware modeling cycle-level accuracy, and, most of all, very fast, complete execution of workloads. These are mandatory requirements for the measurement of SDC and other types of failure rates. They also facilitate a very broad design space exploration and assist diligent sensitivity studies for all microarchitectural parameters of modern computing engines that affect the rate of execution errors. The simulation-based framework supports CPUs of different ISAs (x86, Arm, RISC-V – which can be compared to each other), a wide range of microarchitectures with different complexities (to contribute to sensitivity analysis), as well as GPUs and other accelerators for data-intensive workloads integrated in complex SoCs. Results will be presented for the analysis of different hardware fault types and architectures/microarchitectures, as well as on the validation of the findings of the strategy.

The tutorial will then shift towards resilience insights from systems at scale. The first part will focus on DRAM. We shall present field data on DRAM reliability from several production systems and spanning multiple DRAM generations. We shall dive into the nature of faults observed in memory present a case study of how the field data played a key role in driving resilience improvements for High-Bandwidth Memory (HBM) DRAM that is now part of the JEDEC HBM3 standard.

The last part of the tutorial will cover insights from the large-scale effort of a major hyperscaler (Meta) in the detection and diagnosis of hardware fault related failures in its fleet of computing machines. This part of the tutorial will focus on how Meta detects and manages hardware issues in their fleet, and show case studies focused on Silent Data Corruptions where attendees can learn several diagnostics and debugging techniques used at large scale data centers, starting from application level diagnostics, platform monitoring, to building assembly level reproducers of hard-to-detect issues.

Takeaways

Understanding of hardware fault mechanisms that lead to system failure and silent execution errors. Methodologies for the measurement of the error and failure rates using simulation-based approaches as well as in-field, large scale experimentation. By attending this tutorial, participants will gain a holistic perspective on computing systems (processors and memories) vulnerability assessment, which subsequently assists circuit, system, and software designers in making effective and efficient mitigation decisions across the abstraction layers.

Organizers/Presenters

Dimitris Gizopoulos, George Papadimitriou, Odysseas Chatzopoulos (University of Athens)

Sudhanva Gurumurthi, Vilas Sridharan, Majed Valad Beigi (AMD)

Harish Dattatraya Dixit, Sriram Sankar (Meta)


Tutorial Attendees

Click here to fill the attendee form of the tutorial (name, affliation, email).


Agenda

08:00 - 08:15 - Welcome and Tutorial Outline

  • 08:15 - 08:40 -- Resilience Primer (AMD)
  • 08:40 - 10:00 -- SDCs at Scale (Meta)

10.00 -10.20: Coffe Break

  • 10:20 - 11:00 -- SDCs at Scale (Meta)
  • 11:00 - 12:20 -- Microarchitectural modeling (gem5) and assessment: SDCs from the CPU (U Athens)

12.20 -13.20: Lunch

  • 13:20 - 14:00 -- Beyond the CPU: DRAM Field Data (AMD)
  • 14:00 - 14:30 -- Beyond the CPU: HBM3 Case Study (AMD)
  • 14:30 - 15:20 -- Beyond the CPU: SDCs from the Accelerators (U Athens)

15.20 -15.40: Coffe Break

  • 15:40 - 17:00 -- Open Discussion and Wrap-up (All)

Target audience

This tutorial is designed for researchers, engineers, and practitioners in the fields of computer architecture and systems reliability. Attendees should have a basic understanding of microprocessor architecture/microarchitecture, and the basics of hardware faults and errors.

Short bios

Dimitris Gizopoulos (dgizop@di.uoa.gr) is Professor at the Department of Informatics & Telecommunications of the University of Athens leading the Computer Architecture Lab. The group's research focuses on the dependability, the energy-efficiency, and the performance of computer architectures. Gizopoulos has published more than 190 papers in conferences and journals, has served and is currently serving as Associate Editor for several IEEE and ACM Transactions and Magazines and as member of Program, Organizing and Steering Committees of IEEE and ACM conferences. Gizopoulos is an IEEE Fellow, a Golden Core member of the IEEE Computer Society and a Distinguished ACM member.


George Papadimitriou (georgepap@di.uoa.gr) is a Postdoctoral Researcher in the Department of Informatics & Telecommunications of the University of Athens. His research focuses on dependability and energy-efficient computer architectures, microprocessor reliability, functional correctness of hardware designs and design validation of microprocessors, and has published more than 40 papers in international conferences and journals. He is an IEEE member.



Odysseas Chatzopoulos (Od.Chatzopoulos@di.uoa.gr) is a PhD student in the Department of Informatics & Telecommunications of the University of Athens. His research focuses on energy-efficient microprocessor design, and dependable computing modeling and assessment.


Sudhanva Gurumurthi (Sudhanva.Gurumurthi@amd.com) is an AMD Fellow, where he leads RAS research and advanced development. Before joining AMD, he was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, and an IEEE Computer Society Distinguished Contributor recognition. His publications have received multiple recognitions, including selection to IEEE Micro Top Picks and the ISCA@50 25-Year Retrospective. He had held several leadership roles in computer architecture conferences and journals and currently serves on the on the Dean's Advisory Council of the College of Science and Engineering at Texas State University. He received his PhD in Computer Science and Engineering from Penn State in 2005.

Vilas Sridharan (Vilas.Sridharan@amd.com) is an AMD Senior Fellow where he leads the RAS (Reliability, Availability and Serviceability) Architecture team. His research focuses on the modeling of hardware faults and architectural and micro-architectural approaches to reliability and fault tolerance in high-performance microprocessors. Vilas received his Ph.D. and M.S.E. from the Department of Electrical and Computer Engineering at Northeastern University, and his B.S.E. in Computer Engineering from Princeton University in 2000. From 2000 - 2004, he worked in the SPARC server division at Sun Microsystems.


Majed Valad Beigi (majed.valadbeigi@amd.com) is a senior member of technical staff at AMD where he works in the Reliability, Availability, and Serviceability (RAS) architecture team. His current research focuses on system reliability and fault tolerance, specifically on understanding faults in processor and memory systems to develop novel resilience schemes. Majed received his Ph.D. in Computer Engineering from Northwestern University in 2019.



Harish Dixit (hdd@meta.com) is a Principal Engineer (Release to Production) at Meta. Harish and team work on reliability, analytics and performance evaluation for all of deployed fleet of servers. Harish leads the efforts to deal with silent data corruptions within Meta infrastructure across CPUs, GPUs and ASICs, and has been working across different layers of the stack to mitigate the effects of silent data corruption on production applications. Harish has over 20 patent filings across system architecture and communication domains.



Sriram Sankar (sriramsankar@meta.com) is a Director of Engineering at Meta, where he leads teams that are responsible for the Hardware Health and Availability of the entire Meta Compute/Storage/AI server fleet. Sriram has a Masters and PhD degree from the University of Virginia.

Related Projects


     

Publications

HPCA 2024 - "gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures", O. Chatzopoulos, G. Papadimitriou, V. Karakostas, and D. Gizopoulos, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2024), March 2024.


HPCA 2023 - "AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment", G. Papadimitriou and D. Gizopoulos, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023), February 2023.


HPCA 2023 - "A Systematic Study of DDR4 DRAM Faults in the Field", M. V. Beigi, Y. Cao, S. Gurumurthi, C. Recchia, A. Walton, and V. Sridharan, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023), February 2023.


MICRO 2023 - "Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs", D. Agiakatsikas, G. Papadimitriou, V. Karakostas, D. Gizopoulos, M. Psarakis, C. Belanger-Champagne, and E. Blackmore, IEEE/ACM International Sympocium on Microarchitecture (MICRO 2023), October 2023.


IOLTS 2023 - "Silent Data Corruptions: The Stealthy Saboteurs of Digital Integrity", G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2023), July 2023.


TC 2023 - "Silent Data Corruptions: Microarchitectural Perspectives", G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Computers, Volume: 72, Issue: 11, pp. 3072-3085, November 2023.


TETC 2023 - "Anatomy of On-Chip Memory Hardware Fault Effects Across the Layers", G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Emerging Topics in Computing, Volume: 11, Issue: 2, pp. 420-431, June 2023.


SIGARCH Blog 2023 - "Emerging Fault Modes: Challenges and Research Opportunities", Sudhanva Gurumurthi, Vilas Sridharan, and Sankar Gurumurthy, Computer Architecture Today blog, July 17, 2023.


ISPASS 2022 - "gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs", D. Sartzetakis, G. Papadimitriou, and D. Gizopoulos, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2022), May 2022.


TC 2022 - "Soft Error Effects on Arm Microprocessors: Early Estimations Versus Chip Measurements", P. Bodmann, G. Papadimitriou, R. L. Rech Jr, D. Gizopoulos, and P. Rech, IEEE Transactions on Computers, Volume: 71, Issue: 10, pp. 2358-2369, October 2022.


arXiv 2022 - "Detecting silent data corruptions in the wild", H. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, S. Sankar, arXiv, March 2022.


ISCA 2021 - "Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers", G. Papadimitriou, and D. Gizopoulos, ACM/IEEE International Symposium on Computer Architecture (ISCA 2021), June 2021.


CAL 2021 - "HBM3 RAS: Enhancing Resilience at Scale", S. Gurumurthi, K. Lee, M. Jang, V. Sridharan, A. Nygren, Y. Ryu, K. Sohn, T. Kim, and H. Chung, IEEE Computer Architecture Letters (IEEE CAL), 20(2), pages 158-161, October 2021.


arXiv 2021 - "Silent Data Corruptions at Scale", H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, S. Sankar, arXiv, February 2021.


DSN 2019 - "Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments", A. Chatzidimitriou, P. Bodmann, G. Papadimitriou, D. Gizopoulos, and P. Rech, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2019), June 2019.


DSN 2017 - "RT Level vs. Microarchitecture Level Reliability Assessment: Case Study on ARM Cortex-A9 CPU", A. Chatzidimitriou, M. Kaliorakis, D. Gizopoulos, M. Iacaruso, M. Pipponzi, R. Mariani, S. Di Carlo, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2017), June 2017.


ASPLOS 2015 - "Memory Errors in Modern Systems: The Good, The Bad, and the Ugly", V. Sridharan, N. DeBardeleben, S. Blanchard, K. Ferreira, J. Stearley, J. Shalf, S. Gurumurthi, ACM Architectural Support for Programming Languages and Operating Systems (ASPLOS 2015), April 2015.


MICRO 2014 - "Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults", M. Wilkening, V. Sridharan, D. Kaeli, S. Li, F. Previlon, S. Gurumurthi, IEEE/ACM International Symposium on Microarchitecture (MICRO 201), December 2014.



Creative Commons License Created by Computer Architecture Lab @ UoA
This work is licensed under a CC License
University of Athens
Dept. of Informatics and Telecommunications

Address:
Panepistimiopolis, Ilissia
Athens, Greece, GR 157 84

Phone:
+30 210 727 5145
Email:
dgizop AT di DOT uoa DOT gr