Microarchitecture-Level Reliability Assessment
for CPUs and GPUs
TUTORIAL-3A: Sunday March 13, 2016, 9:00am – 5:30pm
Organizers:
Dimitris Gizopoulos | U. Athens |
Ramon Canal | UPC Barcelona |
Presenters:
Dimitris Gizopoulos | U Athens |
Ramon Canal | UPC Barcelona |
Cameron McNairy | Intel |
Arijit Biswas | Intel |
Vilas Sridharan | AMD |
David Kaeli | NortheasternU |
Summary
Early assessment of the reliability of microprocessor components can drive informed decisions for their protection against transient, intermittent and permanent hardware faults. Microarchitecture-level simulators are employed for such early assessments and can deliver fast and accurate system-level reliability reports taking into consideration the masking effects of the full hardware and software layers. This tutorial focuses on different microarchitecture-level techniques for reliability measurements on modern CPUs and GPUs. AVF and FIT estimations for microprocessor hardware components and software workloads can be delivered by either ACE-based methods or statistical fault injection methods. The tutorial will discuss the pros and cons of both approaches in terms of the accuracy of the reliability estimations and their throughput as well as techniques that improve these aspects. Reliability assessment tools based on publicly available simulators (such as Gem5, MARSS, GPGPUsim, Mullti2Sim) will be presented as well as the current practice of Intel and AMD for reliability assessment and protection for CPUs, GPUs and co-processors.
The target audience of the tutorial includes researchers and practitioners interested in microprocessor reliability assessment at the early design stages. Basic understanding of microarchitecture, reliability, and fault tolerance terminology and techniques is required.
Program
09:00 – 12:30 Morning Session (10:30 – 11:00 Coffee Break)
Introduction | Gizopoulos, Canal |
Microarchitecture-Level Reliability Assessment for x86 and ARM CPUs on Gem5 and MARSS |
Gizopoulos |
Intel’s Practice in Microarchitecture-Level Reliability Assessment for CPUs/GPUs and Co-processors |
McNairy, Biswas |
Hierarchical Reliability Assessment. From technology and RTL modules to architectures |
Canal |
14:00 – 17:30 Afternoon Session (15:30 – 16:00 Coffee Break)
GPUs Microarchitecture-Level Reliability Assessment on Multi2Sim and GPGPUsim | Kaeli, Gizopoulos |
SER Modeling, Analysis, and Remediation in AMD High-Performance Microprocessors | Sridharan |
Discussion – Closing | Gizopoulos, Canal |