Microarchitecture-Level Reliability Assessment

for CPUs and GPUs

 

TUTORIAL-3A: Sunday March 13, 2016, 9:00am – 5:30pm

 

 



 

Organizers:

 

Dimitris Gizopoulos U. Athens
Ramon Canal UPC Barcelona

 

 

Presenters:

 

Dimitris Gizopoulos U Athens
Ramon Canal UPC Barcelona
Cameron McNairy Intel
Arijit Biswas Intel
Vilas Sridharan AMD
David Kaeli NortheasternU

 

 

 

Summary

Early assessment of the reliability of microprocessor components can drive informed decisions for their protection against transient, intermittent and permanent hardware faults. Microarchitecture-level simulators are employed for such early assessments and can deliver fast and accurate system-level reliability reports taking into consideration the masking effects of the full hardware and software layers. This tutorial focuses on different microarchitecture-level techniques for reliability measurements on modern CPUs and GPUs. AVF and FIT estimations for microprocessor hardware components and software workloads can be delivered by either ACE-based methods or statistical fault injection methods. The tutorial will discuss the pros and cons of both approaches in terms of the accuracy of the reliability estimations and their throughput as well as techniques that improve these aspects. Reliability assessment tools based on publicly available simulators (such as Gem5, MARSS, GPGPUsim, Mullti2Sim) will be presented as well as the current practice of Intel and AMD for reliability assessment and protection for CPUs, GPUs and co-processors.

The target audience of the tutorial includes researchers and practitioners interested in microprocessor reliability assessment at the early design stages. Basic understanding of microarchitecture, reliability, and fault tolerance terminology and techniques is required.

 

 

 

 

Program

 

09:00 – 12:30 Morning Session (10:30 – 11:00 Coffee Break)

 

 

Introduction Gizopoulos, Canal
Microarchitecture-Level Reliability Assessment for x86 and ARM CPUs
on Gem5 and MARSS
Gizopoulos
Intel’s Practice in Microarchitecture-Level Reliability Assessment
for CPUs/GPUs and Co-processors
McNairy, Biswas
Hierarchical Reliability Assessment.
From technology and RTL modules to architectures
Canal

 


14:00 – 17:30 Afternoon Session (15:30 – 16:00 Coffee Break)

 

 

GPUs Microarchitecture-Level Reliability Assessment on Multi2Sim and GPGPUsim Kaeli, Gizopoulos
SER Modeling, Analysis, and Remediation in AMD High-Performance Microprocessors Sridharan
Discussion – Closing Gizopoulos, Canal