Advanced Topics in Computer Architecture
报告题目：Toward Exascale Resilience
报告人：Mattan Erez, 副教授,美国德克萨斯大学奥斯汀分校（University of Texas at Austin, USA）
The march toward supercomputing performance that reaches exaflops and the scientific applications that scale to utilize them continues at a steady pace. Along the way, resilience concerns and fears are increasing in importance and prominence. In this short course I will give an overview of different fault, error, and failure modes, and discuss their importance and some common and widely-diverging estimates of their expected rates. I will then discuss various solutions that span both incremental and more revolutionary potential solutions. I will describe the memory system in detail as it plays a major role in current resilient platforms and then describe various proposed techniques that span multiple system layers, from hardware to runtimes to the programmer, algorithm, and tools. This short course will be researchy in nature as this is a relatively new and rapidly evolving topic.
Mattan Erez is an Associate Professor at the Department of Electrical and Computer Engineering at the University of Texas at Austin. His research focuses on improving the performance, efficiency, and scalability of computing systems through advances in hardware architecture, software systems, and programming models. The vision is to increase the cooperation across system layers and develop flexible and adaptive mechanisms for proportional resource usage. Mattan received a B.Sc. in Electrical Engineering and a B.A. in Physics from the Technion, Israel Institute of Technology and his M.S and Ph.D. in Electrical Engineering from Stanford University. He is a recipient of a 2012 Presidential Early Career Award in Science and Engineering (PECASE; awarded 2014), a 2012 Early Career Research Award from the Department of Energy, and a 2010 NSF CAREER Award.