- This event has passed.
CDS Seminar(17 December, 2019): Erasure Coded Computations
Tue, 17 Dec 19 @ 11:00 AM -- 12:30 PM
Department of Computational and Data Sciences
SPEAKER: Prof. Ananth Grama
TITLE: “Erasure Coded Computations”
Venue: #102, CDS Seminar Hall
Date & Time: Dec 17, 2019, 11:00 AM
Dealing with hardware and software faults is an important problem, as parallel and distributed systems scale to millions of processing cores and across wide area networks. Traditional methods for dealing with faults include checkpoint-restart, active replicas, and deterministic replay. These techniques have significant associated resource overheads and constraints. We propose a completely new approach to fault tolerant computations using an analog of a commonly used technique in storage — erasure coding. In storage, an encoding-decoding pair corresponds to an identity function — i.e., the decoded form of erasure coded data must be identical to the original data even in the presence of (a bounded number of) erasures. In the computational analog, the decoded form of an encoded data on which an algorithm has been executed must be identical to the result of the algorithm on the original data, in the presence of processor failures. Stated otherwise, erasure coded computations apply a minimally modified algorithm on a suitably augmented (coded) input to produce an augmented output. The execution of such an algorithm proceeds completely oblivious to faults in the system. In the event of one or more faults, the real solution is recovered using a rapid reconstruction method (decoding) from the augmented output. We demonstrate this approach on two common problems — solving sparse linear systems and solving eigenvalue problems. We present input augmentation and output recovery techniques for these problems. Through detailed experiments, we show that our approach can be made oblivious to a large number of faults with very low computational overhead. Specifically, we demonstrate typical cases where a single fault can be corrected with less than 10% overhead in time and space, and even in extreme cases (fault rates of 20%), our approach is able to compute a solution with reasonable overhead. These results represent a significant improvement over the state of the art and present significant new opportunities for programming models at scale that do not enforce synchrony or require periodic checkpoints/ reductions to persistent storage.
This work is in collaboration with David Gleich (Purdue), Ahmed Sameh (Purdue), Yao Zhu (Bloomberg), and Xuejiao Kang (Facebook).
Ananth Grama is the Samuel Conte Professor of Computer Science at Purdue University. He works in broad areas of parallel and distributed computing, scientific computing, and large-scale data analytics. He received his PhD from the University of Minnesota in 1996, and has been at Purdue since then. He served as the Director of the Computational Science and Engineering and Computational Life Sciences programs at Purdue from 2012 – 16, and Chaired the Biodata Management and Analysis (BDMA) Study Section of the National Institute of Health from 2012 – 14. He is a recipient of the National Science Foundation CAREER award (1998), Purdue University Faculty Scholar Award (2002-07), a Fellow of the American Association for the Advancement of Sciences (2013), and a Distinguished Alumnus of the University of Minnesota (2016).
Host Faculty: Prof. Sashikumaar Ganesan