Ph.D: Thesis Colloquium: 102 : CDS: 18, May 2026 “Leveraging Replication for Integration of Fault Tolerance and High Performance in MPI Applications”

When

18 May 26    
3:00 PM - 4:00 PM

Event Type

DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
Ph.D. Thesis Colloquium


Speaker: Mr. Sarthak Joshi
S.R. Number: 06-18-01-10-12-21-1-19304
Title: Leveraging Replication for Integration of Fault Tolerance and High Performance in MPI Applications
Research Supervisor: Prof. Sathish Vadhiyar
Date & Time : May 18, 2026 (Monday), 03:00 PM
Venue : #102, CDS Seminar Hall


ABSTRACT
Faults in high-performance systems are expected to be very frequent in the current exascale computing era. In this thesis, we have defined FTHP-MPI (Fault Tolerant and High Performance MPI), a fault-tolerant MPI library that augments checkpoint/restart with replication to provide resilience from failures. Our library is designed to provide fault tolerance in a native MPI library that does not provide support for fault tolerance. This enables users to transparently and seamlessly implement fault tolerance into an MPI application while still utilizing the underlying high performance communication implementations of the native MPI library. We have implemented efficient parallel communication techniques that involve replicas. Our framework deals with the unique challenges of integrating support for checkpointing and replication including handling the differences in the number of processes at the point of checkpoint creation and restart and implementing transparent checkpointing and replication while handling differences and overlaps in virtual memory addresses across the processes. We have extensively tested our library under both failure-free and failure conditions using real executions with three benchmark applications, namely, HPCG, CloverLeaf mini-application and the Particle-In-Cell Simulation, scaling up to 16K processes. We have observed that under failure-free conditions, our library only incurs a 3-8% overhead due to the presence of replicas when the number of computational processes is the same. We simulated failure conditions by killing processes either randomly at intervals based on a Weibull Distribution or using super computing logs from Tsubame-3 supercomputing system. We show that our library can outperform pure checkpointing by 13-19% with the same number of processes across both cases, even when half the processes do redundant work.

We have also developed mechanisms in our framework to provide adaptive partial replication to mitigate the efficiency loss due to employing additional resources for replication. Our framework utilizes failure prediction to protect a large number of processes using only a small number of replicas. However, imperfect failure predictors can result in application interruptions, necessitating frequent checkpointing. In a first such work, we have developed a formulation for optimal checkpointing interval, taking into account the recall metric of the failure predictor and the replication degree, to minimize the checkpointing overheads. We have tested adaptive replication using a variety of simulated failure predictors with different precision and recall values. We observe that adaptive replication outperforms checkpointing and full replication by 56.7% and 27.6%, respectively even with relatively bad failure predictors with 50% precision and recall. We also developed a simulation framework to project the impact of replication-based fault tolerance for modern exascale systems. We show that, at the MTBF values observed in modern exascale systems, while replication without any failure predictions is not a viable approach, adaptive replication can still outperform pure checkpointing even with a low recall when running large-scale executions. Furthermore, replication-based approaches become more suitable for fault tolerance when compared to pure checkpointing as MTBF values are expected to decrease further, and reductions in checkpoint overheads by continuous advancements in checkpointing can also be leveraged by adaptive replication implementations.

Finally, we also implemented replication-based fault tolerance framework in MPI-IO, the first of its kind. Our approach utilizes the replicas as additional resources that can share the work and improve the efficiency while still providing fault tolerance. We observe that in a setup with a significant share of I/O, our fault-tolerant library with replication can reduce the execution time by 44-47%.


ARE WELCOME