BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//wp-events-plugin.com//7.3.5//EN
TZID:Asia/Kolkata
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
UID:192@cds.iisc.ac.in
DTSTART;TZID=Asia/Kolkata:20260518T150000
DTEND;TZID=Asia/Kolkata:20260518T160000
DTSTAMP:20260506T151653Z
URL:https://cds.iisc.ac.in/events/ph-d-thesis-colloquium-102-cds-18-may-20
 26-leveraging-replication-for-integration-of-fault-tolerance-and-high-perf
 ormance-in-mpi-applications/
SUMMARY:Ph.D: Thesis Colloquium: 102 : CDS: 18\, May 2026 "Leveraging Repli
 cation for Integration of Fault Tolerance and High Performance in MPI Appl
 ications"
DESCRIPTION:DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES\nPh.D. Thesis Col
 loquium\n\n\n\nSpeaker: Mr. Sarthak Joshi\nS.R. Number: 06-18-01-10-12-21-
 1-19304\nTitle: Leveraging Replication for Integration of Fault Tolerance 
 and High Performance in MPI Applications\nResearch Supervisor: Prof. Sathi
 sh Vadhiyar\nDate &amp\; Time : May 18\, 2026 (Monday)\, 03:00 PM\nVenue :
  #102\, CDS Seminar Hall\n\n\n\nABSTRACT\nFaults in high-performance syste
 ms are expected to be very frequent in the current exascale computing era.
  In this thesis\, we have defined FTHP-MPI (Fault Tolerant and High Perfor
 mance MPI)\, a fault-tolerant MPI library that augments checkpoint/restart
  with replication to provide resilience from failures. Our library is desi
 gned to provide fault tolerance in a native MPI library that does not prov
 ide support for fault tolerance. This enables users to transparently and s
 eamlessly implement fault tolerance into an MPI application while still ut
 ilizing the underlying high performance communication implementations of t
 he native MPI library. We have implemented efficient parallel communicatio
 n techniques that involve replicas. Our framework deals with the unique ch
 allenges of integrating support for checkpointing and replication includin
 g handling the differences in the number of processes at the point of chec
 kpoint creation and restart and implementing transparent checkpointing and
  replication while handling differences and overlaps in virtual memory add
 resses across the processes. We have extensively tested our library under 
 both failure-free and failure conditions using real executions with three 
 benchmark applications\, namely\, HPCG\, CloverLeaf mini-application and t
 he Particle-In-Cell Simulation\, scaling up to 16K processes. We have obse
 rved that under failure-free conditions\, our library only incurs a 3-8% o
 verhead due to the presence of replicas when the number of computational p
 rocesses is the same. We simulated failure conditions by killing processes
  either randomly at intervals based on a Weibull Distribution or using sup
 er computing logs from Tsubame-3 supercomputing system. We show that our l
 ibrary can outperform pure checkpointing by 13-19% with the same number of
  processes across both cases\, even when half the processes do redundant w
 ork.\n\nWe have also developed mechanisms in our framework to provide adap
 tive partial replication to mitigate the efficiency loss due to employing 
 additional resources for replication. Our framework utilizes failure predi
 ction to protect a large number of processes using only a small number of 
 replicas. However\, imperfect failure predictors can result in application
  interruptions\, necessitating frequent checkpointing. In a first such wor
 k\, we have developed a formulation for optimal checkpointing interval\, t
 aking into account the recall metric of the failure predictor and the repl
 ication degree\, to minimize the checkpointing overheads. We have tested a
 daptive replication using a variety of simulated failure predictors with d
 ifferent precision and recall values. We observe that adaptive replication
  outperforms checkpointing and full replication by 56.7% and 27.6%\, respe
 ctively even with relatively bad failure predictors with 50% precision and
  recall. We also developed a simulation framework to project the impact of
  replication-based fault tolerance for modern exascale systems. We show th
 at\, at the MTBF values observed in modern exascale systems\, while replic
 ation without any failure predictions is not a viable approach\, adaptive 
 replication can still outperform pure checkpointing even with a low recall
  when running large-scale executions. Furthermore\, replication-based appr
 oaches become more suitable for fault tolerance when compared to pure chec
 kpointing as MTBF values are expected to decrease further\, and reductions
  in checkpoint overheads by continuous advancements in checkpointing can a
 lso be leveraged by adaptive replication implementations.\n\nFinally\, we 
 also implemented replication-based fault tolerance framework in MPI-IO\, t
 he first of its kind. Our approach utilizes the replicas as additional res
 ources that can share the work and improve the efficiency while still prov
 iding fault tolerance. We observe that in a setup with a significant share
  of I/O\, our fault-tolerant library with replication can reduce the execu
 tion time by 44-47%.\n\n\n\nARE WELCOME
CATEGORIES:Events,Ph.D. Thesis Colloquium
END:VEVENT
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
X-LIC-LOCATION:Asia/Kolkata
BEGIN:STANDARD
DTSTART:20250518T150000
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
END:STANDARD
END:VTIMEZONE
END:VCALENDAR