Our
lab, the Middleware and Runtime Systems (MARS) lab, focuses on
building middleware and runtime systems for parallel applications and
systems.
Runtime Systems / Application Frameworks
Our
lab works on building runtime systems for HPC applications on both
accelerator and general HPC systems. We primarily focus on irregular
applications including graph applications, N-Body simulations,
Molecular Dynamics (MD), and Adaptive Mesh Refinement (AMR)
applications. We have also worked with applications in climate science
and visualization in collaboration with researchers working in these
areas.
Runtime Strategies and Programming
Models on GPU systems:
The research is on developing
runtime strategies including hybrid asynchronous executions of
applications on both CPU and GPU cores for their effective use, dynamic
scheduling, load balancing computations within the GPUs, and data
layout optimizations for both graph-based and scientific applications.
- We have developed bin-packing based load balancing on GPUs,
knap-sack formulation of asynchronous executions on CPUs and GPUs and
kernel optimizations for AMR applications.
- Developed dynamic load balancing strategies for
graph-based applications including BFS, and SSSP.
- Developed an algorithm for hybrid executions of
betweenness centrality on both CPU and GPU cores.
- In our work on programming models, the aim is to deal with
challenges that arise out of executing different programming models on
GPU systems. Our recent work is on developing user abstractions and
runtime strategies for efficient executions of asynchronous
message-passing applications written in Charm++ on GPUs. Developed
runtime strategies for both regular applications like matrix
computations and irregular applications like N-Body and molecular
dynamics applications. This work will be extended to include other
programming models.
- Developed HyPar, a runtime framework that uses
divide-and-conquer model for graph applications. This model has been
used for applications including Borouvka's MST, graph coloring,
triangle counting, community detection and connected components to
provide large-scale benefits over the traditional BSP approaches.
Performance Modeling, Scalability,
Mapping of Applications on Large-Scale Systems:
This research focuses on
performance modeling, scalability studies and processor allocation of
large applications on large systems, and mapping and
remapping/rescheduling strategies on HPC network topologies.
- We have developed processor allocation, mapping and
reallocation strategies for simultaneous executions of nested
simulations in weather modeling applications that involve dynamically
varying weather phenomena like tracking cyclones, and rain clouds.
- Developed techniques that use matching of application
signatures to predict large scale runs using small scale runs.
We plan to extend our technique
of performance modeling of large-scale runs to auto-identify and
auto-correct scalability bugs.
Middleware
Middleware
is another primary research field in our lab. This includes middleware
for supercomputer jobs, grid middleware and fault tolerance for
parallel applications.
Middleware for Supercomputers, HPC Grids:
Batch systems and queues are used in many production and research-based
supercomputer systems. Our research builds middleware framework that
interfaces between the users and the batch queues and systems. The
middleware includes prediction techniques that predict queue waiting
times and the execution times incurred by the parallel jobs submitted
to the batch queues, and scheduling strategies that use these
prediction techniques to assign the appropriate batch queue and number
of processors for job execution with the aim of reducing the turnaround
times of the users and increasing the throughput of the system.
- We have developed techniques for predicting jobs that have
short queue waiting times (quick starters).
- Extended the work to predict queue waiting times for all
classes of jobs based on history of job submissions.
- Also developed strategies to predict ranges of execution
times based on previous job submissions by the user and the loads on
the system.
- We developed methods that automatically use these
predictions for job molding (changing processor request size) and
delayed submissions.
- We have also done work on middleware for metascheduling
HPC jobs in a grid of supercomputers in dynamic electricity markets.
The middleware uses predictions of queue waiting times to predict
execution periods of jobs in different supecomputers of the grid,
considers electricity price variations at the supercomputer sites
during the execution periods, and submits/migrates the jobs to the
supercomputers predicted to have least electricity costs during the
predicted period and least response time.
- We have also worked on strategies to automatically deciding
the best queue configuration for a system based on the history of
usage.
Fault Tolerance:
Our lab has investigated the use
of replication for fault tolerance. The novelty is that instead of
replicating all the processes, thereby resulting in only about 50%
application efficiency in the presence of failures, our methods
replicate a small subset of processes (typically, less than 1%) based
on failure predictions. We demonstrated the effectiveness of this
strategy for current peta-scale and future exa-scale systems. Our
research also built a MPI library that uses this partial replication
technique.
Acceleration of Scientific Applications and Solvers
Our
lab has long-standing research on providing high performane solutions
for climate modeling applications.
- We started by providing
solutions for dynamic executions of climate modeling applications on a
grid of supercomputers.
- We also worked on developing adaptive frameworks for
simulation and online remote visualization of critical climate
applications in resource-constrained environments.
- We then developed solutions for accelerating specific
time-consuming phases in atmosphere and ocean models on accelerators
including GPUs and Xeon Phis. Our work on Intel Xeon Phis was part of
the first Intel Parallel Computing Centre (IPCC) in India.
- We then developed efficient I/O methodologies including use
of collective I/O and selective writing strategies in ocean models.
- As part of a DST project, we are currently working
on comprehensive HPC solutions including hybrid CPU-GPU executions and
large-scale data handling for fine resolution ocean model.
Our
lab has also developed pipelined preconditioned conjugate gradient (CG)
methods for distributed memory systems. This was extended to other
iterative solvers including BiCGStab and MinRes, CGR methods.
Pipleining was also developed for hybrid and asynchronous CPU-GPU
executions of the solvers.
High Performance Machine Learning and
Data Science
Our
lab has been developing communication minimizing and avoiding
strategies for machine learning frameworks. We also explore novel
models of parallelism for machine learning applications..
- We developed three
methods for fast learning of Knowledge Graph Embeddings (KGEs) at
scale. These include reduced communication of Algather by sparsifying
SGM, variable margin approach, and distributed implementation of the
popular Adam optimization algorithm.
- We proposed more strategies to further speed up the
learning of KGEs including dynamic selection of allreduce and allgather
communication based on the sparsity of the gradient matrix, selecting
only a subset of gradient vectors for reduced communications, employing
gradient quantization, partitioning of KGE triples based on relations
and selection for negative sampling. All these techniques resulted in
about 40 times speedup over unoptimized versions.
- We developed highly scalable k-NN search using vantage
point tress for partitioning the space and exploitng HNSW algorithm for
local search. Our hybrid MPI-OpenMP algorithm helped in computation of
k-NN for 10000 queries with billion points in a 128-dimensional space
in the order of a few seconds.
- We also developed GPU solutions for acceleration of GCNs
(Graph Convolution Neural Networks) for analysis of brain functional
networks. Our results with Alzheimer data demonstrated reduction in
execution times by about 60%.
Last Modified:
February 2023