Research

Research

Our lab, the Middleware and Runtime Systems (MARS) lab, focuses on building middleware and runtime systems for parallel applications and systems.

Runtime Systems / Application Frameworks

Our lab works on building runtime systems for HPC applications on both accelerator and general HPC systems. We primarily focus on irregular applications including graph applications, N-Body simulations, Molecular Dynamics (MD), and Adaptive Mesh Refinement (AMR) applications. We have also worked with applications in climate science and visualization in collaboration with researchers working in these areas.

Runtime Strategies and Programming Models on GPU systems: The research is on developing runtime strategies including hybrid asynchronous executions of applications on both CPU and GPU cores for their effective use, dynamic scheduling, load balancing computations within the GPUs, and data layout optimizations for both graph-based and scientific applications.

We have developed bin-packing based load balancing on GPUs, knap-sack formulation of asynchronous executions on CPUs and GPUs and kernel optimizations for AMR applications.
Developed dynamic load balancing strategies for graph-based applications including BFS, and SSSP.
Developed an algorithm for hybrid executions of betweenness centrality on both CPU and GPU cores.
In our work on programming models, the aim is to deal with challenges that arise out of executing different programming models on GPU systems. Our recent work is on developing user abstractions and runtime strategies for efficient executions of asynchronous message-passing applications written in Charm++ on GPUs. Developed runtime strategies for both regular applications like matrix computations and irregular applications like N-Body and molecular dynamics applications. This work will be extended to include other programming models.
Developed HyPar, a runtime framework that uses divide-and-conquer model for graph applications. This model has been used for applications including Borouvka's MST, graph coloring, triangle counting, community detection and connected components to provide large-scale benefits over the traditional BSP approaches.

Performance Modeling, Scalability, Mapping of Applications on Large-Scale Systems: This research focuses on performance modeling, scalability studies and processor allocation of large applications on large systems, and mapping and remapping/rescheduling strategies on HPC network topologies.

We have developed processor allocation, mapping and reallocation strategies for simultaneous executions of nested simulations in weather modeling applications that involve dynamically varying weather phenomena like tracking cyclones, and rain clouds.
Developed techniques that use matching of application signatures to predict large scale runs using small scale runs. We plan to extend our technique of performance modeling of large-scale runs to auto-identify and auto-correct scalability bugs.

Middleware

Middleware is another primary research field in our lab. This includes middleware for supercomputer jobs, grid middleware and fault tolerance for parallel applications.

Middleware for Supercomputers, HPC Grids: Batch systems and queues are used in many production and research-based supercomputer systems. Our research builds middleware framework that interfaces between the users and the batch queues and systems. The middleware includes prediction techniques that predict queue waiting times and the execution times incurred by the parallel jobs submitted to the batch queues, and scheduling strategies that use these prediction techniques to assign the appropriate batch queue and number of processors for job execution with the aim of reducing the turnaround times of the users and increasing the throughput of the system.

We have developed techniques for predicting jobs that have short queue waiting times (quick starters).
Extended the work to predict queue waiting times for all classes of jobs based on history of job submissions.
Also developed strategies to predict ranges of execution times based on previous job submissions by the user and the loads on the system.
We developed methods that automatically use these predictions for job molding (changing processor request size) and delayed submissions.
We have also done work on middleware for metascheduling HPC jobs in a grid of supercomputers in dynamic electricity markets. The middleware uses predictions of queue waiting times to predict execution periods of jobs in different supecomputers of the grid, considers electricity price variations at the supercomputer sites during the execution periods, and submits/migrates the jobs to the supercomputers predicted to have least electricity costs during the predicted period and least response time.
We have also worked on strategies to automatically deciding the best queue configuration for a system based on the history of usage.

Fault Tolerance: Our lab has investigated the use of replication for fault tolerance. The novelty is that instead of replicating all the processes, thereby resulting in only about 50% application efficiency in the presence of failures, our methods replicate a small subset of processes (typically, less than 1%) based on failure predictions. We demonstrated the effectiveness of this strategy for current peta-scale and future exa-scale systems. Our research also built a MPI library that uses this partial replication technique.

Acceleration of Scientific Applications and Solvers

Our lab has long-standing research on providing high performane solutions for climate modeling applications.

We started by providing solutions for dynamic executions of climate modeling applications on a grid of supercomputers.

We also worked on developing adaptive frameworks for simulation and online remote visualization of critical climate applications in resource-constrained environments.

We then developed solutions for accelerating specific time-consuming phases in atmosphere and ocean models on accelerators including GPUs and Xeon Phis. Our work on Intel Xeon Phis was part of the first Intel Parallel Computing Centre (IPCC) in India.

We then developed efficient I/O methodologies including use of collective I/O and selective writing strategies in ocean models.

As part of a DST project, we are currently working on comprehensive HPC solutions including hybrid CPU-GPU executions and large-scale data handling for fine resolution ocean model.

Our lab has also developed pipelined preconditioned conjugate gradient (CG) methods for distributed memory systems. This was extended to other iterative solvers including BiCGStab and MinRes, CGR methods. Pipleining was also developed for hybrid and asynchronous CPU-GPU executions of the solvers.

High Performance Machine Learning and Data Science

Our lab has been developing communication minimizing and avoiding strategies for machine learning frameworks. We also explore novel models of parallelism for machine learning applications..

We developed three methods for fast learning of Knowledge Graph Embeddings (KGEs) at scale. These include reduced communication of Algather by sparsifying SGM, variable margin approach, and distributed implementation of the popular Adam optimization algorithm.

We proposed more strategies to further speed up the learning of KGEs including dynamic selection of allreduce and allgather communication based on the sparsity of the gradient matrix, selecting only a subset of gradient vectors for reduced communications, employing gradient quantization, partitioning of KGE triples based on relations and selection for negative sampling. All these techniques resulted in about 40 times speedup over unoptimized versions.

We developed highly scalable k-NN search using vantage point tress for partitioning the space and exploitng HNSW algorithm for local search. Our hybrid MPI-OpenMP algorithm helped in computation of k-NN for 10000 queries with billion points in a 128-dimensional space in the order of a few seconds.

We also developed GPU solutions for acceleration of GCNs (Graph Convolution Neural Networks) for analysis of brain functional networks. Our results with Alzheimer data demonstrated reduction in execution times by about 60%.

Last Modified: February 2023