MPI Collective Communications and CUDA Odd-even Sort
Due: February 6
MPI Collective Communication
Implement using MPI Bruck’s algorithm for alltoall collective for different message sizes (1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1M Bytes) and plot graphs of execution times vs message sizes, comparing this algorithm with native MPI implementation (i.e., a call to MPI_Alltoall). Do this assignment on 8, 16, 32, 64 and 128 cores. For the algorithm, refer the paper “Optimization of Collective Communication Operations in MPICH” by Thakur, Rabenseifner and Gropp, IJHPCA 2005.
CUDA Odd-even Sort
Implement Odd-even sort described in the pages 395, 396 and algorithm 9.4 in the book by Grama et al. on GPUs using CUDA. Implement using only a single thread block and use the maximum array size that can fit in the shared memory of the GPU.
Use the Turing cluster platform for the assignment. Refer to the platform notes given in the class webpage.
Profilers: Prepare a single consolidate report for both the above assignments, giving descriptions of your algorithm/strategies, results, jumpshot (MPI) and CUDA profiling outputs and observations. For jumpshot and CUDA profiling output, click on the “parallel profilers” link in the course website and read the relevant descriptions.