Collective Communications and Coalesced Access
Due: February 4, 2017

Collective Communications

Implement using MPI the recursive halving algorithm for commutative operations for MPI_Reducescatter. For the algorithm, refer to the paper  "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp, IJHPCA 2005. Compare the performance of this algorithm with an approach that performs MPI_Reduce followed by MPI_Scatter. Compare for different message sizes (1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1M Bytes) and plot graphs of execution times vs message sizes. Do this assignment on 8, 16, 32, 64 and 96 cores. Use the turing-cluster for this problem. See Platform Notes -> First MPI program.

Advantage of Coalesced Access

Recall the problem related to accessing parameters of atoms including x,y,z position, velocity and force. Write a GPU kernel that updates these fields such that each GPU thread performs the updates for one atom. Use double data type for all the fields. Report the execution times and compare the performance of:

Use million atoms for the problem size. Use the turing gpu node for this problem. See Platform Notes -> First CUDA program.