Implement using MPI the recursive halving algorithm for commutative operations for MPI_Reducescatter. For the algorithm, refer to the paper "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp, IJHPCA 2005. Compare the performance of this algorithm with an approach that performs MPI_Reduce followed by MPI_Scatter. Compare for different message sizes (1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1M Bytes) and plot graphs of execution times vs message sizes. Do this assignment on 8, 16, 32, 64 and 96 cores. Use the turing-cluster for this problem. See Platform Notes -> First MPI program.
Recall the problem related to accessing parameters of atoms including x,y,z position, velocity and force. Write a GPU kernel that updates these fields such that each GPU thread performs the updates for one atom. Use double data type for all the fields. Report the execution times and compare the performance of:
Uncoalesced access using array of structure approach where the block size is not a multiple of warp size
Uncoalesced access using array of structure approach where the block size is a multiple of warp size
Coalesced access using structure of array approach where the block size is not a multiple of warp size
Coalesced access using structure of array approach where the block size is a multiple of warp size
Use million atoms for the problem size. Use the turing gpu node for this problem. See Platform Notes -> First CUDA program.