CUDA Optimizations

Illustrate the performance benefits of the following optimizations on GPU with a matrix-vector multiplication program.
  1. block sizes as a multiple of warp size (i.e. all warps fully populated) and
  2. coalesced memory access

Show the CUDA profiling output.