Reduction using CUDA
In
this assignment, you will be implementing the reduction program covered
in the class for an array of 20K integer elements using only one thread
block. Note that since the maximum thread block size is only 1K
threads, each thread will start the tree based reduction using 20
elements each and perform the steps until the number of elements in a
level is 1K. After then, the code shown in the class takes over.
Implement
the program with and without shared memory usage and show timing for
both the versions, and compare with the time taken for a sequential
code on the CPU in terms of speedup.