Prefix Sum using CUDA
Given an array A, the prefix sum problem is to find an output array B such that
B[i] = sum(A[0],...,A[i])
Implement a prefix program in CUDA that calculates the predix for an input array of 20K integer elements using only one thread
block.
Implement
the program with and without shared memory usage and show timing for
both the versions, and compare with the time taken for a sequential
code on the CPU in terms of speedup.