Prefix Sum using CUDA

Given an array A, the prefix sum problem is to find an output array B such that
B[i] = sum(A[0],...,A[i])

Implement a prefix program in CUDA that calculates the predix for an input array of 20K integer elements using only one thread block.

Implement the program with and without shared memory usage and show timing for both the versions, and compare with the time taken for a sequential code on the CPU in terms of speedup.