Bitonic Sort on GPUs

Implement the bitonic sort algorithm covered in the class with  CUDA using one thread block. Compare the execution time with the sequential CPU agorithm on the CPU for the largest data size involving integers that your implementation can accommodate within the shared memory on the GPU. Use the turing GPU node for this assignment.

Extra points: Extend your program using multiple thread blocks and show results for the largest array that can fit in the GPU device memory. For merging across thread blocks, use the idea of PRAM merge sort covered in the class.