Implement the bitonic sort algorithm covered in the class with CUDA using one thread block.
Compare the execution time with the sequential CPU agorithm on the CPU
for the largest data size involving integers that your implementation
can accommodate within the shared memory on the GPU. Use the turing GPU
node for this assignment.
Extra points: Extend your program using multiple thread blocks
and show results for the largest array that can fit in the GPU device
memory. For merging across thread blocks, use the idea of PRAM merge
sort covered in the class.