Parallel Programming

The objective of this course is to give you some level of confidence in parallel programming techniques, algorithms and tools. At the end of the course, you would (we hope) be in a position to apply parallelization to your project areas and beyond, and to explore new avenues of research in the area of parallel programming. The course covers parallel programming tools, constructs, models, algorithms, parallel matrix computations, parallel programming optimizations, scientific applications and parallel system software. MPI, OpenMP and CUDA will be covered.

Class

Ankit Shrivatsava
Md. Imbesat Hassan Rizvi
GVK Madhav
Manglani Krishna Prem
Ponnezhil Dass MP
Prateek Kushwaha
Rajrup Ghosh
Rakesh Kumar Mallik
Snehansu Sekar Sahu
Srinivas K
Rintu Panja
Rahul Hasija

Introduction to Parallel Computing
by Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Publisher: Addison Wesley; (2003) ISBN: 0-201-64865-2
Various publication materials and references that will be posted along with the lecture slides

Additional Reading

Petascale Computing: Algorithms and Applications
David A. Bader (Ed.), Chapman & Hall/CRC Computational Science Series. 2007.
The Sourcebook of Parallel Computing
by Jack Dongarra (Editor), Ian Foster, Geoffrey Fox (Editor), Ken Kennedy, Andy White (Editor), Linda Torczon, Wiliiam Gropp Publisher: Morgan Kaufmann; (November 2002) ISBN: 1-558-60871-0
Numerical Linear Algebra for High Performance Computers
by Jack Dongarra, Iain Duff, Danny Sorensen, Henk van der Vorst Publisher: SIAM; (1998) ISBN: 0-89871-428-1

Sessionals
- 2 Mid-Term Exams (February 11, March 17) - 30
- Assignments (2. Due: February 4, February 27) - 20
Terminal
- Assignment (1. Due: March 28) - 10
- Final project (Proposal: March 5, Final presentation: April 18, Final report: April 19) - 20
- Final Exam (April 29 forenoon) - 20

Topic	Reading Material
Prerequisites Introduction MPI OpenMP CUDA	Introduction: Grama et al. - 2.4, 3.1, 3.5, 5.1, 5.6 MPI-1: Online tutorial "MPI Complete Reference". Google for it. OpenMP - Lecture slides, and OpenMP tutorial: http://www.llnl.gov/computing/tutorials/openMP CUDA - lecture slides Besides, follow the "Platform for Assignments" link in my HPC 2015 web page.
Parallel Programming tools/languages/models PRAM algorithms MPI collective communication implementations MPI communicator groups, process topologies MPI-IO Parallel I/O Optimizations GPU programming - CUDA Optimizations ppt, Advanced CUDA Slides, CUDA occupancy calculator xl	PRAM algorithms: Book "Parallel Computing: Theory and Practice" by Michael J Quinn. Available with me. Pages 25-32, 40-42, 256 MPI-1: Online tutorial "MPI Complete Reference". Google for it. Collective Communications: Lecture slides, and Google for paper "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp, IJHPCA 2005 Mapping to network topologies. Section 5.1 in book "Parallel Computing: Theory and Practice" by Michael J Quinn. CUDA Optimizations - Chapter 5 of CUDA programming guide and slides 30-34 of advanced CUDA, CUDA Occupancy calculator MPI-2: Online tutorial: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm Parallel I/O Optimizations: Google for paper Rajeev Thakur, William Gropp, and Ewing Lusk, “A Case for Using MPI's Derived Datatypes to. Improve I/O Performance,” in Proceedings of SC98.
Parallel Algorithms Divide and Conquer algorithms ppt Solving tridiagonal systems Prefix computations Sample sort FFT ppt Mesh-based algorithm APSP, along with MST and SSSP ppt Graph Algorithms - BFS, DFS, Partitioning, Coloring ppt	Tridiagonal systems - lecture slides Prefix computations - lecture slides Sorting Paper: On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993. (Pages 1-6) Paper: Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992. (Pages 2-4) FFT Chapter in "Introduction to Parallel Computing" book APSP: Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6 Graph algorithms Lecture slides Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. (Pages 1-7) Paper: Accelerating large graph algorithms on the GPU using CUDA. Harish and Narayanan. HiPC 2007. (Page 5-8) DFS: Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6 A parallel graph partitioning algorithm for a message-passing multiprocessor - Gilbert and Zmijewski - - pages 427-433, 437-440. Coloring: Maximal Independent sets by Luby - "Introduction to Parallel Computing" Book Coloring: Paper "Scalable parallel graph coloring algorithms" by Gebremedhin, Manne.
Matrix computations Dense Linear Algebra ppt Sparse Linear Algebra ppt	Dense Linear Algebra Lecture slides Paper: Towards Dense Linear Algebra for Hybrid Accelerated Manycore Systems. Parallel Computing. 2010 (section 4.1) Sparse LA Sparse Matrix vector multiplication: Paper: Efficient Sparse matrix-vector multiplication on cache-based GPUs. Reguly, Giles. InPar 2012. For cholesky factorization and subsequent steps: Sources [You can get these papers from me and take photocopies] Parallel Algorithms for sparse linear systems - Heath, Ng and Peyton Reordering sparse matrices for parallel elimination - Liu Task scheduling for parallel sparse Cholesky factorization - Geist and Ng Lecture slides General steps of sparse matrix factorization: Heath, Ng and Peyton - pages 420-429 Parallel ordering Heath, Ng and Peyton - pages 429-435 till Kernighan Lin Liu - pages 75 and 89 (you can read other pages on reduction of elimination tree heights if interested) For mapping: Heath, Ng and Peyton - 437-439, figures 9 and 10; Geist and Ng - sections 3 and 4 For numerical factorization: Heath, Ng and Peyton - 442-450
Scientific Applications Molecular dynamics Game of Life Adaptivity and Dynamic Load Balancing DLB in DFS, GoL Mesh applications	Molecular Dynamics - Paper: ``A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems'' by Plimpton and Hendrickson. Sections 2-5. Mesh applications Paper: "Multilevel diffusion schemes for repartitioning of adaptive meshes" by Schloegel et al. Paper: "Dynamic repartitioning of adaptively refined meshes" by Schloegel et al. Paper: "Dynamic Octree Load Balancing Using Space-filling curves" by Campbell et al. - Section 2.5 Paper: "Irregularity in Multi-dimensional space-filling curves with applications in multimedia databases" by Mokbel and Aref - Section 4
Parallel System Software Scheduling in Parallel Systems Fault tolerance for large systems	Scheduling Paper: "Backfilling with lookahead to optimize the packing of parallel jobs" by Shmueli and Feitelson. JPDC 2005. Paper: "A comparison study of eleven static mapping heuristics for a class of meta-tasks on heterogeneous computing systems" by Tracy Braun et al., HCW 1999. Fault Tolerance: Paper: "An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance" by James Plank

Important - Look at the Rules section that contains important information on assignment deadlines, policies on plagiarism.

Platform for Assignments

Parallel Profilers

Assignments

Assignment 1 - MPI Collectives
Assignment 2 - CUDA Sparse Matrix-Vector Multiplication
Assignment 3 - Parallel BFS

Final Project

The final project has to clearly demonstrate the uniqueness of your work over existing work and show adequate performance improvements. You can work in a team of max 2 members. It can be in

parallelizing well known algorithms or problems. Examples:
- hybrid executions on both CPU and GPU cores.
- graph algorithms - spanning trees, shortest paths, satisfiability, mesh generation or refinement (e.g., Delaunay)
- sorting, searching
- clustering algorithms
- parallelization of fundamental algorithms you would encounter in an algorithm book. e.g.,
  1. Introduction to Algorithms. Third Edition. Cormen, Leiserson, Rivest, Stein
  2. The Design and Analysis of Algorithms. Aho, Hopcroft, Ulman
  3. Data Structures and Algorithms. Aho, Hopcroft, Ulman
parallelizing numerical methods, Examples:
- Dense martrix or sparse matrix computations. e.g., Cholesky, QR, Inverse, SVD, conjugate gradient etc.
- Transforms (FFT, wavelet etc.)
- Any numerical method you would encounter in matrix/numerical methods book. e.g.,
  1. Introduction to Numerical Analysis. Second Edition. Hildebrand
  2. Elementary Numerical Analysis. Third Edition. Conte, de Boor
System software. Examples:
- Techniques for scheduling or mapping parallel tasks to processors to achieve least makespan or throughput
- Load balancing techniques for irregular computations
- Automatic techniques for splitting tasks among GPU and CPU cores.
- Automatic data management techniques for GPU cores.
- Proposing genetic programming abstractions
- Fault tolerance

Sample Projects from Previous Years

GPU implementation of implicit Runge-Kutta methods - pdf
Parallel implementation of shared nearest neighbor clustering algorithm - pdf
Hybrid implementation of alternate least square algorithm - pdf
Approximate k-nearest neighbor search - pdf
Incremental Delaunay triangulation - pdf
k-maximum subarray problem - pdf
Scale Invariant Feature Transform (SIFT) - pdf
AKS algorithm for primality proving - pdf
Betweenness centrality - pdf

Important Assignment Notes

Ethics

Please do not even exchange ideas with your friends since there is a thin line between exchanging ideas and codes looking the same.
Please do not look up web/books for solutions.
See Dr. Yogesh' nice writeup on plagiarism policies in his HPC page

Deadlines

All assignments will be evaluated for a maximum of 10. There will be a penalty of -1 for every additional day taken for submission after the assignment due date.

Thus, you will have to be judicious regarding deciding when to submit your assignments.

Example

Suppose you have completed 1/2 of the assignment by the due date.

Scenario 1:

You think that it will take another 1 day to finish 3/4 of the assignment. In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit a day after, you will get a maximum score of 6.5 (=7.5-1, -1 for the extra day). Thus, you will get better score if you take an extra day, finish 3/4 of the assignment and then submit.

Scenario 2:

You think that it will take another 3 days to finish 3/4 of the assignment. In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit 3 days after, you will get a maximum score of 4.5 (=7.5-3, -3 for the three extra days). Thus, you will get better score if you submit your assignment that is 1/2 complete by the due date than submit the assignment that will be 3/4 complete after 3 days.