Parallel Programming

Instructor: Sathish Vadhiyar
Course number: DS295
Credits: 3:1
Semester: Jan 2025
Lecture: Tue/Th 1130AM-1PM (First class: Jan 2, 2025, 1130AM)
Office Hours: Fridays, 5-6 PM
Fill out this survey form
Microsoft Teams Link

Overview

The objective of this course is to give you some level of confidence in parallel programming techniques, algorithms and tools. At the end of the course, you would (we hope) be in a position to apply parallelization to your project areas and beyond, and to explore new avenues of research in the area of parallel programming.

The course covers parallel architectures, parallelization principles, parallel programming standards and tools, parallel algorithms, numerical methods, scientific applications and parallel system software.

Pre-requisites

Basic knowledge of computer systems, data structures and programming, and algorithms.

Grading Scheme

Sessionals
- Mid-term [1 No.] (February 18) – 20
- Assignments [3 Nos.] – 30
  - Assignment 1 due: February 10
  - Assignment 2 due: March 20
  - Literature study report: April 10
Terminal
- Literature Study Report (Due: April 10) – 20
- Final project – 30
  - Proposal: February 28
  - Final presentation: April 28 forenoon
  - Final report: By April 29

Resources

Introduction to Parallel Computing. Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. Publisher: Addison Wesley. ISBN: 0-201-64865-2. 2003.
Parallel Computing Architecture. A Hardware/Software Approach. David Culler, Jaswant Singh. Publisher: Morgan Kauffman. ISBN: 981-4033-103. 1999.
Parallel Computing. Theory and Practice. Michael J. Quinn. Publisher: Tata: McGraw-Hill. ISBN: 0-07-049546-7. 2002.
An Introduction to Parallel Programming. Peter S Pacheco. Publisher: Morgan Kauffman. ISBN: 978-93-80931-75-3. 2011.
Online references for OpenMP, MPI, CUDA
Various publication materials and references that will be posted along with the lecture slides.

Lecture Slides

Parallel Programming fundamentals
- Architecture – ppt
- Parallelization principles – metrics, data distributions pdf
- Performance profiling – CUDA profiling (pdf)
- PRAM programming model pdf
Parallel programming standards and tools
- OpenMP – pdf
- MPI – basic pdf, Groups and process topologies pdf, MPI I/O pdf, I/O optimizations pdf, MPI-2 pdf
- GPU architectures (pdf) and CUDA programming (pdf),
- CUDA Optimizations pdf, Advanced CUDA Slides, CUDA occupancy calculator xls
Parallel Algorithms
- Sorting – pdf
- Searching (BFS) and Graph Algorithms – pdf
- Machine and Deep Learning Algorithms – pdf, pdf
Parallel linear algebra and numerical methods
- Matrix multiplication
- Fast Fourier Transforms – pdf
- Dense Linear Algebra – pdf
- Sparse linear algebra – pdf
Parallelism in scientific applications
- Molecular Dynamics – pdf
- Mesh applications pdf
- N-Body Simulations pdf
System Software
- Scheduling pdf

Reading Materials

For those portions for which reading materials are not given below, lecture slides above are the only reading materials.

Parallel Programming fundamentals
- Parallel Architecture
  - Grama et al. – 2.4, For GPU architecture
  - Google for NVIDIA A100 white paper
- Parallelization Principles
  - Grama et al. – 3.1, 3.5, 5.1-5.6
  - Book by Culler and Singh – 2.2, 2.3
  - PRAM: Book by Michael J Quinn Pages 25-32, 40-42, 256
  - Roofline performance model – Paper by Williams et al. “Roofline: an insightful visual performance model for multicore architectures” in Communications of ACM, 2009.
Parallel Programming standards and tools
- CUDA – Google for NVIDIA CUDA programming guide.
- CUDA optimizations –
- Chapter 5 of CUDA programming guide and slides 30-34 of advanced CUDA, CUDA Occupancy calculator
- OpenMP – OpenMP tutorial: http://www.llnl.gov/computing/tutorials/openMP
- MPI-1: Google for “MPI: The complete reference”.
- MPI-2: Online tutorial: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm
- Parallel I/O optimizations – Google for paper Rajeev Thakur, William Gropp, and Ewing Lusk, “A Case for Using MPI’s Derived Datatypes to. Improve I/O Performance”, in Proceedings of SC98.
Parallel Algorithms
- Sorting
  - Bitonic sort: In the book by Grama et al.
  - Paper: On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993. (Pages 1-6)
  - Paper: Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992. (Pages 2-4)
- Graph algorithms
  - Lecture slides.
  - Paper: A parallel graph partitioning algorithm for a message-passing multiprocessor – Gilbert and Zmijewski – – pages 427-433, 437-440.
  - Paper: Multi-level Graph Partitioning Schemes – Karypis and Kumar. Google for it.
- Machine Learning and Deep Learning
  - Paper: PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures. IPDPS 2016.
FFT Chapter in “Introduction to Parallel Computing” book
Dense Linear Algebra
- Paper: “F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and Multi-GPU Systems. In In the Proceedings of IEEE/ACM Supercomputing Conference (SC), pages 365–376, 2012.”
- Slides on “Communication Avoiding QR and LU”, CS 294 lecture slides, Laura Grigori, ALPINES INRIA Rocquencourt – LJLL, UPMC – https://who.rocq.inria.fr/Laura.Grigori/TeachingDocs/CS-294_Spr2016/Slides_CS-294_Spr2016/CS294_Spr16_CALUQR.pdf
- “Communication-optimal parallel and sequential QR and LU Factorizations”, James Demmel et al., Technical Report No. UCB/EECS-2008-89, August 2008.
Sparse Linear Algebra
- Sources [You can get these papers from me and take photocopies]
  - Parallel Algorithms for sparse linear systems – Heath, Ng and Peyton
  - Reordering sparse matrices for parallel elimination – Liu
  - Task scheduling for parallel sparse Cholesky factorization – Geist and Ng
  - Lecture slides
- General steps of sparse matrix factorization: Heath, Ng and Peyton – pages 420-429
- Parallel ordering
  - Heath, Ng and Peyton – pages 429-435 till Kernighan Lin
  - Liu – pages 75 and 89 (you can read other pages on reduction of elimination tree heights if interested)
- For mapping: Heath, Ng and Peyton – 437-439, figures 9 and 10; Geist and Ng – sections 3 and 4
- For numerical factorization: Heath, Ng and Peyton – 442-450
Parallelism in Scientific Applications
- Molecular Dynamics: Paper: “A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems” by Plimpton and Hendrickson. Sections 2-5.
- Mesh applications
  - Paper: “Multilevel diffusion schemes for repartitioning of adaptive meshes” by Schloegel et al.
  - Paper: “Dynamic repartitioning of adaptively refined meshes” by Schloegel et al.
  - Paper: “Dynamic Octree Load Balancing Using Space-filling curves” by Campbell et al. – Section 2.5
  - Paper: “Irregularity in Multi-dimensional space-filling curves with applications in multimedia databases” by Mokbel and Aref – Section 4
- N-body Simulations
  - The paper “Scalable parallel formulations of the Barnes-Hut method for n-body simulations” by Grama, Kumar and Sameh. In Parallel Computing 24(1998) 797-822.
  - The paper “Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, Fast Multipole, and Dardiosity” by Singh, Holt, Totsuka, Gupta and Hennessey. In Journal of Parallel and Distributed Computing, 1994.
System Software
- Scheduling
  - Paper: “Backfilling with lookahead to optimize the packing of parallel jobs” by Shmueli and Feitelson. JPDC 2005.
  - Paper: “A comparison study of eleven static mapping heuristics for a class of meta-tasks on heterogeneous computing systems” by Tracy Braun et al., HCW 1999
- Fault tolerance
  - Paper: “An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance” by James Plank

Others

Topic	Reading

Parallel Algorithms Divide and Conquer Algorithms – pdf MPI collective communication algorithms pdf Mesh-based algorithm: APSP pdf	Collective Communications: Lecture slides, and Google for paper “Optimization of Collective Communication Operations in MPICH” by Thakur, Rabenseifner and Gropp, IJHPCA 2005 Mapping to network topologies. Section 5.1 in book “Parallel Computing: Theory and Practice” by Michael J Quinn Tridiagonal systems – lecture slides APSP: Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6

Scientific Applications Game of Life pdf DFS and Dynamic Load Balancing pdf Bioinformatics pdf	DFS: Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6 Bioinformatics Paper: Parallel Biological Sequence Comparison using Prefix Computations. Aluru et. al. JPDC 2003.

Assignments

Turing cluster notes on OpenMP, MPI and CUDA for the assignments
Assignment 1 –
Assignment 2 –
Assignment 3 –

Final Project

The final project has to clearly demonstrate the uniqueness of your work over existing work and show adequate performance improvements. You can work in a team of max 2 members. It can be in

parallelizing well known algorithms or problems. Examples:
- hybrid executions on both CPU and GPU cores.
- graph algorithms – spanning trees, shortest paths, satisfiability, mesh generation or refinement (e.g., Delaunay)
- sorting, searching
- clustering algorithms
- parallelization of fundamental algorithms you would encounter in an algorithm book. e.g.,
  1. Introduction to Algorithms. Third Edition. Cormen, Leiserson, Rivest, Stein
  2. The Design and Analysis of Algorithms. Aho, Hopcroft, Ulman
  3. Data Structures and Algorithms. Aho, Hopcroft, Ulman
parallelizing numerical methods, Examples:
- Dense martrix or sparse matrix computations. e.g., Cholesky, QR, Inverse, SVD, conjugate gradient etc.
- >Transforms (FFT, wavelet etc.)
- Any numerical method you would encounter in matrix/numerical methods book. e.g.,
  1. Introduction to Numerical Analysis. Second Edition. Hildebrand
  2. Elementary Numerical Analysis. Third Edition. Conte, de Boor
System software. Examples:
- Techniques for scheduling or mapping parallel tasks to processors to achieve least makespan or throughput
- Load balancing techniques for irregular computations
- Automatic techniques for splitting tasks among GPU and CPU cores.
- Automatic data management techniques for GPU cores.
- Proposing genetic programming abstractions
- Fault tolerance

Sample Projects from Previous Years

Finding maximum clique in hybrid system – pdf
Parallel hybrid implementation of SVM – pdf
GPU TSP – pdf
GPU implementation of implicit Runge-Kutta methods – pdf
Parallel implementation of shared nearest neighbor clustering algorithm – pdf
Hybrid implementation of alternate least square algorithm – pdf
Approximate k-nearest neighbor search – pdf
Incremental Delaunay triangulation – pdf
k-maximum subarray problem – pdf
Scale Invariant Feature Transform (SIFT) – pdf
AKS algorithm for primality proving – pdf
Betweenness centrality – pdf

Ethics and Rules

Ethics

Please do not even exchange ideas with your friends since there is a thin line between exchanging ideas and codes looking the same.
Please do not look up web/books for solutions.
See Dr. Yogesh’ nice writeup on plagiarism policies in his HPC page

Deadlines

All assignments will be evaluated for a maximum of 10. There will be a penalty of -1 for every additional day taken for submission after the assignment due date.

Thus, you will have to be judicious regarding deciding when to submit your assignments.

Example

Suppose you have completed 1/2 of the assignment by the due date.

Scenario 1:

You think that it will take another 1 day to finish 3/4 of the assignment. In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit a day after, you will get a maximum score of 6.5 (=7.5-1, -1 for the extra day). Thus, you will get better score if you take an extra day, finish 3/4 of the assignment and then submit.

Scenario 2:

You think that it will take another 3 days to finish 3/4 of the assignment. In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit 3 days after, you will get a maximum score of 4.5 (=7.5-3, -3 for the three extra days). Thus, you will get better score if you submit your assignment that is 1/2 complete by the due date than submit the assignment that will be 3/4 complete after 3 days.

DS 295: Parallel Programming

Parallel Programming

Overview

Pre-requisites

Grading Scheme

Resources

Lecture Slides

Reading Materials

Others

Assignments

Final Project

Ethics and Rules

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us