Parallel Programming (3:1)

Instructor: Sathish Vadhiyar

Meeting Hours: 2:00-3:30 PM; Monday, Wednesday; Room 202, SERC

The objective of this course is to give you some level of confidence in parallel programming techniques, algorithms and tools. At the end of the course, you would (we hope) be in a position to deal with real-life parallel  problems and to explore new avenues of research in the area of parallel programming. The course covers parallel programming tools, constructs, models, algorithms, parallel matrix computations, parallel programming optimizations, scientific applications and parallel system software. MPI, OpenMP and CUDA will be covered.


  1. Neel Choudhary
  2. Tarun Sharma
  3. Arun I
  4. Nitish Divakar
  5. Jaley H Dholakiya
  6. M Hariprasad
  7. Ishwar Raut
  8. Vedsar Kushwaha
  9. Kachore Vaibhav Ankush
  10. Anubhav Das
  11. Dhuldhule Pratima Ashok
  12. Roopesh Mangal
  13. Abdus Shamim
  14. Anshul Kumar
  15. V Jagannath
  16. Tarun Bansal
  17. Ashirbad Mishra


Additional Reading

Topic Reading Material
  • Prerequisites
    • Introduction
    • MPI
    • OpenMP
    • CUDA-basics
  • Parallel Programming tools/languages/models
    • MPI collective communication implementations
    • MPI communicator groups, process topologies
    • PRAM algorithms


  • MPI-1: Online tutorial "MPI Complete Reference". Google for it.
  • Collective Communications: Lecture slides, and Google for paper "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp, IJHPCA 2005
  • PRAM algorithms: Book "Parallel Computing: Theory and Practice" by Michael J Quinn. Available with me. Pages 25-32, 40-42, 256
  • Parallel Algorithms
    • List ranking and Parallel Prefix
    • Sorting
    • Graph
    • FFT
  • List ranking
    • Paper: Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (section III)
  • Sorting
    • Book: Introduction to parallel computing by Grama et al. Sections 9.2.1, 9.2.2, 9.5, 9.6.2
    • Paper: On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993. (Pages 1-6)
    • Paper: Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992. (Pages 2-4)
  • Graph algorithms
    • Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. (Pages 1-7)
    • Paper:Accelerating large graph algorithms on the GPU using CUDA. Harish and Narayanan. HiPC 2007. (Page 5-8)
    • Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6
  • FFT Chapter in "Introduction to Parallel Computing" book
  • Matrix computations
    • Dense matrix computations
    • Sparse matrix computations
  • Dense Linear Algebra
    • Lecture slides
    • Paper: Towards Dense Linear Algebra for Hybrid Accelerated Manycore Systems. Parallel Computing. 2010 (section 4.1)
  • Sparse LA
    • Sparse Matrix vector multiplication: Paper: Efficient Sparse matrix-vector multiplication on cache-based GPUs. Reguly, Giles. InPar 2012.
    • For cholesky factorization and subsequent steps:
    • Sources [You can get these papers from me and take photocopies]
      • Parallel Algorithms for sparse linear systems - Heath, Ng and Peyton
      • Reordering sparse matrices for parallel elimination - Liu
      • A parallel graph partitioning algorithm for a message-passing multiprocessor - Gilbert and Zmijewski
      • Task scheduling for parallel sparse Cholesky factorization - Geist and Ng
      • Lecture slides
    • General steps of sparse matrix factorization: Heath, Ng and Peyton - pages 420-429
    • Parallel ordering
      • Heath, Ng and Peyton - pages 429-435 till Kernighan Lin
      • Liu - pages 75 and 89 (you can read other pages on reduction of elimination tree heights if interested),
      • for Kernighan Lin ordering - Gilbert and Zmijewski - pages 427-433, 437-440, your lecture slides
    • For mapping: Heath, Ng and Peyton - 437-439, figures 9 and 10; Geist and Ng - sections 3 and 4
    • For numerical factorization: Heath, Ng and Peyton - 442-450
    • Maximal Independent sets by Luby - "Introduction to Parallel Computing" Book
  • Advanced Parallel Programming Constructs, Models and Optimizations
    • MPI-IO
    • Parallel I/O Optimizations
    • GPU programming - CUDA Optimizations ppt, Advanced CUDA Slides
    • OpenCL, Charm++, Cilk - ppt
  • Scientific Applications
    • Molecular dynamics
    • Game of Life
    • Cosmology/n-nody simulations
    • Mesh applications
    • Bioinformatics
  • Molecular Dynamics - Paper: ``A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems'' by Plimpton and Hendrickson. Sections 2-5.
  • N-body Simulations
    • The paper "Scalable parallel formulations of the Barnes-Hut method for n-body simulations" by Grama, Kumar and Sameh. In Parallel Computing 24(1998) 797-822.
    • The paper "Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, Fast Multipole, and Dardiosity" by Singh, Holt, Totsuka, Gupta and Hennessey. In Journal of Parallel and Distributed Computing, 1994
  • Mesh applications
    • Paper: "Multilevel diffusion schemes for repartitioning of adaptive meshes" by Schloegel et al.
    • Paper: "Dynamic repartitioning of adaptively refined meshes" by Schloegel et al.
    • Paper: "Dynamic Octree Load Balancing Using Space-filling curves" by Campbell et al. - Section 2.5
    • Paper: "Irregularity in Multi-dimensional space-filling curves with applications in multimedia databases" by Mokbel and Aref - Section 4
  • Bioinformatics
    • Paper: "Parallel biological sequence comparisons using prefix computations" by Aluru et al. JPDC 2003.
  • Parallel System Software
    • Scheduling in Parallel Systems
    • Checkpointing and fault tolerance for large systems
  • Scheduling
    • Paper: "Backfilling with lookahead to optimize the packing of parallel jobs" by Shmueli and Feitelson. JPDC 2005.
    • Paper: "A comparison study of eleven static mapping heuristics for a class of meta-tasks on heterogeneous computing systems" by Tracy Braun et al., HCW 1999.
  • Checkpointing:
    • Paper: "An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance" by James Plank

Important - Look at the Rules section that contains important information on assignment deadlines, policies on plagiarism.

Platform for Assignments and Project- Tyrone, Tesla

Parallel Profilers


  1. Sparse matrix-vector multiplication
  2. Load balancing BFS
  3. Application Scheduling

Final Project

The final project has to clearly demonstrate the uniqueness of your work over existing work and show adequate performance improvements. It can be in

Sample Projects from Previous Years

Important Assignment Notes

  1. Please do not even exchange ideas with your friends since there is a thin line between exchanging ideas and codes looking the same.
  2. Please do not look up web/books for solutions.

All assignments will be evaluated for a maximum of 10. There will be a penalty of -1 for every additional day taken for submission after the assignment due date.

Thus, you will have to be judicious regarding deciding when to submit your assignments.


Suppose you have completed 1/2 of the assignment by the due date.

Scenario 1:

You think that it will take another 1 day to finish 3/4 of the assignment. In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit a day after, you will get a maximum score of 6.5 (=7.5-1, -1 for the extra day). Thus, you will get better score if you take an extra day, finish 3/4 of the assignment and then submit.

Scenario 2:

You think that it will take another 3 days to finish 3/4 of the assignment.  In this scenario, if you submit by the due date, you will get a maximum score of 5 and if you submit 3 days after, you will get a maximum score of 4.5 (=7.5-3, -3 for the three extra days). Thus, you will get better score if you submit your assignment that is 1/2 complete by the due date than submit the assignment that will be 3/4 complete after 3 days.