Parallel Programming Practice

Parallel Programming (3:1)

The objective of this course is to give you some level of confidence in parallel programming techniques, algorithms and tools. At the end of the course, you would (we hope) be in a position to deal with real-life parallel problems and to explore new avenues of research in the area of parallel programming.

Class

Aditya Acharya
Pavan Badarla
Geetha Venkatesh
Jayasimha Talur
Aniruddha Acharya
Aakriti Gupta
Aashish Viswakarma
Pavan Akulakrishna
Jagvir Singh

Instructor

Sathish Vadhiyar

Reading Materials

Introduction to Parallel Computing
by Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Publisher: Addison Wesley; (2003) ISBN: 0-201-64865-2
Various publication materials, lecture slides

Additional Reading

Petascale Computing: Algorithms and Applications
David A. Bader (Ed.), Chapman & Hall/CRC Computational Science Series. 2007.
The Sourcebook of Parallel Computing
by Jack Dongarra (Editor), Ian Foster, Geoffrey Fox (Editor), Ken Kennedy, Andy White (Editor), Linda Torczon, Wiliiam Gropp Publisher: Morgan Kaufmann; (November 2002) ISBN: 1-558-60871-0
Numerical Linear Algebra for High Performance Computers
by Jack Dongarra, Iain Duff, Danny Sorensen, Henk van der Vorst Publisher: SIAM; (1998) ISBN: 0-89871-428-1

Meeting Hours

11:30-13:00; Tuesday, Thursday; Room 202, SERC

Grading System

Sessionals
- 2 Mid-Term Exams (February 19, March 26) - 16
- Assignments (4) - 24
- Final project - 10
Terminal
- Final project - 10
- Final Exam (April 29) - 40

Syllabus

Prerequisites - Introduction, MPI, OpenMP, CUDA-basics
Parallel Programming tools/languages/models - 1
1. MPI - collective communication implementations, communicator groups, process topologies
Parallel Algorithms
1. List Ranking and Parallel Prefix ppt
2. Sorting ppt
3. Graph ppt
4. FFT ppt
Matrix computations
1. Dense matrix computations ppt
2. Sparse matrix computations ppt
Advanced Parallel Programming Constructs and Optimizations
1. MPI-2 ppt
2. Parallel I/O optimizations ppt
3. GPU programming - CUDA Optimizations ppt, Advanced CUDA Slides
4. OpenCL - class ppt, lecture 5 from open university kit
5. Charm++ ppt
Scientific Applications
1. Molecular Dynamics ppt
2. Population evolution or Game of Life ppt
3. Cosmology/n-nody simulations ppt
4. Mesh applications ppt
Parallel System Software
1. Scheduling in Parallel Systems ppt
2. Checkpointing and Fault tolerance for large systems ppt

Reading Material

Introduction: Grama et al. - 2.4, 3.1, 3.5, 5.1, 5.6
MPI-1: Online tutorial "MPI Complete Reference". Google for it.
Collective Communications: Lecture slides, and Google for paper "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp, IJHPCA 2005
OpenMP - Lecture slides, and OpenMP tutorial: http://www.llnl.gov/computing/tutorials/openMP
Sorting
- Paper: On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993. (Pages 1-6)
- Paper: Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992. (Pages 2-4)
- Paper: Highly scalable parallel sorting. Solomonic and Kale. IPDPS 2010. (Pages 1-7)
- Book: Introduction to parallel computing by Grama et al. Sections 9.2.1, 9.2.2, 9.5, 9.6.2
FFT - FFT Chapter in "Introduction to Parallel Computing" book
Graph algorithms
- Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. (Pages 1-7)
- Paper:Accelerating large graph algorithms on the GPU using CUDA. Harish and Narayanan. HiPC 2007. (Page 5-8)
- Book: Introduction to parallel computing by Grama et al. Sections 10.2-10.4, Sections 11.4.1-11.4.6
List Ranking
- Lecture slides
- Paper: Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (section III)
CUDA Optimizations - Chapter 5 of CUDA programming guide and slides 30-34 of advanced CUDA
Dense Linear Algebra
- Lecture slides
- Paper: Towards Dense Linear Algebra for Hybrid Accelerated Manycore Systems. Parallel Computing. 2010 (section 4.1).
- Paper: Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems. ICS 2012 (sections 2 and 3).
Sparse LA
- Sparse Matrix vector multiplication: Paper: Efficient Sparse matrix-vector multiplication on cache-based GPUs. Reguly, Giles. InPar 2012.
- For cholesky factorization and subsequent steps:
- Sources [You can get these papers from me and take photocopies]
  - Parallel Algorithms for sparse linear systems - Heath, Ng and Peyton
  - Reordering sparse matrices for parallel elimination - Liu
  - A parallel graph partitioning algorithm for a message-passing multiprocessor - Gilbert and Zmijewski
  - Task scheduling for parallel sparse Cholesky factorization - Geist and Ng
  - Parallel algorithms for forward and back substitution in direct solution of sparse linear systems - Anshul Gupta and Vipin Kumar
  - Matrix computations - Golub and van Loan
  - Lecture slides
- General steps of sparse matrix factorization: Heath, Ng and Peyton - pages 420-429
- Parallel ordering
  - Heath, Ng and Peyton - pages 429-435 till Kernighan Lin
  - Liu - pages 75 and 89 (you can read other pages on reduction of elimination tree heights if interested),
  - for Kernighan Lin ordering - Gilbert and Zmijewski - pages 427-433, 437-440, your lecture slides
- For mapping: Heath, Ng and Peyton - 437-439, figures 9 and 10; Geist and Ng - sections 3 and 4
- For numerical factorization: Heath, Ng and Peyton - 442-450
- Maximal Independent sets by Luby - "Introduction to Parallel Computing Book"
MPI-2: Online tutorial: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm
Parallel I/O Optimizations: Google for paper Rajeev Thakur, William Gropp, and Ewing Lusk, “A Case for Using MPI's Derived Datatypes to. Improve I/O Performance,” in Proceedings of SC98.
Molecular Dynamics - lecture slides, the paper ``A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems'' by Plimpton and Hendrickson. Sections 2-5.
Game of Life - Lecture slides
Cosmology
- Lecture slides
- The paper "Scalable parallel formulations of the Barnes-Hut method for n-body simulations" by Grama, Kumar and Sameh. In Parallel Computing 24(1998) 797-822.
- The paper "Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, Fast Multipole, and Dardiosity" by Singh, Holt, Totsuka, Gupta and Hennessey. In Journal of Parallel and Distributed Computing, 1994
Mesh applications
- Lecture slides
- Paper: "Multilevel diffusion schemes for repartitioning of adaptive meshes" by Schloegel et al.
- Paper: "Dynamic repartitioning of adaptively refined meshes" by Schloegel et al.
- Paper: "Dynamic Octree Load Balancing Using Space-filling curves" by Campbell et al. - Section 2.5
- Paper: "Irregularity in Multi-dimensional space-filling curves with applications in multimedia databases" by Mokbel and Aref - Section 4
OpenCL
- Lecture slides
- Lectures 2, 5, 11 and 13 of OpenCL University Kit from AMD
- OpenCL Overview from AMD
Charm++
- Lecture slides
- Online tutorial
Scheduling
- Paper: "Backfilling with lookahead to optimize the packing of parallel jobs" by Shmueli and Feitelson. JPDC 2005.
- Paper: "A comparison study of eleven static mapping heuristics for a class of meta-tasks on heterogeneous computing systems" by Tracy Braun et al., HCW 1999.
Checkpointing:
- Lecture slides
- Paper: "An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance" by James Plank
- References also given in the lecture slides

Assignments

Parallel Prefix
Sparse Matrix-vector Multiplication
Loadbalancing BFS
Sparse matrix triangular solve

Final Project

The final project has to clearly demonstrate the uniqueness of your work over existing work and show adequate performance improvements. It can be in

parallelizing well known algorithms or problems,
parallelizing numerical methods
mapping (collective) communications to 3D torus of Bluegene/L
prograparallelizing numerical methods
mapping (collective) communications to 3D torus of Bluegene/L
programming interface or abstractions or generic scheduling and load balancing strategies for a class of problems

Notes - contains important information on assignment deadlines, policies on plagiarism.

Platform for Assignments and Project- Tyrone, Tesla

Parallel Profilers