**Matrix Multiplication with Block-Cyclic Distributed Matrices**

**Due: March 11**

In this assignment, you will implement a distributed memory (MPI) matrix-matrix multiplication, AB=C, where A, B and C are square matrices (NXN) that are block-cyclically distributed in both the dimensions across the processors. For matrices, uses double elements. Initially, the matrix can be randomly generated and stored in a file. This matrix is then read by all the processors with block-cyclic distribution using MPI parallel I/O.

You then perform the necessary communications for carrying out the appropriate multiplications.

Show weak scaling results for processes 1, 8, 16, 32, 64 and 96. For the size of a dimension of the matrix, N, for a given number of cores or processors, find the maximum size such that the size per core fits within the RAM size per core on the Turing cluster. Remember to allocate 10% of the RAM size for additional data structures your program may need.

Write a report with methodology, results and observations. For the results, show the weak scaling graph, and show the matrix sizes used in a separate table.

The relative ranks and evaluation scores for this assignment will be based on the matrix sizes used and the execution times obtained.