Analysis
Performance analysis reveals that the 2 dimensional decomposition is always better.
So our matrix multiply only needs one algorithm
- Might need redistribution algorithm to be totally data distribution neutral
However, this is not the best algorithm.