Systolic Matrix Multiply
Replace the A row broadcast with a rotation similar to the B column rotation.
Eliminates the expensive broadcast and replaces it with nearest neighbor comm.
Communication costs much less.
Changes data distribution.
- Should we include it in a library?
- Redistribution costs?