Edgar Solomonik, University of California, Berkeley
2.5D algorithms parameterize the parallel decomposition of dense linear algebra problems. On one end, 2.5D algorithms generalize 2D algorithms, which partition the data without replication. On the other end, 2.5D algorithms reduce to 3D algorithms. 3D algorithms perform minimal inter-processor communication by partitioning the full computational graph and replicating data elements accordingly. 2.5D algorithms minimize communication for any amount of available memory. We present novel 2.5D algorithms and implementations for matrix multiplication (MM) and LU factorization. Our 2.5D LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. We show that 2.5D algorithms can use tournament pivoting to reduce the latency cost and achieve communication lower bounds. The theoretical benefits are realized by our implementation. Our 2.5D LU and MM implementations achieve high-efficiency strong scaling (solving problems faster given more processors). The new algorithms yield significant speedups when scaling to large partitions of IBM BlueGene/P and Cray supercomputers. The configurable virtual topology of 2.5D algorithms can map directly to the 3-D torus physical network topology of BlueGene/P partitions. As a result, 2.5D LU and MM can perform all communication without network contention. Communication time is further reduced by the use of specialized communication collectives. Our topology-aware mapping allows 2.5D algorithms to exploit highly efficient rectangular collectives on the BlueGene/P machine. A performance model of these collectives and our algorithms demonstrates that 2.5D algorithms can run much more efficiently on a hypothetical exascale architecture.
Abstract Author(s): Edgar Solomonik and James Demmel