Going wide and local: Block Krylov space solvers as a means to achieve better compute resource utilization
- Dr. Mathias WAGNER
- Dr. Mathias WAGNER (NVIDIA)
While Lattice QCD presents itself as a problem with an obvious parallelization over lattice sites, it still needs to face the trend in HPC to use wider processors. In addition, maximizing locality becomes increasingly important as wider processors have led to a scenario where available FLOPS has grown faster than the available memory bandwidth. Block Krylov space solvers, which combine solves for multiple right hand sides into one block problem, address both issues. The additional right hand sides increase the available parallelism. By exploiting data locality in the Dslash and BLAS operations, one can achieve a significantly higher fraction of the theoretical peak performance. This, combined with the reduced iterations to solution offered by block solvers, provides a superlinear speedup. We demonstrate results for our implementation for NVIDIA GPUs in the QUDA library and discuss the implementation and its performance.
Preferred track (if multiple tracks have been selected)
Algorithms and Machines