Chohra Chemseddine : Reproducible, Accurately Rounded and Efficient BLAS (RARE-BLAS)

Jeudi 22 Septembre 2016

Modern high performance computation (HPC) performs a huge amount of floating-point operations on massively multithreaded systems. Those systems interleave operations and include both dynamic scheduling and non-deterministic reductions that prevent numerical reproducibility, i.e. getting identical results from multiple runs, even on one given machine. Floating-point addition is non-associative and the result depends on the computation order. Of course, numerical reproducibility is important to debug, check the correctness of programs and validate the results. A way to guarantee the numerical reproducibility is to calculate the correctly rounded value of the exact result.
Algorithms for Reproducible and Accurate BLAS are presented here. We also present performance results on shared memory, distributed memory parallel systems and Intel Xeon Phi accelerator and show that the cost of our BLAS implementation is satisfying compared to vendor BLAS implementation (Intel MKL library).