Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance -- Hiroyuki Ootomo, Rio Yokota

Summary

This paper achieves the full fp32 precision recovering by improving Markidis‘s correction algorithm

Pre knowledge

Tensor cores is to compute $$D = A\times B + C $$ Hence, before going to tensor cores, A and B should be converted to FP16 or TF32. The accumulator will add C, which is a FP32 matrix. The final result $D$ is a FP32 matrix.

Note

The conversion loss is what Markidis‘s et al. consider mainly.

Markidis‘s correction algorithm

Algorithm

Accurate?

Evaluation metric

Accuracy Evaluation metric where $C_{F P 64}$ is the reference matrix using FP64 to compute, i.e. $C_{F P 64} = t o F P 64 (A_{F P 32}) \cdot t o F P 64 (B_{F P 32})$

Accuracy comparison

Paper Reading Annotate/Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance#^h0rsb5x90a

Causes of not accurate

The new algorithm

Algorithm

Lessons

Statistical analysis of floating-point
- Expectation of mantissa bits
- Probability of underflow
Power consumption is one important aspect need to be considered when using half precision algorithm.

In recent year, the real machines of quantum computer have been developed and tried to be shown quantum supremacy, that they compute certain tasks that (classical) supercomputers are not be able to compute in realistic time. Moreover, since they have low power consumption [2], energy eﬃciency is becoming an important metric when evaluating quantum supremacy. For instance, qFlex is a quantum computer simulator based on tensor network contraction using single-precision complex matrix-matrix multiplication, where the power consumption of each component was reported during its simulation on Summit V100 GPUs [28]. Although they have considered to use FP16 and Tensor Cores in their simulation, they decided not to use it since FP16 has less exponent than FP32 and insuﬃcient to use.