- link: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
Can read NVIDIA GPU Performance Background firstly
GEMM is defined as the operation
Math and Memory Bounds
Math limit? or Memory Limit?
Defs are from NVIDIA GPU Performance Background
Arithmetic Intensity =
GPU implementation of GEMM
GPUs implement GEMMs by partitioning the output matrix into tiles, which are then assigned to thread blocks.
Each thread block computes its output tile by stepping through the K dimension in tiles, loading the required values from the A and B matrices, and multiplying and accumulating them into the output.
Better Performance using tensor cores
Tensor Cores depend on NVIDIA library versions. Performance is better when equivalent matrix dimensions M, N, and K are aligned to== multiples of 16 bytes== (or 128 bytes on A100). With NVIDIA cuBLAS versions before 11.0 or NVIDIA cuDNN versions before 7.6.3, this is a requirement to use Tensor Cores;
Decide the tiling size
While multiple tiling strategies are available,larger tiles have more data reuse, allowing them to use less bandwidth and be more efficient than smaller tiles. On the other hand, for a problem of a given size, using larger tiles will generate fewer tiles to run in parallel, which can potentially lead to under-utilization of the GPU.
Tiled Matrix Multiplication -- CUDA implementation