# Matrix Multiplication Background

- link: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

Can read NVIDIA GPU Performance Background firstly

### Overview

#### Problem formulation

GEMM is defined as the operation

#### Terminology

### Math and Memory Bounds

#### Computations

#### Math limit? or Memory Limit?

Defs are from NVIDIA GPU Performance Background

##### Arithmetic Intensity

Arithmetic Intensity =

### GPU implementation of GEMM

GPUs implement GEMMs by partitioning the output matrix into tiles, which are then assigned to thread blocks.

Each thread block computes its output tile by stepping through the K dimension in tiles, loading the required values from the A and B matrices, and multiplying and accumulating them into the output.

#### Better Performance using tensor cores

Tensor Cores depend on NVIDIA library versions. Performance is better when equivalent matrix dimensions M, N, and K are aligned to== multiples of 16 bytes== (or 128 bytes on A100). With NVIDIA cuBLAS versions before 11.0 or NVIDIA cuDNN versions before 7.6.3, this is a requirement to use Tensor Cores;

#### Decide the tiling size

While multiple tiling strategies are available,larger tiles have more data reuse, allowing them to use less bandwidth and be more efficient than smaller tiles. On the other hand, for a problem of a given size, using larger tiles will generate fewer tiles to run in parallel, which can potentially lead to under-utilization of the GPU.