# Programming tensor cores using nvcuda-wmma

- Link: the code are are analyzing is: https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/ and https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

### Background

Matrix Multiplication Background

Tiled Matrix Multiplication -- CUDA implementation

### Chatgpt answer

https://chat.openai.com/share/cfd09f70-3b82-423e-a91f-3c096e658e3d

### Take aways

##### Understand how the work is distributed!!

Unlike Tiled Matrix Multiplication -- CUDA implementation, `wmma`

is operating on warp. That means

We just write `wmma`

in one thread of a warp, and it will cooperate other threads in a warp. In other words, the `__global__ wmma_kernel`

are executed in one thread of each warp. (As a comparison, codes of kernel in Tiled Matrix Multiplication -- CUDA implementation operate on each thread)

This is super important to understand **How blockDim, gridDim are defined, how to index the elements in the matrix in each thread (__global__ kernel function)**

### Analyze the code in https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/

- In the example, the layout of the block is 128*4.
- Then one block contains 4*4 warps. Notice one warp is responsible for a 16*16 tile in the matrices.
- In the abstraction, one 1*32 layout in the block is a warp.
- To index the first thread in the warp, we use ((blockIDx.x
*blockDim.x + threadIdx.x)/32, blockIDx.y*blockDim.y + threadIdx.y) - Notice the threadIdx is consecutive in a warp! (0-31 for warp 0, 32-63 for warp 1....)
- Codes are only effectively on this leading thread of a warp (0, 32, ...).

[[Wmma warp programming.svg]]

### How work is distributed among the warp

From [[Characterizing_the_Error_Resilience_of_Applications_using_Mixed_Precision_Floating_Point_Formats.pdf]], it is concluded that each thread stores which partition of the matrix. The arithmetic algorithm is not demonstrated here since they performed on the tensor cores hardware.

mirror in AMD