Programming tensor cores using nvcuda-wmma

Link: the code are are analyzing is: https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/ and https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

Background

Matrix Multiplication Background
Tiled Matrix Multiplication -- CUDA implementation

Chatgpt answer

https://chat.openai.com/share/cfd09f70-3b82-423e-a91f-3c096e658e3d

Take aways

Understand how the work is distributed!!

Unlike Tiled Matrix Multiplication -- CUDA implementation, wmma is operating on warp. That means

Important

We just write wmma in one thread of a warp, and it will cooperate other threads in a warp. In other words, the __global__ wmma_kernel are executed in one thread of each warp. (As a comparison, codes of kernel in Tiled Matrix Multiplication -- CUDA implementation operate on each thread)

This is super important to understand How blockDim, gridDim are defined, how to index the elements in the matrix in each thread (__global__ kernel function)

Analyze the code in https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/

In the example, the layout of the block is 128*4.
Then one block contains 4*4 warps. Notice one warp is responsible for a 16*16 tile in the matrices.
In the abstraction, one 1*32 layout in the block is a warp.
To index the first thread in the warp, we use ((blockIDx.xblockDim.x + threadIdx.x)/32, blockIDx.yblockDim.y + threadIdx.y)
Notice the threadIdx is consecutive in a warp! (0-31 for warp 0, 32-63 for warp 1....)
Codes are only effectively on this leading thread of a warp (0, 32, ...).

[[Wmma warp programming.svg]]

How work is distributed among the warp

From [[Characterizing_the_Error_Resilience_of_Applications_using_Mixed_Precision_Floating_Point_Formats.pdf]], it is concluded that each thread stores which partition of the matrix. The arithmetic algorithm is not demonstrated here since they performed on the tensor cores hardware.
mirror in AMD