Programming tensor cores using nvcuda-wmma


Matrix Multiplication Background
Tiled Matrix Multiplication -- CUDA implementation

Chatgpt answer

Take aways

Understand how the work is distributed!!

Unlike Tiled Matrix Multiplication -- CUDA implementation, wmma is operating on warp. That means


We just write wmma in one thread of a warp, and it will cooperate other threads in a warp. In other words, the __global__ wmma_kernel are executed in one thread of each warp. (As a comparison, codes of kernel in Tiled Matrix Multiplication -- CUDA implementation operate on each thread)

This is super important to understand How blockDim, gridDim are defined, how to index the elements in the matrix in each thread (__global__ kernel function)

Wmma warp programming.png
[[Wmma warp programming.svg]]

How work is distributed among the warp

From [[Characterizing_the_Error_Resilience_of_Applications_using_Mixed_Precision_Floating_Point_Formats.pdf]], it is concluded that each thread stores which partition of the matrix. The arithmetic algorithm is not demonstrated here since they performed on the tensor cores hardware.
mirror in AMD