AMD matrix cores

Manipulate the matrix cores

Summary

To manipulate the matrix cores, you can

At high-level way, just use libraries such as rocBLAS or rocWMMA(similar to nvidia:wmma+)
write GPU kernels entirely in assembly (which can be somewhat challenging and impractical)
sprinkle HIP kernels with inline assembly
Use compiler intrinsics: these represent the assembly instructions in such a way that the compiler knows about the semantics and requirements
AMD detailed how to use compiler intrinsics.

Difference between using rocWMMA and compiler intrinsics

Using compiler intrinsics

Note

Note that the AMD's warp(namely wavefront in their official doc) is 64 instead of 32 in CUDA

Core function

d= __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt (a, b, c, cbsz, abid, blgp)

How work in distributed

Work were distributed among warps. Specifically, each mfma compiler intrinsics have been decided which thread store which locations of A, B, D. In their labnotes, they give some examples.

They didn't define the arithmetic operations. We guess these operations are performed with the matrix cores hardware.
Experiments for order
mirror in NVIDIA GPU

High-level idea

Each core function have a fixed layout. Check it using https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator

Questions

Matrix cores speedup is 2x compared to simd throughput, which is much slower than NVIDIA's tensor cores?