AMD matrix cores
Manipulate the matrix cores
To manipulate the matrix cores, you can
- At high-level way, just use libraries such as rocBLAS or rocWMMA(similar to nvidia:wmma+)
- write GPU kernels entirely in assembly (which can be somewhat challenging and impractical)
- sprinkle HIP kernels with inline assembly
- Use compiler intrinsics: these represent the assembly instructions in such a way that the compiler knows about the semantics and requirements
AMD detailed how to use compiler intrinsics.
Difference between using rocWMMA and compiler intrinsics
Using compiler intrinsics
Note that the AMD's warp(namely wavefront in their official doc) is 64 instead of 32 in CUDA
d= __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt (a, b, c, cbsz, abid, blgp)
How work in distributed
Work were distributed among warps. Specifically, each
mfma compiler intrinsics have been decided which thread store which locations of
D. In their labnotes, they give some examples.
They didn't define the arithmetic operations. We guess these operations are performed with the matrix cores hardware.
Experiments for order
mirror in NVIDIA GPU
Each core function have a fixed layout. Check it using https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator
- Matrix cores speedup is 2x compared to simd throughput, which is much slower than NVIDIA's tensor cores?