AMD matrix cores
AMD matrix cores
Manipulate the matrix cores
Summary
To manipulate the matrix cores, you can
- At high-level way, just use libraries such as rocBLAS or rocWMMA(similar to nvidia:wmma+)
- write GPU kernels entirely in assembly (which can be somewhat challenging and impractical)
- sprinkle HIP kernels with inline assembly
- Use compiler intrinsics: these represent the assembly instructions in such a way that the compiler knows about the semantics and requirements
AMD detailed how to use compiler intrinsics.
Difference between using rocWMMA and compiler intrinsics
Using compiler intrinsics
Note
Note that the AMD's warp(namely wavefront in their official doc) is 64 instead of 32 in CUDA
Core function
d= __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt (a, b, c, cbsz, abid, blgp)
How work in distributed
Work were distributed among warps. Specifically, each mfma
compiler intrinsics have been decided which thread store which locations of A
, B
, D
. In their labnotes, they give some examples.
They didn't define the arithmetic operations. We guess these operations are performed with the matrix cores hardware.
Experiments for order
mirror in NVIDIA GPU
High-level idea
Each core function have a fixed layout. Check it using https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator
Questions
- Matrix cores speedup is 2x compared to simd throughput, which is much slower than NVIDIA's tensor cores?