AMD matrix cores

Manipulate the matrix cores


To manipulate the matrix cores, you can

  1. At high-level way, just use libraries such as rocBLAS or rocWMMA(similar to nvidia:wmma+)
  2. write GPU kernels entirely in assembly (which can be somewhat challenging and impractical)
  3. sprinkle HIP kernels with inline assembly
  4. Use compiler intrinsics: these represent the assembly instructions in such a way that the compiler knows about the semantics and requirements
    AMD detailed how to use compiler intrinsics.

Using compiler intrinsics


Note that the AMD's warp(namely wavefront in their official doc) is 64 instead of 32 in CUDA

Core function

d= __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt (a, b, c, cbsz, abid, blgp)

How work in distributed

Work were distributed among warps. Specifically, each mfma compiler intrinsics have been decided which thread store which locations of A, B, D. In their labnotes, they give some examples.

They didn't define the arithmetic operations. We guess these operations are performed with the matrix cores hardware.
Experiments for order
mirror in NVIDIA GPU

High-level idea

Each core function have a fixed layout. Check it using


  1. Matrix cores speedup is 2x compared to simd throughput, which is much slower than NVIDIA's tensor cores?