SASS Semantics -- Half instructions store pattern
Store Pattern in registers
Exploiting half precision arithmetic in Nvidia GPUs#^zerj3islv8i
TODO
-
- [ ]
-
Pattern summary from others (not necessarily correct)
HFMA2.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>, <->c.<H0_H0|H1_H1|F32>;
<HADD2|HMUL2>.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>;
Result
opcode | # of operands | scenario | Destination store pattern |
---|---|---|---|
HADD2/HMUL2 (HFMA2) | 3 (4) | normal half, normal half2 function | after execution, the dest. register store two 16-bits number. For half case, two same numbers, for half2 case, may store different number |
HADD2/HMUL2 (HFMA2) | 4 (5?) | add numbers with __float2half2 function |
after execution, only one the first reg store the dest. value. It stores two 16-bits number. |
(HADD2/HMUL2 (HFMA2)).FP32 | 3 (4) | seems to do half2float and float2half to do computation |
first reg. store the dest. value and it's one 32-bits number |
Exploration
Learn SASS Semantics FP16 -- H0_H0 or H1_H1
Learn SASS Semantics -- (FP16) Half instructions store pattern
Issues
Lower 16-bits are zero in HADD2
four operands case
uint16_t
takes the lower 16 bits of other format.See https://stackoverflow.com/questions/53882934/extract-upper-and-lower-word-of-an-unsigned-32-bit-integer
Resource
<HFMA2|HADD2|HMUL2>.FP32
Before executing this instruction, the stored datatype is FP16;
After executing this instruction, the stored datatype is FP32.
R#.<H0_H0|H1_H1>
From Learn SASS Semantics FP16 -- H0_H0 or H1_H1
Conclusion
H0_H0
means lower 16bits of the 32-bit register, we can just use
uint16_t val = R4_value
to extract the value;
H1_H1
means lower 16bits of the 32-bit register, we can just use
uint16_t val = R4_value >> 16
to extract the value;
<HADD2|HMUL2> with 4 operands
From Write and analyze a FP16 CUDA program > Use half2 and perform addition using half2 arithmetic functions, it seems it will appear when we have two constants as the direct arguments for half2
functions (e.g. in this case we have __hadd2(in_array[idx], __float2half2_rn(1.0))
where 1 is the constant).
These constant are with operandType::IMM_DOUBLE
and operandType::IMM_UINT64
.
The final result are stored in the first two operands (same value) as FP16 formats.
e.g.
After HADD2 R7, R7, 1, 1 ;, 4.500000, 1.500000,4.500000,1.500000, 0.000000, 0.000000, 0.000000, 0.000000