SASS Semantics -- Half instructions store pattern

HFMA2.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>, <->c.<H0_H0|H1_H1|F32>;

<HADD2|HMUL2>.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>;

opcode	# of operands	scenario	Destination store pattern
HADD2/HMUL2 (HFMA2)	3 (4)	normal half, normal half2 function	after execution, the dest. register store two 16-bits number. For `half` case, two same numbers, for `half2` case, may store different number
HADD2/HMUL2 (HFMA2)	4 (5?)	add numbers with `__float2half2` function	after execution, only one the first reg store the dest. value. It stores two 16-bits number.
(HADD2/HMUL2 (HFMA2)).FP32	3 (4)	seems to do `half2float` and `float2half` to do computation	first reg. store the dest. value and it's one 32-bits number

Before executing this instruction, the stored datatype is FP16;
After executing this instruction, the stored datatype is FP32.

H0_H0 means lower 16bits of the 32-bit register, we can just use

uint16_t val = R4_value

to extract the value;

H1_H1 means lower 16bits of the 32-bit register, we can just use

uint16_t val = R4_value >> 16

to extract the value;

From Write and analyze a FP16 CUDA program > Use half2 and perform addition using half2 arithmetic functions, it seems it will appear when we have two constants as the direct arguments for half2 functions (e.g. in this case we have __hadd2(in_array[idx], __float2half2_rn(1.0)) where 1 is the constant).

These constant are with operandType::IMM_DOUBLE and operandType::IMM_UINT64.

The final result are stored in the first two operands (same value) as FP16 formats.
e.g.

After   HADD2 R7, R7, 1, 1 ;, 4.500000, 1.500000,4.500000,1.500000, 0.000000, 0.000000, 0.000000, 0.000000