## GCN ISA Instruction Timings

### Preliminary explanations

Almost all instructions (scalar and vector) are executed within 4 cycles. Hence, to
achieve maximum performance, 4 wavefronts should be executed per compute unit.

NOTE: a simple single dword (4-byte) instruction is executed in 4 cycles (thanks to fast
dispatching from cache). However, a 2 dword (8-byte) instruction may require 4 extra cycles
for execution due to bigger size in memory and limits of instruction dispatching.
To achieve best performance, we recommend to use single dword instructions.

A DPFACTOR term is present in some tables; it indicates that the number of cycles depends
on the model of the GPU as follows:

 DPFACTOR     | DP speed | GPU subfamily
--------------|----------|----------------------------
 1            | 1/2      | professional Hawaii
 2            | 1/4      | Highend Tahiti: Radeon HD7970
 4            | 1/8      | Highend Hawaii: R9 290
 8            | 1/16     | Other GPU's
 

### Occupancy table

Waves | SGPRs | VGPRs | LdsW/I | Issue
------|-------|-------|--------|---------
1     | 128   | 256   | 64     | 1
2     | 128   | 128   | 32     | 2
3     | 128   | 84    | 21     | 3
4     | 128   | 64    | 16     | 4
5     | 96    | 48    | 12     | 5
6     | 80    | 40    | 10     | 5
7     | 72    | 36    | 9      | 5
8     | 64    | 32    | 8      | 5
9     | 56    | 28    | 7      | 5
10    | 48    | 24    | 6      | 5

Waves - number of concurrent waves that can be computed by a single SIMD unit
SGPRs - number of maximum SGPRs that can be allocated at that occupancy
VPGRs - number of maximum VGPRs that can be allocated at that occupancy
LdsW/I - maximum amount of LDS space per vector lane per wavefront in dwords
Issue - maximum number of instructions per clock

Each compute unit is partitioned into four SIMD units. So, the maximum number of waves per
compute unit is 40.

### Instruction alignment

Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1):

* any penalty costs 4 cycles
* program divided by in 32-byte blocks
* only the first 3 dwords in the 32-byte block incur no penalty. Any 2-dword
instruction outside these first 3 dwords adds a single penalty.
* if the instructions is longer (more than four cycles) then the last cycles/4 dwords are free
* if 16 or more cycle 2-dword instruction and 2-dword instruction in 4 dword, then there is
no penalty for the second 2-dword instruction.
* best place to jump is the 5 first dwords in the 32-byte block. Jump to rest of the dwords causes
1-3 penalties, depending on number of dwords (N-4, where N is the dword number). This rule
does not apply to backward jumps (???)
* any conditional jump instruction should be in first half of the 32-byte block, otherwise
1-4 penalties are added if jump is not taken, depending on dword number (N-3, where N is dword number).

IMPORTANT: If the occupancy is greater than 1 wave per compute unit, then the penalties,
branches, and scalar instructions will be masked while executing
more waves than 4\*CUs. For best results is recommended to execute many waves
(multiple of 4\*CUs) with occupancy greater than 1.

The GCN 1.2 always execute instruction with full speed if these are in instruction cache.
GCN 1.2 can fetch double dword instructions in full speed.

### Instruction scheduling

* if many wavefronts are executed in a single CU (if many wavefronts) then scalar, vector and
data-share, memory (???) execution units can run independently in parallel,
achieving many instructions per cycles.
* between any integer V_ADD\*, V_SUB\*, V_FIRSTREADLINE_B32, V_READLANE_B32 operations
and any scalar ALU instructions there is 16-cycle delay. Masked if there are more waves than 4*CUs.
* any conditional jump that directly checks VCCZ or EXECZ after an instruction that changes
VCC or EXEC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.
* any conditional jump that directly checks SCC after an instruction that changes SCC,
EXEC, VCC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.

### SOP2 Instruction timings

All SOP2 instructions (S_CBRANCH_G_FORK not checked) take 4 cycles.

### SOPK Instruction timings

All SOPK instructions (S_CBRANCH_I_FORK  not checked) take 4 cycles.
S_SETREG_B32 and S_SETREG_IMM32_B32 take 8 cycles.

### SOP1 Instruction timings

The S_*_SAVEEXEC_B64 instructions take 8 cycles. Other ALU instructions (except
S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles.

### SOPC Instruction timings

All comparison and bit checking instructions take 4 cycles.

### SOPP Instruction timings

Jumps cost 4 cycle (no jump) or 20 cycles (???) if jump is performed.

### SMRD Instruction timings

Timings of SMRD instructions includes only time to fetch and execute instruction without
loading data from memory. Timings of SMRD instructions are in this table:

 Instruction           | Cycles        | Instruction           | Cycles
-----------------------|---------------|-----------------------|---------------
 S_BUFFER_LOAD_DWORD   | 4             | S_LOAD_DWORD          | 4
 S_BUFFER_LOAD_DWORDX2 | 4             | S_LOAD_DWORDX2        | 4
 S_BUFFER_LOAD_DWORDX4 | 4             | S_LOAD_DWORDX4        | 4
 S_BUFFER_LOAD_DWORDX8 | 8             | S_LOAD_DWORDX8        | 8
 S_BUFFER_LOAD_DWORDX16 | 16-24        | S_LOAD_DWORDX16       | 16-24
 S_DCACHE_INV          | 4             | S_MEMTIME             | 4
 S_DCACHE_INV_VOL      | 4             |

### VOP2 Instruction timings

All VOP2 instructions take 4 cycles. All instruction can achieve throughput 1 instruction
per cycle.

### VOP1 Instruction timings

Maximum throughput of these instructions can be calculated by using the expression
`(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction
per cycle, etc.
Timings of VOP1 instructions are in this table:

 Instruction           | Cycles        | Instruction           | Cycles
-----------------------|---------------|-----------------------|---------------
 V_BFREV_B32           | 4             | V_FREXP_EXP_I32_F32   | 4
 V_CEIL_F16            | 4             | V_FREXP_EXP_I32_F64   | DPFACTOR*4
 V_CEIL_F32            | 4             | V_FREXP_MANT_F16      | 4
 V_CEIL_F64            | DPFACTOR*4    | V_FREXP_MANT_F32      | 4
 V_CLREXCP             | 4             | V_FREXP_MANT_F64      | DPFACTOR*4
 V_COS_F16             | 16            | V_LOG_CLAMP_F32       | 16
 V_COS_F32             | 16            | V_LOG_F16             | 16
 V_CVT_F16_F32         | 4             | V_LOG_F32             | 16
 V_CVT_F16_I16         | 4             | V_LOG_LEGACY_F32      | 16
 V_CVT_F16_U16         | 4             | V_MBCNT_LO_U32_B32    | 4
 V_CVT_F32_F16         | 4             | V_MBCNT_HI_U32_B32    | 4
 V_CVT_F32_F64         | DPFACTOR*4    | V_MOVRELD_B32         | 4
 V_CVT_F32_I32         | 4             | V_MOVRELSD_B32        | 4
 V_CVT_F32_U32         | 4             | V_MOVRELS_B32         | 4
 V_CVT_F32_UBYTE0      | 4             | V_MOV_B32             | 4
 V_CVT_F32_UBYTE1      | 4             | V_MOV_FED_B32         | 4
 V_CVT_F32_UBYTE2      | 4             | V_NOP                 | 4
 V_CVT_F32_UBYTE3      | 4             | V_NOT_B32             | 4
 V_CVT_F64_F32         | DPFACTOR*4    | V_RCP_CLAMP_F32       | 16
 V_CVT_F64_I32         | DPFACTOR*4    | V_RCP_CLAMP_F64       | DPFACTOR*8
 V_CVT_F64_U32         | DPFACTOR*4    | V_RCP_F16             | 16
 V_CVT_FLR_I32_F32     | 4             | V_RCP_F32             | 16
 V_CVT_I16_F16         | 4             | V_RCP_F64             | DPFACTOR*8
 V_CVT_I32_F32         | 4             | V_RCP_IFLAG_F32       | 16
 V_CVT_I32_F64         | DPFACTOR*4    | V_RCP_LEGACY_F32      | 16
 V_CVT_OFF_F32_I4      | 4             | V_READFIRSTLANE_B32   | 4
 V_CVT_RPI_I32_F32     | 4             | V_RNDNE_F16           | 4
 V_CVT_U16_F16         | 4             | V_RNDNE_F32           | 4
 V_CVT_U32_F32         | 4             | V_RNDNE_F64           | DPFACTOR*4
 V_CVT_U32_F64         | DPFACTOR*4    | V_RSQ_CLAMP_F32       | 16
 V_EXP_F16             | 16            | V_RSQ_CLAMP_F64       | DPFACTOR*8
 V_EXP_F32             | 16            | V_RSQ_F16             | 16
 V_EXP_LEGACY_F32      | 16            | V_RSQ_F32             | 16
 V_FFBH_I32            | 4             | V_RSQ_F64             | DPFACTOR*8
 V_FFBH_U32            | 4             | V_RSQ_LEGACY_F32      | 16
 V_FFBL_B32            | 4             | V_SIN_F16             | 16
 V_FLOOR_F16           | 4             | V_SIN_F32             | 16
 V_FLOOR_F32           | 4             | V_SQRT_F16            | 16 
 V_FLOOR_F64           | DPFACTOR*4    | V_SQRT_F32            | 16
 V_FRACT_F16           | 4             | V_SQRT_F64            | DPFACTOR*8
 V_FRACT_F32           | 4             | V_TRUNC_F16           | 4
 V_FRACT_F64           | DPFACTOR*4    | V_TRUNC_F32           | 4
 V_FREXP_EXP_I16_F16   | 4             | V_TRUNC_F64           | DPFACTOR*4

### VOPC Instruction timings

Maximum throughput of these instructions can be calculated by using expression
`(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2
instruction per cycle, etc.
All 16-bit and 32-bit comparison instructions take 4 cycles.
All 64-bit comparison instructions take DPFACTOR*4 cycles.

### VOP3 Instruction timings

Maximum throughput of these instructions can be calculated by using expression
`(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2
instruction per cycle and etc.

Timings of VOP3 instructions are in this table:

 Instruction           | Cycles        | Instruction           | Cycles
-----------------------|---------------|-----------------------|---------------
 V_ADD_F64             | DPFACTOR*4    | V_MAD_LEGACY_F32      | 4
 V_ALIGNBIT_B32        | 4             | V_MAD_U16             | 4
 V_ALIGNBYTE_B32       | 4             | V_MAD_U32_U24         | 4
 V_ASHR_I64            | DPFACTOR*4    | V_MAD_U64_U32         | 16
 V_ASHRREV_I64         | DPFACTOR*4    | V_MAX3_F32            | 4
 V_BFE_I32             | 4             | V_MAX3_I32            | 4
 V_BFE_U32             | 4             | V_MAX3_U32            | 4
 V_BFI_B32             | 4             | V_MAX_F64             | DPFACTOR*4
 V_CUBEID_F32          | 4             | V_MED3_F32            | 4
 V_CUBEMA_F32          | 4             | V_MED3_I32            | 4
 V_CUBESC_F32          | 4             | V_MED3_U32            | 4
 V_CUBETC_F32          | 4             | V_MIN3_F32            | 4
 V_CVT_PK_U8_F32       | 4             | V_MIN3_I32            | 4
 V_DIV_FIXUP_F32       | 16            | V_MIN3_U32            | 4
 V_DIV_FIXUP_F64       | DPFACTOR*4    | V_MIN_F64             | DPFACTOR*4
 V_DIV_FMAS_F32        | 16            | V_MQSAD_PK_U16_U8     | 16
 V_DIV_FMAS_F64        | DPFACTOR*8    | V_MQSAD_U32_U8        | 16
 V_DIV_SCALE_F32       | 16            | V_MQSAD_U8            | 16
 V_DIV_SCALE_F64       | DPFACTOR*4    | V_MSAD_U8             | 4
 V_MAD_F16             | 4             | V_MULLIT_F32          | 4
 V_FMA_F32             | 4 or 16 (1)   | V_MUL_F64             | DPFACTOR*8
 V_FMA_F64             | DPFACTOR*8    | V_MUL_HI_I32          | 16
 V_LDEXP_F64           | DPFACTOR*4    | V_MUL_HI_U32          | 16
 V_LERP_U8             | 4             | V_MUL_LO_I32          | 16
 V_LSHL_B64            | DPFACTOR*4    | V_MUL_LO_U32          | 16
 V_LSHLREV_B64         | DPFACTOR*4    | V_QSAD_PK_U16_U8      | 16
 V_LSHR_B64            | DPFACTOR*4    | V_QSAD_U8             | 16
 V_LSHRREV_B64         | DPFACTOR*4    | V_SAD_HI_U8           | 4
 V_MAD_F16             | 4             | V_SAD_U16             | 4
 V_MAD_F32             | 4             | V_SAD_U32             | 4
 V_MAD_I16             | 4             | V_SAD_U8              | 4
 V_MAD_I32_I24         | 4             | V_TRIG_PREOP_F64      | DPFACTOR*8
 V_MAD_I64_I32         | 16            |

(1) - for device with DP speed 1/2, 1/4 or 1/8 is 4 cycles, for other devices is 16 cycles

### DS Instruction timings

Timings of DS instructions includes only execution without waiting for completing
LDS/GDS memory access on a single wavefront. Throughput indicates maximal possible
throughput that excludes any other delays and penalties.
Timings of DS instructions are in this table:

 Instruction            | Cycles | Throughput
------------------------|--------|------------
 DS_ADD_RTN_U32         | 8      | 1/4
 DS_ADD_RTN_U64         | 12     | 1/6
 DS_ADD_SRC2_U32        | 4      | 1/4
 DS_ADD_SRC2_U64        | 8      | 1/8
 DS_ADD_U32             | 8      | 1/4
 DS_ADD_U64             | 12     | 1/6
 DS_AND_B32             | 8      | 1/4
 DS_AND_B64             | 12     | 1/6
 DS_AND_RTN_B32         | 8      | 1/4
 DS_AND_RTN_B64         | 12     | 1/6
 DS_AND_SRC2_B32        | 4      | 1/4
 DS_AND_SRC2_B64        | 8      | 1/8
 DS_APPEND              | 4      | ?
 DS_CMPST_B32           | 12     | 1/6
 DS_CMPST_B64           | 20     | 1/10
 DS_CMPST_F32           | 12     | 1/6
 DS_CMPST_F64           | 20     | 1/10
 DS_CMPST_RTN_B32       | 12     | 1/6
 DS_CMPST_RTN_B64       | 20     | 1/10
 DS_CMPST_RTN_F32       | 12     | 1/6
 DS_CMPST_RTN_F64       | 20     | 1/10
 DS_CONDXCHG32_RTN_B128 | ?      | ?
 DS_CONDXCHG32_RTN_B64  | ?      | ?
 DS_CONSUME             | 4      | ?
 DS_DEC_RTN_U32         | 8      | 1/4
 DS_DEC_RTN_U64         | 12     | 1/6
 DS_DEC_SRC2_U32        | 4      | 1/4
 DS_DEC_SRC2_U64        | 8      | 1/8
 DS_DEC_U32             | 8      | 1/4
 DS_DEC_U64             | 12     | 1/6
 DS_GWS_BARRIER         | ?      | ?
 DS_GWS_INIT            | ?      | ?
 DS_GWS_SEMA_BR         | ?      | ?
 DS_GWS_SEMA_P          | ?      | ?
 DS_GWS_SEMA_RELEASE_ALL| ?      | ?
 DS_GWS_SEMA_V          | ?      | ?
 DS_INC_RTN_U32         | 8      | 1/4
 DS_INC_RTN_U64         | 12     | 1/6
 DS_INC_SRC2_U32        | 4      | 1/4
 DS_INC_SRC2_U64        | 8      | 1/8
 DS_INC_U32             | 8      | 1/4
 DS_INC_U64             | 12     | 1/6
 DS_MAX_F32             | 8      | 1/4
 DS_MAX_F64             | 12     | 1/6
 DS_MAX_I32             | 8      | 1/4
 DS_MAX_I64             | 12     | 1/6
 DS_MAX_RTN_F32         | 8      | 1/4
 DS_MAX_RTN_F64         | 12     | 1/6
 DS_MAX_RTN_I32         | 8      | 1/4
 DS_MAX_RTN_I64         | 12     | 1/6
 DS_MAX_RTN_U32         | 8      | 1/4
 DS_MAX_RTN_U64         | 12     | 1/6
 DS_MAX_SRC2_F32        | 4      | 1/4
 DS_MAX_SRC2_F64        | 8      | 1/8
 DS_MAX_SRC2_I32        | 4      | 1/4
 DS_MAX_SRC2_I64        | 8      | 1/8
 DS_MAX_SRC2_U32        | 4      | 1/4
 DS_MAX_SRC2_U64        | 8      | 1/8
 DS_MAX_U32             | 8      | 1/4
 DS_MAX_U64             | 12     | 1/6
 DS_MIN_F32             | 8      | 1/4
 DS_MIN_F64             | 12     | 1/6
 DS_MIN_I32             | 8      | 1/4
 DS_MIN_I64             | 12     | 1/6
 DS_MIN_RTN_F32         | 8      | 1/4
 DS_MIN_RTN_F64         | 12     | 1/6
 DS_MIN_RTN_I32         | 8      | 1/4
 DS_MIN_RTN_I64         | 12     | 1/6
 DS_MIN_RTN_U32         | 8      | 1/4
 DS_MIN_RTN_U64         | 12     | 1/6
 DS_MIN_SRC2_F32        | 4      | 1/4
 DS_MIN_SRC2_F64        | 8      | 1/8
 DS_MIN_SRC2_I32        | 4      | 1/4
 DS_MIN_SRC2_I64        | 8      | 1/8
 DS_MIN_SRC2_U32        | 4      | 1/4
 DS_MIN_SRC2_U64        | 8      | 1/8
 DS_MIN_U32             | 8      | 1/4
 DS_MIN_U64             | 12     | 1/6
 DS_MSKOR_B32           | 12     | 1/6
 DS_MSKOR_B64           | 20     | 1/10
 DS_MSKOR_RTN_B32       | 12     | 1/6
 DS_MSKOR_RTN_B64       | 20     | 1/10
 DS_NOP                 | 4      | ?
 DS_ORDERED_COUNT (???) | ?      | ?
 DS_OR_B32              | 8      | 1/4
 DS_OR_B64              | 12     | 1/6
 DS_OR_RTN_B32          | 8      | 1/4
 DS_OR_RTN_B64          | 12     | 1/6
 DS_OR_SRC2_B32         | 4      | 1/4
 DS_OR_SRC2_B64         | 8      | 1/8
 DS_READ2ST64_B32       | 8      | 1/4
 DS_READ2ST64_B64       | 16     | 1/8
 DS_READ2_B32           | 8      | 1/4
 DS_READ2_B64           | 16     | 1/8
 DS_READ_B128           | 16     | 1/8
 DS_READ_B32            | 4      | 1/2
 DS_READ_B64            | 8      | 1/4
 DS_READ_B96            | 16     | 1/8
 DS_READ_I16            | 4      | 1/2
 DS_READ_I8             | 4      | 1/2
 DS_READ_U16            | 4      | 1/2
 DS_READ_U8             | 4      | 1/2
 DS_RSUB_RTN_U32        | 8      | 1/4
 DS_RSUB_RTN_U64        | 12     | 1/6
 DS_RSUB_SRC2_U32       | 4      | 1/4
 DS_RSUB_SRC2_U64       | 8      | 1/8
 DS_RSUB_U32            | 8      | 1/4
 DS_RSUB_U64            | 12     | 1/6
 DS_SUB_RTN_U32         | 8      | 1/4
 DS_SUB_RTN_U64         | 12     | 1/6
 DS_SUB_SRC2_U32        | 4      | 1/4
 DS_SUB_SRC2_U64        | 8      | 1/8
 DS_SUB_U32             | 8      | 1/4
 DS_SUB_U64             | 12     | 1/6
 DS_SWIZZLE_B32         | 4      | 1/2
 DS_WRAP_RTN_B32        | ?      | ?
 DS_WRITE2ST64_B32      | 12     | 1/6
 DS_WRITE2ST64_B64      | 20     | 1/10
 DS_WRITE2_B32          | 12     | 1/6
 DS_WRITE2_B64          | 20     | 1/10
 DS_WRITE_B128          | 20     | 1/10
 DS_WRITE_B16           | 8      | 1/4
 DS_WRITE_B32           | 8      | 1/4
 DS_WRITE_B64           | 12     | 1/8
 DS_WRITE_B8            | 8      | 1/4
 DS_WRITE_B96           | 16     | 1/10
 DS_WRITE_SRC2_B32      | 12     | 1/4
 DS_WRITE_SRC2_B64      | 20     | 1/8
 DS_WRXCHG2ST64_RTN_B32 | 12     | 1/6
 DS_WRXCHG2ST64_RTN_B64 | 20     | 1/12
 DS_WRXCHG2_RTN_B32     | 12     | 1/6
 DS_WRXCHG2_RTN_B64     | 20     | 1/12
 DS_WRXCHG_RTN_B32      | 8      | 1/4
 DS_WRXCHG_RTN_B64      | 12     | 1/6
 DS_XOR_B32             | 8      | 1/4
 DS_XOR_B64             | 12     | 1/6
 DS_XOR_RTN_B32         | 8      | 1/4
 DS_XOR_RTN_B64         | 12     | 1/6
 DS_XOR_SRC2_B32        | 4      | 1/4
 DS_XOR_SRC2_B64        | 8      | 1/8

About bank conflicts: The LDS memory is partitioned in 32 banks. The bank number is in
bits 2-6 of the address. A bank conflict occurs when two addresses hit the same
bank, but the addresses are different starting from the 7bit
(the first 2 bits of the address doesn't matter).
Any bank conflict adds penalty to timing and throughput. In the worst case, the throughput
can be not greater 1/32 requests per cycle.
 
### MUBUF Instruction timings

Timings of MUBUF instructions includes only execution without waiting for completing
main memory access on a single wavefront. Additional GLCX adds X cycles to instruction
if the instruction uses the GLC modifier. Timings of MUBUF instructions are in this table:

 Instruction                | Cycles
----------------------------|-----------
 BUFFER_ATOMIC_ADD          | 16+GLC1
 BUFFER_ATOMIC_ADD_X2       | 16+GLC2
 BUFFER_ATOMIC_AND          | 16+GLC1
 BUFFER_ATOMIC_AND_X2       | 16
 BUFFER_ATOMIC_CMPSWAP      | 32
 BUFFER_ATOMIC_CMPSWAP_X2   | 32
 BUFFER_ATOMIC_DEC          | 16+GLC1
 BUFFER_ATOMIC_DEC_X2       | 16+GLC2
 BUFFER_ATOMIC_FCMPSWAP     | 32
 BUFFER_ATOMIC_FCMPSWAP_X2  | 32
 BUFFER_ATOMIC_FMAX         | 16+GLC1
 BUFFER_ATOMIC_FMAX_X2      | 16+GLC2
 BUFFER_ATOMIC_FMIN         | 16+GLC1
 BUFFER_ATOMIC_FMIN_X2      | 16+GLC2
 BUFFER_ATOMIC_INC          | 16+GLC1
 BUFFER_ATOMIC_INC_X2       | 16+GLC2
 BUFFER_ATOMIC_OR           | 16+GLC1
 BUFFER_ATOMIC_OR_X2        | 16+GLC2
 BUFFER_ATOMIC_RSUB         | 16+GLC1
 BUFFER_ATOMIC_RSUB_X2      | 16+GLC2
 BUFFER_ATOMIC_SMAX         | 16+GLC1
 BUFFER_ATOMIC_SMAX_X2      | 16+GLC2
 BUFFER_ATOMIC_SMIN         | 16+GLC1
 BUFFER_ATOMIC_SMIN_X2      | 16+GLC2
 BUFFER_ATOMIC_SUB          | 16+GLC1
 BUFFER_ATOMIC_SUB_X2       | 16+GLC2
 BUFFER_ATOMIC_SWAP         | 16+GLC1
 BUFFER_ATOMIC_SWAP_X2      | 16+GLC2
 BUFFER_ATOMIC_UMAX         | 16+GLC1
 BUFFER_ATOMIC_UMAX_X2      | 16+GLC2
 BUFFER_ATOMIC_UMIN         | 16+GLC1
 BUFFER_ATOMIC_UMIN_X2      | 16+GLC2
 BUFFER_ATOMIC_XOR          | 16+GLC1
 BUFFER_ATOMIC_XOR_X2       | 16+GLC2
 BUFFER_LOAD_DWORD          | 8
 BUFFER_LOAD_DWORDX2        | 18
 BUFFER_LOAD_DWORDX3        | 16
 BUFFER_LOAD_DWORDX4        | 16
 BUFFER_LOAD_FORMAT_X       | 8
 BUFFER_LOAD_FORMAT_XY      | 18?
 BUFFER_LOAD_FORMAT_XYZ     | 16
 BUFFER_LOAD_FORMAT_XYZW    | 16
 BUFFER_LOAD_SBYTE          | 8
 BUFFER_LOAD_SSHORT         | 8
 BUFFER_LOAD_UBYTE          | 8
 BUFFER_LOAD_USHORT         | 8
 BUFFER_STORE_BYTE          | 16
 BUFFER_STORE_DWORD         | 16
 BUFFER_STORE_DWORDX2       | 16
 BUFFER_STORE_DWORDX3       | 16
 BUFFER_STORE_DWORDX4       | 16
 BUFFER_STORE_FORMAT_X      | 16
 BUFFER_STORE_FORMAT_XY     | 16
 BUFFER_STORE_FORMAT_XYZ    | 16
 BUFFER_STORE_FORMAT_XYZW   | 16
 BUFFER_STORE_SHORT         | 16
 BUFFER_WBINVL1             | ?
 BUFFER_WBINVL1_SC          | ?