## GCN ISA Instruction Timings ### Preliminary explanations Almost all instructions (scalar and vector) are executed within 4 cycles. Hence, to achieve maximum performance, 4 wavefronts should be executed per compute unit. NOTE: a simple single dword (4-byte) instruction is executed in 4 cycles (thanks to fast dispatching from cache). However, a 2 dword (8-byte) instruction may require 4 extra cycles for execution due to bigger size in memory and limits of instruction dispatching. To achieve best performance, we recommend to use single dword instructions. A DPFACTOR term is present in some tables; it indicates that the number of cycles depends on the model of the GPU as follows: DPFACTOR | DP speed | GPU subfamily --------------|----------|---------------------------- 1 | 1/2 | professional Hawaii 2 | 1/4 | Highend Tahiti: Radeon HD7970 4 | 1/8 | Highend Hawaii: R9 290 8 | 1/16 | Other GPU's ### Occupancy table Waves | SGPRs | VGPRs | LdsW/I | Issue ------|-------|-------|--------|--------- 1 | 128 | 256 | 64 | 1 2 | 128 | 128 | 32 | 2 3 | 128 | 84 | 21 | 3 4 | 128 | 64 | 16 | 4 5 | 96 | 48 | 12 | 5 6 | 80 | 40 | 10 | 5 7 | 72 | 36 | 9 | 5 8 | 64 | 32 | 8 | 5 9 | 56 | 28 | 7 | 5 10 | 48 | 24 | 6 | 5 Waves - number of concurrent waves that can be computed by a single SIMD unit SGPRs - number of maximum SGPRs that can be allocated at that occupancy VPGRs - number of maximum VGPRs that can be allocated at that occupancy LdsW/I - maximum amount of LDS space per vector lane per wavefront in dwords Issue - maximum number of instructions per clock Each compute unit is partitioned into four SIMD units. So, the maximum number of waves per compute unit is 40. ### Instruction alignment Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1): * any penalty costs 4 cycles * program divided by in 32-byte blocks * only the first 3 dwords in the 32-byte block incur no penalty. Any 2-dword instruction outside these first 3 dwords adds a single penalty. * if the instructions is longer (more than four cycles) then the last cycles/4 dwords are free * if 16 or more cycle 2-dword instruction and 2-dword instruction in 4 dword, then there is no penalty for the second 2-dword instruction. * best place to jump is the 5 first dwords in the 32-byte block. Jump to rest of the dwords causes 1-3 penalties, depending on number of dwords (N-4, where N is the dword number). This rule does not apply to backward jumps (???) * any conditional jump instruction should be in first half of the 32-byte block, otherwise 1-4 penalties are added if jump is not taken, depending on dword number (N-3, where N is dword number). IMPORTANT: If the occupancy is greater than 1 wave per compute unit, then the penalties, branches, and scalar instructions will be masked while executing more waves than 4\*CUs. For best results is recommended to execute many waves (multiple of 4\*CUs) with occupancy greater than 1. The GCN 1.2 always execute instruction with full speed if these are in instruction cache. GCN 1.2 can fetch double dword instructions in full speed. ### Instruction scheduling * if many wavefronts are executed in a single CU (if many wavefronts) then scalar, vector and data-share, memory (???) execution units can run independently in parallel, achieving many instructions per cycles. * between any integer V_ADD\*, V_SUB\*, V_FIRSTREADLINE_B32, V_READLANE_B32 operations and any scalar ALU instructions there is 16-cycle delay. Masked if there are more waves than 4*CUs. * any conditional jump that directly checks VCCZ or EXECZ after an instruction that changes VCC or EXEC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs. * any conditional jump that directly checks SCC after an instruction that changes SCC, EXEC, VCC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs. ### SOP2 Instruction timings All SOP2 instructions (S_CBRANCH_G_FORK not checked) take 4 cycles. ### SOPK Instruction timings All SOPK instructions (S_CBRANCH_I_FORK not checked) take 4 cycles. S_SETREG_B32 and S_SETREG_IMM32_B32 take 8 cycles. ### SOP1 Instruction timings The S_*_SAVEEXEC_B64 instructions take 8 cycles. Other ALU instructions (except S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles. ### SOPC Instruction timings All comparison and bit checking instructions take 4 cycles. ### SOPP Instruction timings Jumps cost 4 cycle (no jump) or 20 cycles (???) if jump is performed. ### SMRD Instruction timings Timings of SMRD instructions includes only time to fetch and execute instruction without loading data from memory. Timings of SMRD instructions are in this table: Instruction | Cycles | Instruction | Cycles -----------------------|---------------|-----------------------|--------------- S_BUFFER_LOAD_DWORD | 4 | S_LOAD_DWORD | 4 S_BUFFER_LOAD_DWORDX2 | 4 | S_LOAD_DWORDX2 | 4 S_BUFFER_LOAD_DWORDX4 | 4 | S_LOAD_DWORDX4 | 4 S_BUFFER_LOAD_DWORDX8 | 8 | S_LOAD_DWORDX8 | 8 S_BUFFER_LOAD_DWORDX16 | 16-24 | S_LOAD_DWORDX16 | 16-24 S_DCACHE_INV | 4 | S_MEMTIME | 4 S_DCACHE_INV_VOL | 4 | ### VOP2 Instruction timings All VOP2 instructions take 4 cycles. All instruction can achieve throughput 1 instruction per cycle. ### VOP1 Instruction timings Maximum throughput of these instructions can be calculated by using the expression `(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction per cycle, etc. Timings of VOP1 instructions are in this table: Instruction | Cycles | Instruction | Cycles -----------------------|---------------|-----------------------|--------------- V_BFREV_B32 | 4 | V_FREXP_EXP_I32_F32 | 4 V_CEIL_F16 | 4 | V_FREXP_EXP_I32_F64 | DPFACTOR*4 V_CEIL_F32 | 4 | V_FREXP_MANT_F16 | 4 V_CEIL_F64 | DPFACTOR*4 | V_FREXP_MANT_F32 | 4 V_CLREXCP | 4 | V_FREXP_MANT_F64 | DPFACTOR*4 V_COS_F16 | 16 | V_LOG_CLAMP_F32 | 16 V_COS_F32 | 16 | V_LOG_F16 | 16 V_CVT_F16_F32 | 4 | V_LOG_F32 | 16 V_CVT_F16_I16 | 4 | V_LOG_LEGACY_F32 | 16 V_CVT_F16_U16 | 4 | V_MBCNT_LO_U32_B32 | 4 V_CVT_F32_F16 | 4 | V_MBCNT_HI_U32_B32 | 4 V_CVT_F32_F64 | DPFACTOR*4 | V_MOVRELD_B32 | 4 V_CVT_F32_I32 | 4 | V_MOVRELSD_B32 | 4 V_CVT_F32_U32 | 4 | V_MOVRELS_B32 | 4 V_CVT_F32_UBYTE0 | 4 | V_MOV_B32 | 4 V_CVT_F32_UBYTE1 | 4 | V_MOV_FED_B32 | 4 V_CVT_F32_UBYTE2 | 4 | V_NOP | 4 V_CVT_F32_UBYTE3 | 4 | V_NOT_B32 | 4 V_CVT_F64_F32 | DPFACTOR*4 | V_RCP_CLAMP_F32 | 16 V_CVT_F64_I32 | DPFACTOR*4 | V_RCP_CLAMP_F64 | DPFACTOR*8 V_CVT_F64_U32 | DPFACTOR*4 | V_RCP_F16 | 16 V_CVT_FLR_I32_F32 | 4 | V_RCP_F32 | 16 V_CVT_I16_F16 | 4 | V_RCP_F64 | DPFACTOR*8 V_CVT_I32_F32 | 4 | V_RCP_IFLAG_F32 | 16 V_CVT_I32_F64 | DPFACTOR*4 | V_RCP_LEGACY_F32 | 16 V_CVT_OFF_F32_I4 | 4 | V_READFIRSTLANE_B32 | 4 V_CVT_RPI_I32_F32 | 4 | V_RNDNE_F16 | 4 V_CVT_U16_F16 | 4 | V_RNDNE_F32 | 4 V_CVT_U32_F32 | 4 | V_RNDNE_F64 | DPFACTOR*4 V_CVT_U32_F64 | DPFACTOR*4 | V_RSQ_CLAMP_F32 | 16 V_EXP_F16 | 16 | V_RSQ_CLAMP_F64 | DPFACTOR*8 V_EXP_F32 | 16 | V_RSQ_F16 | 16 V_EXP_LEGACY_F32 | 16 | V_RSQ_F32 | 16 V_FFBH_I32 | 4 | V_RSQ_F64 | DPFACTOR*8 V_FFBH_U32 | 4 | V_RSQ_LEGACY_F32 | 16 V_FFBL_B32 | 4 | V_SIN_F16 | 16 V_FLOOR_F16 | 4 | V_SIN_F32 | 16 V_FLOOR_F32 | 4 | V_SQRT_F16 | 16 V_FLOOR_F64 | DPFACTOR*4 | V_SQRT_F32 | 16 V_FRACT_F16 | 4 | V_SQRT_F64 | DPFACTOR*8 V_FRACT_F32 | 4 | V_TRUNC_F16 | 4 V_FRACT_F64 | DPFACTOR*4 | V_TRUNC_F32 | 4 V_FREXP_EXP_I16_F16 | 4 | V_TRUNC_F64 | DPFACTOR*4 ### VOPC Instruction timings Maximum throughput of these instructions can be calculated by using expression `(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction per cycle, etc. All 16-bit and 32-bit comparison instructions take 4 cycles. All 64-bit comparison instructions take DPFACTOR*4 cycles. ### VOP3 Instruction timings Maximum throughput of these instructions can be calculated by using expression `(1/(CYCLES/4))` - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction per cycle and etc. Timings of VOP3 instructions are in this table: Instruction | Cycles | Instruction | Cycles -----------------------|---------------|-----------------------|--------------- V_ADD_F64 | DPFACTOR*4 | V_MAD_LEGACY_F32 | 4 V_ALIGNBIT_B32 | 4 | V_MAD_U16 | 4 V_ALIGNBYTE_B32 | 4 | V_MAD_U32_U24 | 4 V_ASHR_I64 | DPFACTOR*4 | V_MAD_U64_U32 | 16 V_ASHRREV_I64 | DPFACTOR*4 | V_MAX3_F32 | 4 V_BFE_I32 | 4 | V_MAX3_I32 | 4 V_BFE_U32 | 4 | V_MAX3_U32 | 4 V_BFI_B32 | 4 | V_MAX_F64 | DPFACTOR*4 V_CUBEID_F32 | 4 | V_MED3_F32 | 4 V_CUBEMA_F32 | 4 | V_MED3_I32 | 4 V_CUBESC_F32 | 4 | V_MED3_U32 | 4 V_CUBETC_F32 | 4 | V_MIN3_F32 | 4 V_CVT_PK_U8_F32 | 4 | V_MIN3_I32 | 4 V_DIV_FIXUP_F32 | 16 | V_MIN3_U32 | 4 V_DIV_FIXUP_F64 | DPFACTOR*4 | V_MIN_F64 | DPFACTOR*4 V_DIV_FMAS_F32 | 16 | V_MQSAD_PK_U16_U8 | 16 V_DIV_FMAS_F64 | DPFACTOR*8 | V_MQSAD_U32_U8 | 16 V_DIV_SCALE_F32 | 16 | V_MQSAD_U8 | 16 V_DIV_SCALE_F64 | DPFACTOR*4 | V_MSAD_U8 | 4 V_MAD_F16 | 4 | V_MULLIT_F32 | 4 V_FMA_F32 | 4 or 16 (1) | V_MUL_F64 | DPFACTOR*8 V_FMA_F64 | DPFACTOR*8 | V_MUL_HI_I32 | 16 V_LDEXP_F64 | DPFACTOR*4 | V_MUL_HI_U32 | 16 V_LERP_U8 | 4 | V_MUL_LO_I32 | 16 V_LSHL_B64 | DPFACTOR*4 | V_MUL_LO_U32 | 16 V_LSHLREV_B64 | DPFACTOR*4 | V_QSAD_PK_U16_U8 | 16 V_LSHR_B64 | DPFACTOR*4 | V_QSAD_U8 | 16 V_LSHRREV_B64 | DPFACTOR*4 | V_SAD_HI_U8 | 4 V_MAD_F16 | 4 | V_SAD_U16 | 4 V_MAD_F32 | 4 | V_SAD_U32 | 4 V_MAD_I16 | 4 | V_SAD_U8 | 4 V_MAD_I32_I24 | 4 | V_TRIG_PREOP_F64 | DPFACTOR*8 V_MAD_I64_I32 | 16 | (1) - for device with DP speed 1/2, 1/4 or 1/8 is 4 cycles, for other devices is 16 cycles ### DS Instruction timings Timings of DS instructions includes only execution without waiting for completing LDS/GDS memory access on a single wavefront. Throughput indicates maximal possible throughput that excludes any other delays and penalties. Timings of DS instructions are in this table: Instruction | Cycles | Throughput ------------------------|--------|------------ DS_ADD_RTN_U32 | 8 | 1/4 DS_ADD_RTN_U64 | 12 | 1/6 DS_ADD_SRC2_U32 | 4 | 1/4 DS_ADD_SRC2_U64 | 8 | 1/8 DS_ADD_U32 | 8 | 1/4 DS_ADD_U64 | 12 | 1/6 DS_AND_B32 | 8 | 1/4 DS_AND_B64 | 12 | 1/6 DS_AND_RTN_B32 | 8 | 1/4 DS_AND_RTN_B64 | 12 | 1/6 DS_AND_SRC2_B32 | 4 | 1/4 DS_AND_SRC2_B64 | 8 | 1/8 DS_APPEND | 4 | ? DS_CMPST_B32 | 12 | 1/6 DS_CMPST_B64 | 20 | 1/10 DS_CMPST_F32 | 12 | 1/6 DS_CMPST_F64 | 20 | 1/10 DS_CMPST_RTN_B32 | 12 | 1/6 DS_CMPST_RTN_B64 | 20 | 1/10 DS_CMPST_RTN_F32 | 12 | 1/6 DS_CMPST_RTN_F64 | 20 | 1/10 DS_CONDXCHG32_RTN_B128 | ? | ? DS_CONDXCHG32_RTN_B64 | ? | ? DS_CONSUME | 4 | ? DS_DEC_RTN_U32 | 8 | 1/4 DS_DEC_RTN_U64 | 12 | 1/6 DS_DEC_SRC2_U32 | 4 | 1/4 DS_DEC_SRC2_U64 | 8 | 1/8 DS_DEC_U32 | 8 | 1/4 DS_DEC_U64 | 12 | 1/6 DS_GWS_BARRIER | ? | ? DS_GWS_INIT | ? | ? DS_GWS_SEMA_BR | ? | ? DS_GWS_SEMA_P | ? | ? DS_GWS_SEMA_RELEASE_ALL| ? | ? DS_GWS_SEMA_V | ? | ? DS_INC_RTN_U32 | 8 | 1/4 DS_INC_RTN_U64 | 12 | 1/6 DS_INC_SRC2_U32 | 4 | 1/4 DS_INC_SRC2_U64 | 8 | 1/8 DS_INC_U32 | 8 | 1/4 DS_INC_U64 | 12 | 1/6 DS_MAX_F32 | 8 | 1/4 DS_MAX_F64 | 12 | 1/6 DS_MAX_I32 | 8 | 1/4 DS_MAX_I64 | 12 | 1/6 DS_MAX_RTN_F32 | 8 | 1/4 DS_MAX_RTN_F64 | 12 | 1/6 DS_MAX_RTN_I32 | 8 | 1/4 DS_MAX_RTN_I64 | 12 | 1/6 DS_MAX_RTN_U32 | 8 | 1/4 DS_MAX_RTN_U64 | 12 | 1/6 DS_MAX_SRC2_F32 | 4 | 1/4 DS_MAX_SRC2_F64 | 8 | 1/8 DS_MAX_SRC2_I32 | 4 | 1/4 DS_MAX_SRC2_I64 | 8 | 1/8 DS_MAX_SRC2_U32 | 4 | 1/4 DS_MAX_SRC2_U64 | 8 | 1/8 DS_MAX_U32 | 8 | 1/4 DS_MAX_U64 | 12 | 1/6 DS_MIN_F32 | 8 | 1/4 DS_MIN_F64 | 12 | 1/6 DS_MIN_I32 | 8 | 1/4 DS_MIN_I64 | 12 | 1/6 DS_MIN_RTN_F32 | 8 | 1/4 DS_MIN_RTN_F64 | 12 | 1/6 DS_MIN_RTN_I32 | 8 | 1/4 DS_MIN_RTN_I64 | 12 | 1/6 DS_MIN_RTN_U32 | 8 | 1/4 DS_MIN_RTN_U64 | 12 | 1/6 DS_MIN_SRC2_F32 | 4 | 1/4 DS_MIN_SRC2_F64 | 8 | 1/8 DS_MIN_SRC2_I32 | 4 | 1/4 DS_MIN_SRC2_I64 | 8 | 1/8 DS_MIN_SRC2_U32 | 4 | 1/4 DS_MIN_SRC2_U64 | 8 | 1/8 DS_MIN_U32 | 8 | 1/4 DS_MIN_U64 | 12 | 1/6 DS_MSKOR_B32 | 12 | 1/6 DS_MSKOR_B64 | 20 | 1/10 DS_MSKOR_RTN_B32 | 12 | 1/6 DS_MSKOR_RTN_B64 | 20 | 1/10 DS_NOP | 4 | ? DS_ORDERED_COUNT (???) | ? | ? DS_OR_B32 | 8 | 1/4 DS_OR_B64 | 12 | 1/6 DS_OR_RTN_B32 | 8 | 1/4 DS_OR_RTN_B64 | 12 | 1/6 DS_OR_SRC2_B32 | 4 | 1/4 DS_OR_SRC2_B64 | 8 | 1/8 DS_READ2ST64_B32 | 8 | 1/4 DS_READ2ST64_B64 | 16 | 1/8 DS_READ2_B32 | 8 | 1/4 DS_READ2_B64 | 16 | 1/8 DS_READ_B128 | 16 | 1/8 DS_READ_B32 | 4 | 1/2 DS_READ_B64 | 8 | 1/4 DS_READ_B96 | 16 | 1/8 DS_READ_I16 | 4 | 1/2 DS_READ_I8 | 4 | 1/2 DS_READ_U16 | 4 | 1/2 DS_READ_U8 | 4 | 1/2 DS_RSUB_RTN_U32 | 8 | 1/4 DS_RSUB_RTN_U64 | 12 | 1/6 DS_RSUB_SRC2_U32 | 4 | 1/4 DS_RSUB_SRC2_U64 | 8 | 1/8 DS_RSUB_U32 | 8 | 1/4 DS_RSUB_U64 | 12 | 1/6 DS_SUB_RTN_U32 | 8 | 1/4 DS_SUB_RTN_U64 | 12 | 1/6 DS_SUB_SRC2_U32 | 4 | 1/4 DS_SUB_SRC2_U64 | 8 | 1/8 DS_SUB_U32 | 8 | 1/4 DS_SUB_U64 | 12 | 1/6 DS_SWIZZLE_B32 | 4 | 1/2 DS_WRAP_RTN_B32 | ? | ? DS_WRITE2ST64_B32 | 12 | 1/6 DS_WRITE2ST64_B64 | 20 | 1/10 DS_WRITE2_B32 | 12 | 1/6 DS_WRITE2_B64 | 20 | 1/10 DS_WRITE_B128 | 20 | 1/10 DS_WRITE_B16 | 8 | 1/4 DS_WRITE_B32 | 8 | 1/4 DS_WRITE_B64 | 12 | 1/8 DS_WRITE_B8 | 8 | 1/4 DS_WRITE_B96 | 16 | 1/10 DS_WRITE_SRC2_B32 | 12 | 1/4 DS_WRITE_SRC2_B64 | 20 | 1/8 DS_WRXCHG2ST64_RTN_B32 | 12 | 1/6 DS_WRXCHG2ST64_RTN_B64 | 20 | 1/12 DS_WRXCHG2_RTN_B32 | 12 | 1/6 DS_WRXCHG2_RTN_B64 | 20 | 1/12 DS_WRXCHG_RTN_B32 | 8 | 1/4 DS_WRXCHG_RTN_B64 | 12 | 1/6 DS_XOR_B32 | 8 | 1/4 DS_XOR_B64 | 12 | 1/6 DS_XOR_RTN_B32 | 8 | 1/4 DS_XOR_RTN_B64 | 12 | 1/6 DS_XOR_SRC2_B32 | 4 | 1/4 DS_XOR_SRC2_B64 | 8 | 1/8 About bank conflicts: The LDS memory is partitioned in 32 banks. The bank number is in bits 2-6 of the address. A bank conflict occurs when two addresses hit the same bank, but the addresses are different starting from the 7bit (the first 2 bits of the address doesn't matter). Any bank conflict adds penalty to timing and throughput. In the worst case, the throughput can be not greater 1/32 requests per cycle. ### MUBUF Instruction timings Timings of MUBUF instructions includes only execution without waiting for completing main memory access on a single wavefront. Additional GLCX adds X cycles to instruction if the instruction uses the GLC modifier. Timings of MUBUF instructions are in this table: Instruction | Cycles ----------------------------|----------- BUFFER_ATOMIC_ADD | 16+GLC1 BUFFER_ATOMIC_ADD_X2 | 16+GLC2 BUFFER_ATOMIC_AND | 16+GLC1 BUFFER_ATOMIC_AND_X2 | 16 BUFFER_ATOMIC_CMPSWAP | 32 BUFFER_ATOMIC_CMPSWAP_X2 | 32 BUFFER_ATOMIC_DEC | 16+GLC1 BUFFER_ATOMIC_DEC_X2 | 16+GLC2 BUFFER_ATOMIC_FCMPSWAP | 32 BUFFER_ATOMIC_FCMPSWAP_X2 | 32 BUFFER_ATOMIC_FMAX | 16+GLC1 BUFFER_ATOMIC_FMAX_X2 | 16+GLC2 BUFFER_ATOMIC_FMIN | 16+GLC1 BUFFER_ATOMIC_FMIN_X2 | 16+GLC2 BUFFER_ATOMIC_INC | 16+GLC1 BUFFER_ATOMIC_INC_X2 | 16+GLC2 BUFFER_ATOMIC_OR | 16+GLC1 BUFFER_ATOMIC_OR_X2 | 16+GLC2 BUFFER_ATOMIC_RSUB | 16+GLC1 BUFFER_ATOMIC_RSUB_X2 | 16+GLC2 BUFFER_ATOMIC_SMAX | 16+GLC1 BUFFER_ATOMIC_SMAX_X2 | 16+GLC2 BUFFER_ATOMIC_SMIN | 16+GLC1 BUFFER_ATOMIC_SMIN_X2 | 16+GLC2 BUFFER_ATOMIC_SUB | 16+GLC1 BUFFER_ATOMIC_SUB_X2 | 16+GLC2 BUFFER_ATOMIC_SWAP | 16+GLC1 BUFFER_ATOMIC_SWAP_X2 | 16+GLC2 BUFFER_ATOMIC_UMAX | 16+GLC1 BUFFER_ATOMIC_UMAX_X2 | 16+GLC2 BUFFER_ATOMIC_UMIN | 16+GLC1 BUFFER_ATOMIC_UMIN_X2 | 16+GLC2 BUFFER_ATOMIC_XOR | 16+GLC1 BUFFER_ATOMIC_XOR_X2 | 16+GLC2 BUFFER_LOAD_DWORD | 8 BUFFER_LOAD_DWORDX2 | 18 BUFFER_LOAD_DWORDX3 | 16 BUFFER_LOAD_DWORDX4 | 16 BUFFER_LOAD_FORMAT_X | 8 BUFFER_LOAD_FORMAT_XY | 18? BUFFER_LOAD_FORMAT_XYZ | 16 BUFFER_LOAD_FORMAT_XYZW | 16 BUFFER_LOAD_SBYTE | 8 BUFFER_LOAD_SSHORT | 8 BUFFER_LOAD_UBYTE | 8 BUFFER_LOAD_USHORT | 8 BUFFER_STORE_BYTE | 16 BUFFER_STORE_DWORD | 16 BUFFER_STORE_DWORDX2 | 16 BUFFER_STORE_DWORDX3 | 16 BUFFER_STORE_DWORDX4 | 16 BUFFER_STORE_FORMAT_X | 16 BUFFER_STORE_FORMAT_XY | 16 BUFFER_STORE_FORMAT_XYZ | 16 BUFFER_STORE_FORMAT_XYZW | 16 BUFFER_STORE_SHORT | 16 BUFFER_WBINVL1 | ? BUFFER_WBINVL1_SC | ?