Maxwell Instruction Set Architecture

Instruction Set

Quick Links

AL2P ALD AST ATOM ATOMS B2R BAR BFE BFI BPT BRA BRK BRX CAL CCTL CCTLL CCTLT CONT CS2R CSET CSETP DADD DEPBAR DFMA DMNMX DMUL DSET DSETP EXIT F2F F2I FADD FADD32I FCHK FCMP FFMA FFMA32I FLO FMNMX FMUL FMUL32I FSET FSETP FSWZADD GETCRSPTR GETLMEMBASE HADD2 HADD2_32I HFMA2 HFMA2_32I HMUL2 HMUL2_32I HSET2 HSETP2 I2F I2I IADD IADD3 IADD32I ICMP IDE IMAD IMAD32I IMADSP IMNMX IMUL IMUL32I IPA ISBERD ISCADD ISCADD32I ISET ISETP JCAL JMP JMX KIL LD LDC LDG LDL LDS LEA LEPC LONGJMP LOP LOP3 LOP32I MEMBAR MOV MOV32I MUFU NOP OUT P2R PBK PCNT PEXIT PIXLD PLONGJMP POPC PRET PRMT PSET PSETP R2B R2P RAM RED RET RRO RTT S2R SAM SEL SETCRSPTR SETLMEMBASE SHF SHFL SHL SHR SSY ST STG STL STP STS SUATOM SULD SURED SUST SYNC TEX TEXS TLD TLD4 TLD4S TLDS TMML TXA TXD TXQ VABSDIFF VABSDIFF4 VADD VMAD VMNMX VOTE VSET VSETP VSHL VSHR XMAD

Index of Instructions

Index of Floating Point Instructions
OpcodeDescription
FADD FP32 Add
FADD32I FP32 Add
FCHK Single Precision FP Divide Range Check
FCMP FP32 Compare to Zero and Select Source
FFMA FP32 Fused Multiply and Add
FFMA32I FP32 Fused Multiply and Add
FMNMX FP32 Minimum/Maximum
FMUL FP32 Multiply
FMUL32I FP32 Multiply
FSET FP32 Compare And Set
FSETP FP32 Compare And Set Predicate
FSWZADD FP32 Add used for FSWZ emulation
IPA Interpolate Attribute
MUFU Multi Function Operation
RRO Range Reduction Operator FP
DADD FP64 Add
DFMA FP64 Fused Mutiply Add
DMNMX FP64 Minimum/Maximum
DMUL FP64 Multiply
DSET FP64 Compare And Set
DSETP FP64 Compare And Set Predicate
HADD2 FP16 SIMD Addition
HADD2_32I FP16 SIMD Addition
HFMA2 FP16 SIMD Fused Multiply and Add
HFMA2_32I FP16 SIMD Fused Multiply and Add
HMUL2 FP16 SIMD Multiply
HMUL2_32I FP16 SIMD Multiply
HSET2 FP16 SIMD Compare and Set
HSETP2 FP16 SIMD Compare and Set Predicate
Index of Integer Instructions
OpcodeDescription
BFE Bit Field Extract
BFI Bit Field Insert
FLO Find Leading One
IADD Integer Addition
IADD3 3-input Integer Addition
IADD32I Integer Addition
ICMP Integer Compare to Zero and Select Source
IMAD Integer Multiply And Add
IMAD32I Integer Multiply And Add
IMADSP Extracted Integer Multiply And Add.
IMNMX Integer Minimum/Maximum
IMUL Integer Multiply
IMUL32I Integer Multiply
ISCADD Scaled Integer Addition
ISCADD32I Scaled Integer Addition
ISET Integer Compare And Set
ISETP Integer Compare And Set Predicate
LEA Compute Effective Address
LOP Logic Operation
LOP3 3-input Logic Operation
LOP32I Logic Operation
POPC Population count
SHF Funnel Shift
SHL Shift Left
SHR Shift Right
XMAD Integer Short Multiply Add
Index of Video Instructions
OpcodeDescription
VABSDIFF Integer Byte/Short Absolute Difference
VADD Integer Byte/Short Addition
VMAD Integer Byte/Short Multiply Add
VMNMX Integer Byte/Short Minimum/Maximum
VSET Integer Byte/Short Set
VSETP Integer Byte/Short Compare And Set Predicate
VSHL Integer Byte/Short Shift Left
VSHR Integer Byte/Short Shift Right
VABSDIFF4 Integer SIMD Byte Absolute Difference
Index of Conversion Instructions
OpcodeDescription
F2F Floating Point To Floating Point Conversion
F2I Floating Point To Integer Conversion
I2F Integer To Floating Point Conversion
I2I Integer To Integer Conversion
Index of Movement Instructions
OpcodeDescription
MOV Move
MOV32I Move
PRMT Permute Register Pair
SEL Select Source with Predicate
SHFL Warp Wide Register Shuffle
Index of Predicate/CC Instructions
OpcodeDescription
CSET Test Condition Code And Set
CSETP Test Condition Code and Set Predicate
PSET Combine Predicates and Set
PSETP Combine Predicates and Set Predicate
P2R Move Predicate Register To Register
R2P Move Register To Predicate/CC Register
Index of Texture Instructions
OpcodeDescription
TEX Texture Fetch
TLD Texture Load
TLD4 Texture Load 4
TMML Texture MipMap Level
TXA Texture Virtual AA
TXD Texture Fetch With Derivatives
TXQ Texture Query
TEXS Texture Fetch with scalar/non-vec4 source/destinations
TLD4S Texture Load 4 with scalar/non-vec4 source/destinations
TLDS Texture Load with scalar/non-vec4 source/destinations
STP Set Texture Phase
Index of Graphics Load/Store Instructions
OpcodeDescription
AL2P Attribute Logical to physical (translate)
ALD Attribute Load
AST Attribute Store
ISBERD Read from ISBE structures used by VTG shaders
OUT Output Token
PIXLD Pixel Load
Index of Compute Load/Store Instructions
OpcodeDescription
LD Load from generic Memory
LDC Load Constant
LDG Load from Global Memory
LDL Load within Local Memory Window
LDS Local within Shared Memory Window
ST Store to generic Memory
STG Store to global Memory
STL Store within Local or Shared Window
STS Store within Local or Shared Window
ATOM Atomic Operation on generic Memory
ATOMS Atomic Operation on Shared Memory
RED Reduction Operation on generic Memory
CCTL Cache Control
CCTLL Cache Control
MEMBAR Memory Barrier
CCTLT Texture Cache Control
SUATOM Surface Reduction
SULD Surface Load
SURED Atomic Reduction on surface memory
SUST Surface Store
Index of Control Instructions
OpcodeDescription
BRA Relative Branch
BRX Relative Branch Indirect
JMP Absolute Jump
JMX Absolute Jump Indirect
SSY Set Synchronization Point
SYNC Converge threads after conditional branch
CAL Relative Call
JCAL Absolute Call
PRET Pre-Return From Subroutine
RET Return From Subroutine
BRK Break
PBK Pre-Break
CONT Continue
PCNT Pre-continue
EXIT Exit Program
PEXIT Pre-Exit
LONGJMP Long-Jump
PLONGJMP Pre-Long-Jump
KIL Kill Thread
BPT BreakPoint/Trap
IDE Interrupt disable/enable
RAM Restore Active Mask
RTT Return From Trap
SAM Set Active Mask
Index of Miscellaneous Instructions
OpcodeDescription
NOP No Operation
CS2R Move Special Register to Register
S2R Move Special Register to Register
LEPC Load Effective Program Counter
B2R Move Barrier To Register
BAR Barrier Synchronization
R2B Move Register to Barrier
VOTE Vote Across SIMD Thread Group
DEPBAR Dependency Barrier
GETCRSPTR Get Call Return Stack Pointer
GETLMEMBASE Get Local Memory Base Pointer
SETCRSPTR Set Call Return Stack Pointer
SETLMEMBASE Set Local Memory Base Pointer

NVN Constant Buffer Accesses

The following section will describe how to interpret constant buffer accesses, such as uniform references, in the SASS dump.

Assembly instructions may fetch values from a bound constant buffer in arithmetic or general load instructions. For example, the instruction:
            MOV             R4, c[0xa][0x0];                              # [000128]
moves data from a contant buffer into register R4. References to constant buffer memory in disassembled instructions are of the form "c[A][B]", where "c" indicates a reference to a constant buffer in GPU hardware. The first index ("[0xa]" in the above example) indicates which constant bank (buffer binding) the instruction is reading from. The second index ("[0x0]" in the above example) is the byte offset into the bank. There are a total of 18 constant banks available per shader stage. 4 banks are reserved by the compiler and NVN implementation for various purposes: internal data, such as driver-managed constants, shader constants, non-uniform buffer uniform data, or other non-user data. 14 banks are reserved for backing user-defined uniform buffers in the shaders, and these banks start with constant bank "c[0x3]".

The following table illustrates the constant bank layout:

HW Constant bankPurpose
c[0x0] Reserved for driver-managed constants
c[0x1] Immediate constants in shader code, extracted by the compiler
c[0x2] Bound resource uniforms (images and samplers) for the shader stage
c[0x3] through c[0x10] User uniform buffer bindings 0 through 13 for the shader stage
c[0x11] Reserved by the driver


Note that the layout described in this section details binaries of GLSLC GPU code major version 1. If a new GPU code major version format is introduced, these layouts might change, but that would indicate a backwards incompatible break.

Reserved Driver-Managed Constants

The NVN driver binds an internal constant buffer to hardware constant bank #0, which holds driver-managed constant data. The constant buffer holds API state that needs to be fetched by compiler-generated shader code implementing various NVN API features. Among the data stored in this constant buffer are 16-byte descriptors for shader storage blocks, which are programmed via nvn::CommandBuffer::BindStorageBuffer.

For graphics shaders, this internal constant buffer is shared by all shader stages. The shader storage block bindings for each shader stage can be found in the following locations:

SSBO bindingsc[0x0] entries
Vertex SSBO bindings 0 through 15c[0x0][0x110 through 0x20F]
Tess Control SSBO bindings 0 through 15c[0x0][0x210 through 0x30F]
Tess Eval SSBO bindings 0 through 15c[0x0][0x310 through 0x40F]
Geometry SSBO bindings 0 through 15c[0x0][0x410 through 0x50F]
Fragment SSBO bindings 0 through 15c[0x0][0x510 through 0x60F]


A separate internal constant buffer is used for compute shaders. In addition to storing shader storage block bindings, this constant buffer also holds 16-byte descriptors for compute shader uniform buffer bindings. Tegra X1 compute shader hardware only supports 8 total constant buffer bindings. Uniform buffer bindings #0 through #4 map directly to "c[0x3]" through "c[0x7]", while buffer bindings #5 through #13 are fetched from the internal constant buffer "c[0x0]" and emulated using global loads. If an array of uniform buffer bindings crosses the boundary between bindings #4 and #5 and an access to that array uses a non-constant buffer index, that access will use the internal constant buffer, regardless of the actual index.

Compute API bindingsConstant buffer locations
Uniform buffer bindings 0 through 4 (if non-emulated) c[0x3] through c[0x7]
Uniform buffer bindings 0 through 13 (if emulated with global loads)c[0x0][0x210 through 0x2F0]; 16 bytes each
SSBO bindings 0 through 15 c[0x0][0x310 through 0x40F]; 16 bytes each

Bound Resource Uniforms

GLSL shaders used by NVN can include sampler or image uniforms associated with API binding points which are not stored in user-defined uniform blocks. For example:
            layout(binding=4) uniform sampler2D smp;
declares a variable _smp_ that is associated with API binding point #4 for the shader stage. Unlike OpenGL, NVN has separate API binding points for each shader stage. The handles used by these binding points are stored in a per-stage internal constant buffer bound to hardware constant bank #2 ("c[0x2]") at a pre-defined fixed offset based on the assigned binding for each uniform in the shader. Samplers, separate textures/samplers, and images are represented as 8 byte entries per binding.

The layout of bank c[0x2] is defined as follows:

Byte range in c[0x2]Usage
c[0x2][<0x0 through 0x1F >] Reserved for internal use
c[0x2][<0x20 through 0x11F>] API combined texture/sampler bindings 0 through 31; 8 bytes each
c[0x2][<0x120 through 0x15F>] API Image bindings 0 through 7; 8 bytes each
c[0x2][<0x160 through 0x167>] Reserved for internal use
c[0x2][<0x168 through 0x567>] API texture-only bindings 0 through 127; 8 bytes each
c[0x2][<0x568 through 0x667>] API sampler-only bindings 0 through 31; 8 bytes each


Note that texture and image instructions using bound resource uniforms might show up in the assembly as instructions referencing 4-byte immediate integer offsets instead of making explicit references to "c[0x2]"; the hardware is already programmed to reference constant bank #2 for pulling descriptors for these types of instructions, and the compiler might choose to optimize out intermediate loads from this constant bank. For example, in the instruction
	     TEXS.NODEP.P    R2, R0, R4, R4, 0xe, 2D, RGBA;                # [000010]
the "0xe" (14) indicates that the hardware will fetch the texture or image descriptor at an offset of 14*4 = 56 bytes from the beginning of the bound resource uniform constant buffer. That would refer to combined texture/sampler binding #3.

Also note that while samplers and images are represented as 8-byte values, generated shader code may fetch bindings using 4-byte loads.

Uniform Buffer Bindings

Each uniform buffer binding in a shader stage corresponds to a single constant bank. For example, "c[0xa][0x0]" maps to the first byte of the uniform buffer binding #7 in the shader. If the user had defined a uniform bank in a GLSL shader with "layout(binding = 3)", then SASS instructions referencing that uniform buffer would contain "c[0x6]".

In order to determine byte offsets of members of a uniform block, developers will need to know the layout of the uniform block's data or use GLSLC reflection information to query offsets. Uniform buffers that use std140 or std430 layouts have an explicit pre-defined format for the uniform block data. Shaders that do not use one of these fixed layouts for a uniform buffer have the compiler assign an implementation-dependent layout. In the case std140 or std430 are not used, developers mapping uniform buffer uniforms in the shader to the byte offsets within the constant bank will need to use GLSLC's reflection section to get the byte offsets for the desired uniforms. Byte offsets in the reflection section for uniform buffer uniforms map directly to the byte offsets into the corresponding constant bank.

Vertex Attributes and Varyings

When passing inputs and outputs between graphics shader stages, instructions refer to an attribute by a canonical "address" of the form "a[addr]", where _addr_ is a byte offset into a logical structure containing all possible attributes. For example:
            IPA             R5, a[0x80], R4;                              # [0001b8] ATTR0
            IPA             R6, a[0x84], R4;                              # [0001c8] GENERIC_ATTRIBUTE_00_Y
            IPA             R7, a[0x88], R4;                              # [0001d0] GENERIC_ATTRIBUTE_00_Z
            IPA             R8, a[0x8c], R4;                              # [0001d8] GENERIC_ATTRIBUTE_00_W
interpolates (IPA) the four components of generic vector attribute 0, which has byte offsets in the range [0x80, 0x8F]. In GLSL shaders, a layout qualifier like:
            layout(location=4) in vec4 value;
will associate _value_ with generic vector attribute 4, which has an associated offset of 0xC0.

The interface supports 32 generic vectors with offsets in the range [0x80, 0x27F]. The interface also supports various fixed-function attributes. For instructions using attributes, as in the example above, comments in the disassembled instructions identify the attributes accessed.

Note that these "byte offsets" are not actually used as offsets in memory; they are rather treated as canonical attribute numbers. Tegra X1 hardware optimizes attribute passing so that unused or "dead" attributes are not passed between stages and do not consume storage or memory bandwidth.

Quick Links