Maxwell Instruction Set Architecture

Instruction Set

Quick Links

AL2P ALD AST ATOM ATOMS B2R BAR BFE BFI BPT BRA BRK BRX CAL CCTL CCTLL CCTLT CONT CS2R CSET CSETP DADD DEPBAR DFMA DMNMX DMUL DSET DSETP EXIT F2F F2I FADD FADD32I FCHK FCMP FFMA FFMA32I FLO FMNMX FMUL FMUL32I FSET FSETP FSWZADD GETCRSPTR GETLMEMBASE HADD2 HADD2_32I HFMA2 HFMA2_32I HMUL2 HMUL2_32I HSET2 HSETP2 I2F I2I IADD IADD3 IADD32I ICMP IDE IMAD IMAD32I IMADSP IMNMX IMUL IMUL32I IPA ISBERD ISCADD ISCADD32I ISET ISETP JCAL JMP JMX KIL LD LDC LDG LDL LDS LEA LEPC LONGJMP LOP LOP3 LOP32I MEMBAR MOV MOV32I MUFU NOP OUT P2R PBK PCNT PEXIT PIXLD PLONGJMP POPC PRET PRMT PSET PSETP R2B R2P RAM RED RET RRO RTT S2R SAM SEL SETCRSPTR SETLMEMBASE SHF SHFL SHL SHR SSY ST STG STL STP STS SUATOM SULD SURED SUST SYNC TEX TEXS TLD TLD4 TLD4S TLDS TMML TXA TXD TXQ VABSDIFF VABSDIFF4 VADD VMAD VMNMX VOTE VSET VSETP VSHL VSHR XMAD

Index of Instructions

Index of Floating Point Instructions
Opcode	Description
FADD	FP32 Add
FADD32I	FP32 Add
FCHK	Single Precision FP Divide Range Check
FCMP	FP32 Compare to Zero and Select Source
FFMA	FP32 Fused Multiply and Add
FFMA32I	FP32 Fused Multiply and Add
FMNMX	FP32 Minimum/Maximum
FMUL	FP32 Multiply
FMUL32I	FP32 Multiply
FSET	FP32 Compare And Set
FSETP	FP32 Compare And Set Predicate
FSWZADD	FP32 Add used for FSWZ emulation

IPA	Interpolate Attribute
MUFU	Multi Function Operation
RRO	Range Reduction Operator FP

DADD	FP64 Add
DFMA	FP64 Fused Mutiply Add
DMNMX	FP64 Minimum/Maximum
DMUL	FP64 Multiply
DSET	FP64 Compare And Set
DSETP	FP64 Compare And Set Predicate

HADD2	FP16 SIMD Addition
HADD2_32I	FP16 SIMD Addition
HFMA2	FP16 SIMD Fused Multiply and Add
HFMA2_32I	FP16 SIMD Fused Multiply and Add
HMUL2	FP16 SIMD Multiply
HMUL2_32I	FP16 SIMD Multiply
HSET2	FP16 SIMD Compare and Set
HSETP2	FP16 SIMD Compare and Set Predicate
Index of Integer Instructions
Opcode	Description
BFE	Bit Field Extract
BFI	Bit Field Insert
FLO	Find Leading One
IADD	Integer Addition
IADD3	3-input Integer Addition
IADD32I	Integer Addition
ICMP	Integer Compare to Zero and Select Source
IMAD	Integer Multiply And Add
IMAD32I	Integer Multiply And Add
IMADSP	Extracted Integer Multiply And Add.
IMNMX	Integer Minimum/Maximum
IMUL	Integer Multiply
IMUL32I	Integer Multiply
ISCADD	Scaled Integer Addition
ISCADD32I	Scaled Integer Addition
ISET	Integer Compare And Set
ISETP	Integer Compare And Set Predicate
LEA	Compute Effective Address
LOP	Logic Operation
LOP3	3-input Logic Operation
LOP32I	Logic Operation
POPC	Population count
SHF	Funnel Shift
SHL	Shift Left
SHR	Shift Right
XMAD	Integer Short Multiply Add
Index of Video Instructions
Opcode	Description
VABSDIFF	Integer Byte/Short Absolute Difference
VADD	Integer Byte/Short Addition
VMAD	Integer Byte/Short Multiply Add
VMNMX	Integer Byte/Short Minimum/Maximum
VSET	Integer Byte/Short Set
VSETP	Integer Byte/Short Compare And Set Predicate
VSHL	Integer Byte/Short Shift Left
VSHR	Integer Byte/Short Shift Right

VABSDIFF4	Integer SIMD Byte Absolute Difference
Index of Conversion Instructions
Opcode	Description
F2F	Floating Point To Floating Point Conversion
F2I	Floating Point To Integer Conversion
I2F	Integer To Floating Point Conversion
I2I	Integer To Integer Conversion
Index of Movement Instructions
Opcode	Description
MOV	Move
MOV32I	Move
PRMT	Permute Register Pair
SEL	Select Source with Predicate
SHFL	Warp Wide Register Shuffle
Index of Predicate/CC Instructions
Opcode	Description
CSET	Test Condition Code And Set
CSETP	Test Condition Code and Set Predicate
PSET	Combine Predicates and Set
PSETP	Combine Predicates and Set Predicate

P2R	Move Predicate Register To Register
R2P	Move Register To Predicate/CC Register
Index of Texture Instructions
Opcode	Description
TEX	Texture Fetch
TLD	Texture Load
TLD4	Texture Load 4
TMML	Texture MipMap Level
TXA	Texture Virtual AA
TXD	Texture Fetch With Derivatives
TXQ	Texture Query

TEXS	Texture Fetch with scalar/non-vec4 source/destinations
TLD4S	Texture Load 4 with scalar/non-vec4 source/destinations
TLDS	Texture Load with scalar/non-vec4 source/destinations

STP	Set Texture Phase
Index of Graphics Load/Store Instructions
Opcode	Description
AL2P	Attribute Logical to physical (translate)
ALD	Attribute Load
AST	Attribute Store
ISBERD	Read from ISBE structures used by VTG shaders
OUT	Output Token
PIXLD	Pixel Load
Index of Compute Load/Store Instructions
Opcode	Description
LD	Load from generic Memory
LDC	Load Constant
LDG	Load from Global Memory
LDL	Load within Local Memory Window
LDS	Local within Shared Memory Window

ST	Store to generic Memory
STG	Store to global Memory
STL	Store within Local or Shared Window
STS	Store within Local or Shared Window

ATOM	Atomic Operation on generic Memory
ATOMS	Atomic Operation on Shared Memory
RED	Reduction Operation on generic Memory

CCTL	Cache Control
CCTLL	Cache Control
MEMBAR	Memory Barrier

CCTLT	Texture Cache Control
SUATOM	Surface Reduction
SULD	Surface Load
SURED	Atomic Reduction on surface memory
SUST	Surface Store
Index of Control Instructions
Opcode	Description
BRA	Relative Branch
BRX	Relative Branch Indirect
JMP	Absolute Jump
JMX	Absolute Jump Indirect
SSY	Set Synchronization Point
SYNC	Converge threads after conditional branch

CAL	Relative Call
JCAL	Absolute Call
PRET	Pre-Return From Subroutine
RET	Return From Subroutine

BRK	Break
PBK	Pre-Break

CONT	Continue
PCNT	Pre-continue

EXIT	Exit Program
PEXIT	Pre-Exit

LONGJMP	Long-Jump
PLONGJMP	Pre-Long-Jump

KIL	Kill Thread

BPT	BreakPoint/Trap
IDE	Interrupt disable/enable
RAM	Restore Active Mask
RTT	Return From Trap
SAM	Set Active Mask
Index of Miscellaneous Instructions
Opcode	Description
NOP	No Operation

CS2R	Move Special Register to Register
S2R	Move Special Register to Register

LEPC	Load Effective Program Counter

B2R	Move Barrier To Register
BAR	Barrier Synchronization
R2B	Move Register to Barrier

VOTE	Vote Across SIMD Thread Group

DEPBAR	Dependency Barrier

GETCRSPTR	Get Call Return Stack Pointer
GETLMEMBASE	Get Local Memory Base Pointer
SETCRSPTR	Set Call Return Stack Pointer
SETLMEMBASE	Set Local Memory Base Pointer

NVN Constant Buffer Accesses

The following section will describe how to interpret constant buffer accesses, such as uniform references, in the SASS dump.

Assembly instructions may fetch values from a bound constant buffer in arithmetic or general load instructions. For example, the instruction:

            MOV             R4, c[0xa][0x0];                              # [000128]

moves data from a contant buffer into register R4. References to constant buffer memory in disassembled instructions are of the form "c[A][B]", where "c" indicates a reference to a constant buffer in GPU hardware. The first index ("[0xa]" in the above example) indicates which constant bank (buffer binding) the instruction is reading from. The second index ("[0x0]" in the above example) is the byte offset into the bank. There are a total of 18 constant banks available per shader stage. 4 banks are reserved by the compiler and NVN implementation for various purposes: internal data, such as driver-managed constants, shader constants, non-uniform buffer uniform data, or other non-user data. 14 banks are reserved for backing user-defined uniform buffers in the shaders, and these banks start with constant bank "c[0x3]".

The following table illustrates the constant bank layout:

HW Constant bank	Purpose
c[0x0]	Reserved for driver-managed constants
c[0x1]	Immediate constants in shader code, extracted by the compiler
c[0x2]	Bound resource uniforms (images and samplers) for the shader stage
c[0x3] through c[0x10]	User uniform buffer bindings 0 through 13 for the shader stage
c[0x11]	Reserved by the driver

Note that the layout described in this section details binaries of GLSLC GPU code major version 1. If a new GPU code major version format is introduced, these layouts might change, but that would indicate a backwards incompatible break.

Reserved Driver-Managed Constants

The NVN driver binds an internal constant buffer to hardware constant bank #0, which holds driver-managed constant data. The constant buffer holds API state that needs to be fetched by compiler-generated shader code implementing various NVN API features. Among the data stored in this constant buffer are 16-byte descriptors for shader storage blocks, which are programmed via nvn::CommandBuffer::BindStorageBuffer.

For graphics shaders, this internal constant buffer is shared by all shader stages. The shader storage block bindings for each shader stage can be found in the following locations:

SSBO bindings	c[0x0] entries
Vertex SSBO bindings 0 through 15	c[0x0][0x110 through 0x20F]
Tess Control SSBO bindings 0 through 15	c[0x0][0x210 through 0x30F]
Tess Eval SSBO bindings 0 through 15	c[0x0][0x310 through 0x40F]
Geometry SSBO bindings 0 through 15	c[0x0][0x410 through 0x50F]
Fragment SSBO bindings 0 through 15	c[0x0][0x510 through 0x60F]

A separate internal constant buffer is used for compute shaders. In addition to storing shader storage block bindings, this constant buffer also holds 16-byte descriptors for compute shader uniform buffer bindings. Tegra X1 compute shader hardware only supports 8 total constant buffer bindings. Uniform buffer bindings #0 through #4 map directly to "c[0x3]" through "c[0x7]", while buffer bindings #5 through #13 are fetched from the internal constant buffer "c[0x0]" and emulated using global loads. If an array of uniform buffer bindings crosses the boundary between bindings #4 and #5 and an access to that array uses a non-constant buffer index, that access will use the internal constant buffer, regardless of the actual index.

Compute API bindings	Constant buffer locations
Uniform buffer bindings 0 through 4 (if non-emulated)	c[0x3] through c[0x7]
Uniform buffer bindings 0 through 13 (if emulated with global loads)	c[0x0][0x210 through 0x2F0]; 16 bytes each
SSBO bindings 0 through 15	c[0x0][0x310 through 0x40F]; 16 bytes each

Bound Resource Uniforms

GLSL shaders used by NVN can include sampler or image uniforms associated with API binding points which are not stored in user-defined uniform blocks. For example:

            layout(binding=4) uniform sampler2D smp;

declares a variable _smp_ that is associated with API binding point #4 for the shader stage. Unlike OpenGL, NVN has separate API binding points for each shader stage. The handles used by these binding points are stored in a per-stage internal constant buffer bound to hardware constant bank #2 ("c[0x2]") at a pre-defined fixed offset based on the assigned binding for each uniform in the shader. Samplers, separate textures/samplers, and images are represented as 8 byte entries per binding.

The layout of bank c[0x2] is defined as follows:

Byte range in c[0x2]	Usage
c[0x2][<0x0 through 0x1F >]	Reserved for internal use
c[0x2][<0x20 through 0x11F>]	API combined texture/sampler bindings 0 through 31; 8 bytes each
c[0x2][<0x120 through 0x15F>]	API Image bindings 0 through 7; 8 bytes each
c[0x2][<0x160 through 0x167>]	Reserved for internal use
c[0x2][<0x168 through 0x567>]	API texture-only bindings 0 through 127; 8 bytes each
c[0x2][<0x568 through 0x667>]	API sampler-only bindings 0 through 31; 8 bytes each

Note that texture and image instructions using bound resource uniforms might show up in the assembly as instructions referencing 4-byte immediate integer offsets instead of making explicit references to "c[0x2]"; the hardware is already programmed to reference constant bank #2 for pulling descriptors for these types of instructions, and the compiler might choose to optimize out intermediate loads from this constant bank. For example, in the instruction

	     TEXS.NODEP.P    R2, R0, R4, R4, 0xe, 2D, RGBA;                # [000010]

the "0xe" (14) indicates that the hardware will fetch the texture or image descriptor at an offset of 14*4 = 56 bytes from the beginning of the bound resource uniform constant buffer. That would refer to combined texture/sampler binding #3.

Also note that while samplers and images are represented as 8-byte values, generated shader code may fetch bindings using 4-byte loads.

Uniform Buffer Bindings

Each uniform buffer binding in a shader stage corresponds to a single constant bank. For example, "c[0xa][0x0]" maps to the first byte of the uniform buffer binding #7 in the shader. If the user had defined a uniform bank in a GLSL shader with "layout(binding = 3)", then SASS instructions referencing that uniform buffer would contain "c[0x6]".

In order to determine byte offsets of members of a uniform block, developers will need to know the layout of the uniform block's data or use GLSLC reflection information to query offsets. Uniform buffers that use std140 or std430 layouts have an explicit pre-defined format for the uniform block data. Shaders that do not use one of these fixed layouts for a uniform buffer have the compiler assign an implementation-dependent layout. In the case std140 or std430 are not used, developers mapping uniform buffer uniforms in the shader to the byte offsets within the constant bank will need to use GLSLC's reflection section to get the byte offsets for the desired uniforms. Byte offsets in the reflection section for uniform buffer uniforms map directly to the byte offsets into the corresponding constant bank.

Vertex Attributes and Varyings

When passing inputs and outputs between graphics shader stages, instructions refer to an attribute by a canonical "address" of the form "a[addr]", where _addr_ is a byte offset into a logical structure containing all possible attributes. For example:

            IPA             R5, a[0x80], R4;                              # [0001b8] ATTR0
            IPA             R6, a[0x84], R4;                              # [0001c8] GENERIC_ATTRIBUTE_00_Y
            IPA             R7, a[0x88], R4;                              # [0001d0] GENERIC_ATTRIBUTE_00_Z
            IPA             R8, a[0x8c], R4;                              # [0001d8] GENERIC_ATTRIBUTE_00_W

interpolates (IPA) the four components of generic vector attribute 0, which has byte offsets in the range [0x80, 0x8F]. In GLSL shaders, a layout qualifier like:

            layout(location=4) in vec4 value;

will associate _value_ with generic vector attribute 4, which has an associated offset of 0xC0.

The interface supports 32 generic vectors with offsets in the range [0x80, 0x27F]. The interface also supports various fixed-function attributes. For instructions using attributes, as in the example above, comments in the disassembled instructions identify the attributes accessed.

Note that these "byte offsets" are not actually used as offsets in memory; they are rather treated as canonical attribute numbers. Tegra X1 hardware optimizes attribute passing so that unused or "dead" attributes are not passed between stages and do not consume storage or memory bandwidth.