S2R : Move Special Register to Register

Format

SPA 5.0:
{@{!}Pg} S2R Rd, SRa {&req_6} {&rdN} {&wrN} {?sched} ; {@{!}Pg} CS2R Rd, SRa {&req_6} {?sched} ;

Description

Both CS2R and S2R moves special register SRa to register Rd.

These special registers typically have architectural state that is setup by external methods or bundles and not produced by program execution itself.

S2R is a variable latency read of special registers.

A subset of these special registers are tightly coupled with instruction executions i.e. these can be read and written back to register file with fixed latency from issue cycle. This subset can read via the CS2R instruction. Reading special registers outside this subset with CS2R returns 0.

It is desirable to use special registers by name, e.g. SR_LaneId, rather than by number, e.g. SR0, to avoid problems if special registers change numeric assignments in the future. The numeric names for special register operands are SR0 to SR255.

Unimplemented special registers read as zero. Unused bit fields in special registers read as zero.

The following table indicates the special registers that can be read with CS2R/S2R

Special register table
SR# Name Type/Bits CS2R (coupled) Shader Types/Description
SR0 SR_LaneId [THREAD] N valid in all shader types
4:0 Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1. (Same information as bits 4:0 of SR3)
31:5 zero
SR1 Reserved N
SR2 SR_VirtCfg WARP N valid in all shader types
31:0 Virtual Configuration after floorsweeping and throttle/limit methods.
.WarpSz 5:0 Warp size is number of thread lanes per warp, constant 32.
7:6 zero
.NWarp 14:8 Virtual Number of warps per SM, (1..MAX) where MAX is 48 before SPA3.0 and 64 after.
.NArrayLower 19:16 Lower 4 bits of Virtual Number of thread arrays (CTAs). (1..MAX) where MAX = 16 for SPA 5.0 and above.
.NSM 28:20 Virtual Number of SMs total, 1 to GPU max.
.NArrayUpper 30:29 Upper bits of virtual Number of thread arrays (CTAs).
SR3 SR_VirtId [THREAD] N valid in all shader types
31:0 Virtual Id after floorsweeping and throttle/limit methods.
.LaneId 4:0 Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1.
7:5 zero
.WarpId 14:8 Virtual warp id, 0 to SR_VirtCfg.NWarp-1. This field was 13:8 prior to SPA 5.3
15:15 zero
.ArrayIdLower 19:16 Lower 4 bits of Virtual thread array (CTA) id, 0 to SR_VirtCfg.NArray-1.
.SMId 28:20 Virtual SM id, 0 to SR_VirtCfg.NSM-1.
.ArrayIdUpper 30:29 Upper bits of Virtual thread array (CTA) id.
31:31 zero
SR4..7 SR_PM0..3 SM Y valid in all shader types
31:0 Per Subpartition Performance counter. They are configured via statebundles from pushbuffer methods and context switched.Each thread will see a copy of perf counters in its own subpartition.
SR8..11 SR_PM4..7 SM Y valid in all shader types
31:0 Shared Performance counter. They are configured via statebundles from pushbuffer methods and context switched. Thse perf counters are counting events in shared units like MIOS and PIXOUT.
SR12 - SR14 Reserved N
SR15 SR_ORDERING_TICKET valid in pixel shader only [Warp] N
8:0 Ticket dispenser ID.
15:9 Ticket increment value. Zero for all but last warp of TC tile. For the last warp , the increment value is 128-N where N is the number of warps in TC tile.
31:16 Assigned Ticket value.To be used to match global ticket counter to determine if warp is allowed to proceed with pixel blend operations.
SR16 SR_PRIM_TYPE [Warp] N valid in all shader types (see below)
13:0
warp input primitive type and size
Return values are:
Shader SR read value
VSa: POINT,LINE,TRIANGLE,PATCH
VSb: POINT,LINE,TRIANGLE,PATCH
TI/TS: POINT,LINE,TRIANGLE,PATCH
GS: POINT,LINE,TRIANGLE,PATCH
PS: 0
Compute: 0
Where
Type Enum value
POINT 0x0001
LINE 0x0002 | (adjancency << 8)
TRIANGLE 0x0004 | (adjancency << 8)
PATCH 0x0008 | (size of patch << 8)
SR17 SR_INVOCATION_ID [THREAD] N valid in all shader types (see below)
4:0
Primitive invocation id (when hw generates multiple instances of a primitive)                 
GS Shaders:
Hw is generating multiple instances: return(invocation id) in range of [0,InvocationCount-1]
Hw not generating multiple instances: return(0)
TI/TS Shaders: return(invocation id); [0,InvocationCount-1] Other Shaders: return(0)
SR18 SR_Y_DIRECTION [WARP] N valid in all shader types except compute
31:0
A floating point number (either -1.0 or +1.0) tied to the SetWindowOrigin method.         
It can be combined with an FMUL to correct the sign of the y derivitives calculated
in an OpenGL program to insure they work for both render-to-texture and normal rendering.
SetWindowOrigin.Mode SR read value
UPPER_LEFT +1.0f
LOWER_LEFT -1.0f
SR19 SR_THREAD_KILL [THREAD] N PS only (reads as zero otherwise)
31:0
A boolean register value indicating if the thread has been killed.                                  
Helper pixels (pixels with no coverage used to round out quads) start execution in a killed state.
Pixel shaders can also use be killed with the KIL instruction.
A pixel thread that is killed will have it's outputs automatically discarded and
is not allowed to execute global store commands.
PS Shaders: if(thread has been killed or is helper pixel): return(0xffff_ffff) else: return(0) Other Shaders:
return(0)
SR20 SM_SHADER_TYPE [WARP] N valid in all shader types
7:0
Read the shader type of the currently running thread (Enumerated as below)
ShaderType Value
Vertex A 0
Vertex B 1
Tessellation Init 2
Tessellation 3
Geometry 4
Pixel 5
Compute 6
SR21 SR_DirectCBEWriteAddressLow [Warp] N VSB & TI only (reads as zero otherwise)
31:0 Lower 32b of CBE (circular buffer entry ) address in global memory for VTG shaders.
This field is 0 unless SR_DirectCBEWriteEnabled==1.
SR22 SR_DirectCBEWriteAddressHigh [Warp] N VSB & TI only (reads as zero otherwise)
7:0 Upper 8b of CBE (circular buffer entry ) address in global memory for VTG shaders.
This field is 0 unless SR_DirectCBEWriteEnabled==1.
31:8 reserved
SR23 SR_DirectCBEWriteEnable [Warp] N VSB & TI only (reads as zero otherwise)
0:0 Maps to CopyOutOptIn bit in SPH header. Can be valid only in VS and TS shaders.
Shader can query this bit to determine if it is the last alpha stage and
expected to write output directly to CBE structure in global memory (L2).
SR24 SR_MACHINE_ID_0 SM N valid in all shader types
31:0
A completely SW defined value, set with a PRI.  
Possible uses might include
- Affinity interpretation: enum value telling how to interpret affinity
- Double Precision or not bit
- Family & Chip Codes
- ISA codes
SR25..27 SR_MACHINE_ID_1..3 SM N valid in all shader types
31:0 31:0 Reserved for future. Reads as zero
SR28 SR_AFFINITY SM N valid in all shader types
7:0 Affinity[0] value
15:8 Affinity[1] value
23:16 Affinity[2] value
31:24 Affinity[3] value
Each of the 4 elements in the array is a seperate byte value.
SW can determine if two SMs are "affine" if they have the same affinity array values.
For GF100, Affinity[0] contains the logical GPC# an SM is attached to.
SR29 SR_INVOCATION_INFO [THREAD] N VTG only (reads as zero otherwise)
.primIndex 7:0 primitive index of the thread. Note: For vertex shaders/Single instance geometry this corresponds to laneId.
.vertexPerPrim 21:16 vertices per primitive. Note (.primIndex *.vertexPerPrim) is used as offset for reading vertex handle using ISBERD
SR30 SR_WScaleFactor_XY [WARP] N VTG only (reads as zero otherwise)
31:0 fp32 representation of XY plane scalefactor (1.0 or 256.0 based) on SM state.
SR31 SR_WScaleFactor_Z [WARP] N VTG only (reads as zero otherwise)
31:0 fp32 representation of Z plane scalefactor (1.0 or 256.0 based) on SM state.
SR32 SR_Tid [THREAD] N CS only (reads as zero otherwise)
31:0 Thread Id (all component fields combined) within CTA
.x 10:0 tid.x (Tesla compatible)
15:11 zero
.y 25:16 tid.y (Tesla compatible)
.z 31:26 tid.z (Tesla compatible)
SR33 SR_Tid.X [THREAD] N CS only (reads as zero otherwise)
10:0 Thread Id X component within CTA. Ranges between 0 and SR_NTid.X - 1
SR34 SR_Tid.Y [THREAD] N CS only (reads as zero otherwise)
9:0 Thread Id Y component within CTA. Ranges between 0 and SR_NTid.Y - 1
SR35 SR_Tid.Z [THREAD] N CS only (reads as zero otherwise)
5:0 Thread Id Z component within CTA. Ranges between 0 and SR_NTid.Z - 1
SR36 Reserved N
SR37 SR_CTAid.X [CTA] N CS only (reads as zero otherwise)
31:0 CTA Id X component within grid. Ranges between 0 and SR_NCTAid.X - 1
SR38 SR_CTAid.Y [CTA] N CS only (reads as zero otherwise)
15:0 CTA Id Y component within grid. Ranges between 0 and SR_NCTAid.Y - 1
SR39 SR_CTAid.Z [CTA] N CS only (reads as zero otherwise)
15:0 CTA Id Z component within grid. Ranges between 0 and SR_NCTAid.Z - 1
SR40 SR_NTid [CTA] N CS only (reads as zero otherwise)
12:0 Total number of live threads remaining in CTA, initialized from CTA size (set by SetCtaResourceAllocation method's ThreadCount field) and every time a warp completes, 32 is subtracted. If ThreadCount is not a multiple of 32, SR_NTid may be (32 - ThreadCount mod 32) shy of the actual remaining thread count. This count changes only when a whole warp completes. If only a subset of threads in a warp complete, SR_NTid does not change.
SR41 SR_CirQueueIncrMinusOne [CTA] N CS only (reads as zero otherwise)
.incrMinusOne 7:0 Circular queue increment (of work) associated withis CTA, minus one. Typically this value is QMD.QueueEntriesPerCtaMinusOne. However for CWD can launch work with partially occupied CTAs when QMD.CoalesceWaitingPeriod expires. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.
30:8 Reserved
.isQueue 31:31 Set to 1 if task is launched as GWC circular queue, 0 if launched as grid. Note that
SR42 SR_NLATC [CTA] N CS only (reads as zero otherwise)
12:0 Total number of launched and alive (non-exited) threads of a CTA, initialized to zero at CTA launch, and incremented every warp launch by 32 threads and decremented by 32 every time a warp exits. Once a CTA is fully loaded, SR42 will only be different from SR40 if ThreadCount is not a multiple of 32, in which case SR42 will be greater than SR40 by ThreadCount mod 32.
SR43..47 Reserved N
SR48 SR_SWinLo [Global] N CS only (reads as zero otherwise)
31:24 Shared Window base address in bytes, multiple of 16MB Set by SetShaderSharedMemoryWindow method
SR49 SR_SWINSZ [Global] N CS only (reads as zero otherwise)
24:24 Shared Window size in bytes (CONSTANT 16MB)
SR50 SR_SMemSz [Global] N CS only (reads as zero otherwise)
23:7 Shared memory allocated size in bytes, multiple of 128B Set by SetSharedMemorySize method
SR51 SR_SMemBanks [Global] N CS only (reads as zero otherwise)
5:5 Number of 32-bit banks in L1/Shared RAM (CONSTANT == 32)
SR52 SR_LWinLo [Global] N valid in all shader types
31:24 Local Window base address in bytes, multiple of 16MB Set by SetShaderLocalMemoryWindow method
SR53 SR_LWINSZ [Global] N valid in all shader types
24:24 Local Window size in bytes (CONSTANT == 16MB)
SR54 SR_LMemLoSz [Global] N valid in all shader types
19:4 Local memory low allocated size in bytes per thread, multiple of 16B Set by SetShaderThreadMemoryLowSize method in Compute Set by a field in the Shader Program Header for Graphics
SR55 SR_LMemHiOff [Global] N valid in all shader types
24:4 Local memory high allocated offset in bytes per thread, multiple of 16B LMemHiOff = LWINSZ - LMemHiSz; LMemHiSz is allocated size in bytes per thread. LMemHiSz is set by SetShaderThreadMemoryHighSize method for Compute It is set by a field in the Shader Program Header for Graphics
SR56 SR_EqMask [Thread] N valid in all shader types
31:0 Mask of thread position in current warp, i.e., (1<<(TID&0x1f))
SR57 SR_LtMask [Thread] N valid in all shader types
31:0 Mask of all lower thread positions in current warp, i.e., (1<<(TID&0x1f))-1
SR58 SR_LeMask [Thread] N valid in all shader types
31:0 Mask of current and all lower thread positions, i.e., SR_EqMask | SR_LtMask
SR59 SR_GtMask [Thread] N valid in all shader types
31:0 Mask of all higher thread positions, i.e., ~SR_LeMask
SR60 SR_GeMask [Thread] N valid in all shader types
31:0 Mask of current and all higher thread positions, i.e., ~SR_LtMask
SR61 SR_RegAlloc [Warp] N valid in all shader types
7:0 Set by SetPipeline[].RegisterCount method (graphics) or RegisterCount field in QMD (for compute)
SR62..63 Reserved for future N
SR64 SR_GlobalErrorStatus SM N valid in all shader types
Mode 3:0
Presented to all warps that enter the trap handler.
Results undefined if PRI is changed when SM is not paused.
SingleStepEnabled 0:0
Indicates that the SM is in single step mode.  In the absence of other bits set in the 
Error Status Register, this is the primary indication of why a warp enters a trap handler.
This is controlled by the PRI_SM_DBGR_CONTROL0.SINGLE_STEP_MODE (settings are ENABLE/DISABLE).
Preemption 2:1
Enumeration Values:
0: NORMAL - no preemption request in progress
1: PREEMPTION_SAVE - preemption save in progress
2: PREEMPTION_RESTORE - preemtion restore in progress
reserved 3:3 Reserved
Global Errors 31:4
Contains state that must be visible to all warps or errors that cannot be charged to a single warp.
Presented to all warps that enter the trap handler.
Results undefined if PRI is changed when SM is not paused.
Semantics:
- Global errors caused by warps outside of the trap handler are not guaranteed by HW to be
reported before entering trap handler. SW must flush global faults (e.g., L1 store faults)
by using a CCTL.IVALL operation.
- Global errors are not disabled while in the trap handler. (There is no double buffering.)
- Global errors caused by warps inside the trap handler are not guaranteed by HW to be reported before
leaving the trap handler. SW must flush and check for global faults before RTT.
- Cleared by CPU while SM is paused via PRI.
4:4
Indicates CPU has instructed this SM to stop.  All warps should prepare
state and then execute a BPT.PAUSE to yield control to the CPU.
This is controlled by the CPU writing to PRI_SM_DBGR_CONTROL0.STOP_TRIGGER.
Note that if any warp is executing IDE critical section (i.e instructions between IDE.DI and IDE.EN),
then CPU_STOP is delayed, until no warp is in critical section. However if PRI_SM_DBGR_CONTROL0.STOP_IS_NOT_MASKABLE is set,
then check for any warp being in critical section is not performed while processing CPU_STOP event.
Setting PRI_SM_DBGR_CONTROL0.STOP_IS_NOT_MASKABLE is expected to be used by the CPU to debug code in IDE critical section.
5:5
Indication if any warp is in IDE critical section. 
6:6 reserved
7:7
Multiple Warp Errors
Indicates that a warp error occurred while this SM's warp error register was non-zero (a warp error is still
pending) on this SM. Since the second error's information (including its warp number) cannot be captured,
the multiple errors bit is indication that at least one error has been lost.
8:8 reserved
9:9
Single Warp Error
Indicates that a warp error has been detected on this SM and that error has not yet been cleared (by
the CPU via PRI). When warp errors are cleared via PRI, this state is cleared immediately.
10:10
Warp Trap 1 
Indicates that at least one warp on the SM executed BPT.TRAP with an immediate of 1.
11:11
Warp Trap 2+
Indicates that at least one warp on the SM executed BPT.TRAP with an immediate of 2 or greater.
12:12 Reserved for BPT.INT. Will always be read as 0 via S2R.
31:13 Reserved
SR65 Reserved For Future N
SR66 SR_WarpErrorStatus SM N valid in all shader types
Warp Errors 7:0
   Contains errors known to be caused by this thread's warp.
These error are presented only to the warp that is responsible for the error.
Semantics:
- Only the first-detected warp error from a single warp is reported.
- If more than 1 warp error is seen, then "Multiple Warp Errors" is set and subsequent errors from the
same warp or from other warps are lost. (There is no double buffering)
- All warp errors will be reported before entering the trap handler.
- All warp errors caused be any warp in the trap handler will be ignored. Innocuous ones will be allowed silently
and blatantly dangerous/corrupting ones will result in immediate and silent warp termination.
- Error status is not cleared after the warp that caused the error reads SR_WarpErrorStatus.
Enumeration Values:
0: no errors
1: STACK_ERROR
2: API_STACK_ERROR
3: Not Used
4: PC_WRAP
5: MISALIGNED_PC
6: PC_OVERFLOW
7: Not Used
8: MISALIGNED_REG
9: ILLEGAL_INSTR_ENCODING
10: Not Used
11: ILLEGAL_INSTR_PARAM
12: INVALID_CONST_ADDR
13: OOR_REG
14: OOR_ADDR
15: MISALIGNED_ADDR
16: INVALID_ADDR_SPACE
17: Not Used
18: INVALID_CONST_ADDR_LDC
19: Not Used
20: Not Used
21: Not Used
22: PHYSICAL_STACK_OVERFLOW
23: MMU_FAULT
reserved 23:8 reserved
Trap Immediate 26:24
Non-zero indicates that this warp personally executed a BPT.TRAP
instruction. The BPT.TRAP immediate is provided in this field.
BPT.TRAP 0; is illegal.
Cleared on RTT
reserved 31:27 Reserved
SR67 Reserved N
SR68 - SR71 Reserved for future N
SR72..75 SR_PM_HI0..3 SM Y valid in all shader types
7:0 Upper 8 bits of per Subpartition Performance counter. Lower 32 bits are in SR_PM0..3. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS_S[n], where n = 0..3 specifying the subpartition warp belongs to. Each register has 4 8-bit wide fields, one corresponding each of SR_PM_HI0..3 in subpartition.
SR76..79 SR_PM_HI4..7 SM Y valid in all shader types
7:0 Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS1. There are 4 8-bit wide fields, one for each of SM_PM_HI4..7.
SR76..79 SR_PM_HI4..7 SM Y valid in all shader types
7:0 Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7.
SR80 SR_ClockLo SM Y valid in all shader types
31:0 Real-time (tepid) SM clock counter, low 32-bits, wraps silently.
SR81 SR_ClockHi SM Y valid in all shader types
31:0 Real-time (tepid) SM clock counter, high 32-bits, wraps silently.
SR82 SR_GlobalTimerLo SM Y valid in all shader types
31:0 Lower 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns).
SR83 SR_GlobalTimerHi SM Y valid in all shader types
31:0 Upper 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns).
SR84 - SR95 Reserved for future N
SR96 SR_HwTaskId [Warp] N CS only (reads as zero otherwise)
4:0 HW internal task id assigned by CWD. Max number of tasks is per chip constant, architectural limit is 32 tasks.
31:5 reserved
SR97 SR_CircularQueueEntryIndex [Warp] N CS only (reads as zero otherwise)
23:0 Index to the GPU Work Creation (GWC) circular queue. Used for ticket # to an ordered queue. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.
31:24 reserved
SR98 SR_CircularQueueEntryAddressLow [Warp] N CS only (reads as zero otherwise)
31:0 Lower 32b of GWC circular queue entry address. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.
SR99 SR_CircularQueueEntryAddressHigh [Warp] N CS only (reads as zero otherwise)
7:0 Upper 8b of GWC circular queue entry address This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.
31:8 reserved
SR100 - SR255 Reserved for future N

Examples:

S2R R0, SR_LaneId;
S2R R0, SR2;

Back to Index of Instructions