SPA 5.0:
{@{!}Pg}
S2R
Rd, SRa
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
{@{!}Pg}
CS2R
Rd, SRa
{&req_6}
{?sched}
;
Both CS2R and S2R moves special register SRa to register Rd.
These special registers typically have architectural state that is setup by external methods or bundles and not produced by program execution itself.S2R is a variable latency read of special registers.
A subset of these special registers are tightly coupled with instruction executions i.e. these can be read and written back to register file with fixed latency from issue cycle. This subset can read via the CS2R instruction. Reading special registers outside this subset with CS2R returns 0.
It is desirable to use special registers by name, e.g. SR_LaneId, rather than by number, e.g. SR0, to avoid problems if special registers change numeric assignments in the future. The numeric names for special register operands are SR0 to SR255.
Unimplemented special registers read as zero. Unused bit fields in special registers read as zero.
The following table indicates the special registers that can be read with CS2R/S2R
SR# | Name | Type/Bits | CS2R (coupled) | Shader Types/Description | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SR0 | SR_LaneId | [THREAD] | N | valid in all shader types | ||||||||||||||||||||||||
4:0 | Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1. (Same information as bits 4:0 of SR3) | |||||||||||||||||||||||||||
31:5 | zero | |||||||||||||||||||||||||||
SR1 | Reserved | N | ||||||||||||||||||||||||||
SR2 | SR_VirtCfg | WARP | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Virtual Configuration after floorsweeping and throttle/limit methods. | |||||||||||||||||||||||||||
.WarpSz | 5:0 | Warp size is number of thread lanes per warp, constant 32. | ||||||||||||||||||||||||||
7:6 | zero | |||||||||||||||||||||||||||
.NWarp | 14:8 | Virtual Number of warps per SM, (1..MAX) where MAX is 48 before SPA3.0 and 64 after. | ||||||||||||||||||||||||||
.NArrayLower | 19:16 | Lower 4 bits of Virtual Number of thread arrays (CTAs). (1..MAX) where MAX = 16 for SPA 5.0 and above. | ||||||||||||||||||||||||||
.NSM | 28:20 | Virtual Number of SMs total, 1 to GPU max. | ||||||||||||||||||||||||||
.NArrayUpper | 30:29 | Upper bits of virtual Number of thread arrays (CTAs). | ||||||||||||||||||||||||||
SR3 | SR_VirtId | [THREAD] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Virtual Id after floorsweeping and throttle/limit methods. | |||||||||||||||||||||||||||
.LaneId | 4:0 | Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1. | ||||||||||||||||||||||||||
7:5 | zero | |||||||||||||||||||||||||||
.WarpId | 14:8 | Virtual warp id, 0 to SR_VirtCfg.NWarp-1. This field was 13:8 prior to SPA 5.3 | ||||||||||||||||||||||||||
15:15 | zero | |||||||||||||||||||||||||||
.ArrayIdLower | 19:16 | Lower 4 bits of Virtual thread array (CTA) id, 0 to SR_VirtCfg.NArray-1. | ||||||||||||||||||||||||||
.SMId | 28:20 | Virtual SM id, 0 to SR_VirtCfg.NSM-1. | ||||||||||||||||||||||||||
.ArrayIdUpper | 30:29 | Upper bits of Virtual thread array (CTA) id. | ||||||||||||||||||||||||||
31:31 | zero | |||||||||||||||||||||||||||
SR4..7 | SR_PM0..3 | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Per Subpartition Performance counter. They are configured via statebundles from pushbuffer methods and context switched.Each thread will see a copy of perf counters in its own subpartition. | |||||||||||||||||||||||||||
SR8..11 | SR_PM4..7 | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Shared Performance counter. They are configured via statebundles from pushbuffer methods and context switched. Thse perf counters are counting events in shared units like MIOS and PIXOUT. | |||||||||||||||||||||||||||
SR12 - SR14 | Reserved | N | ||||||||||||||||||||||||||
SR15 | SR_ORDERING_TICKET valid in pixel shader only | [Warp] | N | |||||||||||||||||||||||||
8:0 | Ticket dispenser ID. | |||||||||||||||||||||||||||
15:9 | Ticket increment value. Zero for all but last warp of TC tile. For the last warp , the increment value is 128-N where N is the number of warps in TC tile. | |||||||||||||||||||||||||||
31:16 | Assigned Ticket value.To be used to match global ticket counter to determine if warp is allowed to proceed with pixel blend operations. | |||||||||||||||||||||||||||
SR16 | SR_PRIM_TYPE | [Warp] | N | valid in all shader types (see below) | ||||||||||||||||||||||||
13:0 | warp input primitive type and size
|
|||||||||||||||||||||||||||
SR17 | SR_INVOCATION_ID | [THREAD] | N | valid in all shader types (see below) | ||||||||||||||||||||||||
4:0 | Primitive invocation id (when hw generates multiple instances of a primitive) |
|||||||||||||||||||||||||||
SR18 | SR_Y_DIRECTION | [WARP] | N | valid in all shader types except compute | ||||||||||||||||||||||||
31:0 | A floating point number (either -1.0 or +1.0) tied to the SetWindowOrigin method.
|
|||||||||||||||||||||||||||
SR19 | SR_THREAD_KILL | [THREAD] | N | PS only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | A boolean register value indicating if the thread has been killed. |
|||||||||||||||||||||||||||
SR20 | SM_SHADER_TYPE | [WARP] | N | valid in all shader types | ||||||||||||||||||||||||
7:0 | Read the shader type of the currently running thread (Enumerated as below)
|
|||||||||||||||||||||||||||
SR21 | SR_DirectCBEWriteAddressLow | [Warp] | N | VSB & TI only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | Lower 32b of CBE (circular buffer entry ) address in global memory for VTG shaders. This field is 0 unless SR_DirectCBEWriteEnabled==1. |
|||||||||||||||||||||||||||
SR22 | SR_DirectCBEWriteAddressHigh | [Warp] | N | VSB & TI only (reads as zero otherwise) | ||||||||||||||||||||||||
7:0 | Upper 8b of CBE (circular buffer entry ) address in global memory for VTG shaders. This field is 0 unless SR_DirectCBEWriteEnabled==1. |
|||||||||||||||||||||||||||
31:8 | reserved | |||||||||||||||||||||||||||
SR23 | SR_DirectCBEWriteEnable | [Warp] | N | VSB & TI only (reads as zero otherwise) | ||||||||||||||||||||||||
0:0 | Maps to CopyOutOptIn bit in SPH header. Can be valid only in VS and TS shaders. Shader can query this bit to determine if it is the last alpha stage and expected to write output directly to CBE structure in global memory (L2). |
|||||||||||||||||||||||||||
SR24 | SR_MACHINE_ID_0 | SM | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | A completely SW defined value, set with a PRI. |
|||||||||||||||||||||||||||
SR25..27 | SR_MACHINE_ID_1..3 | SM | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | 31:0 Reserved for future. Reads as zero | |||||||||||||||||||||||||||
SR28 | SR_AFFINITY | SM | N | valid in all shader types | ||||||||||||||||||||||||
7:0 | Affinity[0] value | |||||||||||||||||||||||||||
15:8 | Affinity[1] value | |||||||||||||||||||||||||||
23:16 | Affinity[2] value | |||||||||||||||||||||||||||
31:24 | Affinity[3] value Each of the 4 elements in the array is a seperate byte value. SW can determine if two SMs are "affine" if they have the same affinity array values. For GF100, Affinity[0] contains the logical GPC# an SM is attached to. |
|||||||||||||||||||||||||||
SR29 | SR_INVOCATION_INFO | [THREAD] | N | VTG only (reads as zero otherwise) | ||||||||||||||||||||||||
.primIndex | 7:0 | primitive index of the thread. Note: For vertex shaders/Single instance geometry this corresponds to laneId. | ||||||||||||||||||||||||||
.vertexPerPrim | 21:16 | vertices per primitive. Note (.primIndex *.vertexPerPrim) is used as offset for reading vertex handle using ISBERD | ||||||||||||||||||||||||||
SR30 | SR_WScaleFactor_XY | [WARP] | N | VTG only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | fp32 representation of XY plane scalefactor (1.0 or 256.0 based) on SM state. | |||||||||||||||||||||||||||
SR31 | SR_WScaleFactor_Z | [WARP] | N | VTG only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | fp32 representation of Z plane scalefactor (1.0 or 256.0 based) on SM state. | |||||||||||||||||||||||||||
SR32 | SR_Tid | [THREAD] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | Thread Id (all component fields combined) within CTA | |||||||||||||||||||||||||||
.x | 10:0 | tid.x (Tesla compatible) | ||||||||||||||||||||||||||
15:11 | zero | |||||||||||||||||||||||||||
.y | 25:16 | tid.y (Tesla compatible) | ||||||||||||||||||||||||||
.z | 31:26 | tid.z (Tesla compatible) | ||||||||||||||||||||||||||
SR33 | SR_Tid.X | [THREAD] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
10:0 | Thread Id X component within CTA. Ranges between 0 and SR_NTid.X - 1 | |||||||||||||||||||||||||||
SR34 | SR_Tid.Y | [THREAD] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
9:0 | Thread Id Y component within CTA. Ranges between 0 and SR_NTid.Y - 1 | |||||||||||||||||||||||||||
SR35 | SR_Tid.Z | [THREAD] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
5:0 | Thread Id Z component within CTA. Ranges between 0 and SR_NTid.Z - 1 | |||||||||||||||||||||||||||
SR36 | Reserved | N | ||||||||||||||||||||||||||
SR37 | SR_CTAid.X | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | CTA Id X component within grid. Ranges between 0 and SR_NCTAid.X - 1 | |||||||||||||||||||||||||||
SR38 | SR_CTAid.Y | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
15:0 | CTA Id Y component within grid. Ranges between 0 and SR_NCTAid.Y - 1 | |||||||||||||||||||||||||||
SR39 | SR_CTAid.Z | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
15:0 | CTA Id Z component within grid. Ranges between 0 and SR_NCTAid.Z - 1 | |||||||||||||||||||||||||||
SR40 | SR_NTid | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
12:0 | Total number of live threads remaining in CTA, initialized from CTA size (set by SetCtaResourceAllocation method's ThreadCount field) and every time a warp completes, 32 is subtracted. If ThreadCount is not a multiple of 32, SR_NTid may be (32 - ThreadCount mod 32) shy of the actual remaining thread count. This count changes only when a whole warp completes. If only a subset of threads in a warp complete, SR_NTid does not change. | |||||||||||||||||||||||||||
SR41 | SR_CirQueueIncrMinusOne | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
.incrMinusOne | 7:0 | Circular queue increment (of work) associated withis CTA, minus one. Typically this value is QMD.QueueEntriesPerCtaMinusOne. However for CWD can launch work with partially occupied CTAs when QMD.CoalesceWaitingPeriod expires. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1. | ||||||||||||||||||||||||||
30:8 | Reserved | |||||||||||||||||||||||||||
.isQueue | 31:31 | Set to 1 if task is launched as GWC circular queue, 0 if launched as grid. Note that | ||||||||||||||||||||||||||
SR42 | SR_NLATC | [CTA] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
12:0 | Total number of launched and alive (non-exited) threads of a CTA, initialized to zero at CTA launch, and incremented every warp launch by 32 threads and decremented by 32 every time a warp exits. Once a CTA is fully loaded, SR42 will only be different from SR40 if ThreadCount is not a multiple of 32, in which case SR42 will be greater than SR40 by ThreadCount mod 32. | |||||||||||||||||||||||||||
SR43..47 | Reserved | N | ||||||||||||||||||||||||||
SR48 | SR_SWinLo | [Global] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
31:24 | Shared Window base address in bytes, multiple of 16MB Set by SetShaderSharedMemoryWindow method | |||||||||||||||||||||||||||
SR49 | SR_SWINSZ | [Global] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
24:24 | Shared Window size in bytes (CONSTANT 16MB) | |||||||||||||||||||||||||||
SR50 | SR_SMemSz | [Global] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
23:7 | Shared memory allocated size in bytes, multiple of 128B Set by SetSharedMemorySize method | |||||||||||||||||||||||||||
SR51 | SR_SMemBanks | [Global] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
5:5 | Number of 32-bit banks in L1/Shared RAM (CONSTANT == 32) | |||||||||||||||||||||||||||
SR52 | SR_LWinLo | [Global] | N | valid in all shader types | ||||||||||||||||||||||||
31:24 | Local Window base address in bytes, multiple of 16MB Set by SetShaderLocalMemoryWindow method | |||||||||||||||||||||||||||
SR53 | SR_LWINSZ | [Global] | N | valid in all shader types | ||||||||||||||||||||||||
24:24 | Local Window size in bytes (CONSTANT == 16MB) | |||||||||||||||||||||||||||
SR54 | SR_LMemLoSz | [Global] | N | valid in all shader types | ||||||||||||||||||||||||
19:4 | Local memory low allocated size in bytes per thread, multiple of 16B Set by SetShaderThreadMemoryLowSize method in Compute Set by a field in the Shader Program Header for Graphics | |||||||||||||||||||||||||||
SR55 | SR_LMemHiOff | [Global] | N | valid in all shader types | ||||||||||||||||||||||||
24:4 | Local memory high allocated offset in bytes per thread, multiple of 16B LMemHiOff = LWINSZ - LMemHiSz; LMemHiSz is allocated size in bytes per thread. LMemHiSz is set by SetShaderThreadMemoryHighSize method for Compute It is set by a field in the Shader Program Header for Graphics | |||||||||||||||||||||||||||
SR56 | SR_EqMask | [Thread] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Mask of thread position in current warp, i.e., (1<<(TID&0x1f)) | |||||||||||||||||||||||||||
SR57 | SR_LtMask | [Thread] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Mask of all lower thread positions in current warp, i.e., (1<<(TID&0x1f))-1 | |||||||||||||||||||||||||||
SR58 | SR_LeMask | [Thread] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Mask of current and all lower thread positions, i.e., SR_EqMask | SR_LtMask | |||||||||||||||||||||||||||
SR59 | SR_GtMask | [Thread] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Mask of all higher thread positions, i.e., ~SR_LeMask | |||||||||||||||||||||||||||
SR60 | SR_GeMask | [Thread] | N | valid in all shader types | ||||||||||||||||||||||||
31:0 | Mask of current and all higher thread positions, i.e., ~SR_LtMask | |||||||||||||||||||||||||||
SR61 | SR_RegAlloc | [Warp] | N | valid in all shader types | ||||||||||||||||||||||||
7:0 | Set by SetPipeline[].RegisterCount method (graphics) or RegisterCount field in QMD (for compute) | |||||||||||||||||||||||||||
SR62..63 | Reserved for future | N | ||||||||||||||||||||||||||
SR64 | SR_GlobalErrorStatus | SM | N | valid in all shader types | ||||||||||||||||||||||||
Mode | 3:0 | Presented to all warps that enter the trap handler. |
||||||||||||||||||||||||||
SingleStepEnabled | 0:0 | Indicates that the SM is in single step mode. In the absence of other bits set in the |
||||||||||||||||||||||||||
Preemption | 2:1 | Enumeration Values: |
||||||||||||||||||||||||||
reserved | 3:3 | Reserved | ||||||||||||||||||||||||||
Global Errors | 31:4 | Contains state that must be visible to all warps or errors that cannot be charged to a single warp. |
||||||||||||||||||||||||||
4:4 | Indicates CPU has instructed this SM to stop. All warps should prepare |
|||||||||||||||||||||||||||
5:5 | Indication if any warp is in IDE critical section. |
|||||||||||||||||||||||||||
6:6 | reserved | |||||||||||||||||||||||||||
7:7 | Multiple Warp Errors |
|||||||||||||||||||||||||||
8:8 | reserved | |||||||||||||||||||||||||||
9:9 | Single Warp Error |
|||||||||||||||||||||||||||
10:10 | Warp Trap 1 |
|||||||||||||||||||||||||||
11:11 | Warp Trap 2+ |
|||||||||||||||||||||||||||
12:12 | Reserved for BPT.INT. Will always be read as 0 via S2R. | |||||||||||||||||||||||||||
31:13 | Reserved | |||||||||||||||||||||||||||
SR65 | Reserved For Future | N | ||||||||||||||||||||||||||
SR66 | SR_WarpErrorStatus | SM | N | valid in all shader types | ||||||||||||||||||||||||
Warp Errors | 7:0 | Contains errors known to be caused by this thread's warp. |
||||||||||||||||||||||||||
reserved | 23:8 | reserved | ||||||||||||||||||||||||||
Trap Immediate | 26:24 | Non-zero indicates that this warp personally executed a BPT.TRAP |
||||||||||||||||||||||||||
reserved | 31:27 | Reserved | ||||||||||||||||||||||||||
SR67 | Reserved | N | ||||||||||||||||||||||||||
SR68 - SR71 | Reserved for future | N | ||||||||||||||||||||||||||
SR72..75 | SR_PM_HI0..3 | SM | Y | valid in all shader types | ||||||||||||||||||||||||
7:0 | Upper 8 bits of per Subpartition Performance counter. Lower 32 bits are in SR_PM0..3. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS_S[n], where n = 0..3 specifying the subpartition warp belongs to. Each register has 4 8-bit wide fields, one corresponding each of SR_PM_HI0..3 in subpartition. | |||||||||||||||||||||||||||
SR76..79 | SR_PM_HI4..7 | SM | Y | valid in all shader types | ||||||||||||||||||||||||
7:0 | Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS1. There are 4 8-bit wide fields, one for each of SM_PM_HI4..7. | |||||||||||||||||||||||||||
SR76..79 | SR_PM_HI4..7 | SM | Y | valid in all shader types | ||||||||||||||||||||||||
7:0 | Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7. | |||||||||||||||||||||||||||
SR80 | SR_ClockLo | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Real-time (tepid) SM clock counter, low 32-bits, wraps silently. | |||||||||||||||||||||||||||
SR81 | SR_ClockHi | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Real-time (tepid) SM clock counter, high 32-bits, wraps silently. | |||||||||||||||||||||||||||
SR82 | SR_GlobalTimerLo | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Lower 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns). | |||||||||||||||||||||||||||
SR83 | SR_GlobalTimerHi | SM | Y | valid in all shader types | ||||||||||||||||||||||||
31:0 | Upper 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns). | |||||||||||||||||||||||||||
SR84 - SR95 | Reserved for future | N | ||||||||||||||||||||||||||
SR96 | SR_HwTaskId | [Warp] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
4:0 | HW internal task id assigned by CWD. Max number of tasks is per chip constant, architectural limit is 32 tasks. | |||||||||||||||||||||||||||
31:5 | reserved | |||||||||||||||||||||||||||
SR97 | SR_CircularQueueEntryIndex | [Warp] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
23:0 | Index to the GPU Work Creation (GWC) circular queue. Used for ticket # to an ordered queue. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1. | |||||||||||||||||||||||||||
31:24 | reserved | |||||||||||||||||||||||||||
SR98 | SR_CircularQueueEntryAddressLow | [Warp] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
31:0 | Lower 32b of GWC circular queue entry address. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1. | |||||||||||||||||||||||||||
SR99 | SR_CircularQueueEntryAddressHigh | [Warp] | N | CS only (reads as zero otherwise) | ||||||||||||||||||||||||
7:0 | Upper 8b of GWC circular queue entry address This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1. | |||||||||||||||||||||||||||
31:8 | reserved | |||||||||||||||||||||||||||
SR100 - SR255 | Reserved for future | N |
S2R R0, SR_LaneId;
S2R R0, SR2;