S2R : Move Special Register to Register

Format

SPA 5.0:
        {@{!}Pg}   S2R    Rd, SRa   {&req_6}   {&rdN}   {&wrN}   {?sched}   ;   
        {@{!}Pg}   CS2R   Rd, SRa   {&req_6}                     {?sched}   ;

Description

Both CS2R and S2R moves special register SRa to register Rd.

These special registers typically have architectural state that is setup by external methods or bundles and not produced by program execution itself.

S2R is a variable latency read of special registers.

A subset of these special registers are tightly coupled with instruction executions i.e. these can be read and written back to register file with fixed latency from issue cycle. This subset can read via the CS2R instruction. Reading special registers outside this subset with CS2R returns 0.

It is desirable to use special registers by name, e.g. SR_LaneId, rather than by number, e.g. SR0, to avoid problems if special registers change numeric assignments in the future. The numeric names for special register operands are SR0 to SR255.

Unimplemented special registers read as zero. Unused bit fields in special registers read as zero.

The following table indicates the special registers that can be read with CS2R/S2R

Special register table

SR#

Name

Type/Bits

CS2R (coupled)

Shader Types/Description

SR0

SR_LaneId

[THREAD]

N

valid in all shader types

4:0

Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1. (Same information as bits 4:0 of SR3)

31:5

zero

SR1

Reserved

N

SR2

SR_VirtCfg

WARP

N

valid in all shader types

31:0

Virtual Configuration after floorsweeping and throttle/limit methods.

.WarpSz

5:0

Warp size is number of thread lanes per warp, constant 32.

7:6

zero

.NWarp

14:8

Virtual Number of warps per SM, (1..MAX) where MAX is 48 before SPA3.0 and 64 after.

.NArrayLower

19:16

Lower 4 bits of Virtual Number of thread arrays (CTAs). (1..MAX) where MAX = 16 for SPA 5.0 and above.

.NSM

28:20

Virtual Number of SMs total, 1 to GPU max.

.NArrayUpper

30:29

Upper bits of virtual Number of thread arrays (CTAs).

SR3

SR_VirtId

[THREAD]

N

valid in all shader types

31:0

Virtual Id after floorsweeping and throttle/limit methods.

.LaneId

4:0

Virtual thread lane, 0 to SR_VirtCfg.WarpSz-1.

7:5

zero

.WarpId

14:8

Virtual warp id, 0 to SR_VirtCfg.NWarp-1. This field was 13:8 prior to SPA 5.3

15:15

zero

.ArrayIdLower

19:16

Lower 4 bits of Virtual thread array (CTA) id, 0 to SR_VirtCfg.NArray-1.

.SMId

28:20

Virtual SM id, 0 to SR_VirtCfg.NSM-1.

.ArrayIdUpper

30:29

Upper bits of Virtual thread array (CTA) id.

31:31

zero

SR4..7

SR_PM0..3

SM

Y

valid in all shader types

31:0

Per Subpartition Performance counter. They are configured via statebundles from pushbuffer methods and context switched.Each thread will see a copy of perf counters in its own subpartition.

SR8..11

SR_PM4..7

SM

Y

valid in all shader types

31:0

Shared Performance counter. They are configured via statebundles from pushbuffer methods and context switched. Thse perf counters are counting events in shared units like MIOS and PIXOUT.

SR12 - SR14

Reserved

N

SR15

SR_ORDERING_TICKET valid in pixel shader only

[Warp]

N

8:0

Ticket dispenser ID.

15:9

Ticket increment value. Zero for all but last warp of TC tile. For the last warp , the increment value is 128-N where N is the number of warps in TC tile.

31:16

Assigned Ticket value.To be used to match global ticket counter to determine if warp is allowed to proceed with pixel blend operations.

SR16

SR_PRIM_TYPE

[Warp]

N

valid in all shader types (see below)

13:0

warp input primitive type and size
Return values are:                        


     Shader 
     SR read value 


     VSa:     
     POINT,LINE,TRIANGLE,PATCH 


     VSb:     
     POINT,LINE,TRIANGLE,PATCH 


     TI/TS:   
     POINT,LINE,TRIANGLE,PATCH 


     GS:      
     POINT,LINE,TRIANGLE,PATCH 


     PS:      
     0                         


     Compute: 
     0                         


Where


     Type 
     Enum value 


     POINT    
     0x0001                              


     LINE     
     0x0002 | (adjancency << 8)    


     TRIANGLE 
     0x0004 | (adjancency << 8)    


     PATCH    
     0x0008 | (size of patch << 8)

SR17

SR_INVOCATION_ID

[THREAD]

N

valid in all shader types (see below)

4:0

Primitive invocation id (when hw generates multiple instances of a primitive)                 
GS Shaders:                                                                           
  Hw is  generating multiple instances: return(invocation id) in range of [0,InvocationCount-1] 
  Hw not generating multiple instances: return(0)                                               
TI/TS Shaders:
  return(invocation id); [0,InvocationCount-1]                                                  
Other Shaders:
  return(0)

SR18

SR_Y_DIRECTION

[WARP]

N

valid in all shader types except compute

31:0

A floating point number (either -1.0 or +1.0) tied to the SetWindowOrigin method.         
It can be combined with an FMUL to correct the sign of the y derivitives calculated       
in an OpenGL program to insure they work for both render-to-texture and normal rendering. 


     SetWindowOrigin.Mode 
     SR read value 


     UPPER_LEFT 
     +1.0f 


     LOWER_LEFT 
     -1.0f

SR19

SR_THREAD_KILL

[THREAD]

N

PS only (reads as zero otherwise)

31:0

A boolean register value indicating if the thread has been killed.                                  
Helper pixels (pixels with no coverage used to round out quads) start execution in a killed state.   
Pixel shaders can also use be killed with the KIL instruction.             
A pixel thread that is killed  will have it's outputs automatically discarded and                    
is not allowed to execute global store commands.                                                     
PS Shaders:
if(thread has been killed or is helper pixel): 
      return(0xffff_ffff)                               
else:
      return(0)                                                                                 
Other Shaders:
      return(0)

SR20

SM_SHADER_TYPE

[WARP]

N

valid in all shader types

7:0

Read the shader type of the currently running thread (Enumerated as below)


     ShaderType 
     Value 


     Vertex A           
     0 


     Vertex B           
     1 


     Tessellation Init  
     2 


     Tessellation       
     3 


     Geometry           
     4 


     Pixel              
     5 


     Compute            
     6

SR21

SR_DirectCBEWriteAddressLow

[Warp]

N

VSB & TI only (reads as zero otherwise)

31:0

Lower 32b of CBE (circular buffer entry ) address in global memory for VTG shaders.
This field is 0 unless SR_DirectCBEWriteEnabled==1.

SR22

SR_DirectCBEWriteAddressHigh

[Warp]

N

VSB & TI only (reads as zero otherwise)

7:0

Upper 8b of CBE (circular buffer entry ) address in global memory for VTG shaders.
This field is 0 unless SR_DirectCBEWriteEnabled==1.

31:8

reserved

SR23

SR_DirectCBEWriteEnable

[Warp]

N

VSB & TI only (reads as zero otherwise)

0:0

Maps to CopyOutOptIn bit in SPH header. Can be valid only in VS and TS shaders.
Shader can query this bit to determine if it is the last alpha stage and
expected to write output directly to CBE structure in global memory (L2).

SR24

SR_MACHINE_ID_0

SM

N

valid in all shader types

31:0

A completely SW defined value, set with a PRI.  
Possible uses might include
    - Affinity interpretation: enum value telling how to interpret affinity
    - Double Precision or not bit
    - Family & Chip Codes
    - ISA codes

SR25..27

SR_MACHINE_ID_1..3

SM

N

valid in all shader types

31:0

31:0 Reserved for future. Reads as zero

SR28

SR_AFFINITY

SM

N

valid in all shader types

7:0

Affinity[0] value

15:8

Affinity[1] value

23:16

Affinity[2] value

31:24

Affinity[3] value
Each of the 4 elements in the array is a seperate byte value.
SW can determine if two SMs are "affine" if they have the same affinity array values.
For GF100, Affinity[0] contains the logical GPC# an SM is attached to.

SR29

SR_INVOCATION_INFO

[THREAD]

N

VTG only (reads as zero otherwise)

.primIndex

7:0

primitive index of the thread. Note: For vertex shaders/Single instance geometry this corresponds to laneId.

.vertexPerPrim

21:16

vertices per primitive. Note (.primIndex *.vertexPerPrim) is used as offset for reading vertex handle using ISBERD

SR30

SR_WScaleFactor_XY

[WARP]

N

VTG only (reads as zero otherwise)

31:0

fp32 representation of XY plane scalefactor (1.0 or 256.0 based) on SM state.

SR31

SR_WScaleFactor_Z

[WARP]

N

VTG only (reads as zero otherwise)

31:0

fp32 representation of Z plane scalefactor (1.0 or 256.0 based) on SM state.

SR32

SR_Tid

[THREAD]

N

CS only (reads as zero otherwise)

31:0

Thread Id (all component fields combined) within CTA

.x

10:0

tid.x (Tesla compatible)

15:11

zero

.y

25:16

tid.y (Tesla compatible)

.z

31:26

tid.z (Tesla compatible)

SR33

SR_Tid.X

[THREAD]

N

CS only (reads as zero otherwise)

10:0

Thread Id X component within CTA. Ranges between 0 and SR_NTid.X - 1

SR34

SR_Tid.Y

[THREAD]

N

CS only (reads as zero otherwise)

9:0

Thread Id Y component within CTA. Ranges between 0 and SR_NTid.Y - 1

SR35

SR_Tid.Z

[THREAD]

N

CS only (reads as zero otherwise)

5:0

Thread Id Z component within CTA. Ranges between 0 and SR_NTid.Z - 1

SR36

Reserved

N

SR37

SR_CTAid.X

[CTA]

N

CS only (reads as zero otherwise)

31:0

CTA Id X component within grid. Ranges between 0 and SR_NCTAid.X - 1

SR38

SR_CTAid.Y

[CTA]

N

CS only (reads as zero otherwise)

15:0

CTA Id Y component within grid. Ranges between 0 and SR_NCTAid.Y - 1

SR39

SR_CTAid.Z

[CTA]

N

CS only (reads as zero otherwise)

15:0

CTA Id Z component within grid. Ranges between 0 and SR_NCTAid.Z - 1

SR40

SR_NTid

[CTA]

N

CS only (reads as zero otherwise)

12:0

Total number of live threads remaining in CTA, initialized from CTA size (set by SetCtaResourceAllocation method's ThreadCount field) and every time a warp completes, 32 is subtracted. If ThreadCount is not a multiple of 32, SR_NTid may be (32 - ThreadCount mod 32) shy of the actual remaining thread count. This count changes only when a whole warp completes. If only a subset of threads in a warp complete, SR_NTid does not change.

SR41

SR_CirQueueIncrMinusOne

[CTA]

N

CS only (reads as zero otherwise)

.incrMinusOne

7:0

Circular queue increment (of work) associated withis CTA, minus one. Typically this value is QMD.QueueEntriesPerCtaMinusOne. However for CWD can launch work with partially occupied CTAs when QMD.CoalesceWaitingPeriod expires. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.

30:8

Reserved

.isQueue

31:31

Set to 1 if task is launched as GWC circular queue, 0 if launched as grid. Note that

SR42

SR_NLATC

[CTA]

N

CS only (reads as zero otherwise)

12:0

Total number of launched and alive (non-exited) threads of a CTA, initialized to zero at CTA launch, and incremented every warp launch by 32 threads and decremented by 32 every time a warp exits. Once a CTA is fully loaded, SR42 will only be different from SR40 if ThreadCount is not a multiple of 32, in which case SR42 will be greater than SR40 by ThreadCount mod 32.

SR43..47

Reserved

N

SR48

SR_SWinLo

[Global]

N

CS only (reads as zero otherwise)

31:24

Shared Window base address in bytes, multiple of 16MB Set by SetShaderSharedMemoryWindow method

SR49

SR_SWINSZ

[Global]

N

CS only (reads as zero otherwise)

24:24

Shared Window size in bytes (CONSTANT 16MB)

SR50

SR_SMemSz

[Global]

N

CS only (reads as zero otherwise)

23:7

Shared memory allocated size in bytes, multiple of 128B Set by SetSharedMemorySize method

SR51

SR_SMemBanks

[Global]

N

CS only (reads as zero otherwise)

5:5

Number of 32-bit banks in L1/Shared RAM (CONSTANT == 32)

SR52

SR_LWinLo

[Global]

N

valid in all shader types

31:24

Local Window base address in bytes, multiple of 16MB Set by SetShaderLocalMemoryWindow method

SR53

SR_LWINSZ

[Global]

N

valid in all shader types

24:24

Local Window size in bytes (CONSTANT == 16MB)

SR54

SR_LMemLoSz

[Global]

N

valid in all shader types

19:4

Local memory low allocated size in bytes per thread, multiple of 16B Set by SetShaderThreadMemoryLowSize method in Compute Set by a field in the Shader Program Header for Graphics

SR55

SR_LMemHiOff

[Global]

N

valid in all shader types

24:4

Local memory high allocated offset in bytes per thread, multiple of 16B LMemHiOff = LWINSZ - LMemHiSz; LMemHiSz is allocated size in bytes per thread. LMemHiSz is set by SetShaderThreadMemoryHighSize method for Compute It is set by a field in the Shader Program Header for Graphics

SR56

SR_EqMask

[Thread]

N

valid in all shader types

31:0

Mask of thread position in current warp, i.e., (1<<(TID&0x1f))

SR57

SR_LtMask

[Thread]

N

valid in all shader types

31:0

Mask of all lower thread positions in current warp, i.e., (1<<(TID&0x1f))-1

SR58

SR_LeMask

[Thread]

N

valid in all shader types

31:0

Mask of current and all lower thread positions, i.e., SR_EqMask | SR_LtMask

SR59

SR_GtMask

[Thread]

N

valid in all shader types

31:0

Mask of all higher thread positions, i.e., ~SR_LeMask

SR60

SR_GeMask

[Thread]

N

valid in all shader types

31:0

Mask of current and all higher thread positions, i.e., ~SR_LtMask

SR61

SR_RegAlloc

[Warp]

N

valid in all shader types

7:0

Set by SetPipeline[].RegisterCount method (graphics) or RegisterCount field in QMD (for compute)

SR62..63

Reserved for future

N

SR64

SR_GlobalErrorStatus

SM

N

valid in all shader types

Mode

3:0

Presented to all warps that enter the trap handler.
Results undefined if PRI is changed when SM is not paused.

SingleStepEnabled

0:0

Indicates that the SM is in single step mode.  In the absence of other bits set in the 
Error Status Register, this is the primary indication of why a warp enters a trap handler.
This is controlled by the PRI_SM_DBGR_CONTROL0.SINGLE_STEP_MODE (settings are ENABLE/DISABLE).

Preemption

2:1

Enumeration Values:
  0: NORMAL - no preemption request in progress
  1: PREEMPTION_SAVE - preemption save in progress
  2: PREEMPTION_RESTORE - preemtion restore in progress

reserved

3:3

Reserved

Global Errors

31:4

Contains state that must be visible to all warps or errors that cannot be charged to a single warp.
Presented to all warps that enter the trap handler.
Results undefined if PRI is changed when SM is not paused.
Semantics:
  - Global errors caused by warps outside of the trap handler are not guaranteed by HW to be 
    reported before entering trap handler.  SW must flush global faults (e.g., L1 store faults) 
    by using a CCTL.IVALL operation.
  - Global errors are not disabled while in the trap handler.  (There is no double buffering.)
  - Global errors caused by warps inside the trap handler are not guaranteed by HW to be reported before 
    leaving the trap handler.  SW must flush and check for global faults before RTT.
  - Cleared by CPU while SM is paused via PRI.

4:4

Indicates CPU has instructed this SM to stop.  All warps should prepare
state and then execute a BPT.PAUSE to yield control to the CPU.
This is controlled by the CPU writing to PRI_SM_DBGR_CONTROL0.STOP_TRIGGER.
Note that if any warp is executing IDE critical section (i.e instructions between IDE.DI and  IDE.EN), 
then CPU_STOP is delayed, until no warp is in critical section. However if PRI_SM_DBGR_CONTROL0.STOP_IS_NOT_MASKABLE is set, 
then check for any warp being in critical section is not performed while processing CPU_STOP event.
Setting PRI_SM_DBGR_CONTROL0.STOP_IS_NOT_MASKABLE is expected to be used by the CPU to debug code in IDE critical section.

5:5

Indication if any warp is in IDE critical section.

6:6

reserved

7:7

Multiple Warp Errors
Indicates that a warp error occurred while this SM's warp error register was non-zero (a warp error is still
pending) on this SM.  Since the second error's information (including its warp number) cannot be captured,
the multiple errors bit is indication that at least one error has been lost.

8:8

reserved

9:9

Single Warp Error
Indicates that a warp error has been detected on this SM and that error has not yet been cleared (by 
the CPU via PRI).  When warp errors are cleared via PRI, this state is cleared immediately.

10:10

Warp Trap 1 
Indicates that at least one warp on the SM executed BPT.TRAP with an immediate of 1.

11:11

Warp Trap 2+
Indicates that at least one warp on the SM executed BPT.TRAP with an immediate of 2 or greater.

12:12

Reserved for BPT.INT. Will always be read as 0 via S2R.

31:13

Reserved

SR65

Reserved For Future

N

SR66

SR_WarpErrorStatus

SM

N

valid in all shader types

Warp Errors

7:0

   Contains errors known to be caused by this thread's warp.
   These error are presented only to the warp that is responsible for the error.
   Semantics:
     - Only the first-detected warp error from a single warp is reported.  
     - If more than 1 warp error is seen, then "Multiple Warp Errors" is set and subsequent errors from the 
       same warp or from other warps are lost. (There is no double buffering)
     - All warp errors will be reported before entering the trap handler.
     - All warp errors caused be any warp in the trap handler will be ignored.  Innocuous ones will be allowed silently 
       and blatantly dangerous/corrupting ones will result in immediate and silent warp termination.
     - Error status is not cleared after the warp that caused the error reads SR_WarpErrorStatus.
   Enumeration Values:
     0: no errors
     1: STACK_ERROR
     2: API_STACK_ERROR
     3: Not Used 
     4: PC_WRAP
     5: MISALIGNED_PC
     6: PC_OVERFLOW
     7: Not Used
     8: MISALIGNED_REG
     9: ILLEGAL_INSTR_ENCODING
    10: Not Used
    11: ILLEGAL_INSTR_PARAM
    12: INVALID_CONST_ADDR
    13: OOR_REG
    14: OOR_ADDR
    15: MISALIGNED_ADDR
    16: INVALID_ADDR_SPACE
    17: Not Used 
    18: INVALID_CONST_ADDR_LDC
    19: Not Used
    20: Not Used 
    21: Not Used 
    22: PHYSICAL_STACK_OVERFLOW
    23: MMU_FAULT

reserved

23:8

reserved

Trap Immediate

26:24

Non-zero indicates that this warp personally executed a BPT.TRAP
instruction.  The BPT.TRAP immediate is provided in this field.
BPT.TRAP 0; is illegal.
Cleared on RTT

reserved

31:27

Reserved

SR67

Reserved

N

SR68 - SR71

Reserved for future

N

SR72..75

SR_PM_HI0..3

SM

Y

valid in all shader types

7:0

Upper 8 bits of per Subpartition Performance counter. Lower 32 bits are in SR_PM0..3. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS_S[n], where n = 0..3 specifying the subpartition warp belongs to. Each register has 4 8-bit wide fields, one corresponding each of SR_PM_HI0..3 in subpartition.

SR76..79

SR_PM_HI4..7

SM

Y

valid in all shader types

7:0

Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7. Note: The upper 8 bits saturate when their value becomes 0xFF. These can reset only by PRI write to NV_PTPC_PRI_SM_DSM_PERF_COUNTER_STATUS1. There are 4 8-bit wide fields, one for each of SM_PM_HI4..7.

SR76..79

SR_PM_HI4..7

SM

Y

valid in all shader types

7:0

Upper 8 bits of Shared Performance counter. Lower 32 bits are in SR_PM4..7.

SR80

SR_ClockLo

SM

Y

valid in all shader types

31:0

Real-time (tepid) SM clock counter, low 32-bits, wraps silently.

SR81

SR_ClockHi

SM

Y

valid in all shader types

31:0

Real-time (tepid) SM clock counter, high 32-bits, wraps silently.

SR82

SR_GlobalTimerLo

SM

Y

valid in all shader types

31:0

Lower 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns).

SR83

SR_GlobalTimerHi

SM

Y

valid in all shader types

31:0

Upper 32 bits of globally sychronized (PTIMER) timestamp (currently in nanoseconds, with resolution of 32 ns).

SR84 - SR95

Reserved for future

N

SR96

SR_HwTaskId

[Warp]

N

CS only (reads as zero otherwise)

4:0

HW internal task id assigned by CWD. Max number of tasks is per chip constant, architectural limit is 32 tasks.

31:5

reserved

SR97

SR_CircularQueueEntryIndex

[Warp]

N

CS only (reads as zero otherwise)

23:0

Index to the GPU Work Creation (GWC) circular queue. Used for ticket # to an ordered queue. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.

31:24

reserved

SR98

SR_CircularQueueEntryAddressLow

[Warp]

N

CS only (reads as zero otherwise)

31:0

Lower 32b of GWC circular queue entry address. This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.

SR99

SR_CircularQueueEntryAddressHigh

[Warp]

N

CS only (reads as zero otherwise)

7:0

Upper 8b of GWC circular queue entry address This field is 0 unless SR_CirQueueIncrMinusOne.isQueue==1.

31:8

reserved

SR100 - SR255

Reserved for future

N

Examples:

S2R R0, SR_LaneId;
S2R R0, SR2;

Back to Index of Instructions