SPA 5.0:
{@{!}Pg}
STL{.cop}{.sz}
[Ra + ImmS24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store within Local window
Omit register Ra to specify an unsigned absolute address within a window:
{@{!}Pg}
STS{.sz}
[Ra + ImmS24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store within Shared window
{@{!}Pg}
STL{.cop}{.sz}
[ImmU24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store to absolute Local address
.cop: { .WB*, .CG, .CS, .WT } Cache write-back*, global, streaming, write-thru .WB* Cache write-back all coherent levels (default*). .CG Cache at global level (cache in L2 and below; L1 cache lines marked as evict-first. .CS Cache streaming, likely to be accessed once (mark for early eviction). .WT Cache write-through (to system memory). .sz: { .8, .U8, .S8, .16, .U16, .S16, .32*, .64, .128 }
{@{!}Pg}
STS{.sz}
[ImmU24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store to absolute Shared address
If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute unsigned immediate offset. An omitted Ra register is assembled as RZ. Otherwise, the effective address is equal to the sum of register Ra (or {Ra+1, Ra} when .E is specified), and the signed-extended signed immediate offset. A negative offset is written as [Ra - offset] or [Ra + -offset]. An omitted immediate offset is assembled as zero. All offsets are in bytes.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses are forced to align to access size and can optionally raise an error. An address outside the window or outside the allocated memory within the window sets Rd to 0 and causes an error.
When used in a pixel shader, the STL operation has helper pixels and killed pixels automatically predicated off by the HW to prevent unwanted writes to local memory. If the pixel's raster coverage is 0 or it has previously been killed using the KIL operation, the threads will not participate in any STL operations.
STL.32 [R3+0x1234], R1; // store to Local address STS.64 [R3 - 16], R4; // store [R5,R4] to 64-bit Shared location STS.32 [0x12], R1; // store R1 to absolute location 0x12 in per-CTA Shared memory