SPA 5.0:
{@{!}Pg}
ST{.E}{.cop}{.sz}
[Ra + ImmS32], Rb {,Plg}
{&req_6}
{&rdN}
{?sched}
;
// Store
.E: Extended address (64 bits, requires two registers) .cop: { .WB*, .CG, .CS, .WT} Cache write-back*, global, streaming, write-thru .WB* Cache write-back all coherent levels (default*) .CG Cache at global level (cache in L2 and below, not L1) .CS Cache streaming, likely to be accessed once (mark for early eviction) .WT Cache write-through (to system memory) .sz: { .8, .U8, .S8, .16, .U16, .S16, .32*, .64, .128 } Bit size stored in memory
{@{!}Pg}
ST{.E}{.cop}{.sz}
[ImmU32], Rb {,Plg}
{&req_6}
{&rdN}
{?sched}
;
// Store to absolute address by omitting register Ra
ST stores register Rb to memory at a generic thread address specified as [Ra + ImmS32] or as [ImmU32].
If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute unsigned immediate offset. An omitted Ra register is assembled as RZ. Otherwise, the effective address is equal to the sum of register Ra (or {Ra+1, Ra} when .E is specified), and the signed-extended signed immediate offset. A negative offset is written as [Ra - offset] or [Ra + -offset]. An omitted immediate offset is assembled as zero. All offsets are in bytes.
The generic thread address space can access Global, Local or Shared memory Plg is optional predicate indicating which thread addresses map to local or global address spaces. If not specified, Plg is assumed to be PT (true predicate), i.e. local or global memory.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses are forced to align to access size and can optionally raise an error. An address outside the window or outside the allocated memory within the window causes an error.
When used in a pixel shader, the ST operation has helper pixels and killed pixels automatically predicated off by the HW to prevent unwanted writes to global memory. If the pixel's raster coverage is 0 or it has previously been killed using the KIL operation, the threads will not participate in any ST operations. Even if the memory window detection would eventually turn the generic ST into a local store, it will be suppressed. A pixel shader that wants killed and helper pixels to still perform local memory operations must use the STL instruction.
ST.32 [R1 + 20], R3; // store 32-bit R2 at 20 bytes offset from byte address in R1 ST.E [R2 + 0x1234], R5; // store 32-bit R5 at 40-bit extended address in (R2,R3) plus offset 0x1234 ST.64 [R1 + 24], R4; // store 64-bit (R4,R5) at 24 bytes offset from byte address in R1 ST.8 [R1 + 24], R4; // store 64-bit (R4,R5) at 24 bytes offset from byte address in R1