SPA 5.0:
{@{!}Pg}
STG{.E}{.cop}{.sz}
[Ra + ImmS24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store
.E: Extended address (64 bits, requires two registers) .cop: { .WB*, .CG, .CS, .WT} Cache write-back*, global, streaming, write-thru .WB* Cache write-back all coherent levels (default*) .CG Cache at global level (cache in L2 and below, not L1) .CS Cache streaming, likely to be accessed once (mark for early eviction) .WT Cache write-through (to system memory) .sz: { .8, .U8, .S8, .16, .U16, .S16, .32*, .64, .128 } Bit size stored in memory
{@{!}Pg}
STG{.E}{.cop}{.sz}
[ ImmU24], Rb
{&req_6}
{&rdN}
{?sched}
;
// Store to absolute address by omitting register Ra
STG stores register Rb to memory at a global thread address specified as [Ra + ImmS32] or as [ImmU32].
If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute unsigned immediate offset. An omitted Ra register is assembled as RZ. Otherwise, the effective address is equal to the sum of register Ra (or {Ra+1, Ra} when .E is specified), and the signed-extended signed immediate offset. A negative offset is written as [Ra - offset] or [Ra + -offset]. An omitted immediate offset is assembled as zero. All offsets are in bytes.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses are forced to align to access size and can optionally raise an error.
When used in a pixel shader, the STG operation has helper pixels and killed pixels automatically predicated off by the HW to prevent unwanted writes to global memory. If the pixel's raster coverage is 0 or it has previously been killed using the KIL operation, the threads will not participate in any STG operations.
STG.32 [R1 + 20], R3; // store 32-bit R2 at 20 bytes offset from byte address in R1 STG.E [R2 + 0x1234], R5; // store 32-bit R5 at 40-bit extended address in (R2,R3) plus offset 0x1234 STG.64 [R1 + 24], R4; // store 64-bit (R4,R5) at 24 bytes offset from byte address in R1 STG.8 [R1 + 24], R4; // store 64-bit (R4,R5) at 24 bytes offset from byte address in R1