SPA 5.0:
{@{!}Pg}
LD{.E}{.cop}{.sz}
Rd, [Ra + ImmS32] {, Plg}
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load
.E Extended address (64 bits, requires two registers) .cop: { .CA*, .CG, .CS, .LU, .CV , .CI} // Cache all*, global, streaming, last-use, volatile, inconsistent .CA* Cache at all levels, likely to be accessed again (default). .CG Cache at global level (cache in L2 and below, not L1). .CS .CS maps to .CA. .LU .LU maps to .CG. .CV Cache as volatile (consider cached system memory lines stale, fetch again). .CI Cache as inconsistent data (expected to be used only with invariant data). .sz: { .U8, .S8, .U16, .S16, .32*, .64, .128, .U.128 } Bit size in memory, unsigned or sign-extended
{@{!}Pg}
LD{.E}{.cop}{.sz}
Rd, [ImmU32] {, Plg}
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load from absolute address by omitting Ra
LD loads register Rd from memory at a generic thread address specified as [Ra + ImmS32] or as [ImmU32].
If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute unsigned immediate offset. An omitted Ra register is assembled as RZ. Otherwise, the effective address is equal to the sum of register Ra (or {Ra+1, Ra} when .E is specified), and the signed-extended signed immediate offset. A negative offset is written as [Ra - offset] or [Ra + -offset]. An omitted immediate offset is assembled as zero. All offsets are in bytes.
The generic thread address space can access Global, Local or Shared memory Plg is optional predicate indicating which thread addresses map to local or global address spaces. If not specified, Plg is assumed to be PT (true predicate), i.e. local or global memory.
.sz = .U.128 (Uniform 128 bit load) can be used to provide a performance hint to the hardware that the access will likely be a uniform address for all threads. Before using it, please see the performance section of the programming guide for detail on how and when to use it.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses are forced to align to access size and can optionally raise an error. An address outside the window or outside the allocated memory within the window sets Rd to 0 and causes an error.
LD.32 R3, [R1 + 20], P0; // load 32 bits into R3 from 20 bytes offset from byte address in R1, P0 controls global/local vs shared LD.E R0, [R2 + 0x1234]; // load 32 bits into R0 from 40-bit extended address in {R2, R3} plus offset 0x1234 LD.U.128 R4, [R1], P1; // load 128 bits into R4, R5, R6, R7 from byte address in R1 using uniform broadcast vector path