SPA 5.0:
{@{!}Pg}
LDL{.cop}{.sz}
Rd, [Ra + ImmS24]
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load within Local window
Omit register Ra to specify an unsigned absolute address within a window:
{@{!}Pg}
LDS{.U}{.sz}
Rd, [Ra + ImmS24]
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load within Shared window
{@{!}Pg}
LDL{.cop}{.sz}
Rd, [ImmU24]
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load from absolute Local address
.cop: { .CA*, .CS, .LU, .CV , .CI} // Cache all*, global, streaming, last-use, volatile, inconsistent .CA* Cache at all levels, likely to be accessed again (default). .CS .CS maps to .CA. .LU Last use, if Local address and line is fuly covered, load, then invalidate line, otherwise evict first. .CV Cache as volatile (consider cached system memory lines stale, fetch again). .CI Cache as inconsistent data (expected to be used only with invariant data). .U : uniform hint. Indicates that the addresses are likely to be - same across all active threads - or thread pairs (tn and tn^1) have same addresses for all active threads // architecture specific - or thread pairs (tn and tn^2) have same addresses for all active threads // architecture specific This hint is used to amplify the data bandwidth for shared loads. .sz: { .U8, .S8, .U16, .S16, .32*, .64, .128 } Bit size in memory, unsigned or sign-extended
{@{!}Pg}
LDS{.sz}
Rd, [ImmU24]
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Load from absolute Shared address
LDL and LDS load register Rd from memory within the Local or Shared address window, respectively.
If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute unsigned immediate offset. An omitted Ra register is assembled as RZ. Otherwise, the effective address is equal to the sum of register Ra and the signed-extended signed immediate offset. A negative offset is written as [Ra - offset] or [Ra + -offset]. An omitted immediate offset is assembled as zero. All offsets are in bytes.
Each address window is 16 MB in size; the allocated per-thread Local and per-CTA Shared sizes are set by the driver.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses are forced to align to access size and can optionally raise an error. An address outside the window or outside the allocated memory within the window sets Rd to 0 and causes an error.
Within a warp of 32 parallel threads, load instructions coalesce Local accesses that fall in the same 128B cache line into one access, and serialize accesses to each different cache line. Local addresses coalesce to a single access when the threads of a warp access the same Local per-thread address.
Shared memory is partitioned into 32 parallel banks of 32 bits. Loads from Shared memory execute in parallel unless parallel threads within a warp access conflicting addresses in a memory bank. Conflicting accesses are serialized to each different address in a memory bank.
LDL.32 R0, [R1 - 0x004]; LDS.32 R0, [R1 + 424]; LDS.32 R0, [424]; // absolute address 424 within Shared window