SPA 5.0:
{@{!}Pg}
ATOMS.op{.sz}
Rd, [Ra + ImmS24], Rb
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Operation
{@{!}Pg}
ATOMS.CAS{.sz}
Rd, [Ra + ImmS24], Rb, Rc
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Compare and Swap
Omit register Ra to specify an absolute address:
{@{!}Pg}
ATOMS.CAST{.SPIN}{.sz}
Rd, [Ra + ImmS24], Rb, Rc
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Compare and Store
{@{!}Pg}
ATOMS.op{.sz}
Rd, [ImmU24], Rb
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Operation
{@{!}Pg}
ATOMS.CAS{.sz}
Rd, [ImmU24], Rb, Rc
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Compare and Swap
.op: { .ADD, .MIN, .MAX, .INC, .DEC, .AND, .OR, .XOR, .EXCH} Operation .sz: { .U32*, .S32, .U64 } Typed bit size of memory and Rd .32 is also accepted and aliases to .U32 .64 is also accepted and aliases to .U64 -------------------------------------------------------------------------------- Supported Atomic Operations .op .sz Description, M is [Ra + Imm24] -------------------------------------------------------------------------------- .ADD .U32 .S32 Rd = M; M = M + Rb; .MIN .U32 .S32 Rd = M; M = min(M, Rb); .MAX .U32 .S32 Rd = M; M = max(M, Rb); .INC .U32 Rd = M; M = (M >= Rb) ? 0 : (M + 1); .DEC .U32 Rd = M; M = (M == 0 || M > Rb)? Rb : M - 1; .AND .U32 .S32 Rd = M; M = M & Rb; .OR .U32 .S32 Rd = M; M = M | Rb; .XOR .U32 .S32 Rd = M; M = M ^ Rb; .EXCH .U32 .S32 .U64 Rd = M; M = Rb; .CAS .U32 .S32 .U64 Rd = M; if (M == Rb) M = Rc; .CAST .U32 .S32 .U64 if (M == Rb) { M = Rc; Rd = 1; } else { Rd = 0 } --------------------------------------------------------------------------------- .SPIN "Fast fail" option. Threads are put into groups, based on the shared memory bank which contains the address they are trying to update. One thread from each group is selected to do its ATOMS.CAST (compare & store); all other threads from that bank will immediately "fail", i.e. they will not attempt to do their store, and will return 0. For cases where multiple threads are expected to be contending for the same address, this can be much faster than ATOMS.CAST (without the .SPIN) or ATOMS.CAS, as the non-.SPIN cases may end up doing up to 32 passes, once per thread, in order to let every thread attempt its compare-and-store or compare-and-swap. Encoding restrictions: The low 2 bits of the ImmS24 or ImmU24 must always be zero.
{@{!}Pg}
ATOMS.CAST{.SPIN}{.sz}
Rd, [ImmU24], Rb, Rc
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
// Atomic Compare and Store
ATOMS.op performs atomic operation .op with register Rb on shared memory, and returns the prior memory value to register Rd.
The byte address is computed as the 32-bit addition of register Ra plus the immediate offset Imm24. When Ra is RZ (or not specified), then the Imm24 is zero-extended to 32 bits. When Ra is other than RZ, then the Imm24 is sign-extended to 32 bits, and added to Ra. The address must be within the 16 MB shared memory window. The allocated per-CTA Shared sizes are set by the driver. ATOMS uses window-specific addressing, like LDS and STS.
ATOMS.op combines register Rb with the specified shared memory location atomically, without intervening accesses to that memory location by other threads:
satomic { // Atomic operation on shared memory location [Ra + Imm24] .sz M = shmem[Ra + Imm24]; // Read memory location Rd = M; // Return prior memory location value to register Rd M = .op(M, Rb); // Form atomic operation result value shmem[Ra + Imm24] = M; // Write memory location }
ATOMS.CAS performs an atomic compare-and-swap operation on shared memory. It requires one or more extra register(s) for the compare value, which are provided as Rc.
{Rb,Rc} are expected to be consecutive registers, naturally aligned based on .sz. Specifically, in ATOMS.CAS/.CAST.32, Rb must be R2n+0(even register) and cannot be RZ, and Rc must be Rb+1 or RZ. Similarly, for ATOMS.CAS/.CAST.64, Rb must be R4n+0 (and cannot be RZ) and Rc must be Rb+2 or RZ.
Other atomic operations assemble the omitted Rc as RZ.
Any operations not supported natively are expected to be supported via an
ATOMS.CAS or ATOMS.CAST or ATOMS.CAST.SPIN loops.
For example:
// uint64 atomicMin uint64_t atomicMin(uint64_t *address, uint64_t val) { uint64_t ret = *address; while (val < ret) { uint64_t old = ret; if ((ret = atomicCAS64(address, old, val)) == old) break; } return ret; }
ATOMS can only be used in compute shaders, as shared memory windows are not defined for other shader types.
Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses cause a misaligned address error. An address outside an allocated memory region causes an address out-of-range error.
ATOMS interprets memory data in little-endian byte order: the effective address specifies the least-significant data bits.
ATOMS.ADD.S32 R0, [R1 - 400], R9; ATOMS.MIN.U32 R0, [R4 + 8], R2; ATOMS.ADD.U32 R9, [0x10], R8; # absolute 24-bit address ATOMS.CAS.U64 R0, [R4 + 8], R4, R6; # R4 is Reg4 aligned, Rc=Rb+2 ATOMS.CAST.SPIN.U32 R0, [R4 + 0x18], R4, R6; # R4 is Reg4 aligned, Rc=Rb+2