RED

SPA 5.0:
        {@{!}Pg}   RED{.E}.op{.sz}   [Ra + ImmS20], Rb   {&req_6}   {&rdN}   {?sched}   ;   // Reduction Operation   

  .E:     Extended address (64 bits, requires two registers)
  .op:    { ADD, MIN, MAX, INC, DEC, AND, OR, XOR }          Operation
  .sz:    {"{ .U32*, .S32, .U64, S64, .F32.FTZ.RN, .F16x2.FTZ.RN }" }
            .32 is also accepted and aliases to .U32


            .64 is also accepted and aliases to .U64
          --------------------------------------------------------------------------------------------------
                                  Reduction Operations
          .op    .sz                                                    Description,  M is [Ra + ImmS20]    
          --------------------------------------------------------------------------------------------------
          .ADD   .U32 .S32 .U64  .F32.FTZ.RN  .F16x2.RN  .F64.RN        M = M + Rb;
          .MIN   .U32 .S32 .U64 .S64   .F16x2.RN                        M = min(M, Rb);
          .MAX   .U32 .S32 .U64 .S64   .F16x2.RN                        M = max(M, Rb);
          .INC   .U32                                                   M = (M >= Rb)? 0 : (M + 1);
          .DEC   .U32                                                   M = (M == 0 || M > Rb)? Rb : M - 1;
          .AND   .U32 .S32 .U64                                         M = M & Rb;
          .OR    .U32 .S32 .U64                                         M = M | Rb;
          .XOR   .U32 .S32 .U64                                         M = M ^ Rb;
          --------------------------------------------------------------------------------------------------

Description

RED.op performs reduction operation .op with register Rb on global memory at a generic thread address. The generic byte address is computed as the 32-bit addition of register Ra plus the signed immediate offset ImmS20, which is then zero-extended to 40-bits. If the .E extension is specified, the generic byte address is computed as the sum of the 64-bit value (R[a],R[a+1]) plus the sign-extended immediate offset ImmS20.

RED combines register Rb with global memory location [Ra + ImmS20] atomically, without intervening accesses to that memory location by other threads:

    atomic {                    // Atomic operation on global memory location [Ra + ImmS20]
        .sz M = mem[Ra + ImmS20];       // Read global memory location
            M = .op(M, Rb);             // Form reduction value
            mem[Ra + ImmS20] = M;       // Write memory location
    }

The generic thread address space accesses global memory, unless it falls in the Local or Shared address window. A RED instruction must address global memory, otherwise it is an invalid address space error.

When used in a pixel shader, RED has helper pixels and killed pixels automatically predicated off by the HW to prevent unwanted writes to global memory. If the pixel's raster coverage is 0 or it has previosuly been killed using the KIL operation, the threads will not participate in any RED operations.

Memory addresses must be naturally aligned, on a byte address that is a multiple of the access size. Misaligned addresses cause an error and do not access memory. An address outside an allocated memory region is ignored, does not access memory, and causes an error.

RED interprets memory data in little-endian byte order: the effective address specifies the least-signficant data bits.

Additional Information:

RED does not cache data in the L1 cache; it first discards any matching (global) L1 cache lines, which could otherwise be stale due to operations of multiple SMs.

To load a RED result (from either the current thread or elsewhere), use LD.CG to bypass any stale global lines introduced by other LD instructions.

Execution Behavior:

The atomic reduction operations are implemented close to the memory subsystem to ensure the atomicity across all executing threads and to limit the duration an atomic lock is "held" on the memory location. Any data required for the RED operations that is cached in L1 is evicted.

RED instructions are pipelined and decoupled from thread execution, like ST instructions. The issuing thread continues execution. A thread may have several RED operations pending, depending on resources. Any subsequent memory operations (LD*, ST*, ATOM, or RED instructions) to the same address are kept in order and do not perform their operation until the current RED is complete.

The order of memory instructions (LD*, ST*, ATOM, RED) issued by a thread to the same address is preserved. The order of memory instructions issued by a thread to different addresses may be reordered. The order of memory instructions issued by different threads to the same address may be reordered. Use MEMBAR.GL to explicitly order Global memory instructions within a thread, and use BAR.SYNC to order memory accesses across concurrent threads of a CTA.

Within a warp of 32 parallel threads, RED instructions coalesce global accesses to different addresses in the same 128B cache line into one access, serialize accesses to the same conflicting address, and serialize accesses to each different cache line. The order of serialization of conflicting RED addresses depends on the implementation. Vector 64-bit accesses coalesce 16 parallel threads at a time, and 128-bit accesses coalesce 8 threads at a time. Global addresses coalesce to a single access when the threads of a warp address different locations within one cache line; two accesses for two cache lines, etc.

RED : Reduction Operation on generic Memory

Format

Description

Additional Information:

Execution Behavior:

Examples: