SHFL : Warp Wide Register Shuffle

Format:

SPA 5.0:
        {@{!}Pg}   SHFL.mode   Pu, Rd, Ra, Sb, Sc   {&req_6}   {&rdN}   {&wrN}   {?sched}   ;   

 .mode:      { .IDX, .UP, .DOWN, .BFLY }

  Sb may be either a register or a 5-bit immediate. 
  Sc may be either a register or a 13-bit immediate.
  Sb contains absolute or relative address of the thread from which current thread reads data from.
  Sc contains the limit thread and mask value for range check. 

Description:

This instruction allows threads in a warp to exchange data. Each thread in the warp computes a source lane (j) from which to read a register Ra. The computation of source lane is a function of the lane id (i), .mode, and the operands Sb/Sc.

 If j is in range:
    The value of register Ra in lane j is read and written into Rd for the current thread (in lane i).
 If the value of j is out-of-range 
      Thread's own lane i would be used as the source lane. So, in effect Ra value is written into Rd.
 If thread corresponding to lane j is inactive, the data read is UNPREDICTABLE 

In either case, predicate register Pu indicates whether j was in range. The operands Sb and Sc control the computation of j and the determination of whether it is in range, respectively.

The Sb/Sc operands contain the following parameters: Sb[4:0] is the index of the source thread id, Sc[12:8] is the warp segmentation mask and Sc[4:0] is the clamp value.

The value of j, is computed using i, index, clamp and the segmentation mask.

If the mask value is 00000, then the warp is segmented into one warp segment (i.e., threads 0-31). If the mask value is 10000, then the warp is segmented into two warp segments (i.e., threads 0-15 and threads 16-31). If the mask value is 11000, then the warp is segmented into four warp segments (i.e., threads 0-7,8-15,16-23, and 24-31).

First the minLane and maxLane are computed from the above parameters, where minLane is the least thread id of the warp segment in which i falls. and maxLane is obtained by adding appropriate least significant bits of the clamp value to the minLane. If the warp is segmented into one warp segments, then all the 5 bits of the clamp value is used. If the warp is segmented into two (resp. four) warp segments, then only the 4 (resp. 3) least significant bits of the clamp value is used.

If the mode is .IDX, j is equal to the minlane plus the appropriate least significant bits of the index. If the warp is segmented into one warp segments, then all the 5 bits of the index is used. If the warp is segmented into two (resp. four) warp segments, then only the 4 (resp. 3) least significant bits of the index is used.

If the mode is .UP (resp. .DOWN), then j is i minus (resp. plus) the index.

If the mode is .BFLY, then j is the i xor the index.

Pu indicates source lane for a given lane is "in range" or valid.

For DOWN/IDX/BFLY modes, if the resulting j value exceeds the maxLane, then j is set to i and the predicate Pu is set to false. Otherwise, it is set to true.

For UP mode, maxlane and minlane are expected to be same (becuase the clamp/Sc[4:0] is expected to be 0). For up mode if the resulting j value is smaller than the minLane, then j is set to i and the predicate Pu is set to false. Otherwise, it is set to true.

Addressing mode .UP mode is useful for prefix sum up where higher numbered threads source from a lower numbered thread a fixed distance apart. Similarly .DOWN mode is used for prefix sum down use cases where lower numbered threads source from a higher numbered thread. .BFLY mode implements the butterfly addressing pattern such as in tree reduction and broadcast.

Addressing mode .IDX (indexed) implies addressing mode where the value of source lane is explicitly specified. .IDX mode is for any addressing patterns that dont fit the above three modes. It is also handy for the broadcast operations.

Modifier .mode must be specified. There is no default.

Examples:

Warp-level INCLUSIVE PLUS SCAN:

    // Assumes input in following registers:
    //     - Rx  = sequence value for this thread
    SHFL.UP     P1, Ry, Rx, 1,  0
@P1 FADD        Rx, Ry, Rx
    SHFL.UP     P1, Ry, Rx, 2,  0
@P1 FADD        Rx, Ry, Rx
    SHFL.UP     P1, Ry, Rx, 4,  0
@P1 FADD        Rx, Ry, Rx
    SHFL.UP     P1, Ry, Rx, 8,  0
@P1 FADD        Rx, Ry, Rx
    SHFL.UP     P1, Ry, Rx, 16, 0
@P1 FADD        Rx, Ry, Rx

Warp-level EXCLUSIVE PLUS SCAN:

    //Perform INCLUSIVE scan as above here//
    SHFL.UP     P1, Rx, Rx, 1, 0
    SEL         Rx, Rx, 0, P1     // Use appropriate identity for 0 with other operators

Warp-level INCLUSIVE PLUS REVERSE-SCAN:

    // Assumes input in following registers:
    //     - Rx  = sequence value for this thread
    //
    SHFL.DOWN   P1, Ry, Rx, 1,  31
@P1 FADD        Rx, Ry, Rx
    SHFL.DOWN   P1, Ry, Rx, 2,  31
@P1 FADD        Rx, Ry, Rx
    SHFL.DOWN   P1, Ry, Rx, 4,  31
@P1 FADD        Rx, Ry, Rx
    SHFL.DOWN   P1, Ry, Rx, 8,  31
@P1 FADD        Rx, Ry, Rx
    SHFL.DOWN   P1, Ry, Rx, 16, 31
@P1 FADD        Rx, Ry, Rx

BUTTERFLY REDUCTION:

    // Assumes input in following registers:
    //     - Rx  = sequence value for this thread
    SHFL.BFLY   __, Ry, Rx, 16,  31   // We never use the predicate
    FADD        Rx, Ry, Rx
    SHFL.BFLY   __, Ry, Rx, 8,   31
    FADD        Rx, Ry, Rx
    SHFL.BFLY   __, Ry, Rx, 4,   31
    FADD        Rx, Ry, Rx
    SHFL.BFLY   __, Ry, Rx, 2,   31
    FADD        Rx, Ry, Rx
    SHFL.BFLY   __, Ry, Rx, 1,   31
    FADD        Rx, Ry, Rx
    // All threads now hold sum in Rx

FSWZ emulation modes:

    //0000: TXD quad expansion (smear 0)
    SHFL.IDX   PT, Ry, Rx, 0,  0x1C03;  // broadcast = 0, Mask = 5'b11100, Max = 3 (within quad) 
    //.1111: TXD quad expansion (smear 1)
    SHFL.IDX   PT, Ry, Rx, 1,  0x1C03;  // broadcast = 1, Mask = 5'b11100, Max = 3 (within quad) 
    //.2222: TXD quad expansion (smear 2)
    SHFL.IDX   PT, Ry, Rx, 2,  0x1C03;  // broadcast = 2, Mask = 5'b11100, Max = 3 (within quad) 
    //.3333: TXD quad expansion (smear 3)
    SHFL.IDX   PT, Ry, Rx, 3,  0x1C03;  // broadcast = 3, Mask = 5'b11100, Max = 3 (within quad) 
    //.1032: DDX
    SHFL.BFLY  PT, Ry, Rx, 1,  0x1C03;  // exchange with tid^1, Mask = 5'b11100, Max = 3 (within quad) 
    //.2301: DDY
    SHFL.BFLY  PT, Ry, Rx, 2,  0x1C03;  // exchange with tid^1, Mask = 5'b11100, Max = 3 (within quad) 

Back to Index of Instructions