SPA 5.0:
{@{!}Pg}
SHFL.mode
Pu, Rd, Ra, Sb, Sc
{&req_6}
{&rdN}
{&wrN}
{?sched}
;
.mode: { .IDX, .UP, .DOWN, .BFLY }
Sb may be either a register or a 5-bit immediate.
Sc may be either a register or a 13-bit immediate.
Sb contains absolute or relative address of the thread from which current thread reads data from.
Sc contains the limit thread and mask value for range check.
This instruction allows threads in a warp to exchange data. Each thread in the warp computes a source lane (j) from which to read a register Ra. The computation of source lane is a function of the lane id (i), .mode, and the operands Sb/Sc.
If j is in range: The value of register Ra in lane j is read and written into Rd for the current thread (in lane i). If the value of j is out-of-range Thread's own lane i would be used as the source lane. So, in effect Ra value is written into Rd. If thread corresponding to lane j is inactive, the data read is UNPREDICTABLE
In either case, predicate register Pu indicates whether j was in range. The operands Sb and Sc control the computation of j and the determination of whether it is in range, respectively.
The Sb/Sc operands contain the following parameters: Sb[4:0] is the index of the source thread id, Sc[12:8] is the warp segmentation mask and Sc[4:0] is the clamp value.
The value of j, is computed using i, index, clamp and the segmentation mask.If the mask value is 00000, then the warp is segmented into one warp segment (i.e., threads 0-31). If the mask value is 10000, then the warp is segmented into two warp segments (i.e., threads 0-15 and threads 16-31). If the mask value is 11000, then the warp is segmented into four warp segments (i.e., threads 0-7,8-15,16-23, and 24-31).
First the minLane and maxLane are computed from the above parameters, where minLane is the least thread id of the warp segment in which i falls. and maxLane is obtained by adding appropriate least significant bits of the clamp value to the minLane. If the warp is segmented into one warp segments, then all the 5 bits of the clamp value is used. If the warp is segmented into two (resp. four) warp segments, then only the 4 (resp. 3) least significant bits of the clamp value is used.
If the mode is .IDX, j is equal to the minlane plus the appropriate least significant bits of the index. If the warp is segmented into one warp segments, then all the 5 bits of the index is used. If the warp is segmented into two (resp. four) warp segments, then only the 4 (resp. 3) least significant bits of the index is used.
If the mode is .UP (resp. .DOWN), then j is i minus (resp. plus) the index.
If the mode is .BFLY, then j is the i xor the index.
Pu indicates source lane for a given lane is "in range" or valid.For DOWN/IDX/BFLY modes, if the resulting j value exceeds the maxLane, then j is set to i and the predicate Pu is set to false. Otherwise, it is set to true.
For UP mode, maxlane and minlane are expected to be same (becuase the clamp/Sc[4:0] is expected to be 0). For up mode if the resulting j value is smaller than the minLane, then j is set to i and the predicate Pu is set to false. Otherwise, it is set to true.
Addressing mode .UP mode is useful for prefix sum up where higher numbered threads source from a lower numbered thread a fixed distance apart. Similarly .DOWN mode is used for prefix sum down use cases where lower numbered threads source from a higher numbered thread. .BFLY mode implements the butterfly addressing pattern such as in tree reduction and broadcast.
Addressing mode .IDX (indexed) implies addressing mode where the value of source lane is explicitly specified. .IDX mode is for any addressing patterns that dont fit the above three modes. It is also handy for the broadcast operations.
Modifier .mode must be specified. There is no default.
// Assumes input in following registers: // - Rx = sequence value for this thread SHFL.UP P1, Ry, Rx, 1, 0 @P1 FADD Rx, Ry, Rx SHFL.UP P1, Ry, Rx, 2, 0 @P1 FADD Rx, Ry, Rx SHFL.UP P1, Ry, Rx, 4, 0 @P1 FADD Rx, Ry, Rx SHFL.UP P1, Ry, Rx, 8, 0 @P1 FADD Rx, Ry, Rx SHFL.UP P1, Ry, Rx, 16, 0 @P1 FADD Rx, Ry, Rx
//Perform INCLUSIVE scan as above here// SHFL.UP P1, Rx, Rx, 1, 0 SEL Rx, Rx, 0, P1 // Use appropriate identity for 0 with other operators
// Assumes input in following registers: // - Rx = sequence value for this thread // SHFL.DOWN P1, Ry, Rx, 1, 31 @P1 FADD Rx, Ry, Rx SHFL.DOWN P1, Ry, Rx, 2, 31 @P1 FADD Rx, Ry, Rx SHFL.DOWN P1, Ry, Rx, 4, 31 @P1 FADD Rx, Ry, Rx SHFL.DOWN P1, Ry, Rx, 8, 31 @P1 FADD Rx, Ry, Rx SHFL.DOWN P1, Ry, Rx, 16, 31 @P1 FADD Rx, Ry, Rx
// Assumes input in following registers: // - Rx = sequence value for this thread SHFL.BFLY __, Ry, Rx, 16, 31 // We never use the predicate FADD Rx, Ry, Rx SHFL.BFLY __, Ry, Rx, 8, 31 FADD Rx, Ry, Rx SHFL.BFLY __, Ry, Rx, 4, 31 FADD Rx, Ry, Rx SHFL.BFLY __, Ry, Rx, 2, 31 FADD Rx, Ry, Rx SHFL.BFLY __, Ry, Rx, 1, 31 FADD Rx, Ry, Rx // All threads now hold sum in Rx
//0000: TXD quad expansion (smear 0) SHFL.IDX PT, Ry, Rx, 0, 0x1C03; // broadcast = 0, Mask = 5'b11100, Max = 3 (within quad) //.1111: TXD quad expansion (smear 1) SHFL.IDX PT, Ry, Rx, 1, 0x1C03; // broadcast = 1, Mask = 5'b11100, Max = 3 (within quad) //.2222: TXD quad expansion (smear 2) SHFL.IDX PT, Ry, Rx, 2, 0x1C03; // broadcast = 2, Mask = 5'b11100, Max = 3 (within quad) //.3333: TXD quad expansion (smear 3) SHFL.IDX PT, Ry, Rx, 3, 0x1C03; // broadcast = 3, Mask = 5'b11100, Max = 3 (within quad) //.1032: DDX SHFL.BFLY PT, Ry, Rx, 1, 0x1C03; // exchange with tid^1, Mask = 5'b11100, Max = 3 (within quad) //.2301: DDY SHFL.BFLY PT, Ry, Rx, 2, 0x1C03; // exchange with tid^1, Mask = 5'b11100, Max = 3 (within quad)