LEA : Compute Effective Address

Format:

SPA 5.0:
        {@{!}Pg}   LEA{.LO*}{.X}   Plg, Rd,      {-}Ra, Sb{, #ScaleU05}        {&req_6}   {?sched}   ;   
        {@{!}Pg}   LEA{.LO*}{.X}        Rd{.CC}, {-}Ra, Sb{, #ScaleU05}        {&req_6}   {?sched}   ;   
        {@{!}Pg}   LEA.HI{.X}      Plg, Rd,      {-}Ra, Sb, Rc {, #ScaleU05}   {&req_6}   {?sched}   ;   
        {@{!}Pg}   LEA.HI{.X}           Rd{.CC}, {-}Ra, Sb, Rc {, #ScaleU05}   {&req_6}   {?sched}   ;   

    .hilo   : {LO*,HI} calculate lower or upper 32 bits of 64 bit address register.
    .X      : Upper bit extended precision add, typically used with LEA.HI
    .CC     : Writes CC when specified.

    Rc cannot be specified with LEA.LO.
    #ScaleU05 defaults to zero if not specified.
    LEA can write at most one of CC and Plg; it cannot write both in a single instruction.

LEA.LO allows the following sources for Ra and Sb:
Ra (OFFSET_LO)
Sb (BASE_LO)
{-}Ra (register)
Sb (constant with immediate address)
{-}Ra (register)
Sb (#Imm20)
{-}Ra (register)
Rb (register)
LEA.HI allows the following sources for Ra, Sb and Rc:
Ra (OFFSET_LO)
Sb (BASE_HI)
Rc (OFFSET_HI)
Notes
{-}Ra (register)
Sb (constant with immediate address)
{-}Rc (register)
{Rc,Ra} is negated, or not,
controlled by the nA bit
{-}Ra (register)
Rb (register)
{-}Rc (register)
{Rc,Ra} is negated, or not,
controlled by the nA bit

Description:

LEA performs scaled offset addition, typically used in address pointer arithmeric for generic and global memory loads and stores. It adds a base value (BASE, typically in Sb) to a scaled offset (OFFSET) producing an EFFECTIVE_ADR. OFFSET can be optionally negated. BASE, OFFSET, and EFFECTIVE_ADDR can be either 32-bits each (in which case, only LEA.LO is needed) or 64-bits each (in which case, LEA.LO and LEA.HI are both needed).

The final 32- or 64-bit result (EFFECTIVE_ADDR) is checked against the shared memory window aperture for the warp, to determine if the address falls in that window. The predicate register Plg is set accordingly. This predicate can then be used by generic memory load/store instructions (e.g. LD and ST) to steer the memory transactions.

Typically, LEA.LO is used with LEA.HI.X in order to get a 64-bit result. Additional LEA.HI.X's can be chained after an initial LEA.LO to perform an arbitrary precision base + scaled offset computation.

Note that LEA does not produce "traditional" values in CC. In particular, CC.OF is !(shared-memory-window-detect) instead of an overflow bit. CC.CF and CC.ZF and CC.SF have "normal" meanings, i.e. similar to those used by (for example) IADD or ISCADD. Largely, the redefinition of CC.OF means means that LEA-produced CC's cannot be usefully combined with certain CC tests which use CC.OF;, Notably, tests for zero/non-zero are still possible with CCs produced by LEA.

Examples:


// Examples for Extended (64 bit) address computation.

// To compute address for double element in A[i+10] where i is 64 bits as well:
// BASE           = <R5,R4>             = &A[0]
// SCALED_OFFSET  = <R3,R2>             = i
// #scaleU5 = 3
// EFFECTIVE_ADDR = <R1,R0>             = &A[i]
// LD_OFFSET      = 10 x sizeof(double) = 80
// Plg is left in P0, for use with generic ops (LD/ST).
//
LEA.LO        R0.CC, R2, R4, 3            ?WAIT6  ; // R0 = ( R2 << 3 )     + R4
LEA.HI.X  P0, R1,    R2, R5, R3, 3        ?WAIT13 ; // R1 = ({R3,R2} >> 29) + R5 + CC.CF
LD.64         Rd, [R0 + 80], P0     &wr0  ?WAIT1  ;

// To compute address for double element in A[100 - i] where i is a signed 32-bit number:
// BASE           = <R5,R4>              = &A[0]
// SCALED_OFFSET  = <R2>                 = i
// #scaleU5 = 3
// EFFECTIVE_ADDR = <R1,R0>              = &A[100-i]
// LD_OFFSET      = 100 x sizeof(double) = 800
// Plg is left in P0, for use with generic ops (LD/ST).
//
// We need a 64-bit value for SCALED_OFFSET; if i was unsigned, we could use {RZ,R2}.
// Given that i is signed, we have to first sign extend R2 into a temp register.
//  
BFE.S32       R3,     R2, 0x011f           ?WAIT1  ; // sign extend R2 into R3
LEA.LO        R0.CC, -R2, R4, 3            ?WAIT6  ; // R0 = ( -R2 << 3 )     + R4
LEA.HI.X  P0, R1,    -R2, R5, R3, 3        ?WAIT13 ; // R1 = (-{R3,R2} >> 29) + R5 + CC.CF
LD.64         Rd, [R0 + 800], P0     &wr0  ?WAIT1  ;

// Assume 64-bit base address of Array B of 128 byte structures is passed to kernel via constants {Const[0][4],Const[0][0]}
// Assume a desired 32-bit field ".field" inside the 128 byte structures is at offset 20.
// To calculate <R1,R0> = &B[i].field where i is unsigned 32 bits:
// BASE           = <c[0][4],c[0][0]> = &B[0]
// SCALED_OFFSET  = <R2>              = i
// EFFECTIVE_ADDR = <R1,R0>           = &B[i].field
// #scaleU5 = 7
// LD_OFFSET      = 20                = (void*)&B[0].field - (void*)&B[0]
// Plg is left in P0, for use with generic ops (LD/ST).
//
LEA.LO        R0.CC, R2, c[0][0], 7            ?WAIT6  ; // R0 = ( R2 << 7 )     + c[0][0]
LEA.HI.X  P0, R1,    R2, c[0][4], 7            ?WAIT13 ; // R1 = ({RZ,R2} >> 25) + c[0][4] + CC.CF
LD.32         Rd, [R0 + 20], P0          &wr0  ?WAIT1  ;


// 32bit (non .E) examples:

// To compute 32-bit address for an unsigned byte element B[i-2] where i is in R12, and &B[0] is in R2:
// BASE           = <R2>  = &B[0]
// SCALED_OFFSET  = <R12> = i
// #scaleU5 = 0
// EFFECTIVE_ADDR = <R10> = &B[i+2]
// LD_OFFSET      = -2
// Plg is left in P1, for use with generic ops (LD/ST).
//
LEA.LO  P1, R10, R12, R2             ?WAIT13 ;  // R10 = ( R12 << 0 ) + R2 - 2
LD.U8       Rd, [R10 - 2], P1  &wr0  ?WAIT1  ;


// Extended precision scaled add examples:

// U128 = U128 + U128<<17
// BASE          = <R3,R2,R1,R0>
// SCALED_OFFSET = <R11,R10,R9,R8>
// #scaleU5      = 17
// RESULT        = <R15,R14,R13,R12>
//
LEA.LO    R12.CC, R8, R0,      17 ?WAIT6 ;
LEA.HI.X  R13.CC, R8, R1, R9,  17 ?WAIT6 ;
LEA.HI.X  R14.CC, R9, R2, R10, 17 ?WAIT6 ;
LEA.HI.X  R15,   R10, R3, R11, 17        ;

Back to Index of Instructions