FSWZADD : FP32 Add used for FSWZ emulation

Format:


SPA 5.0:
        {@{!}Pg}   FSWZADD{.FTZ}{.rnd}{.NDV}   Rd{.CC}, Ra, Rb, znpControl    {&req_6}   {?sched}   ;   

 .FTZ        denorm inputs/output is flushed to sign preserving 0.0.

 .rnd        {.RN*, .RM, .RP, .RZ}
    .RN - Round to the nearest even. This is the default.
    .RM - Round towards -Infinity (floor)
    .RP - Round towards +Infinity (ceiling)
    .RZ - Round towards 0 (truncate)

 .NDV        Force the quad to be treated as non-divergent
             if .NDV is FALSE, 
                  quad is determined to "divergent" if some threads in quad are active and some are not.
                  If the quad is hw divergent, output of FSWZ is forced to 0.0 or
                  +Inf (dependent on State.ShaderControl.DefaultPartial). 
             However, if .NDV is TRUE, 
                  then the hw quad divergence bit will be ignored,
                  and the quad deemed hw non-divergent allowing the expected fp add.

 .CC         Write condition code flags

znpControl :  specifies  modifiers  for Ra and Rb source registers, as 4 sets of character pairs.
              Each set is associated with a specific thread/pixel in a quad. 
              The ordering of these sets is UL,UR,LL,LR in pixel quad i.e.

              | Thread0 (P0:UL)  Thread1 (P1:UR) |
              | Thread2 (P2:LL)  Thread3 (P3:LR) |

              The valid modifier control character pairs are:
znpControl table
char pair Ra modifier Rb modifier
PP none none
NP Negate none
PN none Negate
ZP Force to Zero none

Description:

Add fp32 sources into destination register. Used as part of FSWZ emulation.

Examples:

// DDX implementation
SHFL.BFLY  PT, Ry, Rx, 1,  0x1C03;  // exchange with tid^1, Mask = 5'b11100, Max = 3 (within quad)
FSWZADD   R0,R1,R1,PNNPPNNP;

// DDY implementation for DirectX
SHFL.BFLY  PT, Ry, Rx, 2,  0x1C03;  // exchange with tid^2, Mask = 5'b11100, Max = 3 (within quad)
FSWZADD   R0,R1,R1,PNPNNPNP;

// DDY implementation for OpenGL 
SHFL.BFLY  PT, Ry, Rx, 2,  0x1C03;  // exchange with tid^2, Mask = 5'b11100, Max = 3 (within quad)
FSWZADD   R0,R1,R1,PNPNNPNP;
S2R R1, SR18;                //Accounts for screen origin inversion
FMUL R0, R0, R1;