SPA 5.0:
{@{!}Pg}
CCTL{.E}{.cache}.cop
[Ra + ImmS32]
{&req_6}
{&rdN}
{?sched}
;
// Cache control
{@{!}Pg}
CCTL{.cache}.IVALL
{&req_6}
{&rdN}
{?sched}
;
// Cache invalidate
{@{!}Pg}
CCTL.C.IVALL
{&req_6}
{?sched>=?WAIT5}
;
{@{!}Pg}
CCTL.I.IVALL
{&req_6}
{?sched>=?WAIT5}
;
{@{!}Pg}
CCTLL.cop
[Ra + ImmS24]
{&req_6}
{&rdN}
{?sched}
;
// Cache control with Local address
{@{!}Pg}
CCTLL.IVALL
{&req_6}
{&rdN}
{?sched}
;
Omit register Ra to specify an unsigned absolute address like this:
{@{!}Pg}
CCTLL.CRS.WBALL
{&req_6}
{&rdN}
{?sched>=?WAIT5}
;
.E: Extended address (64 bits, requires two registers) .cache: { .D*, .U, .C, .I, .CRS } .D* Data cache hierarchy L1, L2 with generic byte address, default* .U Deprecated, implemented merely as an alias for .D .C Constant cache hierarchy L1, L2 with constant (slot<<16) + byte address .I Instruction cache hierarchy L1, L2 with instruction byte address .CRS Call return stack cache hierarchy. Only valid with CCTLL.CRS.WBALL .cop: { .PF1, .PF2, .WB, .IV, .IVALL, .RS } .cache .D | .C | .I |.CRS | .cop L1 L2 | L1 L2 | L1 L2 pre-L1 Cache operation .PF1 Y Y - - - - - Pre-fetch line into cache level 1 .PF2 - Y - - - - - Pre-fetch line into cache level 2 .WB Y - - - - - - Write back dirty cache line (flush to memory) .WBALL - - - - - - Y Write back dirty cache line (flush to memory) .IV Y - - - - - - Invalidate cache line (if dirty, first writeback) .IVALL Y - Y - Y - - Invalidate all cache lines (if dirty, first writeback) .RS Y - - - - - - Reset line (mark invalid, without prior writeback) Exceptions: .IVALL and .WBALL require Ra to be RZ and the Immediate to be zero, and cannot be used with .E. CCTL.C.IVALL and CCTL.I.IVALL cannot be specified with a &rd scoreboard. Because CCTL.C.IVALL and CCTL.I.IVALL are executed in dispatch, they do not go into any VQ. .QRY1 is currently unimplemented, and will signal an illegal instruction encoding.
{@{!}Pg}
CCTL{.E}{.cache}.cop
[ImmU32]
{&req_6}
{&rdN}
{?sched}
;
// Cache control with absolute address
CCTL and CCTLL control or query a cache line that contains a specified address.
The generic byte address is computed as the 32-bit addition of register Ra plus the 32-bit signed immediate offset ImmS32 (or ImmS24), which is then zero-extended to 40-bits. If the .E extension is specified, the generic byte address is computed as the sum of the 64-bit value (R[a],R[a+1]) plus the sign-extended immediate offset ImmS32. If register Ra is omitted, equal to RZ, or beyond the set of registers supported for the shader, the effective address is the zero-extended absolute immediate byte offset. An omitted Ra register is assembled as RZ, which reads as zero. An omitted immediate offset is assembled as zero. All offsets are specified in bytes.
The effective address is interpreted within the cache address space specified by CCTL.cache.
There are three cached address spaces that can be controlled or queried with CCTL: the data addresses, the constant addresses, and the instruction addresses. Use a generic thread byte address for the .D cache hierarchies. Use a constant byte address for the .C cache hierarchy ((slot<<16) + address). Use an instruction byte address for the .I cache hierarchy.
Local memory CCTLL addresses are within the Local data window.
The CCTL instruction controls or queries the cache line that contains the supplied address. CCTLL evaluates the effective per-thread Local address of [Ra + ImmS24] within the Local window and performs operation .cop on the selected Local data cache line.
Cache operation CCTLL.CRS.WBALL writes back the contents of the call return stack cache hierarchy for the issuing warp. The contents of SM's top of stack cache is written back to L1. This instruction is to be used to ensure the CRS data in L1 reflects the data in the SM top of stack cache. The CRS caches for other warps may also be written back as a side effect.
CCTLL.CRS.WBALL must interlock on L1 accepting all pending CRS token writebacks such that any subsequent CCTLL.IVALL can be used to flush all written back CRS data out of L1.
SOFTWARE NOTE: CCTLL.CRS.WBALL is used to save the current contents of CRS to memory. It cannot be used for the reverse operation: to "backfill" data into the CRS from the local memory backing store. To restore CRS data from the backing store, first use SETCRSPTR to set the call return stack pointer to 0 (so that the stack is empty), restore the backing store data, and finally restore the stack pointer to the desired value. See SETCRSPTR for details.
CCTL.D.IVALL does not take an address; it will always invalidate all global lines in the L1 LG cache. Similarly, CCTL.U.IVALL does not take an address; it will always invalidate all global lines in the indexed constant cache. CCTLL.IVALL will always invalidate all local lines in the L1 LG cache, potentially triggering a writeback. Unlike other suffixes for CCTL and CCTLL, CCTL and CCTLL with the .IVALL suffix do not take in an address specification. As a result, the Ra field in the instruction encoding is assembled as RZ and the immediate field is assembled as 0. Binary encodings attempting to specify different values for Ra and the immediate fields for the .IVALL suffix are considered illegal encodings.
CCTL.D.PF1 [R3 + 4];