XDNA2 Kernel#

While all optimization objectives stated in the XDNA1 Kernel chapter remain valid, transferring BFP16 data is more complex than transferring BF16 data. Therefore, we use a different L1 data layout to account for the pipelined BFP16 loads and stores described in the XDNA2 ISA section.

L1 Data Layout#

_images/xdna2_data_and_registers.svg

Fig. 3 Data layout of the tensor contraction kernel in the L1 scratchpad memory of an executing compute tile. The kernel computes [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0], with fixed sizes |m0|=8, |k0|=8, |n0|=8, |m1|=2, and |n1|=2. The first computation of a 2×2 block of output tiles is highlighted in darker colors.#

Fig. 3 illustrates the data layout of the BFP16 tensor contraction kernel with FP32 accumulation: [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0]. The tensors are tiled based on the BFP16 VMAC.F operation requirements, meaning that all three tensors use 8×8 tiles. In detail, in0 has tiles of size |m0|×|k0|, in1 of size |n0|×|k0|, and out of size |m0|×|n0|, where |m0|=8, |k0|=8, and |n0|=8. A BFP16 input tile, that is, a tile of either in0 or in1, has a size of 72 bytes. An FP32 tile of out has a size of 256 bytes.

Given that XDNA2 has only five accumulation registers, the kernel employs a 2×2 register blocking for the output tensor. This blocking is described by the dimensions m1 and n1 with |m1|=2 and |n1|=2 in the einsum. Fig. 3 shows the accumulator blocking in gray, using two accumulation registers (DM0 and DM1), each accumulating two tiles. The kernel design section gives a detailed explanation of the mapping of four output tiles to two accumulation registers.

The L1 data layout is more complex due to potential bank conflicts. Specifically, XDNA2 has a total of four logical memory banks. We assign one of the banks to in0 and another one to in1. This leaves two banks for loading and storing the output tensor out. To fully utilize both banks, our L1 data layout evenly distributes the output tensor across the two banks. Accordingly, m1 is the outermost dimension of the output tensor, where the slice with m1=0 is placed on one bank and the slice with m1=1 on the other.

The sizes of the remaining dimensions m2, n2, and k1 are flexible but must be chosen so that the tensors fit into the L1 scratchpad.

Design#

As discussed in the previous section, our L1 data layout can be expressed by the einsum [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0] with the fixed sizes |m0|=8, |k0|=8, |n0|=8, |m1|=2, and |n1|=2. Similar to the XDNA1 kernel, the VMAC.F operation consumes the dimensions m0, k0, and n0. Dimensions m1 and n1 represent our accumulator blocking. The sizes of m2, k1, and n2 remain to be selected.

Before discussing the design decisions of our XDNA2 tensor contraction kernel, we summarize key hardware properties that must be considered:

  1. Loading data into a 256-byte accumulation register requires four 64-byte loads. VLDA can load directly into the accumulation registers. However, VLDB can only load into vector registers. Each load has a latency of seven cycles. A VMOV operation copies 64 bytes from a vector register into an accumulation register and has a latency of two cycles.

  2. VMAC.F reads from the accumulation register in its fourth cycle (forwarding).

  3. Properties 1 and 2 allow us to formulate scheduling rules when a VMAC.F operation depends on loads to the accumulation register. Specifically, we can issue a VMAC.F operation at the earliest in the fifth cycle of the fourth 64-byte load. In other words, the last load and the dependent VMAC.F must be four cycles apart.

  4. Loading a 72-byte vector register requires one VLDA.POP or VLDB.POP from a pre-filled load pipeline. This load has an 8-cycle latency.

  5. A VMAC.F operation has a latency of six cycles.

Output Stationary

See Output Stationary in the XDNA1 kernel design.

Register Blocking

Dimensions m1 and n1 are handled by a 2×2 register blocking; therefore, |m1|=|n1|=2.

Streaming of Input Tiles

The L1 data layout stores the input tiles of in0 in k1×m1 blocks and the tiles of in1 in k1×n1 blocks. Thus, for the computation of a 2×2 block of output tiles, we can use one linear stream (using VLDA) to load the data of in0 and one (using VLDB) to load the data of in1. The highest performance is obtained when consecutively loading eight 72-byte tiles. Additionally, the loads and stores must be 64-byte aligned.

Bank-Aware Buffer Placement

We split the output tensor into two parts along the m1 dimension, allowing for simultaneous access to both halves without bank conflict. In detail, we place the output tensor in a buffer spanning two consecutive memory banks. The first half of the buffer is aligned to the end of the first memory bank, and the second half is aligned to the start of the second bank, forming one contiguous buffer. The implementation passes two pointers for out, one per half.

Linear Contraction Dimension

As in the XDNA1 kernel, the k1 dimension is handled with linear code.

Accumulation Chains

XDNA2 has only five 2048-bit accumulation registers. Therefore, we cannot use a simple double-buffering scheme for the 2×2 register blocking, which would require eight registers.

However, the 6-cycle VMAC.F operation only requires valid data in the accumulation register when it reads from the register in the fourth cycle. This means that the operation writing the required data to the accumulation register must complete in the third cycle of VMAC.F. At all other times, the register may hold unrelated data. Assuming that we write the accumulation data for the VMAC.F in the third cycle, we can:

  1. Write unrelated data to the accumulation register in cycles 1–2, and 4–5.

  2. Read unrelated data from the accumulation register in cycles 1–3, and 5–6.

Thus, VMAC.F operations accumulating in a single output tile must be scheduled at least three cycles apart:

[-- --] [-- --] [-- --] [R0 --] [-- --] [-- W0] [R1 --] [-- --]
   1       2       3       4       5       6       7       8

[-- W1] [R2 --] [-- --] [-- W2] [R3 --] [-- --] [-- W3] [R4 --]
   9       10      11      12      13      14      15      16

Here, R0 is the accumulation register read of the first VMAC.F operation, and R1R4 are the reads of the following four operations. Correspondingly, W0W3 are the writes.

In practice, our kernel schedules the VMAC.F accumulation chains with a distance of four cycles:

[-- --] [-- --] [-- --] [R0 --] [-- --] [-- W0] [-- --] [R1 --]
   1       2       3       4       5       6       7       8

[-- --] [-- W1] [-- --] [R2 --] [-- --] [-- W2] [-- --] [R3 --]
   9       10      11      12      13      14      15      16

The four-cycle distance allows us to interleave two accumulation chains. The second chain has a two-cycle offset:

[-- --] [-- --] [-- --] [-- --] [-- --] [Q0 --] [-- --] [-- P0]
   1       2       3       4       5       6       7       8

[-- --] [Q1 --] [-- --] [-- P1] [-- --] [Q2 --] [-- --] [-- P2]
   9       10      11      12      13      14      15      16

This allows us to use the same accumulation register for both chains:

[-- --] [-- --] [-- --] [R0 --] [-- --] [Q0 W0] [-- --] [R1 P0]
   1       2       3       4       5       6       7       8

[-- --] [Q1 W1] [-- --] [R2 P1] [-- --] [Q2 W2] [-- --] [R3 P2]
   9       10      11      12      13      14      15      16

We see that we obtain an execution throughput of 1/2 after cycle 6, that is, in every other cycle a VMAC.F operation completes. We obtain an execution throughput of 1 by scheduling two additional interleaved chains:

[-- --] [-- --] [-- --] [-- --] [S0 --] [-- --] [U0 T0] [-- --]
   1       2       3       4       5       6       7       8

[S1 V0] [-- --] [U1 T1] [-- --] [S2 V1] [-- --] [U2 T2] [-- --]
   9       10      11      12      13      14      15      16

These two interleaved chains have a one-cycle offset compared to the previous one, meaning that we effectively obtain an execution throughput of 1 after cycle 6.

In summary, to fully utilize the vector unit, our kernel requires two accumulation registers and four accumulation chains. We use the remaining three accumulation registers for hiding the L1-register transfers.

Single Hardware Loop

The m2 and n2 iterations are implemented using a single hardware loop. As in the XDNA1 kernel, the first and last 2×2 blocks of output tiles are computed outside of this loop, forming a warm-up and a cool-down phase.

Accumulation Register Buffering

To summarize, given the einsum [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0] and data layout shown in Fig. 3, m0, k0 and n0 are consumed by the BFP16 VMAC.F operations. m1 and n1 are used for our 2×2 accumulator blocking. Dimension k1 is consumed by linear code, meaning that we explicitly write all instructions in code. This is analogous to unrolling optimizations in a loop-based optimization context. Since k0 and k1 are the only contraction dimensions, after performing all k1 updates, we have fully computed a 2×2 block of output tiles. Next, we advance the last two remaining dimensions, n2 and m2, through a fused hardware loop, where n2 is the faster dimension.

When advancing from one 2×2 block of output tiles i to the next (i+1), we must write i to the L1 scratchpad and load i+1. In general, the tensor contraction kernel writes the data of block i during the computation of block i+1 and reads the accumulator input for i+1 during the computation of i. This approach allows us to hide L1-register transfers behind VMAC.F operations.

We realize this scheme by using the remaining three accumulation registers DM2DM4 for buffering the loads and stores. In detail, we issue the loads for the four output tiles of i+1 while computing i. This data is buffered in registers DM2DM4. DM2 is used for buffering two output tiles. Furthermore, we must use 16 VST operations to store the four output tiles of block i. For this we also use registers DM2DM4 as buffers and execute the 16 VST operations while computing i+1. In this case DM2 is also used for buffering two output tiles.

A single iteration j of the fused hardware loop mostly contains VMAC.F operations that compute a single block i of output tiles. However at the beginning of the iteration two operations finish the computation of block i-1, while at the end of the iteration two operations already start computing block i+1. In summary, the schedule for these four VMAC.F operations and the accumulation register usage in iteration j is as follows:

  1. the first two VMAC.F operations finish computing block i-1 and write to DM4 and DM2;

  2. the following two VMAC.F operations begin updating block i and read from DM4 and DM2, where the corresponding VLDA and VLDB operations have been issued in iteration j-1;

  3. the fourth last and third last VMAC.F operations update block i and write to registers DM2 and DM3 so that the corresponding stores can be issued in iteration j+1;

  4. the last two VMAC.F operations begin updating block i+1 and read from DM2 and DM3.

A VMAC.F operation reads from the source accumulation register in its fourth cycle and writes to the destination in its sixth cycle. By scheduling the two VMAC.F operations in (a) at most two cycles before their respective counterpart in (b), the reads of the operations in (b) are executed before the writes of the operations in (a). The kernels follows an analogous approach for the usage of registers DM2 and DM3 by the VMAC.F operations in (c) and (d). In this case the writes of the VMAC.F operations in (c) are performed after the reads of the operations in (d).

Output Tile Prioritization

As discussed in the paragraph Single Hardware Loop, the tensor contraction kernel ends with a specialized cool-down phase. During this phase, the kernel computes the last 2×2 block of output tiles. Compared to a “regular” iteration of the hardware loop, we cannot hide the final store operations behind following loop iterations. Thus, we prioritize the computation of one output tile, so that the register-L1 transfers are at least partially hidden behind VMAC.F operations of the cool-down phase.

Implementation#

This section discusses the warm-up phase, hardware loop, and cool-down phase of a representative XDNA2 BFP16 tensor contraction kernel. As described in sections L1 Data Layout and Design, the einsum [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0] with fixed sizes |m0|=|k0|=|n0|=8, |m1|=2, and |n1|=2 describes the tensor contraction kernel. We choose |m2|=4, |k1|=12, and |n2|=4 in our example implementation.

Warm-up Phase#

Listing 10 Warm-up phase (lines 7-105) of the XDNA2 kernel.#
  7// k=0
  8// 0                                                                                                                             //          block   B   k
  9  nopv                            ; vlda bmll0, [p2, #0  ]            ; vldb x8,  [p3, #0  ]              ; movs p4, p2          ; movxm r4, #-576 / 8 * 12 * 2     // m1       - jumpback in IN0 (m dimension)
 10  nopv                            ; vlda bmlh0, [p2, #64 ]            ; vldb x9,  [p3, #64 ]              ; movs p5, p3          ; movxm r5, #-576 / 8 * 12 * 2 * 4 // n1 * n2  - jumpback in IN1 (complete tensor)
 11  nopv                            ; vlda bmhl0, [p2, #128]            ; vldb x10, [p3, #128]              ; padds [p4], #256     ; movx r24, #0   ; mov  r25, #0
 12  nopv                            ; vlda bmhh0, [p2, #192]            ; vldb x11, [p3, #192]              ; padds [p5], #256     ; movx r0,  #780 ; mov  r1, #0
 13
 14// 4
 15  nopv                            ; vlda bmhl1, [p4, #128]            ; vldb x8,  [p5], #64               ; nops                 ; movxm ls, #.l_start
 16  nopv                            ; vlda bmhh1, [p4, #192]            ; vldb x9,  [p5], #64               ; nops                 ; movxm le, #.l_end
 17  nopv                            ; vlda bmll1, [p4], #64             ; vldb x10, [p5], #64               ; nops                 ; movxm lc, #14                    // 4(m2) * 4(n2) - 2
 18  nopv                            ; vlda bmlh1, [p4], #192            ; vldb x11, [p5], #64               ; nops                 ; nopxm
 19
 20// 8
 21  nopv                            ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopx           ; vmov bmll2, x8
 22  nopv                            ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmlh2, x9
 23  nopv                            ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhl2, x10
 24  nopv                            ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhh2, x11
 25  nopv                            ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmll3, x8
 26  nopv                            ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmlh3, x9
 27  nopv                            ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhl3, x10
 28  nopv                            ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhh3, x11
 29  nopv                            ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopxm
 30
 31// 17
 32// k=0
 33  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 34  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda bmll2, [p4], #64             ; vldb x4, [p5], #64                ; nops                 ; nopxm
 35  vmac.f dm0, dm2, ex0,  ex3,  r0 ; vlda bmlh2, [p4], #64             ; vldb x5, [p5], #64                ; nops                 ; movx r3, #1    ; nopm
 36  vmac.f dm1, dm3, ex2,  ex3,  r0 ; vlda bmll2, [p4], #64             ; vldb x6, [p5], #64                ; nops                 ; movx r2, #4    ; nopm
 37
 38// k=8
 39  vmac.f dm0, dm0, ex4,  ex5,  r0 ; vlda bmlh2, [p4], #64             ; vldb x7, [p5], #64                ; nops                 ; nopxm
 40  vmac.f dm1, dm1, ex6,  ex5,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 41  vmac.f dm0, dm0, ex4,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 42  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
 43
 44// k=16
 45  vmac.f dm0, dm0, ex8,  ex9,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmll4, x4
 46  vmac.f dm1, dm1, ex10, ex9,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmlh4, x5
 47  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhl4, x6
 48  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhh4, x7
 49
 50// k=24
 51  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; add r1, r1, r3 ; nopm
 52  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; ltu r6, r1, r2 ; nopm
 53  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; mul r1, r1, r6 ; nopm
 54  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; sub r7, r3, r6 ; nopm
 55
 56// k=32
 57  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; mul r6, r6, r4 ; nopm
 58  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda bmll3, [p4], #64             ; nopb                              ; nops                 ; mul r7, r7, r5 ; nopm
 59  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda bmlh3, [p4], #64             ; nopb                              ; nops                 ; nopx           ; mov m0, r6
 60  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda bmhl3, [p4], #64             ; nopb                              ; nops                 ; nopx           ; mov m1, r7
 61
 62// k=40
 63  vmac.f dm0, dm0, ex4,  ex5,  r0 ; vlda bmhh3, [p4], #64             ; nopb                              ; nops                 ; nopxm
 64  vmac.f dm1, dm1, ex6,  ex5,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 65  vmac.f dm0, dm0, ex4,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 66  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
 67
 68// k=48
 69  vmac.f dm0, dm0, ex8,  ex9,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
 70  vmac.f dm1, dm1, ex10, ex9,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopxm
 71  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopxm
 72  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopxm
 73
 74// k=56
 75  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopxm
 76  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopxm
 77  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
 78  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopxm
 79
 80// k=64
 81  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 82  vmac.f dm1, dm1, ex2,  ex1,  r0 ; nopa                              ; vldb x6, [p5], #64                ; nops                 ; nopxm
 83  vmac.f dm0, dm0, ex0,  ex3,  r0 ; nopa                              ; vldb x7, [p5], #64                ; nops                 ; nopxm
 84  vmac.f dm1, dm1, ex2,  ex3,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 85
 86// k=72
 87  vmac.f dm0, dm0, ex4,  ex5,  r0 ; padda [p0], m0                    ; paddb [p1], m1                    ; nops                 ; nopxm
 88  vmac.f dm1, dm1, ex6,  ex5,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
 89  vmac.f dm0, dm0, ex4,  ex7,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopxm
 90  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopxm
 91
 92// k=80
 93  vmac.f dm0, dm0, ex8,  ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 94  vmac.f dm1, dm1, ex10, ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
 95  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopxm
 96  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopxm
 97
 98// k=88_0
 99  vmac.f dm2, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
100  vmac.f dm3, dm1, ex2,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
101// k=96_0
102
103// k=0_0
104  vmac.f dm0, dm2, ex8,  ex9,  r0 ; vlda bmhl2, [p5], #64             ; nopb                              ; nops                 ; nopxm
105  vmac.f dm1, dm3, ex10, ex9,  r0 ; vlda bmhh2, [p5], #64             ; nopb                              ; nops                 ; nopxm

Listing 10 shows the kernel’s warm-up phase.

During the first eight cycles, the load operations for the first 2×2 block are issued. Load unit A can write directly into the accumulation registers. Load unit B can only write into scalar or vector registers. Thus, we first load to vector registers using load unit B and then copy the data to accumulation registers. We use accumulation registers DM0DM3 for these “loads”. The VMAC.F operations in lines 39–96 exclusively use DM0 and DM1 as their destination registers. Thus, registers DM2DM4 can be used to load the next 2×2 block of output tiles, required by the upcoming hardware loop. We discuss this procedure in more detail as part of the next subsection.

Each instruction in lines 7–29 contains a NOPV operation, hence the vector unit is not used. All other instructions perform VMAC.F operations. In summary, 48 of the 65 instructions contain VMAC.F operations.

Hardware Loop#

Listing 11 Hardware loop setup (lines 15-17: chars 132+) of the XDNA2 kernel.#
15movxm ls, #.l_start
16movxm le, #.l_end
17movxm lc, #14                    // 4(m2) * 4(n2) - 2

The hardware loop setup follows the procedure outlined in the XDNA1 kernel. The respective configuration operations are shown in Listing 11.

Listing 12 Body of the loop (lines 107-187) in the XDNA2 kernel.#
107.p2align 4
108.l_start:
109// k=88_1
110  vmac.f dm4, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
111  vmac.f dm2, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmll2, x6
112// k=96_1
113
114// k=0_1
115  vmac.f dm0, dm4, ex8,  ex11, r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; vst bmll2, [p2], #64 ; nopx           ; vmov bmlh2, x7
116  vmac.f dm1, dm2, ex10, ex11, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; vst bmlh2, [p2], #64 ; nopxm
117
118// k=8
119  vmac.f dm0, dm0, ex4,  ex5,  r0 ; nopa                              ; nopb                              ; vst bmhl2, [p2], #64 ; add r1, r1, r3 ; nopm
120  vmac.f dm1, dm1, ex6,  ex5,  r0 ; nopa                              ; nopb                              ; vst bmhh2, [p2], #64 ; ltu r6, r1, r2 ; nopm
121  vmac.f dm0, dm0, ex4,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; mul r1, r1, r6 ; nopm
122  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; sub r7, r3, r6 ; nopm
123
124// k=16
125  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmll2, [p3], #64 ; mul r6, r6, r4 ; nopm
126  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmlh2, [p3], #64 ; mul r7, r7, r5 ; nopm
127  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; vst bmhl2, [p3], #64 ; nopx           ; mov m0, r6
128  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; vst bmhh2, [p3], #64 ; nopx           ; mov m1, r7
129
130// k=24
131  vmac.f dm0, dm0, ex8,  ex9,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; vst bmll4, [p3], #64 ; nopxm
132  vmac.f dm1, dm1, ex10, ex9,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; vst bmlh4, [p3], #64 ; nopxm
133  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmhl4, [p3], #64 ; nopxm
134  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmhh4, [p3], #64 ; nopxm
135
136// k=32
137  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; vst bmll3, [p2], #64 ; nopxm
138  vmac.f dm1, dm1, ex2,  ex1,  r0 ; nopa                              ; nopb                              ; vst bmlh3, [p2], #64 ; nopxm
139  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda bmll2, [p4], #64             ; vldb x4, [p5], #64                ; nops                 ; nopxm
140  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda bmlh2, [p4], #64             ; vldb x5, [p5], #64                ; nops                 ; nopxm
141
142// k=40
143  vmac.f dm0, dm0, ex4,  ex5,  r0 ; vlda bmhl2, [p4], #64             ; vldb x6, [p5], #64                ; nops                 ; nopxm
144  vmac.f dm1, dm1, ex6,  ex5,  r0 ; vlda bmhh2, [p4], #64             ; vldb x7, [p5], #64                ; nops                 ; nopxm
145  vmac.f dm0, dm0, ex4,  ex7,  r0 ; vlda bmll3, [p4], #64             ; nopb                              ; nops                 ; nopxm
146  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
147
148// k=48
149  vmac.f dm0, dm0, ex8,  ex9,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmhl3, [p2], #64 ; nopxm
150  vmac.f dm1, dm1, ex10, ex9,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmhh3, [p2], #64 ; nopxm
151  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmll4, x4
152  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmlh4, x5
153
154// k=56
155  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhl4, x6
156  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmhh4, x7
157  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
158  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopxm
159
160// k=64
161  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
162  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda bmlh3, [p4], #64             ; vldb x6, [p5], #64                ; nops                 ; nopxm
163  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda bmhl3, [p4], #64             ; vldb x7, [p5], #64                ; nops                 ; nopxm
164  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda bmhh3, [p4], #64             ; nopb                              ; nops                 ; nopxm
165
166// k=72
167  vmac.f dm0, dm0, ex4,  ex5,  r0 ; padda [p0], m0                    ; paddb [p1], m1                    ; nops                 ; nopxm
168  vmac.f dm1, dm1, ex6,  ex5,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
169  vmac.f dm0, dm0, ex4,  ex7,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopxm
170  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopxm
171
172// k=80
173  vmac.f dm0, dm0, ex8,  ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
174  vmac.f dm1, dm1, ex10, ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
175  vmac.f dm0, dm0, ex8,  ex11, r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopxm
176  vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopxm
177
178// k=88_0
179  vmac.f dm2, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
180  vmac.f dm3, dm1, ex2,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
181// k=96_0
182
183// k=0_0
184  vmac.f dm0, dm2, ex8,  ex9,  r0 ; vlda bmhl2, [p5], #64             ; nopb                              ; nops                 ; nopxm
185.p2align 4
186.l_end:
187  vmac.f dm1, dm3, ex10, ex9,  r0 ; vlda bmhh2, [p5], #64             ; nopb                              ; nops                 ; nopxm

Listing 12 shows the body of hardware loop in the XDNA2 tensor contraction kernel. Lines 110 and 111 contain the two VMAC.F operations discussed as (a) in the accumulation register buffering part of the kernel design. (b) is given by lines 115 and 116, (c) by lines 179 and 180, and (d) by lines 184 and 187.

Assuming that the hardware loop currently performs iteration j, then the VLDA and VLDB operations in lines 139–145 and lines 162–164 load three output tiles of the next 2×2 block. The VLDA operations directly load into DM2 and DM3, while the VLDB operations first load into vector registers. The data in the vector registers is then copied to DM4.

The two VLDB operations in lines 162 and 163 load the first 128 bytes of the fourth 256-byte output tile to vector registers X6 and X7. This data is ultimately used by the VMAC.F operation in line 116 when performing iteration j+1. The corresponding copies from X6 and X7 to the lower half of accumulation register DM2 are done by the two VMOV operations in lines 111 and 115. The remaining upper 128 bytes of the fourth tile (DM2) are loaded by the two VLDA operations in lines 184 and 187. Note that the L1-to-DM2 transfers of the fourth tile are scheduled to fulfill two conditions:

  • First, they are scheduled so that the first output tile in DM2 of the current 2×2 block has been written to L1. The last update to the first tile is performed in the VMAC.F operation in line 179. This tile is stored in lines 115–120 of the next iteration.

  • Second, they are scheduled so that the fourth output tile of the current 2×2 block is stored in DM2 before the VMAC.F operation in line 116 reads it.

In summary, each of the 48 instructions in the loop body contains a VMAC.F operation. Thus, the vector unit is fully utilized.

Cool-down Phase#

Listing 13 Cool-down phase (lines 189-279) of the XDNA2 kernel.#
189// k=88_1
190  vmac.f dm4, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
191  vmac.f dm2, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopx           ; vmov bmll2, x6
192// k=96_1
193
194// k=0_1
195  vmac.f dm0, dm4, ex8,  ex11, r0 ; nopa                              ; nopb                              ; vst bmll2, [p2], #64 ; nopx           ; vmov bmlh2, x7
196  vmac.f dm1, dm2, ex10, ex11, r0 ; nopa                              ; nopb                              ; vst bmlh2, [p2], #64 ; nopxm
197
198// k=8
199  vmac.f dm0, dm0, ex4,  ex5,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmhl2, [p2], #64 ; nopxm
200  vmac.f dm1, dm1, ex6,  ex5,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmhh2, [p2], #64 ; nopxm
201  vmac.f dm0, dm0, ex4,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
202  vmac.f dm1, dm1, ex6,  ex7,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
203
204// k=16
205  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmll2, [p3], #64 ; nopxm
206  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmlh2, [p3], #64 ; nopxm
207  vmac.f dm0, dm0, ex0,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhl2, [p3], #64 ; nopxm
208  vmac.f dm1, dm1, ex2,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhh2, [p3], #64 ; nopxm
209
210// k=24
211  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmll4, [p3], #64 ; nopxm
212  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmlh4, [p3], #64 ; nopxm
213  vmac.f dm0, dm0, ex0,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhl4, [p3], #64 ; nopxm
214  vmac.f dm1, dm1, ex2,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhh4, [p3], #64 ; nopxm
215
216// k=32
217  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; vst bmll3, [p2], #64 ; nopxm
218  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; vst bmlh3, [p2], #64 ; nopxm
219  vmac.f dm0, dm0, ex0,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhl3, [p2], #64 ; nopxm
220  vmac.f dm1, dm1, ex2,  ex3,  r0 ; nopa                              ; nopb                              ; vst bmhh3, [p2], #64 ; nopxm
221
222// k=40
223  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopxm
224  vmac.f dm1, dm1, ex2,  ex1,  r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopxm
225  vmac.f dm0, dm0, ex0,  ex3,  r0 ; vlda.fill.512 [p0, lf0, r24]      ; vldb.fill.512 [p1, lf1, r25]      ; nops                 ; nopxm
226  vmac.f dm1, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopxm
227
228// k=48
229  vmac.f dm0, dm0, ex0,  ex1,  r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopxm
230  vmac.f dm1, dm1, ex2,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
231  vmac.f dm2, dm0, ex0,  ex3,  r0 ; vlda.pop.576 ex0,  [p0, lf0, r24] ; vldb.pop.576 ex1,  [p1, lf1, r25] ; nops                 ; nopxm
232  vmac.f dm3, dm1, ex2,  ex3,  r0 ; vlda.pop.576 ex2,  [p0, lf0, r24] ; vldb.pop.576 ex3,  [p1, lf1, r25] ; nops                 ; nopxm
233
234// k=56
235  vmac.f dm0, dm0, ex4,  ex5,  r0 ; vlda.pop.576 ex4,  [p0, lf0, r24] ; vldb.pop.576 ex5,  [p1, lf1, r25] ; nops                 ; nopxm
236  vmac.f dm1, dm1, ex4,  ex7,  r0 ; vlda.pop.576 ex6,  [p0, lf0, r24] ; vldb.pop.576 ex7,  [p1, lf1, r25] ; nops                 ; nopxm
237  vmac.f dm2, dm2, ex6,  ex5,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
238  vmac.f dm0, dm0, ex8,  ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
239
240//
241  vmac.f dm3, dm3, ex6,  ex7,  r0 ; vlda.pop.576 ex8,  [p0, lf0, r24] ; vldb.pop.576 ex9,  [p1, lf1, r25] ; nops                 ; nopxm
242  vmac.f dm1, dm1, ex8,  ex11, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops                 ; nopxm
243  vmac.f dm0, dm0, ex0,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
244  vmac.f dm2, dm2, ex10, ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
245
246//
247  vmac.f dm3, dm3, ex10, ex11, r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
248  vmac.f dm0, dm0, ex4,  ex5,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
249  vmac.f dm1, dm1, ex0,  ex3,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
250  vmac.f dm2, dm2, ex2,  ex1,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
251
252//
253  vmac.f dm0, dm0, ex8,  ex9,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
254  vmac.f dm1, dm1, ex4,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
255  vmac.f dm3, dm3, ex2,  ex3,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
256  vmac.f dm2, dm2, ex6,  ex5,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
257
258//
259  vmac.f dm1, dm1, ex8,  ex11, r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
260  vmac.f dm3, dm3, ex6,  ex7,  r0 ; nopa                              ; nopb                              ; nops                 ; nopxm
261  vmac.f dm2, dm2, ex10, ex9,  r0 ; nopa                              ; nopb                              ; vst bmll0, [p2], #64 ; nopxm
262  nopv                            ; nopa                              ; nopb                              ; vst bmlh0, [p2], #64 ; nopxm
263  vmac.f dm3, dm3, ex10, ex11, r0 ; nopa                              ; nopb                              ; vst bmhl0, [p2], #64 ; nopxm
264// k=96
265
266  nopv                            ; nopa                              ; nopb                              ; vst bmhh0, [p2], #64 ; nopxm
267  nopv                            ; nopa                              ; nopb                              ; vst bmll1, [p2], #64 ; nopxm
268  nopv                            ; nopa                              ; nopb                              ; vst bmlh1, [p2], #64 ; nopxm
269  nopv                            ; nopa                              ; nopb                              ; vst bmhl1, [p2], #64 ; nopxm
270  nopv                            ; nopa                              ; nopb                              ; vst bmhh1, [p2], #64 ; nopxm
271  nopv                            ; nopa                              ; nopb                              ; vst bmll2, [p3], #64 ; nopxm
272  nopv                            ; nopa                              ; nopb                              ; vst bmlh2, [p3], #64 ; nopxm
273  nopv                            ; nopa                              ; nopb                              ; vst bmhl2, [p3], #64 ; nopxm
274  nopv                            ; nopa                              ; nopb                              ; vst bmhh2, [p3], #64 ; ret lr
275  nopv                            ; nopa                              ; nopb                              ; vst bmll3, [p3], #64 ; nopxm      // Delay Slot 5
276  nopv                            ; nopa                              ; nopb                              ; vst bmlh3, [p3], #64 ; nopxm      // Delay Slot 4
277  nopv                            ; nopa                              ; nopb                              ; vst bmhl3, [p3], #64 ; nopxm      // Delay Slot 3
278  nopv                            ; nopa                              ; nopb                              ; vst bmhh3, [p3], #64 ; nopxm      // Delay Slot 2
279  nopv                            ; nopa                              ; nopb                              ; nops                 ; nopxm      // Delay Slot 1

Listing 13 shows the kernel’s cool-down phase. To realize the discussed output tile prioritization, each VMAC.F chain after line 228 uses a different accumulation register. The stores of the last 2×2 block are performed in lines 261–278.

In summary, 48 out of 63 instructions contain VMAC.F operations.

Kernel Efficiency#

The vector unit utilization of the three parts is as follows:

  • Warm-up phase: 48 out of 65 instructions contain VMAC.F operations.

  • Hardware loop: All 48 instructions in the loop body contain VMAC.F operations. The loop executes 14 times, yielding a total of 672 instructions with a VMAC.F operation.

  • Cool-down phase: 48 out of 63 instructions contain VMAC.F operations.

In summary, the kernel executes 800 instructions, 768 of which contain VMAC.F operations. This results in a theoretical utilization of 96%. In other words, a compute tile running at 1.8 GHz would execute 1.73×10⁹ BFP16 8×8×8 operations per second. This equates to a theoretical throughput of 1769 BFP16 GFLOPS.

Analogous to the XDNA1 tensor contraction kernel, we have written a benchmark that calls the tensor contraction kernel repeatedly in a loop on the NPU. We have benchmarked the kernel on an XDNA2 NPU (AMD Ryzen AI Max PRO 390) and achieved a throughput of 1760 BFP16 GFLOPS. The benchmarking code is available from our xdna repository. To run the benchmark, execute the following commands:

git clone https://github.com/scalable-analyses/xdna
cd xdna
make run

Note

The installation of the MLIR-AIE compiler aiecc and Peano is documented in the mlir-aie repository. The Makefile assumes that the environment variable PEANO_INSTALL_DIR contains the path to Peano and that aiecc.py is available in the path. Use xrt-smi configure --pmode turbo to set the NPU clock to its maximum frequency.