XDNA1 Kernel#

Tiled tensor contractions use kernels that operate on subtensors (tiles) as their core building block. Accelerating these kernels is crucial to the overall performance of tensor workloads. As discussed in the chapter Instruction Set Architecture, the XDNA cores’ floating-point throughput is driven by VMAC.F. On XDNA1, the highest-throughput VMAC.F operation performs a BF16 4×8×4 matrix multiplication. On XDNA2, the highest-throughput VMAC.F computes a BFP16 8×8×8 matrix multiplication.

The goal of this chapter is to develop a high-performance tensor contraction kernel for XDNA1. We pursue this goal through three sub-objectives:

  1. Maximize the rate at which VMAC.F operations are issued. In the best case, a VMAC.F operation is issued every clock cycle.

  2. Hide all other operations, e.g., loads and stores, and pointer arithmetic, behind computations.

  3. Minimize bank conflicts.

Note

We discuss the design and implementation of a representative best-case kernel. Many other variants of this kernel are required for implementing a flexible and high-performance tensor compiler. We plan to extend the Hello XDNA website with generalizations of the kernels presented in this chapter and the XDNA2 Kernel chapter in the future, e.g., through just-in-time code generation.

Data Layout#

_images/bf16_vmac_data.svg

Fig. 1 Register data layout for the BF16 4×8×4 VMAC.F (<BMLd>|<BMHd>), (<BMLm>|<BMHm>), <Xr>, <Xs>, <Rn> operation computing [m,k],[k,n]->[m,n] with |m|=4, |k|=8, and |n|=4.#

Fig. 1 illustrates the register data layout required for the BF16 4×8×4 VMAC.F operation. The operation multiplies a BF16 M×K matrix in one X register with a BF16 K×N matrix in another X register and adds the result to an FP32 M×N matrix in an accumulation register. All matrices are stored in row-major order. Note that this is equivalent to a column-major matrix-matrix multiplication if we exchange the two operands.

We can also write the operation as an einsum, assuming that all tensors are in row-major order: [m,k],[k,n]->[m,n]. The einsum notation becomes helpful when considering the more complex data layout of the entire kernel.

_images/xdna1_data_and_registers.svg

Fig. 2 Data layout of the tensor contraction kernel in scratchpad memory (L1) of an executing compute tile. The kernel computes [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0], where the dimension sizes |m0|=4, |k0|=8, and |n0|=4 are fixed. The first computation of a 2×4 block of output tiles is highlighted in darker colors.#

Fig. 2 illustrates the data layout of the entire tensor contraction kernel. It covers the two input tensors in0 and in1, as well as the output tensor out. The three tensors are tiled based on the requirements of the BF16 VMAC.F instruction. In detail, in0 has tiles of size M₀×K₀=4×8, in1 of size K₀×N₀=8×4, and out of size M₀×N₀=4×4. The tensor contraction kernel operates on three additional dimensions of type M, K and N. The tiles are stored in row-major order; accordingly, in0 uses M₁×K₁×M₀×K₀, while in1 uses K₁×N₁×K₀×N₀, and out uses M₁×N₁×M₀×N₀.

As before, we can write the operation compactly as an einsum, assuming that all tensors are stored in row-major order: [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0]. The tile size requirement means that |m0|=4, |k0|=8, and |n0|=4. We discuss limitations on m1, k1 and n1 in the following sections.

Design Decisions#

We have already discussed the data layout of our tensor contraction kernel. For this, we identified requirements on the tiling and identified an einsum that summarizes the contraction computed by the targeted kernel. The dimensions m0, k0 and n0 are consumed by the VMAC.F operation, while the handling of dimensions m1, k1 and n1 is still unspecified.

Before introducing design decisions for the kernel, we recapitulate key hardware properties that have to be considered in the kernel design:

  1. Loading a 64-byte accumulation register requires two 32-byte loads (VLDA). A load has a 7-cycle latency. An instruction can contain a single load or store operation accessing the accumulation registers.

  2. VMAC.F reads the accumulation register in its third cycle (forwarding).

  3. Combining properties 1 and 2, we can issue a dependent VMAC.F operation at the earliest in the sixth cycle of the second 32-byte load. In other words, the second load and the dependent VMAC.F have to be five cycles apart.

  4. A VMAC.F operation has a latency of six cycles.

We make the following design decisions in our kernel:

Output Stationary

Load each output tensor value to the register file exactly once. This means that an output value is kept in the respective accumulation register until all updates have been applied through VMAC.F operations. Storing intermediate values and loading them back into the registers is challenging because loads and stores must be in different instructions to avoid bank conflicts.

Register Blocking

|m1| and |n1| must be multiples of the register-blocking size. Our example kernel uses a 2×4 register blocking scheme for the output tensor. This means that we use eight accumulation registers to hold the values of the eight output tiles as shown in Fig. 2.

Due to the 2×4 blocking, each loaded in0 tile is used in four VMAC.F operations, and each in1 tile is used in two operations. This reuse is required to hide the register data transfer behind computation. For example, we could not achieve this with a 2×2 blocking.

Linear Contraction Dimension

The k1 dimension is handled with linear code without loop structures. This allows for different combinations of operations per block and is necessary for register preloading.

Single Hardware Loop

The m1 and n1 dimensions are represented by a single hardware loop. The first and last 2×4 blocks are computed outside of this loop, forming a warm-up phase and cool-down phase.

Double Buffering: Accumulation Registers

A 2×4 block requires eight out of sixteen available accumulation registers. We alternate registers BML0BML3 and BMH0BMH3 with BML4BML7 and BMH4BMH7 to realize a double buffering scheme. This means that while updating the tiles in one half of the accumulation registers, we load the next 2×4 block into the other half.

Double Buffering: Vector Registers

We also use double buffering for the registers holding tiles of in0. In particular, we alternate X0 and X1 with X2 and X3.

Implementation#

This section discusses the implementation of a representative XDNA1 tensor contraction kernel. The kernel computes the einsum [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0] with dimension sizes |m0|=4, |k0|=8, |n0|=4, |m1|=8, |k1|=4, and |n1|=8. It contains three parts: a warm-up phase, a hardware loop, and a cool-down phase.

Warm-Up Phase#

Listing 6 Warm-up phase (lines 7-62) of the XDNA1 kernel.#
 7  nopv                          ; vlda amlh0, [p2, #32] ; vldb wl4, [p1], #32 ; nops                 ; movxm m2, #4*4*4 * 8                     // 4(byte)*4(r)*4(t) * 8(n)                                             // out - 1 row-step
 8  nopv                          ; vlda amll0, [p2], m2  ; vldb wh4, [p1], #32 ; nops                 ; movx r0,  #28            ; mov p3, p2
 9  nopv                          ; vlda amhh0, [p2, #32] ; vldb wl0, [p0], #32 ; nops                 ; movxm m7, #2*4*8 * 4 - 32                // 2(byte)*4(r)*8(s) * 4(k) - 32 (half-block)                           // in0 - m-step
10  nopv                          ; vlda amhl0, [p2], #64 ; vldb wh0, [p0], m7  ; nops                 ; nopx                     ; mov p4, p2
11  nopv                          ; vlda amll1, [p3, #64] ; vldb wl1, [p0], #32 ; nops                 ; movxm m0, #32 - (2*4*8*4)                // 32(half-block) - (m7+32)                                             // in0 - k-step
12  nopv                          ; vlda amlh1, [p3, #96] ; vldb wh1, [p0], m0  ; padds [p3], #128     ; nopx                     ; mov p5, p3
13  nopv                          ; vlda amhl1, [p2], #32 ; vldb wl5, [p1], #32 ; nops                 ; movxm m1, #2*8*4 * 8 - 7 * 32            // 2(byte)*8(s)*4(t) * 8(n) - 7(blocking in n *2 - 1) * 32(half-block)  // in1 - k-step
14  nopv                          ; vlda amhh1, [p2], #32 ; vldb wh5, [p1], #32 ; nops                 ; movx r1,  #0             ; mov r2, #8/4  // 8(n)/4(blocking in n)
15  nopv                          ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; movxm r3, #32 - 2*4*8 * 2 * 4            // 32(half-block) - 2(byte)*4(r)*8(s) * 2(blocking in m) * 4(k)         // in0 - n-step
16  nopv                          ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; movxm r4, #32                            // 32(half-block)                                                       // in0 - m-step // out - n-step
17  vmac.f bml0, bml0, x0, x4, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; movxm r5, #32 - 2*8*4 * 8 * (4-1)        // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * (k-1)                    // in1 - n-step
18  nopv                          ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; movxm r6, #32 - 2*8*4 * 8 * 4            // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * 2(4)                     // in1 - m-step
19  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda amll3, [p3], #32 ; nopb                ; nops                 ; movxm r7, #32 + 4*4*4 * 8                // 32(half-block) + 4(byte)*4(r)*4(t) * 8(n)                            // out - m-step
20  nopv                          ; vlda amlh3, [p3], #32 ; nopb                ; nops                 ; add r1,  r1,  #1         ; nopm
21  vmac.f bml1, bml1, x0, x5, r0 ; vlda amhl3, [p2], #32 ; nopb                ; nops                 ; ltu r27, r1,  r2         ; nopm
22  nopv                          ; vlda amhh3, [p2], #32 ; nopb                ; nops                 ; sel.nez r28, r4, r7, r27 ; nopm
23  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
24  nopv                          ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
25  vmac.f bml2, bml2, x0, x6, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
26  nopv                          ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
27  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
28  nopv                          ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
29  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; movxm ls, #.l_start
30  nopv                          ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; movxm le, #.l_end
31  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; movxm lc, #3                            // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)
32
33// k=8
34  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; nopx                     ; nopm
35  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; nopx                     ; nopm
36  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
37  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
38  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
39  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
40  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
41  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; nopx                     ; nopm
42
43// k=16
44  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; nopx                     ; nopm
45  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; nopx                     ; nopm
46  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
47  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
48  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
49  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
50  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
51  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; nopx                     ; nopm
52
53// k=24
54  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; add r1,  r1,  #1         ; nopm
55  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; ltu r27, r1,  r2         ; nopm
56  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; sel.nez r28, r4, r7, r27 ; mov m6, m5
57  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
58  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
59  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
60  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
61  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx                     ; nopm
62// k=32

Listing 6 shows the warm-up phase of the kernel. Load unit A loads data into the accumulation registers. The sixteen 32-byte VLDA operations in lines 7–22 load the first 2×4 block of output tiles into the accumulation registers BML0BML3 and BMH0BMH3. Additionally, in lines 7–18, load unit B is used to load the first two input tiles of in0 into vector registers X0 and X1, as well as the first four input tiles of in1 into X4X7. The register mapping is also illustrated in Fig. 2. The first update of the 2×4 block is performed by the VMAC.F operations in lines 17–31.

In lines 15–19, the warm-up phase initializes the general-purpose registers R3R7. These are used throughout the kernel and their values copied to modifier registers for subsequent updates of addresses in pointer registers.

We also see that load unit A is used to load the next 2×4 block of output tiles to accumulation registers BML4BML7 and BMH4BMH7 (lines 27–30, 37–40, 47–50, and 57–60).

The first instruction block (lines 7–31) contains eight VMAC.F operations and 17 NOPV operations, thus leaving the vector unit partially unutilized. Every instruction in the following three eight-instruction blocks contains a VMAC.F operation, meaning that the BF16 matrix multiplication unit of the core is fully utilized. In summary, the warm-up phase has a total of 49 instructions, out of which 32 contain VMAC.F operations.

Hardware Loop#

We must perform the loop setup at least 64 bytes before the loop’s start address. The first and last instructions in the loop must be 16-byte aligned. Additionally, the last instruction covered by a loop must have a size of 16 bytes. An instruction that contains operations for all functional units is 16 bytes wide. A NOP instruction is only two bytes wide.

Listing 7 Hardware loop setup (lines 29-31: chars 104+) of the XDNA1 kernel.#
29movxm ls, #.l_start
30movxm le, #.l_end
31movxm lc, #3                            // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)

Listing 7 shows the operations configuring the hardware loop. movxm ls, #.l_start copies the address of the first loop instruction into the loop start register. The operation movxm le, #.l_end copies the address of the last loop instruction into the loop end register. movxm lc, #3 copies the value 3 into the loop counter register.

Listing 8 Body of the loop (lines 64-148) in the XDNA1 kernel.#
 64.p2align 4
 65.l_start:
 66// k=0
 67  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx                     ; nopm
 68  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx                     ; nopm
 69  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx                     ; nopm
 70  vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll0, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
 71  vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh0, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
 72  vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl0, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
 73  vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh0, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
 74  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx                     ; nopm
 75
 76// k=8
 77  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx                     ; nopm
 78  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx                     ; nopm
 79  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx                     ; nopm
 80  vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll1, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
 81  vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh1, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
 82  vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl1, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
 83  vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh1, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
 84  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx                     ; nopm
 85
 86// k=16
 87  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx                     ; nopm
 88  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx                     ; nopm
 89  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx                     ; nopm
 90  vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
 91  vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
 92  vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
 93  vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
 94  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll3, [p5], #32 ; nopx                     ; nopm
 95
 96// k=24
 97  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh3, [p5], m6  ; add r1,  r1,  #1         ; nopm
 98  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl3, [p4], #32 ; ltu r27, r1,  r2         ; nopm
 99  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh3, [p4], m6  ; sel.nez r28, r4, r7, r27 ; mov m6, m5
100  vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll3, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
101  vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh3, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
102  vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl3, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
103  vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh3, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
104  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll4, [p5], #32 ; nopx                     ; nopm
105// k=32
106
107// k=0
108  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh4, [p5], #32 ; nopx                     ; nopm
109  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl4, [p4], #32 ; nopx                     ; nopm
110  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh4, [p4], #32 ; nopx                     ; nopm
111  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
112  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
113  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
114  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
115  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll5, [p5], #32 ; nopx                     ; nopm
116
117// k=8
118  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh5, [p5], #32 ; nopx                     ; nopm
119  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl5, [p4], #32 ; nopx                     ; nopm
120  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh5, [p4], #32 ; nopx                     ; nopm
121  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
122  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
123  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
124  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
125  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll6, [p5], #32 ; nopx                     ; nopm
126
127// k=16
128  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh6, [p5], #32 ; nopx                     ; nopm
129  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl6, [p4], #32 ; nopx                     ; nopm
130  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; vst amhh6, [p4], #32 ; nopx                     ; nopm
131  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
132  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
133  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
134  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
135  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll7, [p5], #32 ; nopx                     ; nopm
136
137// k=24
138  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh7, [p5], m6  ; add r1,  r1,  #1         ; nopm
139  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl7, [p4], #32 ; ltu r27, r1,  r2         ; nopm
140  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh7, [p4], m6  ; sel.nez r28, r4, r7, r27 ; mov m6, m5
141  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
142  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
143  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
144  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
145.p2align 4
146.l_end:
147  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx                     ; nopm
148// k=32

Listing 8 shows the body of the tensor contraction kernel’s loop. In the first half of the body (lines 67–104), the values in accumulation registers with indices 4–7 are updated by the VMAC.F operations. When entering the loop body, accumulation registers 0–3 hold the results of the preceding 2×4 block of output tiles. During execution, these are written to scratchpad memory (L1) using VST operations. Simultaneously, load unit A transfers the next 2×4 block’s tiles into registers 0–3.

The second half of the loop body (lines 108–144) computes the pre-loaded 2×4 block and updates the output tiles in registers 0–3. At the same time, the data of the now preceding 2×4 block, computed in the first half of the loop body, is written to memory, while the next block is loaded to registers 4–7.

Considering the XDNA1 vector unit, we see that every instruction in the loop body contains a VMAC.F operation. Therefore, the unit is fully utilized and all of the 64 instructions in the loop body perform a BF16 4×8×4 matrix multiplication.

Cool-down Phase#

Listing 9 Cool-down phase (lines 150-205) of the XDNA1 kernel.#
150// k=0
151  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx                     ; nopm
152  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx                     ; nopm
153  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx                     ; nopm
154  vmac.f bmh5, bmh5, x1, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
155  vmac.f bml6, bml6, x0, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
156  vmac.f bmh6, bmh6, x1, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
157  vmac.f bml7, bml7, x0, x7, r0 ; nopa                  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
158  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx                     ; nopm
159
160// k=8
161  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx                     ; nopm
162  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx                     ; nopm
163  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx                     ; nopm
164  vmac.f bmh5, bmh5, x3, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
165  vmac.f bml6, bml6, x2, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
166  vmac.f bmh6, bmh6, x3, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
167  vmac.f bml7, bml7, x2, x7, r0 ; nopa                  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
168  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx                     ; nopm
169
170// k=16
171  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx                     ; nopm
172  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx                     ; nopm
173  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0, #0]  ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx                     ; nopm
174  vmac.f bmh5, bmh5, x1, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
175  vmac.f bml6, bml6, x0, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
176  vmac.f bmh6, bmh6, x1, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
177  vmac.f bml7, bml7, x0, x7, r0 ; nopa                  ; vldb wh7, [p1, #0]  ; nops                 ; nopx                     ; nopm
178  vmac.f bmh7, bmh7, x1, x7, r0 ; nopa                  ; nopb                ; vst amll3, [p5], #32 ; nopx                     ; nopm
179
180// k=24
181  vmac.f bml4, bml4, x2, x4, r0 ; nopa                  ; nopb                ; vst amlh3, [p5], m6  ; nopx                     ; nopm
182  vmac.f bmh4, bmh4, x3, x4, r0 ; nopa                  ; nopb                ; vst amhl3, [p4], #32 ; nopx                     ; nopm
183  vmac.f bml5, bml5, x2, x5, r0 ; nopa                  ; nopb                ; vst amhh3, [p4], m6  ; nopx                     ; nopm
184  vmac.f bmh5, bmh5, x3, x5, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
185  vmac.f bml6, bml6, x2, x6, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
186  vmac.f bmh6, bmh6, x3, x6, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
187  vmac.f bml7, bml7, x2, x7, r0 ; nopa                  ; nopb                ; vst amll4, [p5], #32 ; nopx                     ; nopm
188  vmac.f bmh7, bmh7, x3, x7, r0 ; nopa                  ; nopb                ; vst amlh4, [p5], #32 ; nopx                     ; nopm
189// k=32
190
191  nopv                          ; nopa                  ; nopb                ; vst amhl4, [p4], #32 ; nopx                     ; nopm
192  nopv                          ; nopa                  ; nopb                ; vst amhh4, [p4], #32 ; nopx                     ; nopm
193  nopv                          ; nopa                  ; nopb                ; vst amll5, [p5], #32 ; nopx                     ; nopm
194  nopv                          ; nopa                  ; nopb                ; vst amlh5, [p5], #32 ; nopx                     ; nopm
195  nopv                          ; nopa                  ; nopb                ; vst amhl5, [p4], #32 ; nopx                     ; nopm
196  nopv                          ; nopa                  ; nopb                ; vst amhh5, [p4], #32 ; nopx                     ; nopm
197  nopv                          ; nopa                  ; nopb                ; vst amll6, [p5], #32 ; nopx                     ; nopm
198  nopv                          ; nopa                  ; nopb                ; vst amlh6, [p5], #32 ; nopx                     ; nopm
199  nopv                          ; nopa                  ; nopb                ; vst amhl6, [p4], #32 ; nopx                     ; nopm
200  nopv                          ; nopa                  ; nopb                ; vst amhh6, [p4], #32 ; ret lr                   ; nopm
201  nopv                          ; nopa                  ; nopb                ; vst amll7, [p5], #32 ; nopx                     ; nopm //  Delay Slot 5
202  nopv                          ; nopa                  ; nopb                ; vst amlh7, [p5, #0]  ; nopx                     ; nopm //  Delay Slot 4
203  nopv                          ; nopa                  ; nopb                ; vst amhl7, [p4], #32 ; nopx                     ; nopm //  Delay Slot 3
204  nopv                          ; nopa                  ; nopb                ; vst amhh7, [p4, #0]  ; nopx                     ; nopm //  Delay Slot 2
205  nopv                          ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm //  Delay Slot 1

The cool-down phase is shown in Listing 9. It differs from the loop body in two key ways. First, no preloading of the next 2×4 block of output tiles is required. Second, most of the VST operations writing the last block in accumulation registers 4–7 are exposed, meaning they cannot be hidden behind VMAC.F operations. In line 200, the ret lr operation is issued, which has six-cycle latency.

The cool-down phase contains a total of 47 instructions out of which 32 contain VMAC.F operations.

Kernel Efficiency#

Our XDNA1 tensor contraction kernel has the following utilization of the vector unit in the three parts:

  • Warm-up phase: 32 out of 49 instructions contain VMAC.F operations.

  • Hardware loop: All 64 instructions in the loop body contain VMAC.F operations. The loop executes three times, yielding 192 instructions containing VMAC.F operations.

  • Cool-down phase: 32 out of 47 instructions contain VMAC.F operations.

In summary, the kernel consists of 288 instructions out of which 256 contain VMAC.F operations. This leads to a theoretical utilization of 89%. In other words, a compute tile running at 1.8 GHz would execute 1.6×10⁹ BF16 4×8×4 operations per second. This is equivalent to a theoretical floating-point throughput of 410 BF16 GFLOPS.

We implemented a benchmark in which the tensor contraction kernel is called repeatedly in a loop on the NPU. Benchmarking the kernel on an XDNA1 NPU (AMD Ryzen 7 8700G), we achieved 398 BF16 GFLOPS. The benchmarking code is available from our xdna repository. To run the benchmark, execute the following commands:

git clone https://github.com/scalable-analyses/xdna
cd xdna
make run

Note

The installation of the MLIR-AIE compiler aiecc and Peano is documented in the mlir-aie repository. The Makefile assumes that the environment variable PEANO_INSTALL_DIR contains the path to Peano and that aiecc.py is available in the path. Use xrt-smi configure --pmode turbo to set the NPU clock to its maximum frequency.