XDNA1 Kernel#

Tiled tensor contractions use kernels that operate on subtensors (tiles) as their core building block. Accelerating these kernels is crucial to the overall performance of tensor workloads. As discussed in the chapter Instruction Set Architecture, the XDNA cores’ floating-point throughput is driven by VMAC.F. On XDNA1, the highest-throughput VMAC.F operation performs a BF16 4×8×4 matrix multiplication. On XDNA2, the highest-throughput VMAC.F computes a BFP16 8×8×8 matrix multiplication.

The goal of this chapter is to develop a high-performance tensor contraction kernel for XDNA1. We pursue this goal through three sub-objectives:

Maximize the rate at which VMAC.F operations are issued. In the best case, a VMAC.F operation is issued every clock cycle.
Hide all other operations, e.g., loads and stores, and pointer arithmetic, behind computations.
Minimize bank conflicts.

Note

We discuss the design and implementation of a representative best-case kernel. Many other variants of this kernel are required for implementing a flexible and high-performance tensor compiler. We plan to extend the Hello XDNA website with generalizations of the kernels presented in this chapter and the XDNA2 Kernel chapter in the future, e.g., through just-in-time code generation.

Data Layout#

_images/bf16_vmac_data.svg — Fig. 1 Register data layout for the BF16 4×8×4 `VMAC.F (<BMLd>|<BMHd>), (<BMLm>|<BMHm>), <Xr>, <Xs>, <Rn>` operation computing `[m,k],[k,n]->[m,n]` with `|m|=4`, `|k|=8`, and `|n|=4`.#

Fig. 1 illustrates the register data layout required for the BF16 4×8×4 VMAC.F operation. The operation multiplies a BF16 M×K matrix in one X register with a BF16 K×N matrix in another X register and adds the result to an FP32 M×N matrix in an accumulation register. All matrices are stored in row-major order. Note that this is equivalent to a column-major matrix-matrix multiplication if we exchange the two operands.

We can also write the operation as an einsum, assuming that all tensors are in row-major order: [m,k],[k,n]->[m,n]. The einsum notation becomes helpful when considering the more complex data layout of the entire kernel.

_images/xdna1_data_and_registers.svg — Fig. 2 Data layout of the tensor contraction kernel in scratchpad memory (L1) of an executing compute tile. The kernel computes `[m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0]`, where the dimension sizes `|m0|=4`, `|k0|=8`, and `|n0|=4` are fixed. The first computation of a 2×4 block of output tiles is highlighted in darker colors.#

Fig. 2 illustrates the data layout of the entire tensor contraction kernel. It covers the two input tensors in0 and in1, as well as the output tensor out. The three tensors are tiled based on the requirements of the BF16 VMAC.F instruction. In detail, in0 has tiles of size M₀×K₀=4×8, in1 of size K₀×N₀=8×4, and out of size M₀×N₀=4×4. The tensor contraction kernel operates on three additional dimensions of type M, K and N. The tiles are stored in row-major order; accordingly, in0 uses M₁×K₁×M₀×K₀, while in1 uses K₁×N₁×K₀×N₀, and out uses M₁×N₁×M₀×N₀.

As before, we can write the operation compactly as an einsum, assuming that all tensors are stored in row-major order: [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0]. The tile size requirement means that |m0|=4, |k0|=8, and |n0|=4. We discuss limitations on m1, k1 and n1 in the following sections.

Design Decisions#

We have already discussed the data layout of our tensor contraction kernel. For this, we identified requirements on the tiling and identified an einsum that summarizes the contraction computed by the targeted kernel. The dimensions m0, k0 and n0 are consumed by the VMAC.F operation, while the handling of dimensions m1, k1 and n1 is still unspecified.

Before introducing design decisions for the kernel, we recapitulate key hardware properties that have to be considered in the kernel design:

Loading a 64-byte accumulation register requires two 32-byte loads (VLDA). A load has a 7-cycle latency. An instruction can contain a single load or store operation accessing the accumulation registers.
VMAC.F reads the accumulation register in its third cycle (forwarding).
Combining properties 1 and 2, we can issue a dependent VMAC.F operation at the earliest in the sixth cycle of the second 32-byte load. In other words, the second load and the dependent VMAC.F have to be five cycles apart.
A VMAC.F operation has a latency of six cycles.

We make the following design decisions in our kernel:

Output Stationary

Load each output tensor value to the register file exactly once. This means that an output value is kept in the respective accumulation register until all updates have been applied through VMAC.F operations. Storing intermediate values and loading them back into the registers is challenging because loads and stores must be in different instructions to avoid bank conflicts.

Register Blocking

|m1| and |n1| must be multiples of the register-blocking size. Our example kernel uses a 2×4 register blocking scheme for the output tensor. This means that we use eight accumulation registers to hold the values of the eight output tiles as shown in Fig. 2.

Due to the 2×4 blocking, each loaded in0 tile is used in four VMAC.F operations, and each in1 tile is used in two operations. This reuse is required to hide the register data transfer behind computation. For example, we could not achieve this with a 2×2 blocking.

Linear Contraction Dimension: The k1 dimension is handled with linear code without loop structures. This allows for different combinations of operations per block and is necessary for register preloading.

Single Hardware Loop: The m1 and n1 dimensions are represented by a single hardware loop. The first and last 2×4 blocks are computed outside of this loop, forming a warm-up phase and cool-down phase.
Double Buffering: Accumulation Registers: A 2×4 block requires eight out of sixteen available accumulation registers. We alternate registers BML0–BML3 and BMH0–BMH3 with BML4–BML7 and BMH4–BMH7 to realize a double buffering scheme. This means that while updating the tiles in one half of the accumulation registers, we load the next 2×4 block into the other half.
Double Buffering: Vector Registers: We also use double buffering for the registers holding tiles of in0. In particular, we alternate X0 and X1 with X2 and X3.

Implementation#

This section discusses the implementation of a representative XDNA1 tensor contraction kernel. The kernel computes the einsum [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0] with dimension sizes |m0|=4, |k0|=8, |n0|=4, |m1|=8, |k1|=4, and |n1|=8. It contains three parts: a warm-up phase, a hardware loop, and a cool-down phase.

Warm-Up Phase#

Listing 6 Warm-up phase (lines 7-62) of the XDNA1 kernel.#

  nopv                          ; vlda amlh0, [p2, #32] ; vldb wl4, [p1], #32 ; nops                 ; movxm m2, #4*4*4 * 8                     // 4(byte)*4(r)*4(t) * 8(n)                                             // out - 1 row-step
  nopv                          ; vlda amll0, [p2], m2  ; vldb wh4, [p1], #32 ; nops                 ; movx r0,  #28            ; mov p3, p2
  nopv                          ; vlda amhh0, [p2, #32] ; vldb wl0, [p0], #32 ; nops                 ; movxm m7, #2*4*8 * 4 - 32                // 2(byte)*4(r)*8(s) * 4(k) - 32 (half-block)                           // in0 - m-step
  nopv                          ; vlda amhl0, [p2], #64 ; vldb wh0, [p0], m7  ; nops                 ; nopx                     ; mov p4, p2
  nopv                          ; vlda amll1, [p3, #64] ; vldb wl1, [p0], #32 ; nops                 ; movxm m0, #32 - (2*4*8*4)                // 32(half-block) - (m7+32)                                             // in0 - k-step
  nopv                          ; vlda amlh1, [p3, #96] ; vldb wh1, [p0], m0  ; padds [p3], #128     ; nopx                     ; mov p5, p3
  nopv                          ; vlda amhl1, [p2], #32 ; vldb wl5, [p1], #32 ; nops                 ; movxm m1, #2*8*4 * 8 - 7 * 32            // 2(byte)*8(s)*4(t) * 8(n) - 7(blocking in n *2 - 1) * 32(half-block)  // in1 - k-step
  nopv                          ; vlda amhh1, [p2], #32 ; vldb wh5, [p1], #32 ; nops                 ; movx r1,  #0             ; mov r2, #8/4  // 8(n)/4(blocking in n)
  nopv                          ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; movxm r3, #32 - 2*4*8 * 2 * 4            // 32(half-block) - 2(byte)*4(r)*8(s) * 2(blocking in m) * 4(k)         // in0 - n-step
  nopv                          ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; movxm r4, #32                            // 32(half-block)                                                       // in0 - m-step // out - n-step
  vmac.f bml0, bml0, x0, x4, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; movxm r5, #32 - 2*8*4 * 8 * (4-1)        // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * (k-1)                    // in1 - n-step
  nopv                          ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; movxm r6, #32 - 2*8*4 * 8 * 4            // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * 2(4)                     // in1 - m-step
  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda amll3, [p3], #32 ; nopb                ; nops                 ; movxm r7, #32 + 4*4*4 * 8                // 32(half-block) + 4(byte)*4(r)*4(t) * 8(n)                            // out - m-step
  nopv                          ; vlda amlh3, [p3], #32 ; nopb                ; nops                 ; add r1,  r1,  #1         ; nopm
  vmac.f bml1, bml1, x0, x5, r0 ; vlda amhl3, [p2], #32 ; nopb                ; nops                 ; ltu r27, r1,  r2         ; nopm
  nopv                          ; vlda amhh3, [p2], #32 ; nopb                ; nops                 ; sel.nez r28, r4, r7, r27 ; nopm
  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
  nopv                          ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
  vmac.f bml2, bml2, x0, x6, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
  nopv                          ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  nopv                          ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; movxm ls, #.l_start
  nopv                          ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; movxm le, #.l_end
  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; movxm lc, #3                            // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)

// k=8
  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; nopx                     ; nopm

// k=16
  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; nops                 ; nopx                     ; nopm

// k=24
  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; nops                 ; add r1,  r1,  #1         ; nopm
  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; nops                 ; ltu r27, r1,  r2         ; nopm
  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; nops                 ; sel.nez r28, r4, r7, r27 ; mov m6, m5
  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx                     ; nopm
// k=32

Listing 6 shows the warm-up phase of the kernel. Load unit A loads data into the accumulation registers. The sixteen 32-byte VLDA operations in lines 7–22 load the first 2×4 block of output tiles into the accumulation registers BML0–BML3 and BMH0–BMH3. Additionally, in lines 7–18, load unit B is used to load the first two input tiles of in0 into vector registers X0 and X1, as well as the first four input tiles of in1 into X4–X7. The register mapping is also illustrated in Fig. 2. The first update of the 2×4 block is performed by the VMAC.F operations in lines 17–31.

In lines 15–19, the warm-up phase initializes the general-purpose registers R3–R7. These are used throughout the kernel and their values copied to modifier registers for subsequent updates of addresses in pointer registers.

We also see that load unit A is used to load the next 2×4 block of output tiles to accumulation registers BML4–BML7 and BMH4–BMH7 (lines 27–30, 37–40, 47–50, and 57–60).

The first instruction block (lines 7–31) contains eight VMAC.F operations and 17 NOPV operations, thus leaving the vector unit partially unutilized. Every instruction in the following three eight-instruction blocks contains a VMAC.F operation, meaning that the BF16 matrix multiplication unit of the core is fully utilized. In summary, the warm-up phase has a total of 49 instructions, out of which 32 contain VMAC.F operations.

Hardware Loop#

We must perform the loop setup at least 64 bytes before the loop’s start address. The first and last instructions in the loop must be 16-byte aligned. Additionally, the last instruction covered by a loop must have a size of 16 bytes. An instruction that contains operations for all functional units is 16 bytes wide. A NOP instruction is only two bytes wide.

Listing 7 Hardware loop setup (lines 29-31: chars 104+) of the XDNA1 kernel.#

movxm ls, #.l_start
movxm le, #.l_end
movxm lc, #3                            // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)

Listing 7 shows the operations configuring the hardware loop. movxm ls, #.l_start copies the address of the first loop instruction into the loop start register. The operation movxm le, #.l_end copies the address of the last loop instruction into the loop end register. movxm lc, #3 copies the value 3 into the loop counter register.

Listing 8 Body of the loop (lines 64-148) in the XDNA1 kernel.#

.p2align 4
.l_start:
// k=0
  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll0, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh0, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl0, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh0, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx                     ; nopm

// k=8
  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll1, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh1, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl1, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh1, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx                     ; nopm

// k=16
  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll3, [p5], #32 ; nopx                     ; nopm

// k=24
  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh3, [p5], m6  ; add r1,  r1,  #1         ; nopm
  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl3, [p4], #32 ; ltu r27, r1,  r2         ; nopm
  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh3, [p4], m6  ; sel.nez r28, r4, r7, r27 ; mov m6, m5
  vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll3, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
  vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh3, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
  vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl3, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
  vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh3, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll4, [p5], #32 ; nopx                     ; nopm
// k=32

// k=0
  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh4, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl4, [p4], #32 ; nopx                     ; nopm
  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh4, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll5, [p5], #32 ; nopx                     ; nopm

// k=8
  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh5, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl5, [p4], #32 ; nopx                     ; nopm
  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh5, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll6, [p5], #32 ; nopx                     ; nopm

// k=16
  vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh6, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl6, [p4], #32 ; nopx                     ; nopm
  vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3,   [p0], m3  ; vldb wh5, [p1], #32 ; vst amhh6, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4  ; nops                 ; nopx                     ; nopm
  vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll7, [p5], #32 ; nopx                     ; nopm

// k=24
  vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh7, [p5], m6  ; add r1,  r1,  #1         ; nopm
  vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl7, [p4], #32 ; ltu r27, r1,  r2         ; nopm
  vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh7, [p4], m6  ; sel.nez r28, r4, r7, r27 ; mov m6, m5
  vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops                 ; sel.nez r28, r5, r6, r27 ; mov m5, r28
  vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5  ; vldb wh6, [p1], #32 ; nops                 ; sel.nez r28, r3, r4, r27 ; mov m4, r28
  vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops                 ; mul r1,  r1,  r27        ; mov m3, r28
  vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
.p2align 4
.l_end:
  vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx                     ; nopm
// k=32

Listing 8 shows the body of the tensor contraction kernel’s loop. In the first half of the body (lines 67–104), the values in accumulation registers with indices 4–7 are updated by the VMAC.F operations. When entering the loop body, accumulation registers 0–3 hold the results of the preceding 2×4 block of output tiles. During execution, these are written to scratchpad memory (L1) using VST operations. Simultaneously, load unit A transfers the next 2×4 block’s tiles into registers 0–3.

The second half of the loop body (lines 108–144) computes the pre-loaded 2×4 block and updates the output tiles in registers 0–3. At the same time, the data of the now preceding 2×4 block, computed in the first half of the loop body, is written to memory, while the next block is loaded to registers 4–7.

Considering the XDNA1 vector unit, we see that every instruction in the loop body contains a VMAC.F operation. Therefore, the unit is fully utilized and all of the 64 instructions in the loop body perform a BF16 4×8×4 matrix multiplication.

Cool-down Phase#

Listing 9 Cool-down phase (lines 150-205) of the XDNA1 kernel.#

// k=0
  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x1, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x0, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x1, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x0, x7, r0 ; nopa                  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx                     ; nopm

// k=8
  vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1,   [p0], m0  ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x3, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x2, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x3, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x2, x7, r0 ; nopa                  ; vldb wh7, [p1], m1  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2,   [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx                     ; nopm

// k=16
  vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2,   [p0], m7  ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3,   [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3,   [p0, #0]  ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x1, x5, r0 ; nopa                  ; vldb wl6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x0, x6, r0 ; nopa                  ; vldb wh6, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x1, x6, r0 ; nopa                  ; vldb wl7, [p1], #32 ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x0, x7, r0 ; nopa                  ; vldb wh7, [p1, #0]  ; nops                 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x1, x7, r0 ; nopa                  ; nopb                ; vst amll3, [p5], #32 ; nopx                     ; nopm

// k=24
  vmac.f bml4, bml4, x2, x4, r0 ; nopa                  ; nopb                ; vst amlh3, [p5], m6  ; nopx                     ; nopm
  vmac.f bmh4, bmh4, x3, x4, r0 ; nopa                  ; nopb                ; vst amhl3, [p4], #32 ; nopx                     ; nopm
  vmac.f bml5, bml5, x2, x5, r0 ; nopa                  ; nopb                ; vst amhh3, [p4], m6  ; nopx                     ; nopm
  vmac.f bmh5, bmh5, x3, x5, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
  vmac.f bml6, bml6, x2, x6, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
  vmac.f bmh6, bmh6, x3, x6, r0 ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm
  vmac.f bml7, bml7, x2, x7, r0 ; nopa                  ; nopb                ; vst amll4, [p5], #32 ; nopx                     ; nopm
  vmac.f bmh7, bmh7, x3, x7, r0 ; nopa                  ; nopb                ; vst amlh4, [p5], #32 ; nopx                     ; nopm
// k=32

  nopv                          ; nopa                  ; nopb                ; vst amhl4, [p4], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amhh4, [p4], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amll5, [p5], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amlh5, [p5], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amhl5, [p4], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amhh5, [p4], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amll6, [p5], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amlh6, [p5], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amhl6, [p4], #32 ; nopx                     ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amhh6, [p4], #32 ; ret lr                   ; nopm
  nopv                          ; nopa                  ; nopb                ; vst amll7, [p5], #32 ; nopx                     ; nopm //  Delay Slot 5
  nopv                          ; nopa                  ; nopb                ; vst amlh7, [p5, #0]  ; nopx                     ; nopm //  Delay Slot 4
  nopv                          ; nopa                  ; nopb                ; vst amhl7, [p4], #32 ; nopx                     ; nopm //  Delay Slot 3
  nopv                          ; nopa                  ; nopb                ; vst amhh7, [p4, #0]  ; nopx                     ; nopm //  Delay Slot 2
  nopv                          ; nopa                  ; nopb                ; nops                 ; nopx                     ; nopm //  Delay Slot 1

The cool-down phase is shown in Listing 9. It differs from the loop body in two key ways. First, no preloading of the next 2×4 block of output tiles is required. Second, most of the VST operations writing the last block in accumulation registers 4–7 are exposed, meaning they cannot be hidden behind VMAC.F operations. In line 200, the ret lr operation is issued, which has six-cycle latency.

The cool-down phase contains a total of 47 instructions out of which 32 contain VMAC.F operations.

Kernel Efficiency#

Our XDNA1 tensor contraction kernel has the following utilization of the vector unit in the three parts:

Warm-up phase: 32 out of 49 instructions contain VMAC.F operations.
Hardware loop: All 64 instructions in the loop body contain VMAC.F operations. The loop executes three times, yielding 192 instructions containing VMAC.F operations.
Cool-down phase: 32 out of 47 instructions contain VMAC.F operations.

In summary, the kernel consists of 288 instructions out of which 256 contain VMAC.F operations. This leads to a theoretical utilization of 89%. In other words, a compute tile running at 1.8 GHz would execute 1.6×10⁹ BF16 4×8×4 operations per second. This is equivalent to a theoretical floating-point throughput of 410 BF16 GFLOPS.

We implemented a benchmark in which the tensor contraction kernel is called repeatedly in a loop on the NPU. Benchmarking the kernel on an XDNA1 NPU (AMD Ryzen 7 8700G), we achieved 398 BF16 GFLOPS. The benchmarking code is available from our xdna repository. To run the benchmark, execute the following commands:

git clone https://github.com/scalable-analyses/xdna
cd xdna
make run