XDNA1 Kernel#
Tiled tensor contractions use kernels that operate on subtensors (tiles) as their core building block. Accelerating these kernels is crucial to the overall performance of tensor workloads. As discussed in the chapter Instruction Set Architecture, the XDNA cores’ floating-point throughput is driven by VMAC.F. On XDNA1, the highest-throughput VMAC.F operation performs a BF16 4×8×4 matrix multiplication. On XDNA2, the highest-throughput VMAC.F computes a BFP16 8×8×8 matrix multiplication.
The goal of this chapter is to develop a high-performance tensor contraction kernel for XDNA1. We pursue this goal through three sub-objectives:
Maximize the rate at which VMAC.F operations are issued. In the best case, a VMAC.F operation is issued every clock cycle.
Hide all other operations, e.g., loads and stores, and pointer arithmetic, behind computations.
Minimize bank conflicts.
Note
We discuss the design and implementation of a representative best-case kernel. Many other variants of this kernel are required for implementing a flexible and high-performance tensor compiler. We plan to extend the Hello XDNA website with generalizations of the kernels presented in this chapter and the XDNA2 Kernel chapter in the future, e.g., through just-in-time code generation.
Data Layout#
Fig. 1 Register data layout for the BF16 4×8×4 VMAC.F (<BMLd>|<BMHd>), (<BMLm>|<BMHm>), <Xr>, <Xs>, <Rn> operation computing [m,k],[k,n]->[m,n] with |m|=4, |k|=8, and |n|=4.#
Fig. 1 illustrates the register data layout required for the BF16 4×8×4 VMAC.F operation.
The operation multiplies a BF16 M×K matrix in one X register with a BF16 K×N matrix in another X register and adds the result to an FP32 M×N matrix in an accumulation register.
All matrices are stored in row-major order.
Note that this is equivalent to a column-major matrix-matrix multiplication if we exchange the two operands.
We can also write the operation as an einsum, assuming that all tensors are in row-major order: [m,k],[k,n]->[m,n].
The einsum notation becomes helpful when considering the more complex data layout of the entire kernel.
Fig. 2 Data layout of the tensor contraction kernel in scratchpad memory (L1) of an executing compute tile. The kernel computes [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0], where the dimension sizes |m0|=4, |k0|=8, and |n0|=4 are fixed. The first computation of a 2×4 block of output tiles is highlighted in darker colors.#
Fig. 2 illustrates the data layout of the entire tensor contraction kernel.
It covers the two input tensors in0 and in1, as well as the output tensor out.
The three tensors are tiled based on the requirements of the BF16 VMAC.F instruction.
In detail, in0 has tiles of size M₀×K₀=4×8, in1 of size K₀×N₀=8×4, and out of size M₀×N₀=4×4.
The tensor contraction kernel operates on three additional dimensions of type M, K and N.
The tiles are stored in row-major order; accordingly, in0 uses M₁×K₁×M₀×K₀, while in1 uses K₁×N₁×K₀×N₀, and out uses M₁×N₁×M₀×N₀.
As before, we can write the operation compactly as an einsum, assuming that all tensors are stored in row-major order: [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0].
The tile size requirement means that |m0|=4, |k0|=8, and |n0|=4.
We discuss limitations on m1, k1 and n1 in the following sections.
Design Decisions#
We have already discussed the data layout of our tensor contraction kernel.
For this, we identified requirements on the tiling and identified an einsum that summarizes the contraction computed by the targeted kernel.
The dimensions m0, k0 and n0 are consumed by the VMAC.F operation, while the handling of dimensions m1, k1 and n1 is still unspecified.
Before introducing design decisions for the kernel, we recapitulate key hardware properties that have to be considered in the kernel design:
Loading a 64-byte accumulation register requires two 32-byte loads (VLDA). A load has a 7-cycle latency. An instruction can contain a single load or store operation accessing the accumulation registers.
VMAC.F reads the accumulation register in its third cycle (forwarding).
Combining properties 1 and 2, we can issue a dependent VMAC.F operation at the earliest in the sixth cycle of the second 32-byte load. In other words, the second load and the dependent VMAC.F have to be five cycles apart.
A VMAC.F operation has a latency of six cycles.
We make the following design decisions in our kernel:
- Output Stationary
Load each output tensor value to the register file exactly once. This means that an output value is kept in the respective accumulation register until all updates have been applied through VMAC.F operations. Storing intermediate values and loading them back into the registers is challenging because loads and stores must be in different instructions to avoid bank conflicts.
- Register Blocking
|m1|and|n1|must be multiples of the register-blocking size. Our example kernel uses a 2×4 register blocking scheme for the output tensor. This means that we use eight accumulation registers to hold the values of the eight output tiles as shown in Fig. 2.Due to the 2×4 blocking, each loaded
in0tile is used in four VMAC.F operations, and eachin1tile is used in two operations. This reuse is required to hide the register data transfer behind computation. For example, we could not achieve this with a 2×2 blocking.
- Linear Contraction Dimension
The
k1dimension is handled with linear code without loop structures. This allows for different combinations of operations per block and is necessary for register preloading.
- Single Hardware Loop
The
m1andn1dimensions are represented by a single hardware loop. The first and last 2×4 blocks are computed outside of this loop, forming a warm-up phase and cool-down phase.- Double Buffering: Accumulation Registers
A 2×4 block requires eight out of sixteen available accumulation registers. We alternate registers
BML0–BML3andBMH0–BMH3withBML4–BML7andBMH4–BMH7to realize a double buffering scheme. This means that while updating the tiles in one half of the accumulation registers, we load the next 2×4 block into the other half.- Double Buffering: Vector Registers
We also use double buffering for the registers holding tiles of
in0. In particular, we alternateX0andX1withX2andX3.
Implementation#
This section discusses the implementation of a representative XDNA1 tensor contraction kernel.
The kernel computes the einsum [m1,k1,m0,k0],[k1,n1,k0,n0]->[m1,n1,m0,n0] with dimension sizes |m0|=4, |k0|=8, |n0|=4, |m1|=8, |k1|=4, and |n1|=8.
It contains three parts: a warm-up phase, a hardware loop, and a cool-down phase.
Warm-Up Phase#
7 nopv ; vlda amlh0, [p2, #32] ; vldb wl4, [p1], #32 ; nops ; movxm m2, #4*4*4 * 8 // 4(byte)*4(r)*4(t) * 8(n) // out - 1 row-step
8 nopv ; vlda amll0, [p2], m2 ; vldb wh4, [p1], #32 ; nops ; movx r0, #28 ; mov p3, p2
9 nopv ; vlda amhh0, [p2, #32] ; vldb wl0, [p0], #32 ; nops ; movxm m7, #2*4*8 * 4 - 32 // 2(byte)*4(r)*8(s) * 4(k) - 32 (half-block) // in0 - m-step
10 nopv ; vlda amhl0, [p2], #64 ; vldb wh0, [p0], m7 ; nops ; nopx ; mov p4, p2
11 nopv ; vlda amll1, [p3, #64] ; vldb wl1, [p0], #32 ; nops ; movxm m0, #32 - (2*4*8*4) // 32(half-block) - (m7+32) // in0 - k-step
12 nopv ; vlda amlh1, [p3, #96] ; vldb wh1, [p0], m0 ; padds [p3], #128 ; nopx ; mov p5, p3
13 nopv ; vlda amhl1, [p2], #32 ; vldb wl5, [p1], #32 ; nops ; movxm m1, #2*8*4 * 8 - 7 * 32 // 2(byte)*8(s)*4(t) * 8(n) - 7(blocking in n *2 - 1) * 32(half-block) // in1 - k-step
14 nopv ; vlda amhh1, [p2], #32 ; vldb wh5, [p1], #32 ; nops ; movx r1, #0 ; mov r2, #8/4 // 8(n)/4(blocking in n)
15 nopv ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; movxm r3, #32 - 2*4*8 * 2 * 4 // 32(half-block) - 2(byte)*4(r)*8(s) * 2(blocking in m) * 4(k) // in0 - n-step
16 nopv ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; movxm r4, #32 // 32(half-block) // in0 - m-step // out - n-step
17 vmac.f bml0, bml0, x0, x4, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; movxm r5, #32 - 2*8*4 * 8 * (4-1) // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * (k-1) // in1 - n-step
18 nopv ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; movxm r6, #32 - 2*8*4 * 8 * 4 // 32(half-block) - 2(byte)*8(s)*4(t) * 8(n) * 2(4) // in1 - m-step
19 vmac.f bmh0, bmh0, x1, x4, r0 ; vlda amll3, [p3], #32 ; nopb ; nops ; movxm r7, #32 + 4*4*4 * 8 // 32(half-block) + 4(byte)*4(r)*4(t) * 8(n) // out - m-step
20 nopv ; vlda amlh3, [p3], #32 ; nopb ; nops ; add r1, r1, #1 ; nopm
21 vmac.f bml1, bml1, x0, x5, r0 ; vlda amhl3, [p2], #32 ; nopb ; nops ; ltu r27, r1, r2 ; nopm
22 nopv ; vlda amhh3, [p2], #32 ; nopb ; nops ; sel.nez r28, r4, r7, r27 ; nopm
23 vmac.f bmh1, bmh1, x1, x5, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; nops ; sel.nez r28, r5, r6, r27 ; mov m5, r28
24 nopv ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; nops ; sel.nez r28, r3, r4, r27 ; mov m4, r28
25 vmac.f bml2, bml2, x0, x6, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; nops ; mul r1, r1, r27 ; mov m3, r28
26 nopv ; vlda wh3, [p0], m0 ; vldb wh5, [p1], #32 ; nops ; nopx ; nopm
27 vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
28 nopv ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
29 vmac.f bml3, bml3, x0, x7, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; movxm ls, #.l_start
30 nopv ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; movxm le, #.l_end
31 vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; nops ; movxm lc, #3 // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)
32
33// k=8
34 vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; nops ; nopx ; nopm
35 vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; nops ; nopx ; nopm
36 vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; nops ; nopx ; nopm
37 vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
38 vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
39 vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
40 vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
41 vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; nops ; nopx ; nopm
42
43// k=16
44 vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; nops ; nopx ; nopm
45 vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; nops ; nopx ; nopm
46 vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3, [p0], m3 ; vldb wh5, [p1], #32 ; nops ; nopx ; nopm
47 vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
48 vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
49 vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
50 vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4 ; nops ; nopx ; nopm
51 vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; nops ; nopx ; nopm
52
53// k=24
54 vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; nops ; add r1, r1, #1 ; nopm
55 vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; nops ; ltu r27, r1, r2 ; nopm
56 vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; nops ; sel.nez r28, r4, r7, r27 ; mov m6, m5
57 vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; sel.nez r28, r5, r6, r27 ; mov m5, r28
58 vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5 ; vldb wh6, [p1], #32 ; nops ; sel.nez r28, r3, r4, r27 ; mov m4, r28
59 vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; mul r1, r1, r27 ; mov m3, r28
60 vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
61 vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx ; nopm
62// k=32
Listing 6 shows the warm-up phase of the kernel.
Load unit A loads data into the accumulation registers.
The sixteen 32-byte VLDA operations in lines 7–22 load the first 2×4 block of output tiles into the accumulation registers BML0–BML3 and BMH0–BMH3.
Additionally, in lines 7–18, load unit B is used to load the first two input tiles of in0 into vector registers X0 and X1, as well as the first four input tiles of in1 into X4–X7.
The register mapping is also illustrated in Fig. 2.
The first update of the 2×4 block is performed by the VMAC.F operations in lines 17–31.
In lines 15–19, the warm-up phase initializes the general-purpose registers R3–R7.
These are used throughout the kernel and their values copied to modifier registers for subsequent updates of addresses in pointer registers.
We also see that load unit A is used to load the next 2×4 block of output tiles to accumulation registers BML4–BML7 and BMH4–BMH7 (lines 27–30, 37–40, 47–50, and 57–60).
The first instruction block (lines 7–31) contains eight VMAC.F operations and 17 NOPV operations, thus leaving the vector unit partially unutilized. Every instruction in the following three eight-instruction blocks contains a VMAC.F operation, meaning that the BF16 matrix multiplication unit of the core is fully utilized. In summary, the warm-up phase has a total of 49 instructions, out of which 32 contain VMAC.F operations.
Hardware Loop#
We must perform the loop setup at least 64 bytes before the loop’s start address. The first and last instructions in the loop must be 16-byte aligned. Additionally, the last instruction covered by a loop must have a size of 16 bytes. An instruction that contains operations for all functional units is 16 bytes wide. A NOP instruction is only two bytes wide.
29movxm ls, #.l_start
30movxm le, #.l_end
31movxm lc, #3 // (8(m)/2(blocking in m) * 8(n)/4(blocking in n) - 2(warmup + cool-down)) /2(iterations in loop)
Listing 7 shows the operations configuring the hardware loop.
movxm ls, #.l_start copies the address of the first loop instruction into the loop start register.
The operation movxm le, #.l_end copies the address of the last loop instruction into the loop end register.
movxm lc, #3 copies the value 3 into the loop counter register.
64.p2align 4
65.l_start:
66// k=0
67 vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx ; nopm
68 vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx ; nopm
69 vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx ; nopm
70 vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll0, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
71 vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh0, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
72 vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl0, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
73 vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh0, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
74 vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx ; nopm
75
76// k=8
77 vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx ; nopm
78 vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx ; nopm
79 vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx ; nopm
80 vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll1, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
81 vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh1, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
82 vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl1, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
83 vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh1, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
84 vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx ; nopm
85
86// k=16
87 vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx ; nopm
88 vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx ; nopm
89 vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3, [p0], m3 ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx ; nopm
90 vmac.f bmh5, bmh5, x1, x5, r0 ; vlda amll2, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
91 vmac.f bml6, bml6, x0, x6, r0 ; vlda amlh2, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
92 vmac.f bmh6, bmh6, x1, x6, r0 ; vlda amhl2, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
93 vmac.f bml7, bml7, x0, x7, r0 ; vlda amhh2, [p2], #32 ; vldb wh7, [p1], m4 ; nops ; nopx ; nopm
94 vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll3, [p5], #32 ; nopx ; nopm
95
96// k=24
97 vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh3, [p5], m6 ; add r1, r1, #1 ; nopm
98 vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl3, [p4], #32 ; ltu r27, r1, r2 ; nopm
99 vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh3, [p4], m6 ; sel.nez r28, r4, r7, r27 ; mov m6, m5
100 vmac.f bmh5, bmh5, x3, x5, r0 ; vlda amll3, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; sel.nez r28, r5, r6, r27 ; mov m5, r28
101 vmac.f bml6, bml6, x2, x6, r0 ; vlda amlh3, [p3], m5 ; vldb wh6, [p1], #32 ; nops ; sel.nez r28, r3, r4, r27 ; mov m4, r28
102 vmac.f bmh6, bmh6, x3, x6, r0 ; vlda amhl3, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; mul r1, r1, r27 ; mov m3, r28
103 vmac.f bml7, bml7, x2, x7, r0 ; vlda amhh3, [p2], m5 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
104 vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll4, [p5], #32 ; nopx ; nopm
105// k=32
106
107// k=0
108 vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh4, [p5], #32 ; nopx ; nopm
109 vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl4, [p4], #32 ; nopx ; nopm
110 vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh4, [p4], #32 ; nopx ; nopm
111 vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll4, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
112 vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh4, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
113 vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl4, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
114 vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh4, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
115 vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll5, [p5], #32 ; nopx ; nopm
116
117// k=8
118 vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh5, [p5], #32 ; nopx ; nopm
119 vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl5, [p4], #32 ; nopx ; nopm
120 vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh5, [p4], #32 ; nopx ; nopm
121 vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll5, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
122 vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh5, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
123 vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl5, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
124 vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh5, [p2], #32 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
125 vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll6, [p5], #32 ; nopx ; nopm
126
127// k=16
128 vmac.f bml0, bml0, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh6, [p5], #32 ; nopx ; nopm
129 vmac.f bmh0, bmh0, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl6, [p4], #32 ; nopx ; nopm
130 vmac.f bml1, bml1, x0, x5, r0 ; vlda wh3, [p0], m3 ; vldb wh5, [p1], #32 ; vst amhh6, [p4], #32 ; nopx ; nopm
131 vmac.f bmh1, bmh1, x1, x5, r0 ; vlda amll6, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
132 vmac.f bml2, bml2, x0, x6, r0 ; vlda amlh6, [p3], #32 ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
133 vmac.f bmh2, bmh2, x1, x6, r0 ; vlda amhl6, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
134 vmac.f bml3, bml3, x0, x7, r0 ; vlda amhh6, [p2], #32 ; vldb wh7, [p1], m4 ; nops ; nopx ; nopm
135 vmac.f bmh3, bmh3, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll7, [p5], #32 ; nopx ; nopm
136
137// k=24
138 vmac.f bml0, bml0, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh7, [p5], m6 ; add r1, r1, #1 ; nopm
139 vmac.f bmh0, bmh0, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl7, [p4], #32 ; ltu r27, r1, r2 ; nopm
140 vmac.f bml1, bml1, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh7, [p4], m6 ; sel.nez r28, r4, r7, r27 ; mov m6, m5
141 vmac.f bmh1, bmh1, x3, x5, r0 ; vlda amll7, [p3], #32 ; vldb wl6, [p1], #32 ; nops ; sel.nez r28, r5, r6, r27 ; mov m5, r28
142 vmac.f bml2, bml2, x2, x6, r0 ; vlda amlh7, [p3], m5 ; vldb wh6, [p1], #32 ; nops ; sel.nez r28, r3, r4, r27 ; mov m4, r28
143 vmac.f bmh2, bmh2, x3, x6, r0 ; vlda amhl7, [p2], #32 ; vldb wl7, [p1], #32 ; nops ; mul r1, r1, r27 ; mov m3, r28
144 vmac.f bml3, bml3, x2, x7, r0 ; vlda amhh7, [p2], m5 ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
145.p2align 4
146.l_end:
147 vmac.f bmh3, bmh3, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll0, [p5], #32 ; nopx ; nopm
148// k=32
Listing 8 shows the body of the tensor contraction kernel’s loop. In the first half of the body (lines 67–104), the values in accumulation registers with indices 4–7 are updated by the VMAC.F operations. When entering the loop body, accumulation registers 0–3 hold the results of the preceding 2×4 block of output tiles. During execution, these are written to scratchpad memory (L1) using VST operations. Simultaneously, load unit A transfers the next 2×4 block’s tiles into registers 0–3.
The second half of the loop body (lines 108–144) computes the pre-loaded 2×4 block and updates the output tiles in registers 0–3. At the same time, the data of the now preceding 2×4 block, computed in the first half of the loop body, is written to memory, while the next block is loaded to registers 4–7.
Considering the XDNA1 vector unit, we see that every instruction in the loop body contains a VMAC.F operation. Therefore, the unit is fully utilized and all of the 64 instructions in the loop body perform a BF16 4×8×4 matrix multiplication.
Cool-down Phase#
150// k=0
151 vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh0, [p5], #32 ; nopx ; nopm
152 vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl0, [p4], #32 ; nopx ; nopm
153 vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh0, [p4], #32 ; nopx ; nopm
154 vmac.f bmh5, bmh5, x1, x5, r0 ; nopa ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
155 vmac.f bml6, bml6, x0, x6, r0 ; nopa ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
156 vmac.f bmh6, bmh6, x1, x6, r0 ; nopa ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
157 vmac.f bml7, bml7, x0, x7, r0 ; nopa ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
158 vmac.f bmh7, bmh7, x1, x7, r0 ; vlda wl0, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll1, [p5], #32 ; nopx ; nopm
159
160// k=8
161 vmac.f bml4, bml4, x2, x4, r0 ; vlda wh0, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh1, [p5], #32 ; nopx ; nopm
162 vmac.f bmh4, bmh4, x3, x4, r0 ; vlda wl1, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl1, [p4], #32 ; nopx ; nopm
163 vmac.f bml5, bml5, x2, x5, r0 ; vlda wh1, [p0], m0 ; vldb wh5, [p1], #32 ; vst amhh1, [p4], #32 ; nopx ; nopm
164 vmac.f bmh5, bmh5, x3, x5, r0 ; nopa ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
165 vmac.f bml6, bml6, x2, x6, r0 ; nopa ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
166 vmac.f bmh6, bmh6, x3, x6, r0 ; nopa ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
167 vmac.f bml7, bml7, x2, x7, r0 ; nopa ; vldb wh7, [p1], m1 ; nops ; nopx ; nopm
168 vmac.f bmh7, bmh7, x3, x7, r0 ; vlda wl2, [p0], #32 ; vldb wl4, [p1], #32 ; vst amll2, [p5], #32 ; nopx ; nopm
169
170// k=16
171 vmac.f bml4, bml4, x0, x4, r0 ; vlda wh2, [p0], m7 ; vldb wh4, [p1], #32 ; vst amlh2, [p5], #32 ; nopx ; nopm
172 vmac.f bmh4, bmh4, x1, x4, r0 ; vlda wl3, [p0], #32 ; vldb wl5, [p1], #32 ; vst amhl2, [p4], #32 ; nopx ; nopm
173 vmac.f bml5, bml5, x0, x5, r0 ; vlda wh3, [p0, #0] ; vldb wh5, [p1], #32 ; vst amhh2, [p4], #32 ; nopx ; nopm
174 vmac.f bmh5, bmh5, x1, x5, r0 ; nopa ; vldb wl6, [p1], #32 ; nops ; nopx ; nopm
175 vmac.f bml6, bml6, x0, x6, r0 ; nopa ; vldb wh6, [p1], #32 ; nops ; nopx ; nopm
176 vmac.f bmh6, bmh6, x1, x6, r0 ; nopa ; vldb wl7, [p1], #32 ; nops ; nopx ; nopm
177 vmac.f bml7, bml7, x0, x7, r0 ; nopa ; vldb wh7, [p1, #0] ; nops ; nopx ; nopm
178 vmac.f bmh7, bmh7, x1, x7, r0 ; nopa ; nopb ; vst amll3, [p5], #32 ; nopx ; nopm
179
180// k=24
181 vmac.f bml4, bml4, x2, x4, r0 ; nopa ; nopb ; vst amlh3, [p5], m6 ; nopx ; nopm
182 vmac.f bmh4, bmh4, x3, x4, r0 ; nopa ; nopb ; vst amhl3, [p4], #32 ; nopx ; nopm
183 vmac.f bml5, bml5, x2, x5, r0 ; nopa ; nopb ; vst amhh3, [p4], m6 ; nopx ; nopm
184 vmac.f bmh5, bmh5, x3, x5, r0 ; nopa ; nopb ; nops ; nopx ; nopm
185 vmac.f bml6, bml6, x2, x6, r0 ; nopa ; nopb ; nops ; nopx ; nopm
186 vmac.f bmh6, bmh6, x3, x6, r0 ; nopa ; nopb ; nops ; nopx ; nopm
187 vmac.f bml7, bml7, x2, x7, r0 ; nopa ; nopb ; vst amll4, [p5], #32 ; nopx ; nopm
188 vmac.f bmh7, bmh7, x3, x7, r0 ; nopa ; nopb ; vst amlh4, [p5], #32 ; nopx ; nopm
189// k=32
190
191 nopv ; nopa ; nopb ; vst amhl4, [p4], #32 ; nopx ; nopm
192 nopv ; nopa ; nopb ; vst amhh4, [p4], #32 ; nopx ; nopm
193 nopv ; nopa ; nopb ; vst amll5, [p5], #32 ; nopx ; nopm
194 nopv ; nopa ; nopb ; vst amlh5, [p5], #32 ; nopx ; nopm
195 nopv ; nopa ; nopb ; vst amhl5, [p4], #32 ; nopx ; nopm
196 nopv ; nopa ; nopb ; vst amhh5, [p4], #32 ; nopx ; nopm
197 nopv ; nopa ; nopb ; vst amll6, [p5], #32 ; nopx ; nopm
198 nopv ; nopa ; nopb ; vst amlh6, [p5], #32 ; nopx ; nopm
199 nopv ; nopa ; nopb ; vst amhl6, [p4], #32 ; nopx ; nopm
200 nopv ; nopa ; nopb ; vst amhh6, [p4], #32 ; ret lr ; nopm
201 nopv ; nopa ; nopb ; vst amll7, [p5], #32 ; nopx ; nopm // Delay Slot 5
202 nopv ; nopa ; nopb ; vst amlh7, [p5, #0] ; nopx ; nopm // Delay Slot 4
203 nopv ; nopa ; nopb ; vst amhl7, [p4], #32 ; nopx ; nopm // Delay Slot 3
204 nopv ; nopa ; nopb ; vst amhh7, [p4, #0] ; nopx ; nopm // Delay Slot 2
205 nopv ; nopa ; nopb ; nops ; nopx ; nopm // Delay Slot 1
The cool-down phase is shown in Listing 9.
It differs from the loop body in two key ways.
First, no preloading of the next 2×4 block of output tiles is required.
Second, most of the VST operations writing the last block in accumulation registers 4–7 are exposed, meaning they cannot be hidden behind VMAC.F operations.
In line 200, the ret lr operation is issued, which has six-cycle latency.
The cool-down phase contains a total of 47 instructions out of which 32 contain VMAC.F operations.
Kernel Efficiency#
Our XDNA1 tensor contraction kernel has the following utilization of the vector unit in the three parts:
Warm-up phase: 32 out of 49 instructions contain VMAC.F operations.
Hardware loop: All 64 instructions in the loop body contain VMAC.F operations. The loop executes three times, yielding 192 instructions containing VMAC.F operations.
Cool-down phase: 32 out of 47 instructions contain VMAC.F operations.
In summary, the kernel consists of 288 instructions out of which 256 contain VMAC.F operations. This leads to a theoretical utilization of 89%. In other words, a compute tile running at 1.8 GHz would execute 1.6×10⁹ BF16 4×8×4 operations per second. This is equivalent to a theoretical floating-point throughput of 410 BF16 GFLOPS.
We implemented a benchmark in which the tensor contraction kernel is called repeatedly in a loop on the NPU. Benchmarking the kernel on an XDNA1 NPU (AMD Ryzen 7 8700G), we achieved 398 BF16 GFLOPS. The benchmarking code is available from our xdna repository. To run the benchmark, execute the following commands:
git clone https://github.com/scalable-analyses/xdna
cd xdna
make run
Note
The installation of the MLIR-AIE compiler aiecc and Peano is documented in the mlir-aie repository.
The Makefile assumes that the environment variable PEANO_INSTALL_DIR contains the path to Peano and that aiecc.py is available in the path.
Use xrt-smi configure --pmode turbo to set the NPU clock to its maximum frequency.