XDNA2 Kernel#
While all optimization objectives stated in the XDNA1 Kernel chapter remain valid, transferring BFP16 data is more complex than transferring BF16 data. Therefore, we use a different L1 data layout to account for the pipelined BFP16 loads and stores described in the XDNA2 ISA section.
L1 Data Layout#
Fig. 3 Data layout of the tensor contraction kernel in the L1 scratchpad memory of an executing compute tile.
The kernel computes [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0], with fixed sizes |m0|=8, |k0|=8, |n0|=8, |m1|=2, and |n1|=2.
The first computation of a 2×2 block of output tiles is highlighted in darker colors.#
Fig. 3 illustrates the data layout of the BFP16 tensor contraction kernel with FP32 accumulation: [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0].
The tensors are tiled based on the BFP16 VMAC.F operation requirements, meaning that all three tensors use 8×8 tiles.
In detail, in0 has tiles of size |m0|×|k0|, in1 of size |n0|×|k0|, and out of size |m0|×|n0|, where |m0|=8, |k0|=8, and |n0|=8.
A BFP16 input tile, that is, a tile of either in0 or in1, has a size of 72 bytes.
An FP32 tile of out has a size of 256 bytes.
Given that XDNA2 has only five accumulation registers, the kernel employs a 2×2 register blocking for the output tensor.
This blocking is described by the dimensions m1 and n1 with |m1|=2 and |n1|=2 in the einsum.
Fig. 3 shows the accumulator blocking in gray, using two accumulation registers (DM0 and DM1), each accumulating two tiles.
The kernel design section gives a detailed explanation of the mapping of four output tiles to two accumulation registers.
The L1 data layout is more complex due to potential bank conflicts.
Specifically, XDNA2 has a total of four logical memory banks.
We assign one of the banks to in0 and another one to in1.
This leaves two banks for loading and storing the output tensor out.
To fully utilize both banks, our L1 data layout evenly distributes the output tensor across the two banks.
Accordingly, m1 is the outermost dimension of the output tensor, where the slice with m1=0 is placed on one bank and the slice with m1=1 on the other.
The sizes of the remaining dimensions m2, n2, and k1 are flexible but must be chosen so that the tensors fit into the L1 scratchpad.
Design#
As discussed in the previous section, our L1 data layout can be expressed by the einsum [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0] with the fixed sizes |m0|=8, |k0|=8, |n0|=8, |m1|=2, and |n1|=2.
Similar to the XDNA1 kernel, the VMAC.F operation consumes the dimensions m0, k0, and n0.
Dimensions m1 and n1 represent our accumulator blocking.
The sizes of m2, k1, and n2 remain to be selected.
Before discussing the design decisions of our XDNA2 tensor contraction kernel, we summarize key hardware properties that must be considered:
Loading data into a 256-byte accumulation register requires four 64-byte loads. VLDA can load directly into the accumulation registers. However, VLDB can only load into vector registers. Each load has a latency of seven cycles. A VMOV operation copies 64 bytes from a vector register into an accumulation register and has a latency of two cycles.
VMAC.F reads from the accumulation register in its fourth cycle (forwarding).
Properties 1 and 2 allow us to formulate scheduling rules when a VMAC.F operation depends on loads to the accumulation register. Specifically, we can issue a VMAC.F operation at the earliest in the fifth cycle of the fourth 64-byte load. In other words, the last load and the dependent VMAC.F must be four cycles apart.
Loading a 72-byte vector register requires one VLDA.POP or VLDB.POP from a pre-filled load pipeline. This load has an 8-cycle latency.
A VMAC.F operation has a latency of six cycles.
- Output Stationary
See Output Stationary in the XDNA1 kernel design.
- Register Blocking
Dimensions
m1andn1are handled by a 2×2 register blocking; therefore,|m1|=|n1|=2.- Streaming of Input Tiles
The L1 data layout stores the input tiles of
in0ink1×m1blocks and the tiles ofin1ink1×n1blocks. Thus, for the computation of a 2×2 block of output tiles, we can use one linear stream (using VLDA) to load the data ofin0and one (using VLDB) to load the data ofin1. The highest performance is obtained when consecutively loading eight 72-byte tiles. Additionally, the loads and stores must be 64-byte aligned.
- Bank-Aware Buffer Placement
We split the output tensor into two parts along the
m1dimension, allowing for simultaneous access to both halves without bank conflict. In detail, we place the output tensor in a buffer spanning two consecutive memory banks. The first half of the buffer is aligned to the end of the first memory bank, and the second half is aligned to the start of the second bank, forming one contiguous buffer. The implementation passes two pointers forout, one per half.
- Linear Contraction Dimension
As in the XDNA1 kernel, the
k1dimension is handled with linear code.
- Accumulation Chains
XDNA2 has only five 2048-bit accumulation registers. Therefore, we cannot use a simple double-buffering scheme for the 2×2 register blocking, which would require eight registers.
However, the 6-cycle VMAC.F operation only requires valid data in the accumulation register when it reads from the register in the fourth cycle. This means that the operation writing the required data to the accumulation register must complete in the third cycle of VMAC.F. At all other times, the register may hold unrelated data. Assuming that we write the accumulation data for the VMAC.F in the third cycle, we can:
Write unrelated data to the accumulation register in cycles 1–2, and 4–5.
Read unrelated data from the accumulation register in cycles 1–3, and 5–6.
Thus, VMAC.F operations accumulating in a single output tile must be scheduled at least three cycles apart:
[-- --] [-- --] [-- --] [R0 --] [-- --] [-- W0] [R1 --] [-- --] 1 2 3 4 5 6 7 8 [-- W1] [R2 --] [-- --] [-- W2] [R3 --] [-- --] [-- W3] [R4 --] 9 10 11 12 13 14 15 16
Here,
R0is the accumulation register read of the first VMAC.F operation, andR1–R4are the reads of the following four operations. Correspondingly,W0–W3are the writes.In practice, our kernel schedules the VMAC.F accumulation chains with a distance of four cycles:
[-- --] [-- --] [-- --] [R0 --] [-- --] [-- W0] [-- --] [R1 --] 1 2 3 4 5 6 7 8 [-- --] [-- W1] [-- --] [R2 --] [-- --] [-- W2] [-- --] [R3 --] 9 10 11 12 13 14 15 16
The four-cycle distance allows us to interleave two accumulation chains. The second chain has a two-cycle offset:
[-- --] [-- --] [-- --] [-- --] [-- --] [Q0 --] [-- --] [-- P0] 1 2 3 4 5 6 7 8 [-- --] [Q1 --] [-- --] [-- P1] [-- --] [Q2 --] [-- --] [-- P2] 9 10 11 12 13 14 15 16
This allows us to use the same accumulation register for both chains:
[-- --] [-- --] [-- --] [R0 --] [-- --] [Q0 W0] [-- --] [R1 P0] 1 2 3 4 5 6 7 8 [-- --] [Q1 W1] [-- --] [R2 P1] [-- --] [Q2 W2] [-- --] [R3 P2] 9 10 11 12 13 14 15 16
We see that we obtain an execution throughput of 1/2 after cycle 6, that is, in every other cycle a VMAC.F operation completes. We obtain an execution throughput of 1 by scheduling two additional interleaved chains:
[-- --] [-- --] [-- --] [-- --] [S0 --] [-- --] [U0 T0] [-- --] 1 2 3 4 5 6 7 8 [S1 V0] [-- --] [U1 T1] [-- --] [S2 V1] [-- --] [U2 T2] [-- --] 9 10 11 12 13 14 15 16
These two interleaved chains have a one-cycle offset compared to the previous one, meaning that we effectively obtain an execution throughput of 1 after cycle 6.
In summary, to fully utilize the vector unit, our kernel requires two accumulation registers and four accumulation chains. We use the remaining three accumulation registers for hiding the L1-register transfers.
- Single Hardware Loop
The
m2andn2iterations are implemented using a single hardware loop. As in the XDNA1 kernel, the first and last 2×2 blocks of output tiles are computed outside of this loop, forming a warm-up and a cool-down phase.
- Accumulation Register Buffering
To summarize, given the einsum
[m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0]and data layout shown in Fig. 3,m0,k0andn0are consumed by the BFP16 VMAC.F operations.m1andn1are used for our 2×2 accumulator blocking. Dimensionk1is consumed by linear code, meaning that we explicitly write all instructions in code. This is analogous to unrolling optimizations in a loop-based optimization context. Sincek0andk1are the only contraction dimensions, after performing allk1updates, we have fully computed a 2×2 block of output tiles. Next, we advance the last two remaining dimensions,n2andm2, through a fused hardware loop, wheren2is the faster dimension.When advancing from one 2×2 block of output tiles
ito the next (i+1), we must writeito the L1 scratchpad and loadi+1. In general, the tensor contraction kernel writes the data of blockiduring the computation of blocki+1and reads the accumulator input fori+1during the computation ofi. This approach allows us to hide L1-register transfers behind VMAC.F operations.We realize this scheme by using the remaining three accumulation registers
DM2–DM4for buffering the loads and stores. In detail, we issue the loads for the four output tiles ofi+1while computingi. This data is buffered in registersDM2–DM4.DM2is used for buffering two output tiles. Furthermore, we must use 16 VST operations to store the four output tiles of blocki. For this we also use registersDM2–DM4as buffers and execute the 16 VST operations while computingi+1. In this caseDM2is also used for buffering two output tiles.A single iteration
jof the fused hardware loop mostly contains VMAC.F operations that compute a single blockiof output tiles. However at the beginning of the iteration two operations finish the computation of blocki-1, while at the end of the iteration two operations already start computing blocki+1. In summary, the schedule for these four VMAC.F operations and the accumulation register usage in iterationjis as follows:the first two VMAC.F operations finish computing block
i-1and write toDM4andDM2;the following two VMAC.F operations begin updating block
iand read fromDM4andDM2, where the corresponding VLDA and VLDB operations have been issued in iterationj-1;the fourth last and third last VMAC.F operations update block
iand write to registersDM2andDM3so that the corresponding stores can be issued in iterationj+1;the last two VMAC.F operations begin updating block
i+1and read fromDM2andDM3.
A VMAC.F operation reads from the source accumulation register in its fourth cycle and writes to the destination in its sixth cycle. By scheduling the two VMAC.F operations in (a) at most two cycles before their respective counterpart in (b), the reads of the operations in (b) are executed before the writes of the operations in (a). The kernels follows an analogous approach for the usage of registers
DM2andDM3by the VMAC.F operations in (c) and (d). In this case the writes of the VMAC.F operations in (c) are performed after the reads of the operations in (d).
- Output Tile Prioritization
As discussed in the paragraph Single Hardware Loop, the tensor contraction kernel ends with a specialized cool-down phase. During this phase, the kernel computes the last 2×2 block of output tiles. Compared to a “regular” iteration of the hardware loop, we cannot hide the final store operations behind following loop iterations. Thus, we prioritize the computation of one output tile, so that the register-L1 transfers are at least partially hidden behind VMAC.F operations of the cool-down phase.
Implementation#
This section discusses the warm-up phase, hardware loop, and cool-down phase of a representative XDNA2 BFP16 tensor contraction kernel.
As described in sections L1 Data Layout and Design, the einsum [m2,k1,m1,m0,k0],[n2,k1,n1,n0,k0]->[m1,m2,n2,n1,m0,n0] with fixed sizes |m0|=|k0|=|n0|=8, |m1|=2, and |n1|=2 describes the tensor contraction kernel.
We choose |m2|=4, |k1|=12, and |n2|=4 in our example implementation.
Warm-up Phase#
7// k=0
8// 0 // block B k
9 nopv ; vlda bmll0, [p2, #0 ] ; vldb x8, [p3, #0 ] ; movs p4, p2 ; movxm r4, #-576 / 8 * 12 * 2 // m1 - jumpback in IN0 (m dimension)
10 nopv ; vlda bmlh0, [p2, #64 ] ; vldb x9, [p3, #64 ] ; movs p5, p3 ; movxm r5, #-576 / 8 * 12 * 2 * 4 // n1 * n2 - jumpback in IN1 (complete tensor)
11 nopv ; vlda bmhl0, [p2, #128] ; vldb x10, [p3, #128] ; padds [p4], #256 ; movx r24, #0 ; mov r25, #0
12 nopv ; vlda bmhh0, [p2, #192] ; vldb x11, [p3, #192] ; padds [p5], #256 ; movx r0, #780 ; mov r1, #0
13
14// 4
15 nopv ; vlda bmhl1, [p4, #128] ; vldb x8, [p5], #64 ; nops ; movxm ls, #.l_start
16 nopv ; vlda bmhh1, [p4, #192] ; vldb x9, [p5], #64 ; nops ; movxm le, #.l_end
17 nopv ; vlda bmll1, [p4], #64 ; vldb x10, [p5], #64 ; nops ; movxm lc, #14 // 4(m2) * 4(n2) - 2
18 nopv ; vlda bmlh1, [p4], #192 ; vldb x11, [p5], #64 ; nops ; nopxm
19
20// 8
21 nopv ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopx ; vmov bmll2, x8
22 nopv ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopx ; vmov bmlh2, x9
23 nopv ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopx ; vmov bmhl2, x10
24 nopv ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopx ; vmov bmhh2, x11
25 nopv ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopx ; vmov bmll3, x8
26 nopv ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopx ; vmov bmlh3, x9
27 nopv ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopx ; vmov bmhl3, x10
28 nopv ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopx ; vmov bmhh3, x11
29 nopv ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopxm
30
31// 17
32// k=0
33 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
34 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda bmll2, [p4], #64 ; vldb x4, [p5], #64 ; nops ; nopxm
35 vmac.f dm0, dm2, ex0, ex3, r0 ; vlda bmlh2, [p4], #64 ; vldb x5, [p5], #64 ; nops ; movx r3, #1 ; nopm
36 vmac.f dm1, dm3, ex2, ex3, r0 ; vlda bmll2, [p4], #64 ; vldb x6, [p5], #64 ; nops ; movx r2, #4 ; nopm
37
38// k=8
39 vmac.f dm0, dm0, ex4, ex5, r0 ; vlda bmlh2, [p4], #64 ; vldb x7, [p5], #64 ; nops ; nopxm
40 vmac.f dm1, dm1, ex6, ex5, r0 ; nopa ; nopb ; nops ; nopxm
41 vmac.f dm0, dm0, ex4, ex7, r0 ; nopa ; nopb ; nops ; nopxm
42 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
43
44// k=16
45 vmac.f dm0, dm0, ex8, ex9, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopx ; vmov bmll4, x4
46 vmac.f dm1, dm1, ex10, ex9, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopx ; vmov bmlh4, x5
47 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopx ; vmov bmhl4, x6
48 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopx ; vmov bmhh4, x7
49
50// k=24
51 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; add r1, r1, r3 ; nopm
52 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; ltu r6, r1, r2 ; nopm
53 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; mul r1, r1, r6 ; nopm
54 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; sub r7, r3, r6 ; nopm
55
56// k=32
57 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; mul r6, r6, r4 ; nopm
58 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda bmll3, [p4], #64 ; nopb ; nops ; mul r7, r7, r5 ; nopm
59 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda bmlh3, [p4], #64 ; nopb ; nops ; nopx ; mov m0, r6
60 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda bmhl3, [p4], #64 ; nopb ; nops ; nopx ; mov m1, r7
61
62// k=40
63 vmac.f dm0, dm0, ex4, ex5, r0 ; vlda bmhh3, [p4], #64 ; nopb ; nops ; nopxm
64 vmac.f dm1, dm1, ex6, ex5, r0 ; nopa ; nopb ; nops ; nopxm
65 vmac.f dm0, dm0, ex4, ex7, r0 ; nopa ; nopb ; nops ; nopxm
66 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
67
68// k=48
69 vmac.f dm0, dm0, ex8, ex9, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
70 vmac.f dm1, dm1, ex10, ex9, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopxm
71 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopxm
72 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopxm
73
74// k=56
75 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopxm
76 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopxm
77 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
78 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopxm
79
80// k=64
81 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
82 vmac.f dm1, dm1, ex2, ex1, r0 ; nopa ; vldb x6, [p5], #64 ; nops ; nopxm
83 vmac.f dm0, dm0, ex0, ex3, r0 ; nopa ; vldb x7, [p5], #64 ; nops ; nopxm
84 vmac.f dm1, dm1, ex2, ex3, r0 ; nopa ; nopb ; nops ; nopxm
85
86// k=72
87 vmac.f dm0, dm0, ex4, ex5, r0 ; padda [p0], m0 ; paddb [p1], m1 ; nops ; nopxm
88 vmac.f dm1, dm1, ex6, ex5, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
89 vmac.f dm0, dm0, ex4, ex7, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopxm
90 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopxm
91
92// k=80
93 vmac.f dm0, dm0, ex8, ex9, r0 ; nopa ; nopb ; nops ; nopxm
94 vmac.f dm1, dm1, ex10, ex9, r0 ; nopa ; nopb ; nops ; nopxm
95 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopxm
96 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopxm
97
98// k=88_0
99 vmac.f dm2, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
100 vmac.f dm3, dm1, ex2, ex1, r0 ; nopa ; nopb ; nops ; nopxm
101// k=96_0
102
103// k=0_0
104 vmac.f dm0, dm2, ex8, ex9, r0 ; vlda bmhl2, [p5], #64 ; nopb ; nops ; nopxm
105 vmac.f dm1, dm3, ex10, ex9, r0 ; vlda bmhh2, [p5], #64 ; nopb ; nops ; nopxm
Listing 10 shows the kernel’s warm-up phase.
During the first eight cycles, the load operations for the first 2×2 block are issued.
Load unit A can write directly into the accumulation registers.
Load unit B can only write into scalar or vector registers.
Thus, we first load to vector registers using load unit B and then copy the data to accumulation registers.
We use accumulation registers DM0–DM3 for these “loads”.
The VMAC.F operations in lines 39–96 exclusively use DM0 and DM1 as their destination registers.
Thus, registers DM2–DM4 can be used to load the next 2×2 block of output tiles, required by the upcoming hardware loop.
We discuss this procedure in more detail as part of the next subsection.
Each instruction in lines 7–29 contains a NOPV operation, hence the vector unit is not used. All other instructions perform VMAC.F operations. In summary, 48 of the 65 instructions contain VMAC.F operations.
Hardware Loop#
15movxm ls, #.l_start
16movxm le, #.l_end
17movxm lc, #14 // 4(m2) * 4(n2) - 2
The hardware loop setup follows the procedure outlined in the XDNA1 kernel. The respective configuration operations are shown in Listing 11.
107.p2align 4
108.l_start:
109// k=88_1
110 vmac.f dm4, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
111 vmac.f dm2, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopx ; vmov bmll2, x6
112// k=96_1
113
114// k=0_1
115 vmac.f dm0, dm4, ex8, ex11, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; vst bmll2, [p2], #64 ; nopx ; vmov bmlh2, x7
116 vmac.f dm1, dm2, ex10, ex11, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; vst bmlh2, [p2], #64 ; nopxm
117
118// k=8
119 vmac.f dm0, dm0, ex4, ex5, r0 ; nopa ; nopb ; vst bmhl2, [p2], #64 ; add r1, r1, r3 ; nopm
120 vmac.f dm1, dm1, ex6, ex5, r0 ; nopa ; nopb ; vst bmhh2, [p2], #64 ; ltu r6, r1, r2 ; nopm
121 vmac.f dm0, dm0, ex4, ex7, r0 ; nopa ; nopb ; nops ; mul r1, r1, r6 ; nopm
122 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; sub r7, r3, r6 ; nopm
123
124// k=16
125 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmll2, [p3], #64 ; mul r6, r6, r4 ; nopm
126 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmlh2, [p3], #64 ; mul r7, r7, r5 ; nopm
127 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; vst bmhl2, [p3], #64 ; nopx ; mov m0, r6
128 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; vst bmhh2, [p3], #64 ; nopx ; mov m1, r7
129
130// k=24
131 vmac.f dm0, dm0, ex8, ex9, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; vst bmll4, [p3], #64 ; nopxm
132 vmac.f dm1, dm1, ex10, ex9, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; vst bmlh4, [p3], #64 ; nopxm
133 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmhl4, [p3], #64 ; nopxm
134 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmhh4, [p3], #64 ; nopxm
135
136// k=32
137 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; vst bmll3, [p2], #64 ; nopxm
138 vmac.f dm1, dm1, ex2, ex1, r0 ; nopa ; nopb ; vst bmlh3, [p2], #64 ; nopxm
139 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda bmll2, [p4], #64 ; vldb x4, [p5], #64 ; nops ; nopxm
140 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda bmlh2, [p4], #64 ; vldb x5, [p5], #64 ; nops ; nopxm
141
142// k=40
143 vmac.f dm0, dm0, ex4, ex5, r0 ; vlda bmhl2, [p4], #64 ; vldb x6, [p5], #64 ; nops ; nopxm
144 vmac.f dm1, dm1, ex6, ex5, r0 ; vlda bmhh2, [p4], #64 ; vldb x7, [p5], #64 ; nops ; nopxm
145 vmac.f dm0, dm0, ex4, ex7, r0 ; vlda bmll3, [p4], #64 ; nopb ; nops ; nopxm
146 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
147
148// k=48
149 vmac.f dm0, dm0, ex8, ex9, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmhl3, [p2], #64 ; nopxm
150 vmac.f dm1, dm1, ex10, ex9, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmhh3, [p2], #64 ; nopxm
151 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopx ; vmov bmll4, x4
152 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopx ; vmov bmlh4, x5
153
154// k=56
155 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopx ; vmov bmhl4, x6
156 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopx ; vmov bmhh4, x7
157 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
158 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopxm
159
160// k=64
161 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
162 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda bmlh3, [p4], #64 ; vldb x6, [p5], #64 ; nops ; nopxm
163 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda bmhl3, [p4], #64 ; vldb x7, [p5], #64 ; nops ; nopxm
164 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda bmhh3, [p4], #64 ; nopb ; nops ; nopxm
165
166// k=72
167 vmac.f dm0, dm0, ex4, ex5, r0 ; padda [p0], m0 ; paddb [p1], m1 ; nops ; nopxm
168 vmac.f dm1, dm1, ex6, ex5, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
169 vmac.f dm0, dm0, ex4, ex7, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopxm
170 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopxm
171
172// k=80
173 vmac.f dm0, dm0, ex8, ex9, r0 ; nopa ; nopb ; nops ; nopxm
174 vmac.f dm1, dm1, ex10, ex9, r0 ; nopa ; nopb ; nops ; nopxm
175 vmac.f dm0, dm0, ex8, ex11, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopxm
176 vmac.f dm1, dm1, ex10, ex11, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopxm
177
178// k=88_0
179 vmac.f dm2, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
180 vmac.f dm3, dm1, ex2, ex1, r0 ; nopa ; nopb ; nops ; nopxm
181// k=96_0
182
183// k=0_0
184 vmac.f dm0, dm2, ex8, ex9, r0 ; vlda bmhl2, [p5], #64 ; nopb ; nops ; nopxm
185.p2align 4
186.l_end:
187 vmac.f dm1, dm3, ex10, ex9, r0 ; vlda bmhh2, [p5], #64 ; nopb ; nops ; nopxm
Listing 12 shows the body of hardware loop in the XDNA2 tensor contraction kernel. Lines 110 and 111 contain the two VMAC.F operations discussed as (a) in the accumulation register buffering part of the kernel design. (b) is given by lines 115 and 116, (c) by lines 179 and 180, and (d) by lines 184 and 187.
Assuming that the hardware loop currently performs iteration j, then the VLDA and VLDB operations in lines 139–145 and lines 162–164 load three output tiles of the next 2×2 block.
The VLDA operations directly load into DM2 and DM3, while the VLDB operations first load into vector registers.
The data in the vector registers is then copied to DM4.
The two VLDB operations in lines 162 and 163 load the first 128 bytes of the fourth 256-byte output tile to vector registers X6 and X7.
This data is ultimately used by the VMAC.F operation in line 116 when performing iteration j+1.
The corresponding copies from X6 and X7 to the lower half of accumulation register DM2 are done by the two VMOV operations in lines 111 and 115.
The remaining upper 128 bytes of the fourth tile (DM2) are loaded by the two VLDA operations in lines 184 and 187.
Note that the L1-to-DM2 transfers of the fourth tile are scheduled to fulfill two conditions:
First, they are scheduled so that the first output tile in
DM2of the current 2×2 block has been written to L1. The last update to the first tile is performed in the VMAC.F operation in line 179. This tile is stored in lines 115–120 of the next iteration.Second, they are scheduled so that the fourth output tile of the current 2×2 block is stored in
DM2before the VMAC.F operation in line 116 reads it.
In summary, each of the 48 instructions in the loop body contains a VMAC.F operation. Thus, the vector unit is fully utilized.
Cool-down Phase#
189// k=88_1
190 vmac.f dm4, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
191 vmac.f dm2, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopx ; vmov bmll2, x6
192// k=96_1
193
194// k=0_1
195 vmac.f dm0, dm4, ex8, ex11, r0 ; nopa ; nopb ; vst bmll2, [p2], #64 ; nopx ; vmov bmlh2, x7
196 vmac.f dm1, dm2, ex10, ex11, r0 ; nopa ; nopb ; vst bmlh2, [p2], #64 ; nopxm
197
198// k=8
199 vmac.f dm0, dm0, ex4, ex5, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmhl2, [p2], #64 ; nopxm
200 vmac.f dm1, dm1, ex6, ex5, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmhh2, [p2], #64 ; nopxm
201 vmac.f dm0, dm0, ex4, ex7, r0 ; nopa ; nopb ; nops ; nopxm
202 vmac.f dm1, dm1, ex6, ex7, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
203
204// k=16
205 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmll2, [p3], #64 ; nopxm
206 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmlh2, [p3], #64 ; nopxm
207 vmac.f dm0, dm0, ex0, ex3, r0 ; nopa ; nopb ; vst bmhl2, [p3], #64 ; nopxm
208 vmac.f dm1, dm1, ex2, ex3, r0 ; nopa ; nopb ; vst bmhh2, [p3], #64 ; nopxm
209
210// k=24
211 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmll4, [p3], #64 ; nopxm
212 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmlh4, [p3], #64 ; nopxm
213 vmac.f dm0, dm0, ex0, ex3, r0 ; nopa ; nopb ; vst bmhl4, [p3], #64 ; nopxm
214 vmac.f dm1, dm1, ex2, ex3, r0 ; nopa ; nopb ; vst bmhh4, [p3], #64 ; nopxm
215
216// k=32
217 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; vst bmll3, [p2], #64 ; nopxm
218 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; vst bmlh3, [p2], #64 ; nopxm
219 vmac.f dm0, dm0, ex0, ex3, r0 ; nopa ; nopb ; vst bmhl3, [p2], #64 ; nopxm
220 vmac.f dm1, dm1, ex2, ex3, r0 ; nopa ; nopb ; vst bmhh3, [p2], #64 ; nopxm
221
222// k=40
223 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopxm
224 vmac.f dm1, dm1, ex2, ex1, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopxm
225 vmac.f dm0, dm0, ex0, ex3, r0 ; vlda.fill.512 [p0, lf0, r24] ; vldb.fill.512 [p1, lf1, r25] ; nops ; nopxm
226 vmac.f dm1, dm1, ex2, ex3, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopxm
227
228// k=48
229 vmac.f dm0, dm0, ex0, ex1, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopxm
230 vmac.f dm1, dm1, ex2, ex1, r0 ; nopa ; nopb ; nops ; nopxm
231 vmac.f dm2, dm0, ex0, ex3, r0 ; vlda.pop.576 ex0, [p0, lf0, r24] ; vldb.pop.576 ex1, [p1, lf1, r25] ; nops ; nopxm
232 vmac.f dm3, dm1, ex2, ex3, r0 ; vlda.pop.576 ex2, [p0, lf0, r24] ; vldb.pop.576 ex3, [p1, lf1, r25] ; nops ; nopxm
233
234// k=56
235 vmac.f dm0, dm0, ex4, ex5, r0 ; vlda.pop.576 ex4, [p0, lf0, r24] ; vldb.pop.576 ex5, [p1, lf1, r25] ; nops ; nopxm
236 vmac.f dm1, dm1, ex4, ex7, r0 ; vlda.pop.576 ex6, [p0, lf0, r24] ; vldb.pop.576 ex7, [p1, lf1, r25] ; nops ; nopxm
237 vmac.f dm2, dm2, ex6, ex5, r0 ; nopa ; nopb ; nops ; nopxm
238 vmac.f dm0, dm0, ex8, ex9, r0 ; nopa ; nopb ; nops ; nopxm
239
240//
241 vmac.f dm3, dm3, ex6, ex7, r0 ; vlda.pop.576 ex8, [p0, lf0, r24] ; vldb.pop.576 ex9, [p1, lf1, r25] ; nops ; nopxm
242 vmac.f dm1, dm1, ex8, ex11, r0 ; vlda.pop.576 ex10, [p0, lf0, r24] ; vldb.pop.576 ex11, [p1, lf1, r25] ; nops ; nopxm
243 vmac.f dm0, dm0, ex0, ex1, r0 ; nopa ; nopb ; nops ; nopxm
244 vmac.f dm2, dm2, ex10, ex9, r0 ; nopa ; nopb ; nops ; nopxm
245
246//
247 vmac.f dm3, dm3, ex10, ex11, r0 ; nopa ; nopb ; nops ; nopxm
248 vmac.f dm0, dm0, ex4, ex5, r0 ; nopa ; nopb ; nops ; nopxm
249 vmac.f dm1, dm1, ex0, ex3, r0 ; nopa ; nopb ; nops ; nopxm
250 vmac.f dm2, dm2, ex2, ex1, r0 ; nopa ; nopb ; nops ; nopxm
251
252//
253 vmac.f dm0, dm0, ex8, ex9, r0 ; nopa ; nopb ; nops ; nopxm
254 vmac.f dm1, dm1, ex4, ex7, r0 ; nopa ; nopb ; nops ; nopxm
255 vmac.f dm3, dm3, ex2, ex3, r0 ; nopa ; nopb ; nops ; nopxm
256 vmac.f dm2, dm2, ex6, ex5, r0 ; nopa ; nopb ; nops ; nopxm
257
258//
259 vmac.f dm1, dm1, ex8, ex11, r0 ; nopa ; nopb ; nops ; nopxm
260 vmac.f dm3, dm3, ex6, ex7, r0 ; nopa ; nopb ; nops ; nopxm
261 vmac.f dm2, dm2, ex10, ex9, r0 ; nopa ; nopb ; vst bmll0, [p2], #64 ; nopxm
262 nopv ; nopa ; nopb ; vst bmlh0, [p2], #64 ; nopxm
263 vmac.f dm3, dm3, ex10, ex11, r0 ; nopa ; nopb ; vst bmhl0, [p2], #64 ; nopxm
264// k=96
265
266 nopv ; nopa ; nopb ; vst bmhh0, [p2], #64 ; nopxm
267 nopv ; nopa ; nopb ; vst bmll1, [p2], #64 ; nopxm
268 nopv ; nopa ; nopb ; vst bmlh1, [p2], #64 ; nopxm
269 nopv ; nopa ; nopb ; vst bmhl1, [p2], #64 ; nopxm
270 nopv ; nopa ; nopb ; vst bmhh1, [p2], #64 ; nopxm
271 nopv ; nopa ; nopb ; vst bmll2, [p3], #64 ; nopxm
272 nopv ; nopa ; nopb ; vst bmlh2, [p3], #64 ; nopxm
273 nopv ; nopa ; nopb ; vst bmhl2, [p3], #64 ; nopxm
274 nopv ; nopa ; nopb ; vst bmhh2, [p3], #64 ; ret lr
275 nopv ; nopa ; nopb ; vst bmll3, [p3], #64 ; nopxm // Delay Slot 5
276 nopv ; nopa ; nopb ; vst bmlh3, [p3], #64 ; nopxm // Delay Slot 4
277 nopv ; nopa ; nopb ; vst bmhl3, [p3], #64 ; nopxm // Delay Slot 3
278 nopv ; nopa ; nopb ; vst bmhh3, [p3], #64 ; nopxm // Delay Slot 2
279 nopv ; nopa ; nopb ; nops ; nopxm // Delay Slot 1
Listing 13 shows the kernel’s cool-down phase. To realize the discussed output tile prioritization, each VMAC.F chain after line 228 uses a different accumulation register. The stores of the last 2×2 block are performed in lines 261–278.
In summary, 48 out of 63 instructions contain VMAC.F operations.
Kernel Efficiency#
The vector unit utilization of the three parts is as follows:
Warm-up phase: 48 out of 65 instructions contain VMAC.F operations.
Hardware loop: All 48 instructions in the loop body contain VMAC.F operations. The loop executes 14 times, yielding a total of 672 instructions with a VMAC.F operation.
Cool-down phase: 48 out of 63 instructions contain VMAC.F operations.
In summary, the kernel executes 800 instructions, 768 of which contain VMAC.F operations. This results in a theoretical utilization of 96%. In other words, a compute tile running at 1.8 GHz would execute 1.73×10⁹ BFP16 8×8×8 operations per second. This equates to a theoretical throughput of 1769 BFP16 GFLOPS.
Analogous to the XDNA1 tensor contraction kernel, we have written a benchmark that calls the tensor contraction kernel repeatedly in a loop on the NPU. We have benchmarked the kernel on an XDNA2 NPU (AMD Ryzen AI Max PRO 390) and achieved a throughput of 1760 BFP16 GFLOPS. The benchmarking code is available from our xdna repository. To run the benchmark, execute the following commands:
git clone https://github.com/scalable-analyses/xdna
cd xdna
make run
Note
The installation of the MLIR-AIE compiler aiecc and Peano is documented in the mlir-aie repository.
The Makefile assumes that the environment variable PEANO_INSTALL_DIR contains the path to Peano and that aiecc.py is available in the path.
Use xrt-smi configure --pmode turbo to set the NPU clock to its maximum frequency.