Instruction Set Architecture#
As of 2026, the XDNA official documentation focuses mainly on the higher-level software stack and provides limited technical details. However, by combining information from several publicly available sources with microbenchmarks, we can construct an accurate picture of the hardware.
XDNA1 and XDNA2 are AMD’s consumer variants of the AIE-ML (AIE2) and AIE-ML v2 (AIE2p) architectures. Public documentation for the AIE-ML and AIE-ML v2 architectures is available as part of AMD’s Versal AI Edge Series. Architecturally, XDNA1 closely resembles AIE-ML, whereas XDNA2 diverges more from AIE-ML v2. Nonetheless, the AIE-ML and AIE-ML v2 architecture manuals are helpful for understanding the features of the microarchitectures.
As stated in an AMD paper, XDNA cores use a very-long instruction word (VLIW) instruction set architecture (ISA) that supports single-instruction multiple-data (SIMD) operations for fixed- and floating-point arithmetic. Although the ISA itself is not publicly documented, the open-source, LLVM-based compiler framework Peano implements it. Specifically, this framework can compile the AIE-API, which contains intrinsic functions for programming the compute-tile cores. Peano emits assembly code for intrinsic kernels compiled from C++. This allows us to infer the ISA from the generated code. Furthermore, microbenchmarking operations within instruction words allows us to determine their hardware execution properties. Operation latencies are particularly important because they are crucial for scheduling operations in VLIW code.
Floating-Point Matrix Operations#
BF16 is a popular ML data format introduced by Google in 2018. A BF16 number has a sign bit, eight exponent bits, and seven mantissa bits. Thus, we obtain the same dynamic range as FP32, which has eight exponent bits as well. BFP16 is a block-floating-point format, which means that multiple values share the same exponent. Specifically, BFP16 uses eight bits for the exponent and groups eight values into a block. Each of the entries in the block has a sign and seven mantissa bits. This results in a total of 8 + 8 * (1 + 7) = 72 bits, or 9 bytes, for all eight values. Details on the format are available in the AIE-API, where it has the name bfp16ebs8.
Arch. |
bfloat16 × bfloat16 |
float × floatd |
bfp16 × bfp16 |
|---|---|---|---|
AIE-ML/XDNA1 |
4×8×4
8×8×4a
4×16×4ab
8×8×8ab
|
4×8×4
4×1×4b
4×1×8ab
|
|
XDNA2 |
8×8×4ab
4×8×8abc
4×8×4ab
8×8×8e
8×1×8b
|
4×8×4ab |
8×8×8
8×8×16ab
|
AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 at compile time.The AIE-API lists supported matrix-multiplication modes. Table 1 provides an excerpt of the modes relevant for our targeted tensor workloads. We observe that the BF16 4×8×4 mode runs natively on XDNA1, while the BFP16 8×8×8 mode runs natively on XDNA2. All other modes are emulated in software.
XDNA1#
Inferring the ISA#
This section illustrates the process of inferring the XDNA1 ISA using AIE-API intrinsics and Peano. We use the BF16 4×8×4 matrix-matrix multiplication mode, which runs natively on XDNA1, as an example. The notation specifies an M×K×N matrix-matrix multiplication C+=AB, where all matrices are in row-major order:
- M
Appears in A (rows) and C (rows). In the example M=4.
- K
Appears in A (columns) and B (rows). K is the contraction dimension. In the example K=8.
- N
Appears in B (columns) and C (columns). In the example N=4.
1#include <aie_api/aie.hpp>
2
3template <typename T_in, typename T_out,
4 unsigned r, unsigned s, unsigned t>
5inline static void bf16_mac_template( T_in const * __restrict ptr_in0,
6 T_in const * __restrict ptr_in1,
7 T_out * __restrict ptr_out ) {
8 // define matrix multiplication operation
9 using MMUL = aie::mmul<r, s, t, T_in, T_in, accfloat>;
10
11 // define vectors
12 aie::vector<T_in, MMUL::size_A> mat_in0;
13 aie::vector<T_in, MMUL::size_B> mat_in1;
14 aie::vector<T_out, MMUL::size_C> mat_out;
15
16 // load data
17 mat_in0 = aie::load_v<MMUL::size_A>(ptr_in0);
18 mat_in1 = aie::load_v<MMUL::size_B>(ptr_in1);
19 mat_out = aie::load_v<MMUL::size_C>(ptr_out);
20
21 // declare accumulator
22 MMUL mm_out(mat_out);
23
24 // perform matrix multiplication
25 mm_out.mac(mat_in0, mat_in1);
26
27 // store accumulator
28 aie::store_v(ptr_out, mm_out.template to_vector<T_out>());
29
30 return;
31}
32
33// instantiate template
34extern "C" {
35 void bf16_mac( bfloat16 const * ptr_in0,
36 bfloat16 const * ptr_in1,
37 float * ptr_out ) {
38 bf16_mac_template<bfloat16, float, 4, 8, 4>(ptr_in0, ptr_in1, ptr_out);
39 }
40}
Listing 1 shows a simple AIE-API kernel that performs a BF16 4×8×4 operation in line 25 when instantiated with the template parameters from line 38.
The target matrix multiplication mode is set up in line 9.
It operates on BF16 inputs and accumulates into FP32 as specified by accfloat.
Lines 12–14 declare the BF16 4×8 matrix mat_in0, the 8×4 BF16 matrix mat_in1, and the 4×4 FP32 matrix mat_out.
Next, the load_v intrinsics in lines 17–19 load data from memory into registers.
Line 22 declares that mat_out should be used for accumulation of the matrix multiplication in line 25.
The last intrinsic function in line 28 stores the data in mat_out to memory at address ptr_out.
Using the Peano compiler, we can compile the AIE-API kernel and obtain the generated assembly code:
clang++ -O2 -std=c++20 --target=aie2-none-unknown-elf \
-I aie_api/include -S bf16_mac.cpp -o bf16_mac.s
Note
The installation of Peano is documented in the mlir-aie repository.
Additionally, the AIE-API header-only library is required to compile the kernel (-I aie_api/include).
6bf16_mac: // @bf16_mac
7// %bb.0: // %entry
8 nopb ; nopa ; nops ; nopx ; mov p3, p0; nopv
9 vldb wl0, [p0, #0]; mov p0, p1
10 vlda wl2, [p1, #0]; paddb [p0], #32; padds [p3], #32
11 vlda wh0, [p3, #0]; vldb wh2, [p0, #0]
12 vlda amhh0, [p2, #32]
13 vlda amhl0, [p2, #0]
14 nop
15 nop
16 nop
17 mova r0, #28
18 vmac.f bmh0, bmh0, x0, x2, r0
19 nop
20 nop
21 ret lr
22 nop // Delay Slot 5
23 nop // Delay Slot 4
24 vst amhh0, [p2, #32] // Delay Slot 3
25 vst amhl0, [p2, #0] // Delay Slot 2
26 nop // Delay Slot 1
Listing 2 shows the relevant part of the assembly code generated from the AIE-API kernel in Listing 1.
The label bf16_mac in line 6 marks the entry point of the function.
Each of lines 8–26 represents a VLIW instruction, potentially consisting of multiple operations separated by semicolons.
For example, the instruction vlda wl2, [p1, #0]; paddb [p0], #32; padds [p3], #32 in line 10 consists of three operations that are not nop.
The function parameters ptr_in0, ptr_in1, and ptr_out are passed via the pointer registers P0, P1, and P2.
We give high-level descriptions of the operations in each instruction:
Line 8 – nopb; nopa; nops; nopx; mov p3, p0; nopv
mov p3, p0: Copy the value in pointer registerP0toP3.
Line 9 – vldb wl0, [p0, #0]; mov p0, p1
vldb wl0, [p0, #0]: Load 32 bytes (16 BF16 values) from the address in pointer registerP0into the vector registerWL0. The lower 32 bytes of the 64-byte registerX0overlap withWL0.mov p0, p1: Copy the value in pointer registerP1toP0.
Line 10 – vlda wl2, [p1, #0]; paddb [p0], #32; padds [p3], #32
vlda wl2, [p1, #0]: Load 32 bytes (16 BF16 values) from the address in pointer registerP1into the vector registerWL2.paddb [p0], #32: Increment the value in pointer registerP0by 32.padds [p3], #32: Increment the value in pointer registerP3by 32.
Line 11 – vlda wh0, [p3, #0]; vldb wh2, [p0, #0]
vlda wh0, [p3, #0]: Load 32 bytes (16 BF16 values) from the address in pointer registerP3into the vector registerWH0.vldb wh2, [p0, #0]: Load 32 bytes (16 BF16 values) from the address in pointer registerP0into the vector registerWH2.
Line 12 – vlda amhh0, [p2, #32]
vlda amhh0, [p2, #32]: Load 32 bytes (8 FP32 values) from the address in pointer registerP2with an offset of 32 into accumulation registerAMHH0. The upper 32 bytes of the accumulation registerBMH0overlap withAMHH0.
Line 13 – vlda amhl0, [p2, #0]
vlda amhl0, [p2, #0]: Load 32 bytes (8 FP32 values) from the address in pointer registerP2into accumulation registerAMHL0. The lower 32 bytes of the accumulation registerBMH0overlap withAMHL0.
Line 17 – mova r0, #28
mova r0, #28: Write the value 28 into the 4-byte general purpose registerR0.
Line 18 – vmac.f bmh0, bmh0, x0, x2, r0
vmac.f bmh0, bmh0, x0, x2, r0: Perform a matrix-matrix multiplication where the matrices in the vector registerX0andX2are multiplied. The FP32 values of the matrix inBMH0(second occurrence in the operation) are added to the multiplication result. After adding, the final result is stored inBMH0(first occurrence in the operation).Register
R0is used as a configuration register, where the value28corresponds to the BF16 4×8×4 matrix-matrix multiplication.
Line 21 – ret lr
ret lr: Write the address in the link register to the program counter. This operation has a 6-cycle latency, meaning the next 5 instructions are executed during the delay slots.
Line 24 – vst amhh0, [p2, #32]
vst amhh0, [p2, #32]: Store the 8 FP32 values fromAMHH0to the address in registerP2with an offset of 32.
Line 25 – vst amhl0, [p2, #0]
vst amhl0, [p2, #0]: Store the 8 FP32 values fromAMHL0to the address in registerP2.
Some of the instruction properties can be seen directly in the assembly code.
In particular, the XDNA cores lack stall logic and execute instructions in order.
This means that the compiler inserts no-operation (nop) instructions to resolve dependencies.
Consider, for example, the instruction in line 18 that exclusively contains the vmac.f matrix-multiplication operation.
The two stores in lines 24 and 25 write the result of the multiplication to memory and therefore depend on the completion of vmac.f.
We see that the compiler schedules the first store in line 24 after six cycles.
This indicates that the vmac.f operation has a latency of 6 cycles.
The details on Peano’s scheduling rules for the used vmac.f bmh0, bmh0, x0, x2, r0 operation are defined in the AIEngine Scheduling Definitions for AIE2:
InstrItinData<II_VMACf, [InstrStage<1, [R_RV_PORT]>, EmptyCycles<4>, InstrStage<1, [CM_WA_PORT]>],
[6,3,1,1,1,/*srFPFlags*/7, /*crFPMask*/7]>,
We see that the compiler assumes a 6 cycle latency for the instruction.
More precisely, the write-back to the accumulation register bmh0 completes after 6 cycles, whereas the accumulator is read in the third cycle.
This means that our kernels can exploit forwarding and only have to ensure that preceding write-backs are completed before the third cycle of the operation.
The latencies of the BFP16 8×8×8 operation mentioned in Floating-Point Matrix Operations are part of the AIE2p scheduling definitions.
In general, we can follow a similar analysis for all operations. If we are uncertain about a particular latency, we can increase or decrease the distances in the assembly code and run a small test to check if the change was valid.
Summary#
We observed no differences between the register files of XDNA1 and AIE-ML. The process of determining the available operations together with their respective latencies is tedious. We have done this for the XDNA1 operations that are required for our target tensor contraction workload. The obtained ISA is provided in Table 2.
Operation |
Lat. |
Notes |
|---|---|---|
nop - no-operation |
||
|
- |
do nothing |
|
- |
do nothing in unit |
mov - move |
||
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
ld - load - 4 byte load |
||
|
6 |
post-index |
|
6 |
post-index |
|
6 |
offset |
|
6 |
offset |
vld - vector load - 32 byte loads |
||
|
7 |
post-indexa |
|
7 |
post-indexa |
|
7 |
post-indexa |
st - store - 4 byte store |
||
|
6 |
post-indexa |
vst - vector store - 32 byte store |
||
|
2 |
post-indexa |
|
2 |
post-indexa |
|
2 |
post-indexa |
vshuffle - vector shuffle |
||
|
2 |
conf. registerb |
add - addition |
||
|
1 |
exp. leading bitc |
|
1 |
|
padd - pointer addition |
||
|
1 |
|
|
1 |
|
|
1 |
|
mul - multiplication |
||
|
2 |
|
vmac - vector multiply accumulate |
||
|
6 |
lfw, conf. reg.d |
comparisons |
||
|
1 |
|
|
1 |
|
j - jump |
||
|
6 |
|
|
6 |
|
|
6 |
|
<Rn> as configuration register (value 28 for 4x8 -> 8x4, value 29 for 8x4 -> 4x8)(<BMLm>|<BMHm>) are read in third cycle. <Rn> as configuration register (value 28 for 4x8x4-bfloat16)XDNA2#
Inferring the ISA#
The AIE-API can emulate BF16 8×8×8 matrix multiplications through BFP16 8×8×8 operations but does not expose a direct BFP16 8×8×8 intrinsic.
1#include <aie_api/aie.hpp>
2
3template <typename T_in, typename T_out,
4 unsigned r, unsigned s, unsigned t>
5inline static void bf16_mac_template( T_in const * __restrict ptr_in0,
6 T_in const * __restrict ptr_in1,
7 T_out * __restrict ptr_out ) {
8 // define matrix multiplication operation
9 using MMUL = aie::mmul<r, s, t, T_in, T_in, accfloat>;
10
11 // define vectors
12 aie::vector<T_in, MMUL::size_A> mat_in0;
13 aie::vector<T_in, MMUL::size_B> mat_in1;
14 aie::vector<T_out, MMUL::size_C> mat_out;
15
16 // load data
17 mat_in0 = aie::load_v<MMUL::size_A>(ptr_in0);
18 mat_in1 = aie::load_v<MMUL::size_B>(ptr_in1);
19 mat_out = aie::load_v<MMUL::size_C>(ptr_out);
20
21 // declare accumulator
22 MMUL mm_out(mat_out);
23
24 // perform matrix multiplication
25 mm_out.mac(mat_in0, mat_in1);
26
27 // store accumulator
28 aie::store_v(ptr_out, mm_out.template to_vector<T_out>());
29
30 return;
31}
32
33// instantiate template
34extern "C" {
35 void bfp16_emu_mac( bfloat16 const * ptr_in0,
36 bfloat16 const * ptr_in1,
37 float * ptr_out ) {
38 bf16_mac_template<bfloat16, float, 8, 8, 8>(ptr_in0, ptr_in1, ptr_out);
39 }
40}
Listing 3 shows an AIE-API kernel that performs an 8×8×8 BF16 matrix multiplication. The template is identical to the XDNA1 version in Listing 1 but initialized with the appropriate dimension sizes in line 38.
Defining the directive AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 instructs the AIE-API to emulate the matrix multiplication in the kernel with BFP16 operations:
clang++ -O2 -std=c++20 --target=aie2p-none-unknown-elf \
-DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 \
-I aie_api/include -S bfp16_emu_mac.cpp -o bfp16_emu_mac.s
Assembly code obtained from the AIE-API kernel when compiled with the defined directive AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16.# 6bfp16_emu_mac: // @bfp16_emu_mac
7// %bb.0: // %entry
8 nopa ; vldb x2, [p1, #0]; nopxm ; nops
9 vldb x4, [p1, #64]
10 nop
11 nop
12 nop
13 movxm r0, #16256
14 vlda.conv.fp32.bf16 cml0, [p0, #0]; vbcst.16 x0, r0
15 mova r0, #53; vmov x1, x0
16 mova r1, #52; vshuffle x7, x2, x4, r0
17 vlda.conv.fp32.bf16 cmh0, [p0, #64]; vshuffle x6, x2, x4, r1
18 mova r0, #60
19 vmul.f dm1, y3, y0, r0
20 nop
21 nop
22 vlda bmll0, [p2, #0]
23 vlda bmlh0, [p2, #64]
24 vlda bmhl0, [p2, #128]; vconv.bfp16ebs8.fp32 ex0, dm0
25 vlda bmhh0, [p2, #192]; vconv.bfp16ebs8.fp32 ex2, dm1
26 nop
27 nop
28 mova r0, #780
29 vmac.f dm0, dm0, ex0, ex2, r0
30 nop
31 nop
32 nop
33 nop
34 ret lr
35 vst bmll0, [p2, #0] // Delay Slot 5
36 vst bmlh0, [p2, #64] // Delay Slot 4
37 vst bmhl0, [p2, #128] // Delay Slot 3
38 vst bmhh0, [p2, #192] // Delay Slot 2
39 nop // Delay Slot 1
The relevant part of the generated assembly code is shown in Listing 4.
Line 29 contains the instruction vmac.f dm0, dm0, ex0, ex2, r0, which performs the BFP16 matrix multiplication.
Since the intrinsics assume pointers to BF16 inputs and FP32 output, the compiler issues respective conversion operations to BFP16.
We were not able to write an AIE-API kernel that reads BFP16 values directly from L1 scratchpad memory.
Instead, we examined the Peano test code to identify the necessary operations for loading BFP16 data into the EX registers.
49movx r24, #0
50vlda.fill.512 [p0, lf0, r24]
51// no instruction in line
52vlda.pop.576 ex5, [p0, lf0, r24]
53
54
55
56
57
58
59
60
61// no instruction in line
62// no instruction in line
63vmac.f dm2, dm2, ex11, ex5, r0
The relevant part is shown in Listing 5.
The required 576-bit load for an EX register is split into multiple operations.
Line 49 initializes R24 to 0.
R24 is used as an offset and to track the pipeline fill stage.
vlda.fill.512 [p0, lf0, r24] loads 512 bits into load file 0 and increments the value in R24 by 64 (the number of loaded bytes).
vlda.pop.576 ex5, [p0, lf0, r24] loads 512 bits using the value in R24 as an address offset.
Depending on the value in R24, the pipeline consumes the 64 bytes already in LF0 and 8 additional bytes from the subsequent fetch to assemble the 576-bit EX5 register.
The operation increments P0 by 72 and decrements R24 by 8 (number of bytes drained from the fill stage).
The EX5 register is used after eight instructions, indicating an eight-cycle latency.
Summary#
XDNA2 has five 2048-bit accumulator registers (DM0–DM4) and their 1024-bit/512-bit views in the ISA.
Compared to AIE-ML v2, this represents a 1.6× reduction in the architecturally visible accumulator count.
Table 3, Table 4, Table 5, and Table 6 summarize additional information on registers that are relevant for BFP16 operations. The inferred ISA is provided in Table 7.
32-bit |
64-bit |
|---|---|
|
|
|
|
… |
… |
… |
|
|
|
|
8-byte + 64-byte |
72-byte |
|---|---|
|
|
… |
… |
|
|
1-byte + 8-byte |
9-byte |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
512-bit |
1024-bit |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
Operation |
Lat. |
Notes |
|---|---|---|
nop - no-operation |
||
|
- |
do nothing |
|
- |
do nothing in unit |
mov - move |
||
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
2 |
|
|
2 |
|
ld - load - 4 byte load |
||
|
6 |
post-index |
|
6 |
post-index |
|
6 |
offset |
|
6 |
offset |
vld - vector load - 64 byte loads |
||
|
7 |
post-indexa |
|
7 |
post-indexa |
|
7 |
post-indexa |
|
- |
|
|
- |
|
|
8 |
no std. vld followb |
|
8 |
no std. vld followb |
st - store - 4 byte store |
||
|
6 |
post-indexa |
vst - vector store - 64 byte store |
||
|
2 |
post-indexa |
|
2 |
post-indexa |
|
2 |
post-indexa |
|
- |
|
|
2 |
|
add - addition |
||
|
1 |
exp. leading bitc |
|
1 |
|
padd - pointer addition |
||
|
1 |
|
|
1 |
|
|
1 |
|
mul - multiplication |
||
|
2 |
|
vmac - vector multiply accumulate |
||
|
6 |
lfw, conf. reg.d |
comparisons |
||
|
1 |
|
|
1 |
|
j - jump |
||
|
6 |
|
|
6 |
|
|
6 |
|
<DMm> is read in fourth cycle. <Rn> as configuration register (value 780 for 8x8x8-bfp16)