Instruction Set Architecture#

As of 2026, the XDNA official documentation focuses mainly on the higher-level software stack and provides limited technical details. However, by combining information from several publicly available sources with microbenchmarks, we can construct an accurate picture of the hardware.

XDNA1 and XDNA2 are AMD’s consumer variants of the AIE-ML (AIE2) and AIE-ML v2 (AIE2p) architectures. Public documentation for the AIE-ML and AIE-ML v2 architectures is available as part of AMD’s Versal AI Edge Series. Architecturally, XDNA1 closely resembles AIE-ML, whereas XDNA2 diverges more from AIE-ML v2. Nonetheless, the AIE-ML and AIE-ML v2 architecture manuals are helpful for understanding the features of the microarchitectures.

As stated in an AMD paper, XDNA cores use a very-long instruction word (VLIW) instruction set architecture (ISA) that supports single-instruction multiple-data (SIMD) operations for fixed- and floating-point arithmetic. Although the ISA itself is not publicly documented, the open-source, LLVM-based compiler framework Peano implements it. Specifically, this framework can compile the AIE-API, which contains intrinsic functions for programming the compute-tile cores. Peano emits assembly code for intrinsic kernels compiled from C++. This allows us to infer the ISA from the generated code. Furthermore, microbenchmarking operations within instruction words allows us to determine their hardware execution properties. Operation latencies are particularly important because they are crucial for scheduling operations in VLIW code.

Floating-Point Matrix Operations#

BF16 is a popular ML data format introduced by Google in 2018. A BF16 number has a sign bit, eight exponent bits, and seven mantissa bits. Thus, we obtain the same dynamic range as FP32, which has eight exponent bits as well. BFP16 is a block-floating-point format, which means that multiple values share the same exponent. Specifically, BFP16 uses eight bits for the exponent and groups eight values into a block. Each of the entries in the block has a sign and seven mantissa bits. This results in a total of 8 + 8 * (1 + 7) = 72 bits, or 9 bytes, for all eight values. Details on the format are available in the AIE-API, where it has the name bfp16ebs8.

Table 1 Excerpt of the floating-point matrix-multiplication modes listed in the AIE-API.#

Arch.

bfloat16 × bfloat16

float × floatd

bfp16 × bfp16

AIE-ML/XDNA1

4×8×4
8×8×4a
4×16×4ab
8×8×8ab
4×8×4
4×1×4b
4×1×8ab

XDNA2

8×8×4ab
4×8×8abc
4×8×4ab
8×8×8e
8×1×8b

4×8×4ab

8×8×8
8×8×16ab
a Emulated using multiple intrinsic calls.
b Require additional data manipulation.
c 32b * 16b multiplications are emulated on AIE-ML/XDNA1, XDNA2, and AIE-MLv2.
d float multiplications are emulated on AIE-ML/XDNA1, XDNA2, and AIE-MLv2 using native bfloat16 multiplications.
e Mode available through block-floating-point emulation to increase throughput at the cost of accuracy. Enabled by defining AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 at compile time.

The AIE-API lists supported matrix-multiplication modes. Table 1 provides an excerpt of the modes relevant for our targeted tensor workloads. We observe that the BF16 4×8×4 mode runs natively on XDNA1, while the BFP16 8×8×8 mode runs natively on XDNA2. All other modes are emulated in software.

XDNA1#

Inferring the ISA#

This section illustrates the process of inferring the XDNA1 ISA using AIE-API intrinsics and Peano. We use the BF16 4×8×4 matrix-matrix multiplication mode, which runs natively on XDNA1, as an example. The notation specifies an M×K×N matrix-matrix multiplication C+=AB, where all matrices are in row-major order:

M

Appears in A (rows) and C (rows). In the example M=4.

K

Appears in A (columns) and B (rows). K is the contraction dimension. In the example K=8.

N

Appears in B (columns) and C (columns). In the example N=4.

Listing 1 AIE-API kernel that uses the BF16 4×8×4 matrix-multiplication mode.#
 1#include <aie_api/aie.hpp>
 2
 3template <typename T_in, typename T_out,
 4          unsigned r, unsigned s, unsigned t>
 5inline static void bf16_mac_template( T_in const * __restrict ptr_in0,
 6                                      T_in const * __restrict ptr_in1,
 7                                      T_out      * __restrict ptr_out ) {
 8  // define matrix multiplication operation
 9  using MMUL = aie::mmul<r, s, t, T_in, T_in, accfloat>;
10
11  // define vectors
12  aie::vector<T_in,  MMUL::size_A> mat_in0;
13  aie::vector<T_in,  MMUL::size_B> mat_in1;
14  aie::vector<T_out, MMUL::size_C> mat_out;
15
16  // load data
17  mat_in0 = aie::load_v<MMUL::size_A>(ptr_in0);
18  mat_in1 = aie::load_v<MMUL::size_B>(ptr_in1);
19  mat_out = aie::load_v<MMUL::size_C>(ptr_out);
20
21  // declare accumulator
22  MMUL mm_out(mat_out);
23  
24  // perform matrix multiplication
25  mm_out.mac(mat_in0, mat_in1);
26
27  // store accumulator
28  aie::store_v(ptr_out, mm_out.template to_vector<T_out>());
29
30  return;
31}
32
33// instantiate template
34extern "C" {
35  void bf16_mac( bfloat16 const * ptr_in0,
36                 bfloat16 const * ptr_in1,
37                 float          * ptr_out ) {
38    bf16_mac_template<bfloat16, float, 4, 8, 4>(ptr_in0, ptr_in1, ptr_out);
39  }
40}

Listing 1 shows a simple AIE-API kernel that performs a BF16 4×8×4 operation in line 25 when instantiated with the template parameters from line 38. The target matrix multiplication mode is set up in line 9. It operates on BF16 inputs and accumulates into FP32 as specified by accfloat. Lines 12–14 declare the BF16 4×8 matrix mat_in0, the 8×4 BF16 matrix mat_in1, and the 4×4 FP32 matrix mat_out. Next, the load_v intrinsics in lines 17–19 load data from memory into registers. Line 22 declares that mat_out should be used for accumulation of the matrix multiplication in line 25. The last intrinsic function in line 28 stores the data in mat_out to memory at address ptr_out.

Using the Peano compiler, we can compile the AIE-API kernel and obtain the generated assembly code:

clang++ -O2 -std=c++20 --target=aie2-none-unknown-elf \
        -I aie_api/include -S bf16_mac.cpp -o bf16_mac.s

Note

The installation of Peano is documented in the mlir-aie repository. Additionally, the AIE-API header-only library is required to compile the kernel (-I aie_api/include).

Listing 2 Assembly code obtained from the compiled AIE-API kernel.#
 6bf16_mac:                               // @bf16_mac
 7// %bb.0:                               // %entry
 8	nopb	;		nopa	;		nops	;		nopx	;		mov	p3, p0;		nopv	
 9	vldb	wl0, [p0, #0];		mov	p0, p1
10	vlda	wl2, [p1, #0];		paddb	[p0], #32;		padds	[p3], #32
11	vlda	wh0, [p3, #0];		vldb	wh2, [p0, #0]
12	vlda	amhh0, [p2, #32]
13	vlda	amhl0, [p2, #0]
14	nop	
15	nop	
16	nop	
17	mova	r0, #28
18	vmac.f	bmh0, bmh0, x0, x2, r0
19	nop	
20	nop	
21	ret lr	
22	nop	                                //  Delay Slot 5
23	nop	                                //  Delay Slot 4
24	vst	amhh0, [p2, #32]                //  Delay Slot 3
25	vst	amhl0, [p2, #0]                 //  Delay Slot 2
26	nop	                                //  Delay Slot 1

Listing 2 shows the relevant part of the assembly code generated from the AIE-API kernel in Listing 1. The label bf16_mac in line 6 marks the entry point of the function. Each of lines 8–26 represents a VLIW instruction, potentially consisting of multiple operations separated by semicolons. For example, the instruction vlda wl2, [p1, #0]; paddb [p0], #32; padds [p3], #32 in line 10 consists of three operations that are not nop. The function parameters ptr_in0, ptr_in1, and ptr_out are passed via the pointer registers P0, P1, and P2.

We give high-level descriptions of the operations in each instruction:

Line 8nopb; nopa; nops; nopx; mov p3, p0; nopv

  • mov p3, p0: Copy the value in pointer register P0 to P3.

Line 9vldb wl0, [p0, #0]; mov p0, p1

  • vldb wl0, [p0, #0]: Load 32 bytes (16 BF16 values) from the address in pointer register P0 into the vector register WL0. The lower 32 bytes of the 64-byte register X0 overlap with WL0.

  • mov p0, p1: Copy the value in pointer register P1 to P0.

Line 10vlda wl2, [p1, #0]; paddb [p0], #32; padds [p3], #32

  • vlda wl2, [p1, #0]: Load 32 bytes (16 BF16 values) from the address in pointer register P1 into the vector register WL2.

  • paddb [p0], #32: Increment the value in pointer register P0 by 32.

  • padds [p3], #32: Increment the value in pointer register P3 by 32.

Line 11vlda wh0, [p3, #0]; vldb wh2, [p0, #0]

  • vlda wh0, [p3, #0]: Load 32 bytes (16 BF16 values) from the address in pointer register P3 into the vector register WH0.

  • vldb wh2, [p0, #0]: Load 32 bytes (16 BF16 values) from the address in pointer register P0 into the vector register WH2.

Line 12vlda amhh0, [p2, #32]

  • vlda amhh0, [p2, #32]: Load 32 bytes (8 FP32 values) from the address in pointer register P2 with an offset of 32 into accumulation register AMHH0. The upper 32 bytes of the accumulation register BMH0 overlap with AMHH0.

Line 13vlda amhl0, [p2, #0]

  • vlda amhl0, [p2, #0]: Load 32 bytes (8 FP32 values) from the address in pointer register P2 into accumulation register AMHL0. The lower 32 bytes of the accumulation register BMH0 overlap with AMHL0.

Line 17mova r0, #28

  • mova r0, #28: Write the value 28 into the 4-byte general purpose register R0.

Line 18vmac.f bmh0, bmh0, x0, x2, r0

  • vmac.f bmh0, bmh0, x0, x2, r0: Perform a matrix-matrix multiplication where the matrices in the vector register X0 and X2 are multiplied. The FP32 values of the matrix in BMH0 (second occurrence in the operation) are added to the multiplication result. After adding, the final result is stored in BMH0 (first occurrence in the operation).

    Register R0 is used as a configuration register, where the value 28 corresponds to the BF16 4×8×4 matrix-matrix multiplication.

Line 21ret lr

  • ret lr: Write the address in the link register to the program counter. This operation has a 6-cycle latency, meaning the next 5 instructions are executed during the delay slots.

Line 24vst amhh0, [p2, #32]

  • vst amhh0, [p2, #32]: Store the 8 FP32 values from AMHH0 to the address in register P2 with an offset of 32.

Line 25vst amhl0, [p2, #0]

  • vst amhl0, [p2, #0]: Store the 8 FP32 values from AMHL0 to the address in register P2.

Some of the instruction properties can be seen directly in the assembly code. In particular, the XDNA cores lack stall logic and execute instructions in order. This means that the compiler inserts no-operation (nop) instructions to resolve dependencies.

Consider, for example, the instruction in line 18 that exclusively contains the vmac.f matrix-multiplication operation. The two stores in lines 24 and 25 write the result of the multiplication to memory and therefore depend on the completion of vmac.f. We see that the compiler schedules the first store in line 24 after six cycles. This indicates that the vmac.f operation has a latency of 6 cycles.

The details on Peano’s scheduling rules for the used vmac.f bmh0, bmh0, x0, x2, r0 operation are defined in the AIEngine Scheduling Definitions for AIE2:

InstrItinData<II_VMACf, [InstrStage<1, [R_RV_PORT]>, EmptyCycles<4>, InstrStage<1, [CM_WA_PORT]>],
              [6,3,1,1,1,/*srFPFlags*/7, /*crFPMask*/7]>,

We see that the compiler assumes a 6 cycle latency for the instruction. More precisely, the write-back to the accumulation register bmh0 completes after 6 cycles, whereas the accumulator is read in the third cycle. This means that our kernels can exploit forwarding and only have to ensure that preceding write-backs are completed before the third cycle of the operation. The latencies of the BFP16 8×8×8 operation mentioned in Floating-Point Matrix Operations are part of the AIE2p scheduling definitions.

In general, we can follow a similar analysis for all operations. If we are uncertain about a particular latency, we can increase or decrease the distances in the assembly code and run a small test to check if the change was valid.

Summary#

We observed no differences between the register files of XDNA1 and AIE-ML. The process of determining the available operations together with their respective latencies is tedious. We have done this for the XDNA1 operations that are required for our target tensor contraction workload. The obtained ISA is provided in Table 2.

Table 2 XDNA1 operations that are required for our target tensor contraction workload. The operations are sorted by groups and provided together with their latencies.#

Operation

Lat.

Notes

nop - no-operation

NOP

-

do nothing

NOP(V|A|B|S|X|M|XM)

-

do nothing in unit

mov - move

MOV <Rd>, #<imm10>

1

MOV <Rd>, <Rm>

1

MOV(A|X) <Rd>, #<imm8>

1

MOV(A|X) <Rd>, <Rm>

1

MOVXM <Rd>, #<imm32>

1

ld - load - 4 byte load

LD(A|B) <Rd>, [<Pn>], #<imm8>

6

post-index

LD(A|B) <Rd>, [<Pn>], <Mm>

6

post-index

LD(A|B) <Rd>, [<Pn>, #<imm8>]

6

offset

LD(A|B) <Rd>, [<Pn>, <DJm>]

6

offset

vld - vector load - 32 byte loads

VLD(A|B) (<WLd>|<WHd>), [<Pn>], #<imm8>

7

post-indexa

VLDA (<AMLLd>|<AMLHd>|<AMHLd>|<AMHHd>), [<Pn>], #<imm8>

7

post-indexa

VLDA.CONV.FP32.BF16 (<BMLd>|<BMHd>), [<Pn>], #<imm8>

7

post-indexa

st - store - 4 byte store

ST <Rd>, [<Pn>], #<imm8>

6

post-indexa

vst - vector store - 32 byte store

VST (<WLd>|<WHd>), [<Pn>], #<imm8>

2

post-indexa

VST (<AMLLd>|<AMLHd>|<AMHLd>|<AMHHd>), [<Pn>], #<imm8>

2

post-indexa

VST.CONV.BF16.FP32 (<BMLd>|<BMHd>), [<Pn>], #<imm8>

2

post-indexa

vshuffle - vector shuffle

VSHUFFLE (<Xd>|<BMLd>|<BMHd>),  <Xr>, <Xs>, <Rn>

2

conf. registerb

add - addition

ADD <Rd>, <Rm>, #<imm7>

1

exp. leading bitc

ADD <Rd>, <Rm>, <Rn>

1

padd - pointer addition

PADD(A|B|S) [<Pd>], <Mn>

1

PADD(A|S) [<Pd>], #<imm11>

1

PADDB [<Pd>], #<imm10>

1

mul - multiplication

MUL <Rd>, <Rm>, <Rn>

2

vmac - vector multiply accumulate

VMAC.F (<BMLd>|<BMHd>), (<BMLm>|<BMHm>), <Xr>, <Xs>, <Rn>

6

lfw, conf. reg.d

comparisons

(GT|LT|GE|LE){U} <Rd>, <Rm>, <Rn>

1

U for unsigned

SEL.(EQZ|NEZ) <Rd>, <Rm>, <Rn>, <R27>

1

j - jump

J #<label>

6

J(Z|NZ) <Rd>, #<label>

6

RET <LR>

6

a same addressing modes as ld
b <Rn> as configuration register (value 28 for 4x8 -> 8x4, value 29 for 8x4 -> 4x8)
c Leading bit is expanded.
d (<BMLm>|<BMHm>) are read in third cycle. <Rn> as configuration register (value 28 for 4x8x4-bfloat16)

XDNA2#

Inferring the ISA#

The AIE-API can emulate BF16 8×8×8 matrix multiplications through BFP16 8×8×8 operations but does not expose a direct BFP16 8×8×8 intrinsic.

Listing 3 AIE-API kernel that can be compiled to use BFP16 8×8×8 matrix-matrix multiplications.#
 1#include <aie_api/aie.hpp>
 2
 3template <typename T_in, typename T_out,
 4          unsigned r, unsigned s, unsigned t>
 5inline static void bf16_mac_template( T_in const * __restrict ptr_in0,
 6                                      T_in const * __restrict ptr_in1,
 7                                      T_out      * __restrict ptr_out ) {
 8  // define matrix multiplication operation
 9  using MMUL = aie::mmul<r, s, t, T_in, T_in, accfloat>;
10
11  // define vectors
12  aie::vector<T_in,  MMUL::size_A> mat_in0;
13  aie::vector<T_in,  MMUL::size_B> mat_in1;
14  aie::vector<T_out, MMUL::size_C> mat_out;
15
16  // load data
17  mat_in0 = aie::load_v<MMUL::size_A>(ptr_in0);
18  mat_in1 = aie::load_v<MMUL::size_B>(ptr_in1);
19  mat_out = aie::load_v<MMUL::size_C>(ptr_out);
20
21  // declare accumulator
22  MMUL mm_out(mat_out);
23
24  // perform matrix multiplication
25  mm_out.mac(mat_in0, mat_in1);
26
27  // store accumulator
28  aie::store_v(ptr_out, mm_out.template to_vector<T_out>());
29
30  return;
31}
32
33// instantiate template
34extern "C" {
35  void bfp16_emu_mac( bfloat16 const * ptr_in0,
36                      bfloat16 const * ptr_in1,
37                      float          * ptr_out ) {
38    bf16_mac_template<bfloat16, float, 8, 8, 8>(ptr_in0, ptr_in1, ptr_out);
39  }
40}

Listing 3 shows an AIE-API kernel that performs an 8×8×8 BF16 matrix multiplication. The template is identical to the XDNA1 version in Listing 1 but initialized with the appropriate dimension sizes in line 38.

Defining the directive AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 instructs the AIE-API to emulate the matrix multiplication in the kernel with BFP16 operations:

clang++ -O2 -std=c++20 --target=aie2p-none-unknown-elf \
        -DAIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16 \
        -I aie_api/include -S bfp16_emu_mac.cpp -o bfp16_emu_mac.s
Listing 4 Assembly code obtained from the AIE-API kernel when compiled with the defined directive AIE_API_EMULATE_BFLOAT16_MMUL_WITH_BFP16.#
 6bfp16_emu_mac:                              // @bfp16_emu_mac
 7// %bb.0:                               // %entry
 8	nopa	;		vldb	 x2, [p1, #0];		nopxm	;		nops	
 9	vldb	 x4, [p1, #64]
10	nop	
11	nop	
12	nop	
13	movxm	r0, #16256
14	vlda.conv.fp32.bf16	 cml0, [p0, #0];		vbcst.16	 x0, r0
15	mova	r0, #53;		vmov	x1, x0
16	mova	r1, #52;		vshuffle	x7, x2, x4, r0
17	vlda.conv.fp32.bf16	 cmh0, [p0, #64];		vshuffle	x6, x2, x4, r1
18	mova	r0, #60
19	vmul.f	dm1, y3, y0, r0
20	nop	
21	nop	
22	vlda	 bmll0, [p2, #0]
23	vlda	 bmlh0, [p2, #64]
24	vlda	 bmhl0, [p2, #128];		vconv.bfp16ebs8.fp32	 ex0, dm0
25	vlda	 bmhh0, [p2, #192];		vconv.bfp16ebs8.fp32	 ex2, dm1
26	nop	
27	nop	
28	mova	r0, #780
29	vmac.f	dm0, dm0, ex0, ex2, r0
30	nop	
31	nop	
32	nop	
33	nop	
34	ret	lr
35	vst	 bmll0, [p2, #0]                //  Delay Slot 5
36	vst	 bmlh0, [p2, #64]               //  Delay Slot 4
37	vst	 bmhl0, [p2, #128]              //  Delay Slot 3
38	vst	 bmhh0, [p2, #192]              //  Delay Slot 2
39	nop	                                //  Delay Slot 1

The relevant part of the generated assembly code is shown in Listing 4. Line 29 contains the instruction vmac.f dm0, dm0, ex0, ex2, r0, which performs the BFP16 matrix multiplication.

Since the intrinsics assume pointers to BF16 inputs and FP32 output, the compiler issues respective conversion operations to BFP16. We were not able to write an AIE-API kernel that reads BFP16 values directly from L1 scratchpad memory. Instead, we examined the Peano test code to identify the necessary operations for loading BFP16 data into the EX registers.

Listing 5 Relevant operations for loading an EX register in Peano test code.#
49movx r24, #0
50vlda.fill.512 [p0, lf0, r24]
51// no instruction in line
52vlda.pop.576 ex5, [p0, lf0, r24]
53
54
55
56
57
58
59
60
61// no instruction in line
62// no instruction in line
63vmac.f dm2, dm2, ex11, ex5, r0

The relevant part is shown in Listing 5. The required 576-bit load for an EX register is split into multiple operations. Line 49 initializes R24 to 0. R24 is used as an offset and to track the pipeline fill stage. vlda.fill.512 [p0, lf0, r24] loads 512 bits into load file 0 and increments the value in R24 by 64 (the number of loaded bytes). vlda.pop.576 ex5, [p0, lf0, r24] loads 512 bits using the value in R24 as an address offset. Depending on the value in R24, the pipeline consumes the 64 bytes already in LF0 and 8 additional bytes from the subsequent fetch to assemble the 576-bit EX5 register. The operation increments P0 by 72 and decrements R24 by 8 (number of bytes drained from the fill stage). The EX5 register is used after eight instructions, indicating an eight-cycle latency.

Summary#

XDNA2 has five 2048-bit accumulator registers (DM0DM4) and their 1024-bit/512-bit views in the ISA. Compared to AIE-ML v2, this represents a 1.6× reduction in the architecturally visible accumulator count.

Table 3, Table 4, Table 5, and Table 6 summarize additional information on registers that are relevant for BFP16 operations. The inferred ISA is provided in Table 7.

Table 3 Exponent Registers.#

32-bit

64-bit

EL0

E0

EH0

EL11

E11

EH11

Table 4 Vector Registers with Exponent Registers.#

8-byte + 64-byte

72-byte

E0 + X0

EX0

E11 + X11

EX11

Table 5 The data layout of EX0 is as follows: EX0 overlaps with E0 and X0 in an interleaved manner. The first byte of EX0 overlaps with the first byte of E0. The next eight bytes of EX0 overlap with the first eight bytes of X0. This pattern repeats.#

1-byte + 8-byte

9-byte

E0[0] + X0[ 0: 7]

EX0[ 0: 8]

E0[1] + X0[ 8:15]

EX0[ 9:17]

E0[2] + X0[16:23]

EX0[18:26]

E0[3] + X0[24:31]

EX0[27:35]

E0[4] + X0[32:39]

EX0[36:44]

E0[5] + X0[40:47]

EX0[45:53]

E0[6] + X0[48:55]

EX0[54:62]

E0[7] + X0[56:63]

EX0[63:71]

Table 6 Load and store files.#

512-bit

1024-bit

LFL0

LF0

LFH0

LFL1

LF1

LFH1

STL

ST

STH

Table 7 XDNA2 operations that are required for our target tensor contraction workload. The operations are sorted by groups and provided together with their latencies.#

Operation

Lat.

Notes

nop - no-operation

NOP

-

do nothing

NOP(V|A|B|S|X|M|XM)

-

do nothing in unit

mov - move

MOV <Rd>, #<imm10>

1

MOV <Rd>, <Rm>

1

MOV(A|X) <Rd>, #<imm8>

1

MOV(A|X) <Rd>, <Rm>

1

MOVXM <Rd>, #<imm32>

1

VMOV <Xd>, <Xm>

2

VMOV (<BMLLd>|<BMLHd>|<BMHLd>|<BMHHd>), <Xm>

2

ld - load - 4 byte load

LD(A|B) <Rd>, [<Pn>], #<imm8>

6

post-index

LD(A|B) <Rd>, [<Pn>], <Mm>

6

post-index

LD(A|B) <Rd>, [<Pn>, #<imm8>]

6

offset

LD(A|B) <Rd>, [<Pn>, <DJm>]

6

offset

vld - vector load - 64 byte loads

VLD(A|B) <Xd>, [<Pn>], #<imm8>

7

post-indexa

VLDA (<BMLLd>|<BMLHd>|<BMHLd>|<BMHHd>), [<Pn>], #<imm8>

7

post-indexa

VLDA.CONV.FP32.BF16 (<CMLd>|<CMHd>), [<Pn>], #<imm8>

7

post-indexa

VLDA.FILL.512 [P0, LF0, R24]

-

VLDB.FILL.512 [P1, LF1, R25]

-

VLDA.POP.576 <EXd>, [P0, LF0, R24]

8

no std. vld followb

VLDB.POP.576 <EXd>, [P1, LF1, R25]

8

no std. vld followb

st - store - 4 byte store

ST <Rd>, [<Pn>], #<imm8>

6

post-indexa

vst - vector store - 64 byte store

VST (<WLd>|<WHd>), [<Pn>], #<imm8>

2

post-indexa

VST (<BMLLd>|<BMLHd>|<BMHLd>|<BMHHd>), [<Pn>], #<imm8>

2

post-indexa

VST.CONV.BF16.FP32 (<CMLd>|<CMHd>), [<Pn>], #<imm8>

2

post-indexa

VST.PUSH.576.CONV.BFP16EBS8.FP32 <DMd>, [P2, SF, R26]

-

VST.FLUSH.512.CONV [P2, SF, R26]

2

add - addition

ADD <Rd>, <Rm>, #<imm7>

1

exp. leading bitc

ADD <Rd>, <Rm>, <Rn>

1

padd - pointer addition

PADD(A|B|S) [<Pd>], <Mn>

1

PADD(A|S) [<Pd>], #<imm11>

1

PADDB [<Pd>], #<imm10>

1

mul - multiplication

MUL <Rd>, <Rm>, <Rn>

2

vmac - vector multiply accumulate

VMAC.F <DMd>, <DMm>, <EXr>, <EXs>, <Rn>

6

lfw, conf. reg.d

comparisons

(GT|LT|GE|LE){U} <Rd>, <Rm>, <Rn>

1

U for unsigned

SEL.(EQZ|NEZ) <Rd>, <Rm>, <Rn>, <R27>

1

j - jump

J #<label>

6

J(Z|NZ) <Rd>, #<label>

6

RET <LR>

6

a same addressing modes as ld
b Cannot be followed by a standard (non-pipeline) load operation.
c Leading bit is expanded.
d <DMm> is read in fourth cycle. <Rn> as configuration register (value 780 for 8x8x8-bfp16)