4. Neon#

Instruction set architectures have dedicated registers and instructions for vector and matrix processing. In the case of AArch64, Neon is required in all AArch64-compliant processors and is also called Advanced SIMD (ASIMD). It supports scalar floating-point operations and vector operations on vector registers with up to 128 bits. Depending on the microarchitecture (see Section 1.2), we can also use the Scalable Vector Extension (SVE) for vector processing or the Scalable Matrix Extensions (SME) for matrix processing. The vector and matrix processing instructions are used in conjunction with the base instructions introduced in Section 3. Here, the base instructions are used for address calculation, branching, and conditional code execution, and the vector and matrix instructions are used for the actual heavy lifting in terms of computation. This chapter discusses Neon’s vector instructions. Section 5 then discusses SVE, and Section 6 moves on to SME.

Note

Neon has been extended with some matrix processing capabilities in 2019. In this book, we will not discuss these instructions and will instead limit our matrix processing discussions to SME. However, if you are interested in Neon’s matrix instructions, the BFMMLA instruction is a good place to start.

Theoretically, we could build the entire tensor compiler using only Neon. However, in most cases, using the more advanced SVE and SME, if available, will improve performance. There are two ways to proceed:

  1. Study Neon only, i.e. skip everything about SVE and SME and continue to the end of the book. Then, when everything works, go back and add SVE and SME support to the compiler.

  2. Study Neon, SVE, and SME before continuing. Then write a unified tensor compiler that can generate code for all three options.

If you have the breadth to stick with the ISA for a bit longer, and have access to SVE and SME hardware, the second option is recommended. The rationale behind this recommendation is that you will have a better awareness of more powerful instructions when making design decisions in the code generator.

This section follows the structure of Section 2 and Section 3. That is, we introduce the SIMD and floating-point registers, discuss load and store instructions, and finally introduce data processing instructions.

4.1. Registers#

Neon has thirty-two 128-bit SIMD and floating-point registers that are visible to the A64 instruction set. The registers are architecturally named V0 to V31.

../_images/registers_neon.svg

Fig. 4.1.1 Illustration of the thirty-two 128-bit Neon registers V0-V31 visible to the A64 instruction set, the floating-point control register (FPCR), and the floating-point status register (FPSR).#

As shown in Fig. 4.1.1, the register can be accessed as:

  • 8-bit registers: B0 to B31.

  • 16-bit registers: H0 to H31.

  • 32-bit registers: S0 to S31.

  • 64-bit registers: D0 to D31.

  • 128-bit registers: Q0 to Q31.

In addition, we can use the registers as 128-bit vectors of elements or 64-bit vectors of elements. We will discuss this view of the registers in Section 4.2. Neon also has special-purpose registers, two of which are also shown Fig. 4.1.1:

Floating-point Control Register (FPCR)

Controls floating-point behavior. For example, we could enable/disable NaNs, set rounding modes, or enable/disable flushing of denormalized numbers to zero.

Floating-point Status Register (FPSR)

Provides floating-point status information. For example, exception bits in the register are set when division by zero or saturation occurred.

4.2. Arrangement Specifiers#

Many Neon loads and stores, as well as all data-processing instructions, use arrangement specifiers. An arrangement specifier is a suffix in the form .<N><T> used when referring to a register. This suffix encodes the number of lanes and the lane width each instruction operates on. Thus, arrangement specifiers determine how to partition a register’s 64- or 128-bit view into lanes.

Table 4.2.1 Neon arrangement specifiers.#

Specifier

Vector Width (bits)

Number of Lanes

Lane Width (bits)

.2D

128

2

64

.4S

128

4

32

.8H

128

8

16

.16B

128

16

8

.1D

64

1

64

.2S

64

2

32

.4H

64

4

16

.8B

64

8

8

Table 4.2.1 shows the arrangement specifiers available in Neon. In an instruction, we apply these specifiers to the vector registers introduced in Section 4.1. For example, V17.4S means that the instruction treats register Q17 as a vector containing four 32-bit values.

4.3. Procedure Call Standard#

The procedure call standard defines the role of the Neon registers in function calls. V0-V7 are used to pass values into a function and return values. Registers V8-V31 are scratch registers, where V8-V15 are callee-saved and V16-V31 are caller-saved. Unlike the GPRs, we do not have to preserve the entire contents of the callee-saved Neon registers. Instead, only the lower 64 bits for V8-V15 need to be preserved, i.e. the values in D8-D15.

Listing 4.3.1 Example assembly program that sets the frame pointer register and temporarily stores registers X19-X30 and D8-D15 on the stack.#
 1    .text
 2    .type pcs, %function
 3    .global pcs
 4 pcs:
 5    // save frame pointer and link register
 6    stp fp, lr, [sp, #-16]!
 7    // update frame pointer to current stack pointer
 8    mov fp, sp
 9
10    // save callee-saved registers
11    stp x19, x20, [sp, #-16]!
12    stp x21, x22, [sp, #-16]!
13    stp x23, x24, [sp, #-16]!
14    stp x25, x26, [sp, #-16]!
15    stp x27, x28, [sp, #-16]!
16
17    stp  d8,  d9, [sp, #-16]!
18    stp d10, d11, [sp, #-16]!
19    stp d12, d13, [sp, #-16]!
20    stp d14, d15, [sp, #-16]!
21
22    // use registers as needed
23
24    // restore callee-saved registers
25    ldp d14, d15, [sp], #16
26    ldp d12, d13, [sp], #16
27    ldp d10, d11, [sp], #16
28    ldp  d8,  d9, [sp], #16
29
30    ldp x27, x28, [sp], #16
31    ldp x25, x26, [sp], #16
32    ldp x23, x24, [sp], #16
33    ldp x21, x22, [sp], #16
34    ldp x19, x20, [sp], #16
35
36    // restore frame pointer and link register
37    ldp fp, lr, [sp], #16
38
39    ret

Listing 4.3.1 shows an updated version of the template we originally introduced in Section 2.6. Now, in addition to X19-X30, we temporarily store the contents of D8-D15 on the stack. Of course, we can eliminate the corresponding stack transfers if the lower 64 bits of a register in V8-V15 are not modified.

4.4. Loads and Stores#

As with the base instructions, a group of instructions allows us to transfer data between memory and the SIMD&FP registers.

The LDR (immediate, SIMD&FP) and STR (immediate, SIMD&FP) instructions work similarly to LDR (immediate) and STR (immediate) of the base instructions discussed. However, we can now use the B, H, S, D, or Q view on the SIMD&FP registers. Similarly, LDP (SIMD&FP) and STP (SIMD&FP) allow us to transfer data between memory and two SIMD&FP registers. We give high-level descriptions for some example instructions:

ldr d5, [x0]

Load 64 bits (double word) from memory into register D5. In memory, the data is located at the 64-bit address held in register X0.

ldr q1, [x3]

Load 128 bits (quad word) from memory into register Q1. In memory, the data is located at the 64-bit address held in register X3.

str h1, [x3, #32]

Store 16 bits (half word) from register H1 into memory. The memory address is calculated by adding offset 32 to the value in register X3.

ldp q3, q8, [x2]

Load 2x128 bits from memory into registers Q3 and Q8. In memory, the data is at the 64-bit address held in register X2.

A particularly interesting pair of load and store instructions in Neon are LD1 (multiple structures) and ST1 (multiple structures). LD1 (multiple structures) allows us to load data from memory into up to four consecutive SIMD&FP registers, while ST1 (multiple structures) allows us to store data from up to four consecutive registers into memory. The term “consecutive” means that if the first register has the ID Vt, then the following registers must have the IDs (Vt+1)%32, (Vt+2)%32, and (Vt+3)%32. Again, we provide high-level descriptions for some examples:

ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]

Load 4x4x32 bits (512 bits total) from memory into registers V0, V1, V2 and V3. In memory, the data is located at the 64-bit address held in register X0.

st1 {v31.2d, v0.2d, v1.2d, v2.2d}, [x3], #64

Store 4x2x64 bits (512 bits total) from registers V31, V0, V1, and V2 into memory. The memory address is held in register X3. In addition, the value of register X3 is incremented by 64.

4.5. Data Processing Instructions#

This section covers two groups of SIMD&FP data-processing instructions heavily used in our tensor kernels. Rearrangement instructions read source vector registers, rearrange their elements, and write the result to a destination register. Floating-point multiply-accumulate instructions multiply floating-point operands and accumulate the result in a destination register.

4.5.1. Rearrangement Instructions#

TRN1 TRN2 ZIP1 ZIP2

4.5.2. FMADD#

FMADD is a scalar floating-point fused multiply-add (FMA) instruction and exists in FP16, FP32, and FP64 variants. The three variants have the following encodings:

FP16 variant

FMADD <Hd>, <Hn>, <Hm>, <Ha>

FP32 variant

FMADD <Sd>, <Sn>, <Sm>, <Sa>

FP64 variant

FMADD <Dd>, <Dn>, <Dm>, <Da>

The variant of the instruction is encoded in the 2-bit field ftype, while the IDs of the registers are encoded in the five-bit fields Rn, Rm, Ra and Rd. The ID of the source register holding the multiplicand is encoded in the Rn field, the ID of the register holding the multiplier in the Rm field, and the ID of the register holding the addend in the Ra field. The ID of the destination register is encoded in the Rd field. FMADD multiplies the two values in registers with IDs Rn and Rm, adds the result to the value in Ra and writes the result to Rd.

We discuss two FMADD examples:

fmadd h0, h1, h2, h3

Multiply the two FP16 values in H1 and H2, add the product to the FP16 value in H3 and write the result to H0.

fmadd s2, s4, s7, s3

Multiply the two FP32 values in S4 and S7, add the product to the value in S3 and write the result to S2.

4.5.3. FMLA (vector)#

FMLA (vector) is our second FMA instruction and first vector instruction. Instead of operating on scalars, the instruction operates on vectors with up to eight elements. FMLA (vector) exists in FP16, FP32, and FP64 variants. The instruction has two encodings: one for FP16 and one shared by the size-specific FP32 and FP64 variants. The mnemonic for all variants is FMLA <Vd>.<T>, <Vn>.<T>, <Vm>.<T>, where the register IDs are encoded in the Rn, Rm and Rd fields. <T> is an arrangement specifier, introduced in Section 4.2. FMLA (vector) multiplies the elements from the two source registers element-wise, and then adds the result to the vector in the destination register and writes the final vector back to the destination register. This makes it a destructive instruction: it reads the addend vector from the destination register first, then overwrites it with the FMA result.

We discuss two FP32 examples using different arrangement specifiers:

fmla v2.4s, v4.4s, v7.4s

Multiply the two four-element FP32 vectors in V4 and V7, add the product to the vector in V2 and write the result to V2. This instruction operates on the full 128 bits of the involved SIMD&FP registers. In total, the instruction performs eight floating-point operations.

fmla v2.2s, v4.2s, v7.2s

Multiply the two two-element FP32 vectors in V4 and V7, add the product to the vector in V2 and write the result to V2. This instruction operates on the lower 64 bits of the involved SIMD&FP registers. In total, the instruction performs four floating-point operations.

4.5.4. FMLA (by element)#

Our last FMA instruction is FMLA (by element). It has both scalar and vector forms. The vector forms are similar to FMLA (vector), but they use a scalar multiplier taken from a lane of a SIMD&FP register.

The following examples illustrate the vector variants of this instruction:

fmla v17.8h, v2.8h, v8.h[6]

Multiply the eight-element FP16 vector in V2 with the seventh FP16 element of register V8, add the product to the vector in V17 and write the result to V17. In total, the instruction performs sixteen floating-point operations.

fmla v12.4s, v30.4s, v5.s[0]

Multiply the four-element FP32 vector in V30 with the first FP32 element of register V5, add the product to the vector in V12 and write the result to V12. In total, the instruction performs eight floating-point operations.

fmla v0.2d, v0.2d, v0.d[1]

Multiply the two-element FP64 vector in V0 with the second FP64 element of register V0, add the product to the vector in V0 and write the result to V0. In total, the instruction performs four floating-point operations.