4. Neon#

Instruction set architectures have dedicated registers and instructions for vector and matrix processing. In the case of AArch64, Neon is required in all AArch64-compliant processors and is also called Advanced SIMD (ASIMD). It supports scalar floating-point operations and vector operations on vector registers with up to 128 bits. Depending on the microarchitecture (see Section 1.2), we can also use the Scalable Vector Extension (SVE) for vector processing or the Scalable Matrix Extensions (SME) for matrix processing. The vector and matrix processing instructions are used in conjunction with the base instructions introduced in Section 3. Here, the base instructions are used for address calculation, branching, and conditional code execution, and the vector and matrix instructions are used for the actual heavy lifting in terms of computation. This chapter discusses Neon’s vector instructions. Section 5 then discusses SVE, and Section 6 moves on to SME.

Note

Neon has been extended with some matrix processing capabilities in 2019. In this book, we will not discuss these instructions and will instead limit our matrix processing discussions to SME. However, if you are interested in Neon’s matrix instructions, the BFMMLA instruction is a good place to start.

Theoretically, we could build the entire tensor compiler using only Neon. However, in most cases, using the more advanced SVE and SME, if available, will improve performance. There are two ways to proceed:

Study Neon only, i.e. skip everything about SVE and SME and continue to the end of the book. Then, when everything works, go back and add SVE and SME support to the compiler.
Study Neon, SVE, and SME before continuing. Then write a unified tensor compiler that can generate code for all three options.

If you have the breadth to stick with the ISA for a bit longer, and have access to SVE and SME hardware, the second option is recommended. The rationale behind this recommendation is that you will have a better awareness of more powerful instructions when making design decisions in the code generator.

This section follows the structure of Section 2 and Section 3. That is, we introduce the SIMD and floating-point registers, discuss load and store instructions, and finally introduce data processing instructions.

4.1. Registers#

Neon has thirty-two 128-bit SIMD and floating-point registers that are visible to the A64 instruction set. The registers are architecturally named V0 to V31.

../_images/registers_neon.svg — Fig. 4.1.1 Illustration of the thirty-two 128-bit Neon registers V0-V31 visible to the A64 instruction set, the floating-point control register (FPCR), and the floating-point status register (FPSR).#

As shown in Fig. 4.1.1, the register can be accessed as:

8-bit registers: B0 to B31.
16-bit registers: H0 to H31.
32-bit registers: S0 to S31.
64-bit registers: D0 to D31.
128-bit registers: Q0 to Q31.

In addition, we can use the registers as 128-bit vectors of elements or 64-bit vectors of elements. We will discuss this view of the registers in Section 4.2. Neon also has special-purpose registers, two of which are also shown Fig. 4.1.1:

Floating-point Control Register (FPCR): Controls floating-point behavior. For example, we could enable/disable NaNs, set rounding modes, or enable/disable flushing of denormalized numbers to zero.
Floating-point Status Register (FPSR): Provides floating-point status information. For example, exception bits in the register are set when division by zero or saturation occurred.

4.2. Arrangement Specifiers#

Many Neon loads and stores, as well as all data-processing instructions, use arrangement specifiers. An arrangement specifier is a suffix in the form .<N><T> used when referring to a register. This suffix encodes the number of lanes and the lane width each instruction operates on. Thus, arrangement specifiers determine how to partition a register’s 64- or 128-bit view into lanes.

Table 4.2.1 Neon arrangement specifiers.#
Specifier	Vector Width (bits)	Number of Lanes	Lane Width (bits)
`.2D`	128	2	64
`.4S`	128	4	32
`.8H`	128	8	16
`.16B`	128	16	8
`.1D`	64	1	64
`.2S`	64	2	32
`.4H`	64	4	16
`.8B`	64	8	8

Table 4.2.1 shows the arrangement specifiers available in Neon. In an instruction, we apply these specifiers to the vector registers introduced in Section 4.1. For example, V17.4S means that the instruction treats register Q17 as a vector containing four 32-bit values.

4.3. Procedure Call Standard#

The procedure call standard defines the role of the Neon registers in function calls. V0-V7 are used to pass values into a function and return values. Registers V8-V31 are scratch registers, where V8-V15 are callee-saved and V16-V31 are caller-saved. Unlike the GPRs, we do not have to preserve the entire contents of the callee-saved Neon registers. Instead, only the lower 64 bits for V8-V15 need to be preserved, i.e. the values in D8-D15.

Listing 4.3.1 Example assembly program that sets the frame pointer register and temporarily stores registers X19-X30 and D8-D15 on the stack.#

    .text
    .type pcs, %function
    .global pcs
 pcs:
    // save frame pointer and link register
    stp fp, lr, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov fp, sp

    // save callee-saved registers
    stp x19, x20, [sp, #-16]!
    stp x21, x22, [sp, #-16]!
    stp x23, x24, [sp, #-16]!
    stp x25, x26, [sp, #-16]!
    stp x27, x28, [sp, #-16]!

    stp  d8,  d9, [sp, #-16]!
    stp d10, d11, [sp, #-16]!
    stp d12, d13, [sp, #-16]!
    stp d14, d15, [sp, #-16]!

    // use registers as needed

    // restore callee-saved registers
    ldp d14, d15, [sp], #16
    ldp d12, d13, [sp], #16
    ldp d10, d11, [sp], #16
    ldp  d8,  d9, [sp], #16

    ldp x27, x28, [sp], #16
    ldp x25, x26, [sp], #16
    ldp x23, x24, [sp], #16
    ldp x21, x22, [sp], #16
    ldp x19, x20, [sp], #16

    // restore frame pointer and link register
    ldp fp, lr, [sp], #16

    ret

Listing 4.3.1 shows an updated version of the template we originally introduced in Section 2.6. Now, in addition to X19-X30, we temporarily store the contents of D8-D15 on the stack. Of course, we can eliminate the corresponding stack transfers if the lower 64 bits of a register in V8-V15 are not modified.

4.4. Loads and Stores#

As with the base instructions, a group of instructions allows us to transfer data between memory and the SIMD&FP registers.

The LDR (immediate, SIMD&FP) and STR (immediate, SIMD&FP) instructions work similarly to LDR (immediate) and STR (immediate) of the base instructions discussed. However, we can now use the B, H, S, D, or Q view on the SIMD&FP registers. Similarly, LDP (SIMD&FP) and STP (SIMD&FP) allow us to transfer data between memory and two SIMD&FP registers. We give high-level descriptions for some example instructions:

ldr d5, [x0]: Load 64 bits (double word) from memory into register D5. In memory, the data is located at the 64-bit address held in register X0.
ldr q1, [x3]: Load 128 bits (quad word) from memory into register Q1. In memory, the data is located at the 64-bit address held in register X3.
str h1, [x3, #32]: Store 16 bits (half word) from register H1 into memory. The memory address is calculated by adding offset 32 to the value in register X3.
ldp q3, q8, [x2]: Load 2x128 bits from memory into registers Q3 and Q8. In memory, the data is at the 64-bit address held in register X2.

A particularly interesting pair of load and store instructions in Neon are LD1 (multiple structures) and ST1 (multiple structures). LD1 (multiple structures) allows us to load data from memory into up to four consecutive SIMD&FP registers, while ST1 (multiple structures) allows us to store data from up to four consecutive registers into memory. The term “consecutive” means that if the first register has the ID Vt, then the following registers must have the IDs (Vt+1)%32, (Vt+2)%32, and (Vt+3)%32. Again, we provide high-level descriptions for some examples:

ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]: Load 4x4x32 bits (512 bits total) from memory into registers V0, V1, V2 and V3. In memory, the data is located at the 64-bit address held in register X0.
st1 {v31.2d, v0.2d, v1.2d, v2.2d}, [x3], #64: Store 4x2x64 bits (512 bits total) from registers V31, V0, V1, and V2 into memory. The memory address is held in register X3. In addition, the value of register X3 is incremented by 64.

4.5. Data Processing Instructions#

This section covers two groups of SIMD&FP data-processing instructions that are heavily used in our tensor kernels. Rearrangement instructions read source vector registers, rearrange their elements, and write the result to a destination register. Floating-point multiply-add instructions multiply floating-point operands and add the result to a destination register. The source code for the discussed examples is available in the archives rearrangement.tar.xz and fma.tar.xz.

4.5.1. TRN1 and TRN2#

The rearrangement instructions TRN1 and TRN2 operate on two source registers Vn and Vm and the destination register Vd.

../_images/trn1.svg — Fig. 4.5.1 `TRN1` interleaves even-indexed elements from two source vectors.#

The mnemonic of the TRN1 instruction is TRN1 <Vd>.<T>, <Vn>.<T>, <Vm>.<T>, where <T> is the arrangement specifier introduced in Section 4.2. As shown in Fig. 4.5.1, TRN1 interleaves the even-indexed elements from the two source registers and writes them to the destination register. In the example, the arrangement specifier 4S is used; therefore, the two even-indexed 32-bit elements of Vn are interleaved with those of Vm.

Listing 4.5.1 TRN1 example using 4S arrangement specifier.#

    .text
    .type trn1, %function
    .global trn1
    /*
     * Interleave even-indexed 32-bit elements from two vectors (TRN1).
     * @param X0 pointer to first source vector (4x float)
     * @param X1 pointer to second source vector (4x float)
     * @param X2 pointer to destination vector (4x float)
     */
trn1:
    ldr q0, [x0]
    ldr q1, [x1]
    trn1 v2.4s, v0.4s, v1.4s
    str q2, [x2]
    ret
    .size trn1, (. - trn1)

Listing 4.5.1 shows a function that loads the input data from the addresses in X0 and X1, interleaves the even-indexed elements using TRN1, and writes the result to memory at the address in X2.

../_images/trn2.svg — Fig. 4.5.2 `TRN2` interleaves odd-indexed elements from two source vectors.#

Similarly, TRN2 interleaves the odd-indexed elements. Fig. 4.5.2 illustrates this for the 4S arrangement specifier, while Listing 4.5.2 shows an example function using the instruction.

Listing 4.5.2 TRN2 example using 4S arrangement specifier.#

    .text
    .type trn2, %function
    .global trn2
    /*
     * Interleave odd-indexed 32-bit elements from two vectors (TRN2).
     * @param X0 pointer to first source vector (4x float)
     * @param X1 pointer to second source vector (4x float)
     * @param X2 pointer to destination vector (4x float)
     */
trn2:
    ldr q0, [x0]
    ldr q1, [x1]
    trn2 v2.4s, v0.4s, v1.4s
    str q2, [x2]
    ret
    .size trn2, (. - trn2)

4.5.2. ZIP1 and ZIP2#

The rearrangement instruction ZIP1 interleaves pairs of adjacent elements from the lower halves of the source registers, while ZIP2 interleaves pairs of adjacent elements from the upper halves. Fig. 4.5.3 and Fig. 4.5.4 illustrate the behavior of the two instructions when using the 4S arrangement specifier. Corresponding example functions are given in Listing 4.5.3 and Listing 4.5.4.

../_images/zip1.svg — Fig. 4.5.3 `ZIP1` interleaves the lower half of elements from two source vectors.#

Listing 4.5.3 ZIP1 example using 4S arrangement specifier.#

    .text
    .type zip1, %function
    .global zip1
    /*
     * Interleave lower-half 32-bit elements from two vectors (ZIP1).
     * @param X0 pointer to first source vector (4x float)
     * @param X1 pointer to second source vector (4x float)
     * @param X2 pointer to destination vector (4x float)
     */
zip1:
    ldr q0, [x0]
    ldr q1, [x1]
    zip1 v2.4s, v0.4s, v1.4s
    str q2, [x2]
    ret
    .size zip1, (. - zip1)

../_images/zip2.svg — Fig. 4.5.4 `ZIP2` interleaves the upper half of elements from two source vectors.#

Listing 4.5.4 ZIP2 example using 4S arrangement specifier.#

    .text
    .type zip2, %function
    .global zip2
    /*
     * Interleave upper-half 32-bit elements from two vectors (ZIP2).
     * @param X0 pointer to first source vector (4x float)
     * @param X1 pointer to second source vector (4x float)
     * @param X2 pointer to destination vector (4x float)
     */
zip2:
    ldr q0, [x0]
    ldr q1, [x1]
    zip2 v2.4s, v0.4s, v1.4s
    str q2, [x2]
    ret
    .size zip2, (. - zip2)

4.5.3. FMADD#

FMADD is a scalar floating-point fused multiply-add (FMA) instruction and exists in FP16, FP32, and FP64 variants.

Note

FP16 is available with FEAT_FP16, whereas the FP32 and FP64 are part of the base Advanced SIMD floating-point functionality.

The three variants have the following encodings:

FP16 variant: FMADD <Hd>, <Hn>, <Hm>, <Ha>
FP32 variant: FMADD <Sd>, <Sn>, <Sm>, <Sa>
FP64 variant: FMADD <Dd>, <Dn>, <Dm>, <Da>

The variant of the instruction is encoded in the 2-bit field ftype, while the register IDs are encoded in the five-bit fields Rn, Rm, Ra, and Rd. The ID of the source register holding the multiplicand is encoded in the Rn field, the ID of the register holding the multiplier in the Rm field, and the ID of the register holding the addend in the Ra field. The ID of the destination register is encoded in the Rd field. FMADD multiplies the two values in registers identified by Rn and Rm, adds the product to the value in Ra and writes the result to Rd.

../_images/fmadd.svg — Fig. 4.5.5 `FMADD` scalar FP32 multiply-add.#

The behavior of the FP32 variant is illustrated in Fig. 4.5.5. In the example, the instruction multiplies 0.0f with 4.0f, adds the result to 8.0f, and writes the result to the 32-bit destination register Sd.

We discuss two specific FMADD examples:

fmadd h2, h17, h8, h3: Multiply the two FP16 values in H17 and H8, add the product to the FP16 value in H3 and write the result to H2.
fmadd s3, s0, s1, s2: Multiply the two FP32 values in S0 and S1, add the product to the value in S2 and write the result to S3. Listing 4.5.5 uses this instruction in a function with four pointers as parameters.

Listing 4.5.5 FMADD example using FP32 scalar registers.#

    .text
    .type fmadd, %function
    .global fmadd
    /*
     * Scalar floating-point multiply-add: d = a * b + c.
     * @param X0 pointer to a (4x float)
     * @param X1 pointer to b (4x float)
     * @param X2 pointer to c (4x float)
     * @param X3 pointer to d (4x float)
     */
fmadd:
    ldr q0, [x0]
    ldr q1, [x1]
    ldr q2, [x2]
    ldr q3, [x3]
    fmadd s3, s0, s1, s2
    str q3, [x3]
    ret
    .size fmadd, (. - fmadd)

4.5.4. FMLA (vector)#

FMLA (vector) is our second FMA instruction and our first vector multiply-add instruction. Instead of operating on scalars, the instruction operates on vectors with up to eight elements. FMLA (vector) has FP16, FP32, and FP64 variants. The instruction has two encodings: one for FP16 and one shared by the FP32 and FP64 variants. The mnemonic for all variants is FMLA <Vd>.<T>, <Vn>.<T>, <Vm>.<T>, where the register IDs are encoded in the Rn, Rm, and Rd fields.

FMLA (vector) multiplies the elements from the two source registers element-wise, and then adds the result to the vector in the destination register, writing the final result back to it. This makes it a destructive instruction: it reads the addend vector from the destination register first, then overwrites it with the FMA result.

../_images/fmla_vector.svg — Fig. 4.5.6 `FMLA (vector)` element-wise FP32 multiply-add on four-element vectors.#

Fig. 4.5.6 illustrates the behavior of the instruction when using the arrangement specifier 4S. In the illustrated case, the instruction multiplies the two FP32 vectors (3.0f,2.0f,1.0f,0.0f) and (7.0f,6.0f,5.0f,4.0f), adds the product to (11.0f,10.0f,9.0f,8.0f), and writes the result (32.0f,22.0f,14.0f,8.0f) to register Vd.4S.

We discuss two specific FP32 examples using the 4S and 2S arrangement specifiers:

fmla v2.4s, v0.4s, v1.4s: Multiply the two four-element FP32 vectors in V0 and V1, add the product to the vector in V2 and write the result to V2. This instruction operates on the full 128 bits of the involved SIMD&FP registers. In total, the instruction performs eight floating-point operations. Listing 4.5.6 shows an example function using this instruction.
fmla v2.2s, v4.2s, v7.2s: Multiply the two two-element FP32 vectors in V4 and V7, add the product to the vector in V2 and write the result to V2. This instruction operates on the lower 64 bits of the involved SIMD&FP registers. In total, the instruction performs four floating-point operations.

Listing 4.5.6 FMLA (vector) example using 4S arrangement specifier.#

    .text
    .type fmla_vector, %function
    .global fmla_vector
    /*
     * Vector floating-point multiply-add: c += a * b (element-wise).
     * @param X0 pointer to a (4x float)
     * @param X1 pointer to b (4x float)
     * @param X2 pointer to c (4x float)
     */
fmla_vector:
    ldr q0, [x0]
    ldr q1, [x1]
    ldr q2, [x2]
    fmla v2.4s, v0.4s, v1.4s
    str q2, [x2]
    ret
    .size fmla_vector, (. - fmla_vector)

4.5.5. FMLA (by element)#

Our last FMA instruction is FMLA (by element). FMLA (by element) has both scalar and vector forms. The vector forms are similar to FMLA (vector), but they use a scalar multiplier taken from a lane of a SIMD&FP register.

The following examples illustrate the vector variants of this instruction:

fmla v17.8h, v2.8h, v8.h[6]: Multiply the eight-element FP16 vector in V2 by element 6 (the seventh element) of register V8, add the product to the vector in V17 and write the result to V17. In total, the instruction performs sixteen floating-point operations.
fmla v12.2d, v30.2d, v5.d[1]: Multiply the two-element FP64 vector in V30 with the second FP64 element of register V5, add the product to the vector in V12 and write the result to V12. In total, the instruction performs four floating-point operations.
fmla v2.4s, v0.4s, v1.s[2]: Multiply the four-element FP32 vector in V0 by the third FP32 element of register V1, add the product to the vector in V2 and write the result to V2. In total, the instruction performs eight floating-point operations. Fig. 4.5.7 illustrates the behavior of the instruction. In particular, vector (3.0f,2.0f,1.0f,0.0f) is multiplied with scalar 6.0f, the product added to (11.0f,10.0f,9.0f,8.0f), and the result written to the destination register. An example use of the instruction is given in Listing 4.5.7.

../_images/fmla_by_element.svg — Fig. 4.5.7 `FMLA (by element)` FP32 multiply-add broadcasting a scalar from a vector lane.#

Listing 4.5.7 FMLA (by element) example using 4S arrangement specifier with element index.#

    .text
    .type fmla_by_element, %function
    .global fmla_by_element
    /*
     * Vector floating-point multiply-add by element: c += a * b[2].
     * @param X0 pointer to a (4x float)
     * @param X1 pointer to b (4x float, element 2 broadcast)
     * @param X2 pointer to c (4x float)
     */
fmla_by_element:
    ldr q0, [x0]
    ldr q1, [x1]
    ldr q2, [x2]
    fmla v2.4s, v0.4s, v1.s[2]
    str q2, [x2]
    ret
    .size fmla_by_element, (. - fmla_by_element)

Neon

Contents

4. Neon#

4.1. Registers#

4.2. Arrangement Specifiers#

4.3. Procedure Call Standard#

4.4. Loads and Stores#

4.5. Data Processing Instructions#

4.5.1. TRN1 and TRN2#

4.5.2. ZIP1 and ZIP2#

4.5.3. FMADD#

4.5.4. FMLA (vector)#

4.5.5. FMLA (by element)#