.. _ch:neon:

****
Neon
****
Instruction set architectures have dedicated registers and instructions for vector and matrix processing.
In the case of AArch64, `Neon <https://developer.arm.com/Architectures/Neon>`__ is required in all AArch64-compliant processors and is also called Advanced SIMD (ASIMD).
It supports scalar floating-point operations and vector operations on vector registers with up to 128 bits.
Depending on the microarchitecture (see :numref:`sec:intro_hardware`), we can also use the Scalable Vector Extension (SVE) for vector processing or the Scalable Matrix Extensions (SME) for matrix processing.
The vector and matrix processing instructions are used in conjunction with the base instructions introduced in :numref:`ch:base_instructions`.
Here, the base instructions are used for address calculation, branching, and conditional code execution, and the vector and matrix instructions are used for the actual heavy lifting in terms of computation.
This chapter discusses Neon's vector instructions.
:numref:`ch:sve` then discusses SVE, and :numref:`ch:sme` moves on to SME.

.. note::

   Neon has been `extended <https://web.archive.org/web/20250221175249/https://community.arm.com/arm-community-blogs/b/ai-blog/posts/bfloat16-processing-for-neural-networks-on-armv8_2d00_a>`__ with some matrix processing capabilities in 2019. In this book, we will not discuss these instructions and will instead limit our matrix processing discussions to SME. However, if you are interested in Neon's matrix instructions, the :a64_isa:`BFMMLA <SIMD-FP-Instructions/BFMMLA--BFloat16-floating-point-matrix-multiply-accumulate-into-2x2-matrix->` instruction is a good place to start.

Theoretically, we could build the entire tensor compiler using only Neon.
However, in most cases, using the more advanced SVE and SME, if available, will improve performance.
There are two ways to proceed:

1. Study Neon only, i.e. skip everything about SVE and SME and continue to the end of the book. Then, when everything works, go back and add SVE and SME support to the compiler.
2. Study Neon, SVE, and SME before continuing. Then write a unified tensor compiler that can generate code for all three options.

If you have the breadth to stick with the ISA for a bit longer, and have access to SVE and SME hardware, the second option is recommended.
The rationale behind this recommendation is that you will have a better awareness of more powerful instructions when making design decisions in the code generator.

This section follows the structure of :numref:`ch:assembly_language` and :numref:`ch:base_instructions`.
That is, we introduce the SIMD and floating-point registers, discuss load and store instructions, and finally introduce data processing instructions.

.. _sec:neon_registers:

Registers
---------
Neon has thirty-two 128-bit *SIMD and floating-point registers* that are visible to the A64 instruction set.
The registers are architecturally named V0 to V31.

.. _fig:neon_registers:

.. figure:: ../data_neon/registers_neon.svg
   :width: 70%

   Illustration of the thirty-two 128-bit Neon registers V0-V31 visible to the A64 instruction set, the floating-point control register (FPCR), and the floating-point status register (FPSR).

As shown in :numref:`fig:neon_registers`, the register can be accessed as:

* 8-bit registers: B0 to B31.
* 16-bit registers: H0 to H31.
* 32-bit registers: S0 to S31.
* 64-bit registers: D0 to D31.
* 128-bit registers: Q0 to Q31.

In addition, we can use the registers as 128-bit vectors of elements or 64-bit vectors of elements.
We will discuss this view of the registers in :numref:`sec:neon_arr_spec`.
Neon also has special-purpose registers, two of which are also shown :numref:`fig:neon_registers`:

Floating-point Control Register (FPCR)
  Controls floating-point behavior. For example, we could enable/disable NaNs, set rounding modes, or enable/disable flushing of denormalized numbers to zero.
Floating-point Status Register (FPSR)
  Provides floating-point status information. For example, exception bits in the register are set when division by zero or saturation occurred.

.. _sec:neon_arr_spec:

Arrangement Specifiers
----------------------
Many Neon loads and stores, as well as all data-processing instructions, use arrangement specifiers.
An *arrangement specifier* is a suffix in the form ``.<N><T>`` used when referring to a register.
This suffix encodes the number of lanes and the lane width each instruction operates on.
Thus, arrangement specifiers determine how to partition a register's 64- or 128-bit view into lanes.

.. _tab:neon_arr_spec:

.. list-table:: Neon arrangement specifiers.
   :header-rows: 1

   * - Specifier
     - Vector Width (bits)
     - Number of Lanes
     - Lane Width (bits)
   * - ``.2D``
     - 128
     - 2
     - 64
   * - ``.4S``
     - 128
     - 4
     - 32
   * - ``.8H``
     - 128
     - 8
     - 16
   * - ``.16B``
     - 128
     - 16
     - 8
   * - ``.1D``
     - 64
     - 1
     - 64
   * - ``.2S``
     - 64
     - 2
     - 32
   * - ``.4H``
     - 64
     - 4
     - 16
   * - ``.8B``
     - 64
     - 8
     - 8

:numref:`tab:neon_arr_spec` shows the arrangement specifiers available in Neon.
In an instruction, we apply these specifiers to the vector registers introduced in :numref:`sec:neon_registers`.
For example, ``V17.4S`` means that the instruction treats register ``Q17`` as a vector containing four 32-bit values.

Procedure Call Standard
-----------------------
The procedure call standard `defines <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#id52>`__ the role of the Neon registers in function calls. V0-V7 are used to pass values into a function and return values.
Registers V8-V31 are scratch registers, where V8-V15 are callee-saved and V16-V31 are caller-saved.
Unlike the GPRs, we do not have to preserve the entire contents of the callee-saved Neon registers.
Instead, only the lower 64 bits for V8-V15 need to be preserved, i.e. the values in D8-D15.

.. _lst:neon_pcs:

.. literalinclude:: ../data_neon/pcs.s
  :language: gas
  :linenos:
  :caption: Example assembly program that sets the frame pointer register and temporarily stores registers X19-X30 and D8-D15 on the stack.

:numref:`lst:neon_pcs` shows an updated version of the template we originally introduced in :numref:`sec:assembly_pcs`.
Now, in addition to X19-X30, we temporarily store the contents of D8-D15 on the stack.
Of course, we can eliminate the corresponding stack transfers if the lower 64 bits of a register in V8-V15 are not modified.

Loads and Stores
----------------
As with the base instructions, a group of instructions allows us to transfer data between memory and the SIMD&FP registers.

The :a64_isa:`LDR (immediate, SIMD&FP) <SIMD-FP-Instructions/LDR--immediate--SIMD-FP---Load-SIMD-FP-register--immediate-offset-->` and :a64_isa:`STR (immediate, SIMD&FP) <SIMD-FP-Instructions/STR--immediate--SIMD-FP---Store-SIMD-FP-register--immediate-offset-->` instructions work similarly to LDR (immediate) and STR (immediate) of the base instructions discussed.
However, we can now use the B, H, S, D, or Q view on the SIMD&FP registers.
Similarly, :a64_isa:`LDP (SIMD&FP) <SIMD-FP-Instructions/LDP--SIMD-FP---Load-pair-of-SIMD-FP-registers->` and :a64_isa:`STP (SIMD&FP) <SIMD-FP-Instructions/STP--SIMD-FP---Store-pair-of-SIMD-FP-registers->` allow us to transfer data between memory and two SIMD&FP registers.
We give high-level descriptions for some example instructions:

``ldr d5, [x0]``
  Load 64 bits (double word) from memory into register ``D5``. In memory, the data is located at the 64-bit address held in register ``X0``.
``ldr q1, [x3]``
  Load 128 bits (quad word) from memory into register ``Q1``. In memory, the data is located at the 64-bit address held in register ``X3``.
``str h1, [x3, #32]``
  Store 16 bits (half word) from register ``H1`` into memory. The memory address is calculated by adding offset 32 to the value in register ``X3``.
``ldp q3, q8, [x2]``
  Load 2x128 bits from memory into registers ``Q3`` and ``Q8``. In memory, the data is at the 64-bit address held in register ``X2``.

A particularly interesting pair of load and store instructions in Neon are :a64_isa:`LD1 (multiple structures) <SIMD-FP-Instructions/LD1--multiple-structures---Load-multiple-single-element-structures-to-one--two--three--or-four-registers->` and :a64_isa:`ST1 (multiple structures) <SIMD-FP-Instructions/ST1--multiple-structures---Store-multiple-single-element-structures-from-one--two--three--or-four-registers->`.
LD1 (multiple structures) allows us to load data from memory into up to four consecutive SIMD&FP registers, while ST1 (multiple structures) allows us to store data from up to four consecutive registers into memory.
The term "consecutive" means that if the first register has the ID ``Vt``, then the following registers must have the IDs ``(Vt+1)%32``, ``(Vt+2)%32``, and ``(Vt+3)%32``.
Again, we provide high-level descriptions for some examples:

``ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]``
  Load 4x4x32 bits (512 bits total) from memory into registers ``V0``, ``V1``, ``V2`` and ``V3``.
  In memory, the data is located at the 64-bit address held in register ``X0``.
``st1 {v31.2d, v0.2d, v1.2d, v2.2d}, [x3], #64``
  Store 4x2x64 bits (512 bits total) from registers ``V31``, ``V0``, ``V1``, and ``V2`` into memory. The memory address is held in register ``X3``. In addition, the value of register ``X3`` is incremented by 64.

.. _sec:neon_data_proc_inst:

Data Processing Instructions
----------------------------
This section covers two groups of SIMD&FP data-processing instructions heavily used in our tensor kernels.
*Rearrangement instructions* read source vector registers, rearrange their elements, and write the result to a destination register.
*Floating-point multiply-accumulate instructions* multiply floating-point operands and accumulate the result in a destination register.

Rearrangement Instructions
^^^^^^^^^^^^^^^^^^^^^^^^^^
:a64_isa:`TRN1 <SIMD-FP-Instructions/TRN1--Transpose-vectors--primary-->`
:a64_isa:`TRN2 <SIMD-FP-Instructions/TRN2--Transpose-vectors--secondary-->`
:a64_isa:`ZIP1 <SIMD-FP-Instructions/ZIP1--Zip-vectors--primary-->`
:a64_isa:`ZIP2 <SIMD-FP-Instructions/ZIP2--Zip-vectors--secondary-->`

FMADD
^^^^^
:a64_isa:`FMADD <SIMD-FP-Instructions/FMADD--Floating-point-fused-multiply-add--scalar-->` is a scalar floating-point fused multiply-add (FMA) instruction and exists in `FP16 <https://en.wikipedia.org/wiki/Half-precision_floating-point_format>`__, `FP32 <https://en.wikipedia.org/wiki/Single-precision_floating-point_format>`__, and `FP64 <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__ variants.
The three variants have the following encodings:

FP16 variant
  ``FMADD <Hd>, <Hn>, <Hm>, <Ha>``
FP32 variant
  ``FMADD <Sd>, <Sn>, <Sm>, <Sa>``
FP64 variant
  ``FMADD <Dd>, <Dn>, <Dm>, <Da>``

The variant of the instruction is encoded in the 2-bit field ``ftype``, while the IDs of the registers are encoded in the five-bit fields ``Rn``, ``Rm``, ``Ra`` and ``Rd``.
The ID of the source register holding the multiplicand is encoded in the ``Rn`` field, the ID of the register holding the multiplier in the ``Rm`` field, and the ID of the register holding the addend in the ``Ra`` field.
The ID of the destination register is encoded in the ``Rd`` field.
``FMADD`` multiplies the two values in registers with IDs ``Rn`` and ``Rm``, adds the result to the value in ``Ra`` and writes the result to ``Rd``.

We discuss two ``FMADD`` examples:

``fmadd h0, h1, h2, h3``
  Multiply the two FP16 values in ``H1`` and ``H2``, add the product to the FP16 value in ``H3`` and write the result to ``H0``.

``fmadd s2, s4, s7, s3``
  Multiply the two FP32 values in ``S4`` and ``S7``, add the product to the value in ``S3`` and write the result to ``S2``.

FMLA (vector)
^^^^^^^^^^^^^
:a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->` is our second FMA instruction and first vector instruction.
Instead of operating on scalars, the instruction operates on vectors with up to eight elements.
``FMLA (vector)`` exists in FP16, FP32, and FP64 variants.
The instruction has two encodings: one for FP16 and one shared by the size-specific FP32 and FP64 variants.
The mnemonic for all variants is ``FMLA <Vd>.<T>, <Vn>.<T>, <Vm>.<T>``, where the register IDs are encoded in the ``Rn``, ``Rm`` and ``Rd`` fields.
``<T>`` is an arrangement specifier, introduced in :numref:`sec:neon_arr_spec`.
``FMLA (vector)`` multiplies the elements from the two source registers element-wise, and then adds the result to the vector in the destination register and writes the final vector back to the destination register.
This makes it a *destructive instruction*: it reads the addend vector from the destination register first, then overwrites it with the FMA result.

We discuss two FP32 examples using different arrangement specifiers:

``fmla v2.4s, v4.4s, v7.4s``
  Multiply the two four-element FP32 vectors in ``V4`` and ``V7``, add the product to the vector in ``V2`` and write the result to ``V2``. This instruction operates on the full 128 bits of the involved SIMD&FP registers. In total, the instruction performs eight floating-point operations.
``fmla v2.2s, v4.2s, v7.2s``
  Multiply the two two-element FP32 vectors in ``V4`` and ``V7``, add the product to the vector in ``V2`` and write the result to ``V2``. This instruction operates on the lower 64 bits of the involved SIMD&FP registers. In total, the instruction performs four floating-point operations.

FMLA (by element)
^^^^^^^^^^^^^^^^^
Our last FMA instruction is :a64_isa:`FMLA (by element) <SIMD-FP-Instructions/FMLA--by-element---Floating-point-fused-multiply-add-to-accumulator--by-element-->`.
It has both scalar and vector forms.
The vector forms are similar to ``FMLA (vector)``, but they use a scalar multiplier taken from a lane of a SIMD&FP register.

The following examples illustrate the vector variants of this instruction:

``fmla v17.8h, v2.8h, v8.h[6]``
  Multiply the eight-element FP16 vector in ``V2`` with the seventh FP16 element of register ``V8``, add the product to the vector in ``V17`` and write the result to ``V17``. In total, the instruction performs sixteen floating-point operations.
``fmla v12.4s, v30.4s, v5.s[0]``
  Multiply the four-element FP32 vector in ``V30`` with the first FP32 element of register ``V5``, add the product to the vector in ``V12`` and write the result to ``V12``. In total, the instruction performs eight floating-point operations.
``fmla v0.2d, v0.2d, v0.d[1]``
  Multiply the two-element FP64 vector in ``V0`` with the second FP64 element of register ``V0``, add the product to the vector in ``V0`` and write the result to ``V0``. In total, the instruction performs four floating-point operations.