.. _ch:base_instructions:

*****************
Base Instructions
*****************

The *A64 base instructions* form the basis of the `A64 instruction set <https://developer.arm.com/documentation/ddi0487/>`__.
They are further subdivided into the following instruction groups:

* Loads and stores associated with general-purpose registers.
* Data processing (immediate).
* Data processing (register).
* Branches, exception generation, and system instructions.

We cover each of the four instruction groups in a separate section by discussing some important examples and related concepts.
The details of all base instructions are available in the :a64_isa:`Arm A-profile A64 Instruction Set Architecture <>`.

.. _sec:base_loads_stores:

Loads and Stores
================
The first group of instructions transfers data between memory and the general-purpose registers.
An instruction that transfers data from memory to a register is called a *load*.
Conversely, an instruction that transfers data from a register to memory is called a *store*.

LDR (immediate)
---------------
We start by discussing the :a64_isa:`LDR (immediate) <Base-Instructions/LDR--immediate---Load-register--immediate-->` instruction in detail.
It transfers data from memory into a register.
As its name suggests, we can use the instruction in assembly code through the mnemonic ``ldr``.
The instruction has encodings for different addressing modes ("post-index", "pre-index", and "unsigned offset").
Specifically, the :a64_isa:`unsigned offset <Base-Instructions/LDR--immediate---Load-register--immediate--?lang=en#iclass_unsigned_offset>` class has a 32-bit variant with the encoding ``LDR <Wt>, [<Xn|SP>{, #<pimm>}]`` and a 64-bit variant with the encoding ``LDR <Xt>, [<Xn|SP>{, #<pimm>}]``.

The A64 ISA describes the assembler symbols as follows:

  ``<Wt>``
    Is the 32-bit name of the general-purpose register to be transferred, encoded in the "Rt" field.
  ``<Xn|SP>``
    Is the 64-bit name of the general-purpose base register or stack pointer, encoded in the "Rn" field.
  ``<Xt>``
    Is the 64-bit name of the general-purpose register to be transferred, encoded in the "Rt" field.
  ``<pimm>``
    For the "32-bit" variant: is the optional positive immediate byte offset, a multiple of 4 in the range 0 to 16380, defaulting to 0 and encoded in the "imm12" field as <pimm>/4.
    For the "64-bit" variant: is the optional positive immediate byte offset, a multiple of 8 in the range 0 to 32760, defaulting to 0 and encoded in the "imm12" field as <pimm>/8.

  -- Arm Limited: :a64_isa:`Arm A-profile A64 Instruction Set Architecture <Base-Instructions/LDR--immediate---Load-register--immediate--?lang=en#iclass_unsigned_offset>`.

This information can be difficult to parse for those new to assembly language programming, but it becomes clearer when examining examples.
Suppose we use one of the following LDR (immediate) instructions in our assembly program; then we can give high-level descriptions of the instructions:

``ldr w5, [x0]``
  Load 32 bits (word) from memory into register ``W5``. In memory, the data is located at the 64-bit address held in register ``X0``.

``ldr x1, [x3]``
  Load 64 bits (double word) from memory into register ``X1``. In memory, the data is located at the 64-bit address held in register ``X3``.

``ldr x1, [x3, #32]``
  Load 64 bits (double word) from memory into register ``X1``. In memory, the data is located at the 64-bit address obtained by adding ``32`` to the value in register ``X3``.

.. admonition:: Examples

  Knowing the procedure call standard (see :numref:`sec:assembly_pcs`) and load instructions, we can write our first simple functions in assembly language.

  .. _lst:base_load_store_0_h:

  .. literalinclude:: ../data_base/load_store.cpp
    :language: c++
    :lines: 5-5
    :dedent:
    :caption: Syntax of the function ``load_store_0``.

  Suppose we want to implement a function ``load_store_0`` with the syntax given in :numref:`lst:base_load_store_0_h`.
  The function should take a single address as a parameter and load 32 bits of data from that address.
  The return value should be the 32 bits loaded.


  .. _lst:base_load_store_0_s:

  .. literalinclude:: ../data_base/load_store_0.s
    :language: gas
    :caption: Implementation of the function ``load_store_0``, which loads 32 bits.

  An example implementation of the ``load_store_0`` function is shown in :numref:`lst:base_load_store_0_s`.
  According to the PCS, the 64-bit address is passed to the function through register ``X0``.
  The address is then used in the ``ldr w0, [x0]`` instruction, which loads 32 bits from this address into ``W0``.
  Since ``W0`` is simply a view of ``R0``, the data in ``X0`` is overwritten.
  That is, the instruction replaces the lower 32 bits with the loaded data and sets the upper 32 bits to zero.
  The ``ret`` instruction jumps back to the address in the link register.
  We will discuss the details of RET and other branch instructions in :numref:`sec:base_branches`.
  According to the PCS, the return value of the function is expected by the caller to be in register ``W0``.

  .. _lst:base_load_store_1_h:

  .. literalinclude:: ../data_base/load_store.cpp
    :language: c++
    :lines: 6-6
    :dedent:
    :caption: Syntax of the function ``load_store_1``.

  Now suppose we want to implement the function ``load_store_1`` with the syntax shown in :numref:`lst:base_load_store_1_h`.
  While the syntax is the same as ``load_store_0``, the goal is to load data from the address with an additional offset of 16 bytes.

  .. _lst:base_load_store_1_s:

  .. literalinclude:: ../data_base/load_store_1.s
    :language: gas
    :caption: Implementation of the function ``load_store_1``, which loads 32 bits.

  :numref:`lst:base_load_store_1_s` shows an example implementation of the ``load_store_1`` function.
  This time, the instruction ``ldr w0, [x0, #16]`` loads data from the address obtained by adding ``16`` to the value in ``X0``.

STR (immediate)
---------------
Store instructions transfer data from registers to memory.
:a64_isa:`STR (immediate) <Base-Instructions/STR--immediate---Store-register--immediate-->` is an example of a store instruction.
Analogous to LDR (immediate), it has encodings from different addressing modes.
The :a64_isa:`unsigned offset <Base-Instructions/STR--immediate---Store-register--immediate--?lang=en#iclass_unsigned_offset>` class of STR (immediate) also has a 32-bit variant and a 64-bit variant.
The encodings of the two variants are given as ``STR <Wt>, [<Xn|SP>{, #<pimm>}]`` and ``STR <Xt>, [<Xn|SP>{, #<pimm>}]``.
The assembler symbols are similar to the LDR (immediate) counterpart.

Again, we formulate high-level descriptions for some examples:

``str w5, [x0]``
  Store 32 bits (word) from register ``W5`` into memory. The data is written into memory at the 64-bit address held in register ``X0``.

``str x1, [x3]``
  Store 64 bits (double word) from register ``X1`` into memory. The data is written into memory at the 64-bit address held in register ``X3``.

``str x1, [x3, #32]``
  Store 64 bits (double word) from register ``X1`` into memory. The memory address is calculated by adding offset ``32`` to the value in register ``X3``.

.. admonition:: Example

  We can write an example function that uses STR.
  This time we use both LDR and STR to copy 64 bits of data.

  .. _lst:base_load_store_2_h:

  .. literalinclude:: ../data_base/load_store.cpp
    :language: c++
    :lines: 7-8
    :dedent:
    :caption: Syntax of the function ``load_store_2``.

  The syntax of the example function ``load_store_2`` is shown in :numref:`lst:base_load_store_2_h`.
  The function takes two 64-bit pointers to two 64-bit unsigned integers as parameters and has no return value.

  .. _lst:base_load_store_2_s:

  .. literalinclude:: ../data_base/load_store_2.s
    :language: gas
    :caption: Implementation of the ``load_store_2`` function, which copies 64 bits from one memory location to another.

  :numref:`lst:base_load_store_2_s` shows an example implementation of the function.
  According to the PCS, the two 64-bit parameters are passed through the general-purpose registers ``X0`` and ``X1``.
  The first instruction ``ldr x0, [x0]`` loads 64 bits of data from the address held in register ``X0`` and overwrites the register with the loaded data.
  The second instruction ``str x0, [x1]`` stores the data in ``X0`` into memory at the address held in register ``X1``.
  Both instructions together copy 64 bits of data from one memory location to another.
  The last instruction ``ret`` jumps back to the address held in the link register (``X30``).
  

Load/Store Pair
---------------
There are also some instructions that can load data from memory to multiple registers at once, or store data from multiple registers into memory.
:a64_isa:`LDP <Base-Instructions/LDP--Load-pair-of-registers->` and :a64_isa:`STP <Base-Instructions/STP--Store-pair-of-registers->` are such instructions that use two general-purpose registers.
The 64-bit LDP variant in signed offset addressing mode has the encoding ``LDP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]``, while the 64-bit variant of STP in signed offset addressing mode has the encoding ``STP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]``.
Both instructions load/store a total of 128 bits of data.

As before, we formulate high-level descriptions of some examples and then show an example application:

``ldp x5, x7, [x2]``
  Load 2x64 bits from memory into registers ``X5`` and ``X7``. In memory, the data is located at the 64-bit address held in register ``X2``.

``stp w2, w5, [x3]``
  Store 2x32 bits from registers ``W2`` and ``W5`` into memory. The data is written into memory at the 64-bit address held in register ``X3``.


.. admonition:: Example

  In this example, we use LDP and STP to copy two 64-bit integers from one memory location to another.

  .. _lst:base_load_store_3_h:

  .. literalinclude:: ../data_base/load_store.cpp
    :language: c++
    :lines: 9-10
    :dedent:
    :caption: Syntax of the function ``load_store_3``.

  The syntax of such a function could be similar to ``load_store_3`` shown in :numref:`lst:base_load_store_3_h`.
  The function takes two 64-bit pointers to two arrays of signed 64-bit integers as parameters and has no return value.

  .. _lst:base_load_store_3_s:

  .. literalinclude:: ../data_base/load_store_3.s
    :language: gas
    :caption: Implementation of the ``load_store_3`` function, which copies 128 bits from one memory location to another.

  An example implementation is shown in :numref:`lst:base_load_store_3_s`.
  The implementation consists of three instructions.
  First, ``ldp x2, x3, [x0]`` loads 2x64 bits from memory into registers ``X2`` and ``X3`` using the address in parameter register ``X0``.
  Second, ``stp x2, x3, [x1]`` stores 2x64 bits from registers ``X2`` and ``X3`` to memory using the address in parameter register ``X1``.
  Finally, ``ret`` branches to the program address in the link register (``X30``).

.. _sec:base_machine_code:

Machine Code
============
In :numref:`sec:assembly_assembler` we have already seen that we can use the assembler to translate assembly code into instruction words.
We use the term *instruction word* to refer to the full 32-bit encoding of a single A64 instruction (sometimes also called the *instruction encoding*).
However, we have only briefly looked at the instruction ``stp x29, x30, [sp, -16]!`` and its instruction word ``0b10101001101111110111101111111101``.
Now, with our knowledge of the structure of LDR (immediate), we can easily see the structure of the instruction word.

The following table shows the assembly code and corresponding instruction words of LDR (immediate), unsigned offset instructions:

.. list-table::
  :header-rows: 3
  :stub-columns: 1

  * - Bit IDs
    - 31-22
    - 21-10
    - 9-5
    - 4-0
  * - Field
    -
    - imm12
    - Rn
    - Rt
  * - Pattern
    - ``1s11100101``
    - ``iiiiiiiiiiii``
    - ``nnnnn``
    - ``ttttt``
  * - ``ldr w0, [x0]``
    - ``1011100101``
    - ``000000000000``
    - ``00000``
    - ``00000``
  * - ``ldr w1, [x0]``
    - ``1011100101``
    - ``000000000000``
    - ``00000``
    - ``00001``
  * - ``ldr w5, [x0]``
    - ``1011100101``
    - ``000000000000``
    - ``00000``
    - ``00101``
  * - ``ldr x1, [x3]``
    - ``1111100101``
    - ``000000000000``
    - ``00011``
    - ``00001``
  * - ``ldr x1, [x3, #32]``
    - ``1111100101``
    - ``000000000100``
    - ``00011``
    - ``00001``
  * - ``ldr x1, [x3, #4088]``
    - ``1111100101``
    - ``000111111111``
    - ``00011``
    - ``00001``
  * - ``ldr w1, [x3, #2044]``
    - ``1011100101``
    - ``000111111111``
    - ``00011``
    - ``00001``
  * - ``ldr x0, [x0, #32760]``
    - ``1111100101``
    - ``111111111111``
    - ``00000``
    - ``00000``
  * - ``.word 0xb9400000``
    - ``1011100101``
    - ``000000000000``
    - ``00000``
    - ``00000``
  * - ``.word 0xb9400001``
    - ``1011100101``
    - ``000000000000``
    - ``00000``
    - ``00001``
  * - ``.word 0xf97ffc00``
    - ``1111100101``
    - ``111111111111``
    - ``00000``
    - ``00000``

The first row shows the IDs of the bits, where the most significant instruction bit is given by the bit with ID 31 and the least significant one by the bit with ID 0.
The second row shows the location of the three fields ``imm12``, ``Rn``, and ``Rt`` as described in the A64 ISA.
The immediate ``imm12`` has a total of 12 bits and encodes the offset.
Bit 30 (the *size* bit) determines whether this is the 32-bit (``0``) or 64-bit (``1``) variant.
The base register and destination register IDs are encoded in fields ``Rn`` (bits 9-5) and ``Rt`` (bits 4-0) respectively.
The third row shows a short form of the instruction word pattern.
The nine bits with IDs 22-29 and ID 31 are fixed; all other bits depend on how the instruction is used.
We go through the examples one by one:

``ldr w0, [x0]``
  General-purpose register ``W0`` has ID 0, encoded in field ``Rt``, so we have the value ``00000`` in bits 4-0.
  Register ``X0`` also has the ID 0, encoded in the ``Rn`` field.
  We have not specified an offset. According to the A64 ISA, the offset is set to 0 by default and we get the value ``000000000000`` in bits 21-10.
  We are loading the data into a W register, so we are using the 32-bit variant of the instruction.
  This information is stored in the size bit ``s``, which has bit ID 30 and is ``0`` in this case.
``ldr w1, [x0]``
  Compared to the previous example, only the ID of the 32-bit W register has changed.
  Therefore, we now have the value ``00001`` in bits 4-0.
``ldr w5, [x0]``
  Again, only the ID of the W register has changed and we get ``00101`` for bits 4-0.
``ldr x1, [x3]``
  The IDs of both registers changed. Thus, we get ``00001`` for ``Rt`` and ``00011`` for ``Rn``. Additionally, we are now loading 64 bits into register ``X1``. This means that the ``s`` bit with ID 30 is set to ``1``.
``ldr x1, [x3, #32]``
  Compared to the previous instruction, we now load from a different location in memory. Specifically, the address is given by the value in register ``X3`` with an added offset of 32 bytes. The offset is encoded in the field ``imm12`` as ``000000000100``. Note that for the 64-bit variant, the offset is specified in eight-byte increments. In other words, :math:`100_2 = 4_{10}`, i.e., we obtain the effective offset as :math:`4 \cdot 8=32`.
``ldr x1, [x3, #4088]``
  Only the offset has changed, which is now encoded as ``000111111111`` in the ``imm12`` field: :math:`1 1111 1111_2 = 511_{10}` and :math:`511 \cdot 8=4088`.
``ldr w1, [x3, #2044]``
  This is an interesting example because, at first glance, two properties change. First, we are now loading 32 bits instead of 64 bits. Second, we have an offset of 2044 bytes instead of 4088 bytes. However, the offset of the 32-bit variant of LDR (immediate) is specified in 4-byte increments. This means that the numeric value in the field ``imm12`` is identical to the one before: ``000111111111``. Comparing the instruction word of this instruction to the previous one, only the ``s`` bit with ID 30 changes from ``1`` to ``0``.
``ldr x0, [x0, #32760]``
  In this case, we load 64 bits from memory into register ``X0``. The data is loaded from the address that we get by adding the offset 32760 to the value in register ``X0``. The immediate is now ``111111111111``, i.e., 32760 is the largest offset that can be encoded in the ``imm12`` field of the 64-bit variant of the instruction.
``.word 0xb9400000``
  We can also write the instruction word of an instruction directly. In this case, ``0xb9400000`` is simply the hexadecimal representation of the 32 instruction bits of ``ldr w0, [x0]``.
``.word 0xb9400001``
  Hexadecimal representation of the instruction ``ldr w1, [x0]``.
``.word 0xf97ffc00``
  Hexadecimal representation of the instruction ``ldr x0, [x0, #32760]``.

.. note::
   Since we have powerful assemblers that can translate assembly code into machine code, knowledge of the structure and use of instruction words may seem unnecessary.  However, there are situations where this knowledge comes in handy, for example:

   * Assemblers may not support all available instructions. Apple's proprietary `AMX instructions <https://web.archive.org/web/20250221093414/https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f>`__ are such a case, where the matrix accelerator on the M1, M2, and M3 series is only accessible by writing machine code directly. Only with the introduction of M4 in 2014 did Apple `start <https://web.archive.org/web/20240710081224/https://github.com/llvm/llvm-project/pull/95478/commits/1461be872bf26e2e0f2572f688a45af795421432>`__ to support the standardized Scalable Matrix Extensions (`SME <https://web.archive.org/web/20240710075527/https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture>`__).

   * One might want to bypass the overhead of translating ASCII assembly code into machine code. This is the case in situations where machine code is generated at runtime and written directly to memory to be executed. We will discuss just-in-time machine code generation as a core building block for our tensor compiler in :numref:`ch:code_generation`.

Addressing Modes
================
The A64 instruction set has several *addressing modes* for loading and storing data.
In general, load and store instructions use a 64-bit base address along with an optional offset.
The base address is held in one of the 31 general-purpose registers, the stack pointer, or the program counter.
The offset can be encoded directly as an immediate in the 32 instruction bits or held in an offset register.

Following C1.3.3 of the :a_ref_man:`Arm Architecture Reference Manual for A-profile architecture <>`, the most important addressing modes can be summarized as follows:

Base register only: ``[base(, #0)]``
  In the simplest case, we use only the base register to determine the memory address.

  .. _lst:base_addressing_modes_0_s:

  .. literalinclude:: ../data_base/addressing_modes_0.s
    :language: gas
    :caption: Implementation of the ``addressing_modes_0`` function, which copies 32 bits from one memory location to another.

  The example function in :numref:`lst:base_addressing_modes_0_s` shows the base register only addressing mode.
  Specifically, the function copies 32 bits from the memory address in ``X0`` to the memory address in ``X1``.
  The immediate offsets of load ``ldr w2, [x0, #0]`` and store ``str w2, [x1, #0]`` are explicitly set to zero.
  Alternatively, we could write ``ldr w2, [x0]`` for the load and ``str w2, [x1]`` for the store.
  Note that both forms result in the same instruction word because the offset is set to zero by default in the unsigned offset class of :a64_isa:`LDR (immediate) <Base-Instructions/LDR--immediate---Load-register--immediate--?lang=en#iclass_unsigned_offset>` and :a64_isa:`STR (immediate) <Base-Instructions/STR--immediate---Store-register--immediate--?lang=en#iclass_unsigned_offset>`.

Base plus immediate offset: ``[base(, #imm)]``
  Another addressing mode is given by a base address together with an immediate offset.
  In this case, the effective memory address is calculated by adding the immediate offset to the base address.
  Since the immediate is limited in size, the number of possible immediate offsets is also limited.
  In the case of the unsigned offset encoding of :a64_isa:`LDR (immediate) <Base-Instructions/LDR--immediate---Load-register--immediate-->`, the immediate field has a size of 12 bits.
  For the 32-bit variant, the immediate is scaled by 4, allowing offsets from 0 to 16380 in multiples of 4.

  .. _lst:base_addressing_modes_1_s:

  .. literalinclude:: ../data_base/addressing_modes_1.s
    :language: gas
    :caption: Implementation of the ``addressing_modes_1`` function, which copies 32 bits from one memory location to another.

  The example function in :numref:`lst:base_addressing_modes_1_s` shows the base plus immediate offset addressing mode.
  Unlike before, the effective address for the 32-bit load ``ldr w2, [x0, #8]`` is determined by adding 8 to the value in register ``X0``.
  The effective address for the 32-bit store ``str w2, [x1, #12]`` is determined by adding 12 to the value in register ``X1``.

Base plus register offset: ``[base, Xm(, LSL #imm)]``
  In addition to immediate offsets, we can also provide the offset through an additional register.
  In this case, the offset can be any 64-bit value.
  We can also encode an additional left shift of the offset as part of the instruction.

  .. _lst:base_addressing_modes_2_s:

  .. literalinclude:: ../data_base/addressing_modes_2.s
    :language: gas
    :caption: Implementation of the ``addressing_modes_2`` function, which copies 32 bits from one memory location to another.

  The example function in :numref:`lst:base_addressing_modes_2_s` illustrates the base plus register offset addressing mode.
  The ``ldr w4, [x0, x1]`` instruction uses :a64_isa:`LDR (register) <Base-Instructions/LDR--register---Load-register--register-->`.
  This means that the effective memory address is calculated by adding the 64-bit value in ``X0`` to the value in ``X1``.
  The effective address for the store ``str w4, [x2, x3]`` is given by adding the values in ``X2`` and ``X3``.

Pre-indexed: ``[base, #imm]!``
  So far we have only discussed addressing modes that leave the register holding the base address untouched.
  In contrast, pre-indexed instructions calculate the effective address by adding the immediate offset to the base address and also write this address back to the register that held the base address.

  .. _lst:base_addressing_modes_3_s:

  .. literalinclude:: ../data_base/addressing_modes_3.s
    :language: gas
    :caption: Implementation of the ``addressing_modes_3`` function, which copies 2x32 bits from one memory location to another.

  :numref:`lst:base_addressing_modes_3_s` illustrates pre-indexed loads and stores.
  The pre-indexed load ``ldr w2, [x0, #4]!`` calculates the effective address by adding 4 to the value in register ``X0``.
  The address is not only used for the memory access, but is also written back to register ``X0``.
  So, if we interpret the original value in ``X0`` as a C pointer to an array of 32-bit data, we load the second value of the array and then increment the pointer by one, which is a four-byte increment in pointer arithmetic.
  Next, the instruction ``str w2, [x1, #4]!`` stores the 32 bits in ``W2`` at the address obtained by adding 4 to the value in ``X1``.
  It also writes the value incremented by 4 back to ``X1``.
  The following two instructions ``ldr w2, [x0, #4]!`` and ``str w2, [x1, #4]!`` do the same with incremented values in ``X0`` and ``X1``.

Post-indexed: ``[base], #imm``
  Post-indexed instructions use the base address for the memory access, but also add an immediate offset to the value in the register that held the base address.

  .. _lst:base_addressing_modes_4_s:

  .. literalinclude:: ../data_base/addressing_modes_4.s
    :language: gas
    :caption: Implementation of the ``addressing_modes_4`` function, which copies 2x32 bits from one memory location to another.

  Post-indexed loads and stores are shown in :numref:`lst:base_addressing_modes_4_s`.
  The example is similar to the pre-indexed example in :numref:`lst:base_addressing_modes_3_s`.
  However, this time post-indexed loads and stores are used instead of pre-indexed ones.
  In the case of ``ldr w2, [x0], #4``, this means that the instruction first loads the 32 bits from memory using the address in ``X0``, and then increments the value in ``X0`` by 4.

Literal (PC-relative): ``label``
  We can also use the value in the program counter as the base address, along with an immediate offset.
  For example, in the case of :a64_isa:`LDR (literal) <Base-Instructions/LDR--literal---Load-register--literal-->`, the positive or negative immediate offset has a size of 19 bits and encodes multiples of 4.
  This results in a range of ±1 MiB (±1,048,576 bytes) for the offset, calculated as :math:`\pm 2^{18} \times 4` bytes.
  So we can have an offset from -1 MiB to 1 MiB relative to the program counter.
  In assembly code, we can use labels for the immediate offset so that the assembler will insert the correct numeric value for us.

  .. _lst:base_addressing_modes_5_s:

  .. literalinclude:: ../data_base/addressing_modes_5.s
    :language: gas
    :caption: Implementation of the function ``addressing_modes_5``, which loads 32 bits from a PC-relative location and then writes them to memory.

  The use of a PC-relative load is illustrated in :numref:`lst:base_addressing_modes_5_s`.
  In the case of ``ldr w1, my_data``, the assembler would determine the immediate offset by comparing the location of the label ``my_data`` with that of the instruction.
  At runtime, the instruction then loads the 32 bits at the effective address obtained by adding the immediate offset to the value in the program counter.
  In this example, the instruction loads the value 128 to register ``W1``.

.. _sec:base_data_proc:

Data Processing
===============
In addition to load and store instructions, another large group is the *data-processing instructions*.
These instructions read operands from up to two source registers, perform the computation, and write the result back to a destination register.
The registers that an instruction reads from are called *source registers*.
The registers an instruction writes to are called *destination registers*.
When an instruction reads from and writes to a register, that register is also referred to as a destination register in the corresponding A64 ISA instruction description.
In addition to register data, some instructions use immediate values.
Since the ISA contains many data-processing instructions, we formulate high-level descriptions for a few examples:

ADD (immediate): ``add x3, x8, #16``
  Add the unsigned immediate value of ``16`` to the value in source register ``X8`` and write the result to destination register ``X3``.
MOV (register): ``mov w0, w1``
  Copy the 32-bit value from source register ``W1`` to destination register ``W0``.
AND (shifted register): ``and x17, x8, x2, lsl #2``
  Perform a bitwise AND of the 64-bit value in source register ``X8`` with the 64-bit value in source register ``X2`` after it has been logically shifted to the left by 2.
  Write the result to destination register ``X17``.
EOR (shifted register): ``eor w16, w16, w16``
  Perform a bitwise exclusive-OR of the 32-bit value in source register ``W16`` with itself and write the result to destination register ``W16``. This clears the lower 32 bits to zero. Since this is a 32-bit operation, the upper 32 bits of X16 are also implicitly set to zero, effectively zeroing the entire 64-bit register.
MADD: ``madd x0, x1, x2, x3``
  Multiply the 64-bit values in source registers ``X1`` and ``X2``, add the intermediate result to the 64-bit value in source register ``X3``, and write the final 64-bit result to destination register ``X0``.

A special class of data-processing instructions sets the NZCV condition flags based on the result of the instruction.
We call these instructions *flag-setting* instructions.
The NZCV condition flags are defined as follows:

  N
    Negative condition flag. Set to 1 if the result of the last flag-setting instruction was negative.
  Z
    Zero condition flag. Set to 1 if the result of the last flag-setting instruction was zero, and to 0 otherwise. A result of zero often indicates an equal result from a comparison.
  C
    Carry condition flag. Set to 1 if the last flag-setting instruction resulted in a carry condition, for example an unsigned overflow on an addition.
  V
    Overflow condition flag. Set to 1 if the last flag-setting instruction resulted in an overflow condition, for example a signed overflow on an addition.

  -- Arm Limited: C5.2.11 of the :a_ref_man:`Arm Architecture Reference Manual for A-profile architecture <>`

Many flag-setting instructions can be distinguished from their "standard" counterparts by the ``s`` suffix, for example: :a64_isa:`ADDS (shifted register) <Base-Instructions/ADDS--shifted-register---Add-optionally-shifted-register--setting-flags->`, :a64_isa:`ADDS (immediate) <Base-Instructions/ADDS--immediate---Add-immediate-value--setting-flags->`, :a64_isa:`SUBS (shifted register) <Base-Instructions/SUBS--shifted-register---Subtract-optionally-shifted-register--setting-flags->`, :a64_isa:`SUBS (immediate) <Base-Instructions/SUBS--immediate---Subtract-immediate-value--setting-flags->`.

An often-used special case is ``CMP``, which exists in :a64_isa:`extended register <Base-Instructions/CMP--extended-register---Compare--extended-register---an-alias-of-SUBS--extended-register-->`, :a64_isa:`immediate <Base-Instructions/CMP--immediate---Compare--immediate---an-alias-of-SUBS--immediate-->`, and :a64_isa:`shifted register <Base-Instructions/CMP--shifted-register---Compare--shifted-register---an-alias-of-SUBS--shifted-register-->` variants.
These instructions are aliases of their ``SUBS`` counterparts, where only the condition flags are set, but the actual result of the subtraction is discarded.
For example, ``cmp x5, #16`` is equivalent to ``subs xzr, x5, #16``, where xzr is the zero register.

.. _sec:base_branches:

Branches
========
The last group of base instructions is branches, exception generation, and system instructions.
Branching instructions are the most important category out of these and the topic of this section.
Until this point, we discussed load and store instructions, as well as data processing instructions.
Several of these instructions in a code block are simply executed one after another (see also :numref:`sec:assembly_aarch64`), leading to a linear program flow.
Branching instructions allow us to deviate from the linear program flow and jump to different positions in our program.
We already used the two branching instructions :a64_isa:`BL <Base-Instructions/BL--Branch-with-link->` and :a64_isa:`RET <Base-Instructions/RET--Return-from-subroutine->` to jump to the start of a function and back to the calling context.

Before discussing example applications, we briefly introduce the most important branching instructions :a64_isa:`BL <Base-Instructions/BL--Branch-with-link->`, :a64_isa:`RET <Base-Instructions/RET--Return-from-subroutine->`, :a64_isa:`B <Base-Instructions/B--Branch->`, :a64_isa:`B.cond <Base-Instructions/B-cond--Branch-conditionally->`, :a64_isa:`CBZ <Base-Instructions/CBZ--Compare-and-branch-on-zero->`, :a64_isa:`CBNZ <Base-Instructions/CBNZ--Compare-and-branch-on-nonzero->`:

Branch with Link (BL)
  Branch to an offset relative to the program counter and set the link register (``X30``) to ``PC+4``.
  The 26-bit immediate in the field ``imm26`` encodes a signed integer that is multiplied by four to obtain the offset.
  Thus, we obtain an offset range of ±128 MiB, calculated as :math:`\pm 2^{25} \times 4` bytes.

  In assembly code we can use a label instead of a numeric value.
  If the offset to the label is outside of the ±128 MiB range, the linker inserts a `veneer <https://github.com/ARM-software/abi-aa/blob/36280d2821beafe49dfde53c0eaffbae4134a3b1/aapcs64/aapcs64.rst#use-of-ip0-and-ip1-by-the-linker>`__ to reach the destination.
Return from subroutine (RET)
  Branch unconditionally to the address in a register.
  The register defaults to ``LR`` (``X30``).
Branch (B)
  Branch unconditionally to an offset relative to the program counter.
  The offset is stored in a 26-bit immediate, leading to ±128 MiB range.
Branch conditionally (B.cond)
  Branch conditionally to an offset relative to the program counter.
  The 19-bit immediate in the field ``imm19`` encodes the offset with range ±1 MiB.
  The branching condition is :a64_isa:`encoded <Base-Instructions/B-cond--Branch-conditionally-?lang=en#cond_option>` in the field ``cond``.
  For example, ``0000`` refers to the equal (``EQ``) condition, while ``1011`` refers to less than (``LT``).
  The condition :a64_isa:`specifies <Shared-Pseudocode/shared-functions-system#impl-shared.ConditionHolds.1>` a bit pattern for a subset of the NZCV condition flags that has to be fulfilled for the branch to be executed.
  In the case of the ``EQ`` condition, the zero condition flag (``Z``) has to be ``1`` for the branch to execute.
  For ``LT`` condition, the negative condition flag (``N``) has to be unequal to the overflow condition flag (``V``).
Compare and branch on zero (CBZ)
  Branch to an offset relative to the program counter if a register is zero.
  The 5-bit register ID is encoded in the field ``Rt``.
  The offset has a range of ±1 MiB and is encoded in the 19-bit immediate in field ``imm19``.
Compare and branch on nonzero (CBNZ)
  Branch to an offset relative to the program counter if a register is nonzero.
  The field ``Rt`` encodes the 5-bit register ID, and the field ``imm19`` the 19-bit immediate, leading to a ±1 MiB offset range.

.. admonition:: Examples

  The following examples are available in the archive :download:`branches.tar.xz <../data_base/gen/branches.tar.xz>` together with a driver showcasing them.

  .. _lst:base_branches_0_s:

  .. literalinclude:: ../data_base/branches_0.s
    :language: gas
    :linenos:
    :caption: Implementation of the function ``branches_0``.

  Our first example has the function signature ``int32_t branches_0()`` and is shown in :numref:`lst:base_branches_0_s`.
  In line 5 the function sets the value of the return register ``W0`` to ``1``.
  Next, the unconditional branch in line 6 is executed and jumps to ``my_label``.
  Thus, we jump 8 bytes forward relative to the current program counter.
  Effectively, we jump over ``mov w0, #2`` and execute ``ret`` next.

  In summary, calling ``branches_0`` will always return ``1``.

  .. _lst:base_branches_1_s:

  .. literalinclude:: ../data_base/branches_1.s
    :language: gas
    :linenos:
    :caption: Implementation of the function ``branches_1``.

  :numref:`lst:base_branches_1_s` shows the implementation of the function ``int32_t branches_1(int32_t)``.
  The instruction ``cmp w0, #25`` sets the NZCV condition flags based on the result of the subtraction of ``25`` from the value in the first parameter register ``W0``.
  Specifically, the ``Z`` condition flag is set to ``1`` if ``W0`` holds the value ``25``, and set to ``0`` in all other cases.

  Next, the instruction in line 6 sets the value of ``W0`` to ``1``.
  Line 7 contains the conditional branch ``b.eq my_label``.
  The branch jumps to ``my_label`` if the ``Z`` condition flag is ``1``, as implied by the ``EQ`` condition.
  If not, the function continues linearly and executes ``mov w0, #2`` which sets the value of ``W0`` to ``2``.
  ``ret`` is the last instruction executed in either case and jumps to the address in the link register (``LR``).

  In summary, calling ``branches_1`` returns ``1`` if the parameter ``25`` is passed and ``2`` in all other cases.

  .. _lst:base_branches_2_s:

  .. literalinclude:: ../data_base/branches_2.s
    :language: gas
    :linenos:
    :caption: Implementation of the function ``branches_2``.

  :numref:`lst:base_branches_2_s` shows an example function using CBZ.
  In the given case, ``cbz w0, my_label`` jumps only if parameter register ``W0`` is zero.

  Thus, the function returns ``1`` if called with parameter ``0``.
  In all other cases, the function does not modify ``W0``, meaning that the value of the parameter is also the function's return value.

  .. _lst:base_branches_3_s:

  .. literalinclude:: ../data_base/branches_3.s
    :language: gas
    :linenos:
    :caption: Implementation of the function ``branches_3``.

  The function ``branches_3`` in :numref:`lst:base_branches_3_s` is similar to ``branches_2``.
  However, in this case the CBNZ instruction in line 5 branches if the value in ``W0`` is nonzero.
  Therefore, the function returns ``0`` if called with the parameter ``0``: ``branches_3(0)``.
  In all other cases, the function returns ``1`` since the instruction ``mov w0, #1`` is executed.

  .. _lst:base_branches_4_s:

  .. literalinclude:: ../data_base/branches_4.s
    :language: gas
    :linenos:
    :caption: Implementation of the function ``branches_4``.

  The example in :numref:`lst:base_branches_4_s` implements the function ``uint64_t branches_4(uint64_t, uint64_t)``.
  Effectively, ``branches_4`` computes the n-th power of a number.
  For example, the function returns :math:`2^8=256` when passing the parameters ``2`` as base and ``8`` as exponent: ``branches_4(2, 8)``.

  ``mov x2, x0`` copies the base from register ``X0`` to ``X2``.
  ``cbnz x1, my_label`` in line 7 jumps to ``my_label`` if the exponent is nonzero.

  If the exponent is zero, program execution continues linearly and the instructions in lines 8 and 9 are executed next.
  ``mov x0, #1`` sets return register ``W0`` to ``1`` and ``b my_end`` jumps to the end of the function.

  If the exponent is nonzero, the instructions in lines 12 and 13 are executed next.
  ``subs x1, x1, #1`` decrements the value of ``X1`` and sets the NZCV condition flags.
  Next, the conditional branch in line 13 jumps to the end of the function if the ``Z`` condition flag is set to ``1``.
  This is the case if the result of the previous instruction was zero, which holds if the passed exponent was ``1``.
  In that case the value in register ``X0`` is returned unmodified, meaning that we simply return the base if the exponent is ``1``.

  The block in lines 15-18 covers the general case where the exponent is greater than ``1``.
  In line 16 the value of ``X1`` is decremented and the NZCV condition flags are set.
  ``mul x0, x0, x2`` multiplies the current value in ``X0`` with the base in register ``X2``, and writes the result to ``X0``.
  The conditional branch ``b.ne my_loop`` only jumps to label ``my_loop`` if the ``Z`` condition flag is not set.
  This means that the jump is only executed if the result of the SUBS instruction in line 16 was non-zero.
  In summary, we have written a loop that multiples ``X0`` exponent-1 times with the base, effectively computing the exponentiation.