.. _ch:code_generation:

Code Generation
===============

At this point, we know how to write efficient AArch64 kernels that instantiate primitives.
A limiting factor of our current approach is that each of our assembly kernels is still limited to instantiating a set of hardcoded parameters.
Consider the BRGEMM primitive as an example.
In this case, the sizes of the matrix dimensions and the datatype are hardcoded.
This means that until now, we had to write a new kernel whenever we needed a kernel with new hardcoded parameters.

In this chapter, we discuss *just-in-time (JIT) code generation* as an approach to solving this problem.
JIT code generation allows us to continue hardcoding performance-critical parameters into our kernels, but leaves the actual instantiation as a runtime decision.
Unlike prewritten kernels, the JIT approach allows us to delay kernel optimization until we know the full problem specifications in our tensor compiler.

.. _tab:code_gen_overview:

.. list-table:: Overview of our approach to just-in-time code generation for AArch64 and the Linux operating system.
   :header-rows: 0
   :widths: 5 95

   * - **1.**
     - Write instruction words to a temporary buffer.
   * - **2.**
     - Allocate writable pages with ``mmap``.
   * - **3.**
     - Copy data from the buffer to those pages.
   * - **4.**
     - Flush the instruction cache.
   * - **5.**
     - Set the pages to read+execute with ``mprotect``.
   * - **6.**
     - Cast the page pointer to a function pointer and call it.

:numref:`tab:code_gen_overview` provides a high-level overview of the six JIT steps covered in this chapter.
More specifically, :numref:`sec:code_gen_exe_mem` discusses steps 2-5 (setting up the executable memory), while :numref:`sec:code_gen_kernels` shows examples of steps 1 and 6 (building the buffer & calling kernels).
Finally, in :numref:`sec:code_gen_inst_enc` discusses a structured approach to generating instruction words as part of our JIT toolchain.

.. _sec:code_gen_exe_mem:

Executable Memory
-----------------
At its core, the idea of just-in-time code generation is remarkably simple.
Instead of loading instruction words from a file into memory and then executing the code from memory, we generate instruction words, write them directly into memory, and then execute them.
We discussed the structure of instruction words in :numref:`sec:base_machine_code`.
Therefore, generating the instruction words that represent our kernel and writing them to memory is an application of that knowledge.
The only missing piece is a way to get an executable region of memory that contains our instruction words.

Most operating systems enforce the write-xor-execute (W^X) security policy.
W^X means that a page of memory is either writable or executable, but not both.
In short, we will use `mmap <https://www.man7.org/linux/man-pages/man2/mmap.2.html>`__ to request page-aligned, writable, and zero-initialized memory from our Linux operating system.
In practice, ``mmap`` rounds the requested number of bytes up to full pages.
We can then write our kernel to the memory returned by ``mmap``.
Finally, we use `mprotect <https://www.man7.org/linux/man-pages/man2/mprotect.2.html>`__ to make the pages read-only and executable.

.. _lst:code_gen_alloc_mmap:

.. literalinclude:: ../data_code_gen/Kernel.cpp
  :language: c++
  :lines: 21-35
  :dedent:
  :lineno-match:
  :caption: ``alloc_mmap`` function, which uses the POSIX.1 ``mmap`` function to allocate memory.

:numref:`lst:code_gen_alloc_mmap` shows the ``alloc_mmap`` function, which wraps ``mmap`` and returns a pointer to a writable memory region of the desired size.
The first ``mmap`` parameter in line 22 is set to ``0``, resulting in page-aligned memory, while the ``MAP_ANONYMOUS`` bit in line 25 results in zero-initialized memory.
It is used in the minimal example code :download:`mini_jit <../data_code_gen/mini_jit.tar.xz>` to allocate memory for the instruction words of the kernels.

.. _lst:code_gen_set_exec:

.. literalinclude:: ../data_code_gen/Kernel.cpp
  :language: c++
  :lines: 47-57
  :dedent:
  :lineno-match:
  :caption: Function ``set_exec`` using the POSIX.1 ``mprotect`` function.

Once the instruction words are copied to the memory pages allocated by ``mmap``, the ``mini_jit`` code uses the ``set_exec`` function shown in :numref:`lst:code_gen_set_exec` to make them executable.
``set_exec`` uses ``mprotect`` to make the pages read-only and executable, thus following the W^X security model.

.. _lst:code_gen_set_kernel:

.. literalinclude:: ../data_code_gen/Kernel.cpp
  :language: c++
  :lines: 59-89
  :dedent:
  :lineno-match:
  :caption: Function ``set_kernel``, using the two functions ``alloc_mmap`` and ``set_exec``.

The code in :numref:`lst:code_gen_set_kernel` shows how we can use the two functions together.
The ``mini_jit::Kernel`` class has an internal buffer, ``m_buffer``, to which we can add the instruction words one at a time.
Once all the instruction words are added, we call the ``set_kernel`` function, which allocates memory for the kernel by calling ``alloc_mmap`` in line 69, copies the instruction words into the allocated memory in lines 77-79, and makes the memory read-only and executable by calling the ``set_exec`` function in lines 87-88.
The built-in function `__builtin___clear_cache <https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005f_005f_005fclear_005fcache>`__ is called to clear the instruction cache and ensure that the generated code is visible to the core that generated it.

.. _sec:code_gen_kernels:

Kernels
-------
We will now discuss some example kernels that illustrate the use of the ``mini_jit::Kernel`` class.
These kernels are very simple by design, relying mostly on hardcoded instructions.
In real just-in-time code generation software, additional wrappers are used to avoid code bloat and improve readability.
:numref:`sec:code_gen_inst_enc` discusses a better way to generate instruction words. 
All examples can also be found directly in the ``kernel_examples.cpp`` file of  :download:`mini_jit <../data_code_gen/mini_jit.tar.xz>`.

.. _lst:code_gen_kernel_0:

.. literalinclude:: ../data_code_gen/kernel_examples.cpp
  :language: c++
  :lines: 7-19
  :dedent:
  :lineno-match:
  :caption: Example generation of a kernel with two instructions.

:numref:`lst:code_gen_kernel_0` shows our first example kernel.
We see that the two instruction words ``0xd28000a0`` and ``0xd65f03c0`` are passed to ``l_kernel`` by calling the ``add_instr`` instruction in lines 11 and 12.
Internally, the class adds the passed instruction words to the buffer ``m_buffer``.
After the two instruction words are added, the internal buffer has a size of 8 bytes.

Next, line 13 calls the ``write`` function, which writes the 8 instruction bytes to the binary file ``example_0.bin``.
The ``write`` function itself is fairly standard, and details on its implementation can be found in the ``Kernel.cpp`` file.
We can use this file output along with standard command line tools for debugging.
One option is to use ``objdump`` to disassemble the code:

.. code-block:: bash

   objdump -m aarch64 -b binary -D example_0.bin

For the code in :numref:`lst:code_gen_kernel_0`, we get the following output:

.. literalinclude:: ../data_code_gen/gen/example_0.dis
   :language: none
   :lines: 2-

We can see that the two instructions can be written in assembly code as ``mov x0, #5`` and ``ret``, so the kernel simply writes the value ``5`` to the ``X0`` register before branching back to the calling scope with ``ret``.
Since the file output is for debugging, we simply disable it once our just-in-time code generation works as expected.

Continuing with the code in :numref:`lst:code_gen_kernel_0`, we next call the ``set_kernel`` function in line 14.
As discussed earlier, ``set_kernel`` copies the instruction words into page-aligned memory and makes it executable.
In line 17, we use the ``get_kernel`` function to get a ``void const`` pointer to this memory region.
The C-style cast ``(int64_t (*)())`` converts this pointer to a function pointer.

.. note::

  Converting a ``void *`` object pointer to a function pointer is implementation-defined in ISO C/C++. On Linux with GCC or Clang it works reliably, but in other cases it may not.

The function takes no arguments and returns an ``int64_t`` value.
This function pointer is stored in the variable ``l_func``.
Finally, in line 18, the function ``l_func`` is called, which means that our kernel is executed.

.. _lst:code_gen_kernel_1:

.. literalinclude:: ../data_code_gen/kernel_examples.cpp
  :language: c++
  :lines: 26-40
  :dedent:
  :lineno-match:
  :caption: Example generation of a kernel with two instructions. The 16-bit immediate of the ``mov`` instruction is set dynamically.

The example in :numref:`lst:code_gen_kernel_1` is very similar to the previous one.
However, this time the 16-bit immediate of the :a64_isa:`MOV (wide immediate) <Base-Instructions/MOV--wide-immediate---Move-wide-immediate-value--an-alias-of-MOVZ-?lang=en>` instruction is set to ``imm16`` at runtime.
The immediate is encoded in the instruction bits with IDs 5-20.
We perform three steps to generate the desired instruction word:

1. Initialize the instruction word with ``0xd2800000`` and write the result to ``l_ins`` (line 30). The 16 immediate bits are set to zero, that is we use the instruction word for ``mov x0, #0`` as the initial value.
2. Shift ``imm16`` by 5 to the left (line 31). So if we had the 16-bit value ``0b0000'0000'0000'0011`` before, we get the 32-bit value ``0b0000'0000'0000'0000'0000'0000'0110'0000`` after the shift.
3. Do a bitwise OR of ``l_ins`` and the shifted ``imm16`` (line 31). This effectively sets the 16 immediate bits to the desired value.

The rest follows the steps of the first example.
Now, in line 39, the immediate value is returned by the function call ``l_func()``.
This example already outlines the general approach for encoding instructions.
First, we initialize all free bits to zero.
Second, we set the free bits based on runtime parameters.
More detailed examples are given in :numref:`sec:code_gen_inst_enc`.

.. _lst:code_gen_kernel_2:

.. literalinclude:: ../data_code_gen/kernel_examples.cpp
  :language: c++
  :lines: 45-57
  :dedent:
  :lineno-match:
  :caption: Example generation of a kernel with two instructions. The kernel has a single function parameter of type ``int64_t``.

Our next example is shown in :numref:`lst:code_gen_kernel_2`.
Again, we generate a kernel consisting of two instruction words.
This time, however, the kernel takes a 64-bit integer as its only input parameter.
As a result, the ``l_func`` function in lines 54-55 has a different signature.
In line 56, the value ``7`` is passed to the kernel through the ``X0`` parameter register.
The kernel adds the immediate ``#5`` to ``X0`` and branches back to the surrounding scope.
Therefore, the return value of ``l_func( 7 )`` is ``12``.

.. _lst:code_gen_kernel_3:

.. literalinclude:: ../data_code_gen/kernel_examples.cpp
  :language: c++
  :lines: 62-79
  :dedent:
  :lineno-match:
  :caption: Example generation of a kernel with seven instructions. The kernel performs 512 loop iterations.

Our final example in :numref:`lst:code_gen_kernel_3` shows just-in-time code generation of a loop.
The loop consists of a loop counter in register ``X0`` and a branch instruction.
The instruction word ``0xd2804000`` in line 66 initializes the loop counter to ``512``, while ``0xd1000400`` in line 68 decrements the loop counter.
The branch instruction is given by the instruction word ``0xb5ffffc0`` and jumps ``#-8`` bytes relative to the program counter, that is two instructions back.

.. _sec:code_gen_inst_enc:

Instruction Word Generators
----------------------------
This section discusses instruction word generators that set the free bits of an instruction based on user input.
We have already seen a simple example of this in :numref:`sec:code_gen_kernels`, where we dynamically set the 16-bit immediate of MOV (wide immediate) using a shift operation together with a bitwise OR.

Our goal in this section is to systematically raise the level of abstraction so that users of our just-in-time code generator can interface with instruction word generation instead of having to provide the instruction words manually.
In summary, using instruction word generators will feel similar to writing assembly code.

.. _tab:code_gen_cbnz:

.. list-table:: Structure of the CBNZ instruction.
  :header-rows: 3
  :stub-columns: 1

  * - Bit IDs
    - 31
    - 30-24
    - 23-5
    - 4-0
  * - Field
    - sf
    -
    - imm19
    - Rt
  * - Pattern
    - ``s``
    - ``0110101``
    - ``iiiiiiiiiiiiiiiiiii``
    - ``ttttt``
  * - ``cbnz w0, #0``
    - ``0``
    - ``0110101``
    - ``0000000000000000000``
    - ``00000``

We will discuss the implementation of generators for two instructions.
As before, all functions discussed are part of the  :download:`mini_jit <../data_code_gen/mini_jit.tar.xz>` example code.
The first instruction is :a64_isa:`CBNZ <Base-Instructions/CBNZ--Compare-and-branch-on-nonzero->`, and its general structure is shown in :numref:`tab:code_gen_cbnz`.
Bits 30-24 are fixed bits, while all other bits are free bits.
In particular, the ``sf`` field determines whether we are using the 32-bit or 64-bit variant of the instruction, that is whether the instruction compares a ``W`` register or an ``X`` register to zero.
The 19 immediate bits are stored in the field ``imm19`` and determine how far the instruction branches relative to the program counter.
``imm19`` is interpreted as a signed value in two's complement form, and the jump in bytes is obtained by multiplying ``imm19`` by 4.
In other words, ``imm19`` stores how many instructions backward (negative value) or forward (positive value) we branch.
The remaining free bits are given by the field ``Rt``, which encodes the ID of the register, that is it is compared to zero.

.. _lst:code_gen_cbnz:

.. literalinclude:: ../data_code_gen/InstGen.cpp
  :language: c++
  :lines: 23-40
  :dedent:
  :lineno-match:
  :caption: Instruction word generator for the base instruction CBNZ.

:numref:`lst:code_gen_cbnz` shows the ``base_br_cbnz`` function, which generates CBNZ instruction words.
The function takes two parameters, ``reg`` and ``imm19``.
The first parameter ``reg`` contains the ID of the general-purpose register used and the view of that register.
Both are encoded in the enumerator ``gpr_t``.
Specifically, we assign the enumerator values 0-30 to the registers W0-W30 and the values 32-62 to the registers X0-X30.
The second parameter ``imm19`` contains the 19-bit immediate of the instruction.

Going through the code line by line, we see that the local variable ``l_ins`` is initialized with the instruction word ``0x35000000``, that is all free bits are set to zero, as shown in :numref:`tab:code_gen_cbnz`.
Next, lines 28 and 29 set the instruction bits of the ``Rt`` field.
Specifically, we first mask the input parameter ``reg`` with ``0x1f``, which is ``0b11111`` in binary.
The mask sets all but the last five bits to zero.
Thus, regardless of the register type, ``X`` or ``W``, we only set the ``Rt`` bits by the bitwise OR in line 29.

Lines 32 and 33 set the size field ``sf``, which is stored in bit 31 of a CBNZ instruction word.
Our instructions must have a value of ``0`` in ``sf`` when working with a ``W`` register and a value of ``1`` when working with an ``X`` register.
So, in line 32 we apply the mask ``0x20`` to isolate the size bit in our ``gpr_t`` enumerator type.
Then, in line 33, we shift the size bit, now at local ID 5 of ``l_reg_size``, by 26 to the left so that the following bitwise OR modifies instruction bit 31 for an ``X`` register.

The last two bit manipulations in lines 36 and 37 set the 19-bit immediate of the instruction.
As before, we first apply the mask ``0x7ffff`` to eliminate all but the lower 19 bits in ``imm19``.
The masked value in ``l_imm`` is then written to bits 23-5 of ``l_ins`` by left shifting and bitwise ORing.

.. _tab:code_gen_fmla_vector:

.. list-table:: Structure of the FP32 and FP64 versions of the FMLA (vector) instruction.
  :header-rows: 3
  :stub-columns: 1

  * - Bit IDs
    - 31
    - 30
    - 29-23
    - 22
    - 21
    - 20-16
    - 15-10
    - 9-5
    - 4-0
  * - Field
    -
    - Q
    -
    - sz
    -
    - Rm
    -
    - Rn
    - Rd
  * - Pattern
    - ``0``
    - ``q``
    - ``0011100``
    - ``s``
    - ``1``
    - ``mmmmm``
    - ``110011``
    - ``nnnnn``
    - ``ddddd``
  * - ``fmla v0.2s, v0.2s, v0.2s``
    - ``0``
    - ``0``
    - ``0011100``
    - ``0``
    - ``1``
    - ``00000``
    - ``110011``
    - ``00000``
    - ``00000``
  * - ``fmla v0.4s, v0.4s, v0.4s``
    - ``0``
    - ``1``
    - ``0011100``
    - ``0``
    - ``1``
    - ``00000``
    - ``110011``
    - ``00000``
    - ``00000``
  * - ``fmla v0.2d, v0.2d, v0.2d``
    - ``0``
    - ``1``
    - ``0011100``
    - ``1``
    - ``1``
    - ``00000``
    - ``110011``
    - ``00000``
    - ``00000``

Our second example is the FP32 and FP64 versions of :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`.
The overall structure of the instruction is shown in :numref:`tab:code_gen_fmla_vector`.
We see that we need to set the five fields ``Q``, ``sz``, ``Rm``, ``Rn``, and ``Rd``.
``Rn`` contains the ID of the first SIMD&FP source register, ``Rm`` contains the ID of the second SIMD&FP source register, and ``Rd`` contains the ID of the SIMD&FP destination register.
The two fields ``Q`` and ``sz`` together encode the arrangement specifier of the instruction.
Specifically, both are zero for ``2S``, only ``Q`` is ``1``  for ``4S``, and both are ``1`` for ``2D``.

.. _lst:code_gen_fmla_vector:

.. literalinclude:: ../data_code_gen/InstGen.cpp
  :language: c++
  :lines: 42-65
  :dedent:
  :lineno-match:
  :caption: Instruction generator for the FP32 and FP64 versions of the FMLA (vector) instruction.

:numref:`lst:code_gen_fmla_vector` shows the ``neon_dp_fmla_vector`` function, which can generate FP32 and FP64 FMLA (vector) instruction words.
The register IDs are passed by the parameters ``reg_dest``, ``reg_src1``, and ``reg_src2``.
The ``simd_fp_t`` enumeration simply uses the values 0-31 for the SIMD&FP registers V0-V31.
The arrangement specifier is passed through the ``arr_spec`` parameter, where the ``arr_spec_t`` enumerator is a bit special.
It enumerates the value ``0x0`` for ``2S``, the value ``0x40000000`` for ``4S``, and the value ``0x40400000`` for ``2D``.
This means that for ``2S`` all bits are set to zero, for ``4S`` only the bit with ID 30 is set to ``1``, and for ``2D`` only the two bits with ID 30 and 22 are set to ``1``.

The following code sets the ``Rd``, ``Rn`` and ``Rm`` fields in lines 49-58.
Because of the ``arr_spec_t`` values used, the generator can simply do a bitwise OR to set the ``Q`` and ``sz`` fields in line 62 after masking.

In summary, we have seen that the implementation of instruction word generators is straightforward.
The two CBNZ and FMLA (vector) generators discussed use only left shifts, bitwise ANDs, and bitwise ORs.
This reliance on simple bit operations results in fast just-in-time code generation.
This can be an advantage in situations where code generation time is critical.