.. _ch:micro_bench:

***************
Microbenchmarks
***************
Modern computer cores are based on inherently parallel microarchitectures.
This means that we already have a high degree of parallelism within a single core.
In this chapter, we examine microbenchmarks that allow us to measure core-level parallelism.
A detailed understanding of the requirements for core-level parallelism will guide our implementation of low-level kernels in later chapters.


.. _sec:micro_bench_ins_thr:

Execution Throughput
====================
A typical microarchitecture decomposes instructions into microoperations.
The microoperations are issued to pipelines that execute them.
Depending on the microoperation, it may be executed by one of several pipelines.
Typically, an execution pipeline can accept one new microoperation per cycle.

.. _tab:micro_bench_thr_neon_specs:

.. list-table:: Theoretical execution throughput of various floating-point instructions on a core of NVIDIA's Grace CPU assuming a 3.3 GHz clock frequency.
   :header-rows: 1

   * - Instruction
     - Inst / Cycle
     - Ops / Inst
     - Ops / Cycle
     - GFLOPS
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``4S``
     - 4
     - 8
     - 32
     - 105.6
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``2S``
     - 4
     - 4
     - 16
     - 52.8
   * - :a64_isa:`FMUL (scalar) <FMUL--scalar---Floating-point-multiply--scalar-->`
     - 4
     - 1
     - 4
     - 13.2

*Execution throughput* is the maximum number of instructions in a given instruction group that a microarchitecture can execute per cycle.
This assumes ideal conditions where there are no resource conflicts, data dependencies, or memory delays.
For example, assume that a microarchitecture has an execution throughput of 4 for the Neon :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->` instruction.
This means that under ideal conditions, we obtain the results of four FMLA (vector) instructions per cycle.

:numref:`tab:micro_bench_thr_neon_specs` shows the theoretical floating-point throughput of an example core.
The 3.3 GHz is the nominal frequency of the NVIDIA Grace Superchip system microbenchmarked below. It is slightly higher than the 3.1 GHz listed in the corresponding `datasheet <https://resources.nvidia.com/en-us-data-center-overview/hpc-datasheet-grace-cpu-superchip>`__.
The execution throughput of the three example instructions is expressed in instructions per cycle and meets the microarchitecture specifications outlined in Table 3-16 of the  `Neoverse V2 Software Optimization Guide <https://developer.arm.com/documentation/109898/0300>`__.

The number of operations (ops) for a given instruction that a microarchitecture can execute per cycle is given as:

.. math::

   \frac{\text{ops}}{\text{cycle}} = \frac{\text{instructions}}{\text{cycle}} \cdot \frac{\text{ops}}{\text{instruction}}

In the case of FMLA (vector), we may be interested in FP32 arithmetic instructions with arrangement specifier ``4S``.
In this case, each instruction performs a fused-multiply add on four-element vectors, resulting in 8 floating-point operations per instruction.
This gives us the throughput :math:`32 = 4 \cdot 8` for Neoverse V2 in terms of operations per cycle.

We get the operations per second by also considering the clock frequency:

.. math::

   \frac{\text{ops}}{\text{second}} = \frac{\text{ops}}{\text{cycle}} \cdot \frac{\text{cycles}}{\text{second}}

For example, in the case of a Grace core and FMLA (vector) with arrangement specifier ``4S``, we get :math:`105.6\,\text{GOPS} = 3.3\,\text{GHz} \cdot 32\,\text{ops/cycle}`.
This means that a core can perform 105.6 FP32 giga (:math:`10^9`) operations per second using FMLA (vector) instructions under ideal conditions.
Of all the instructions available, this is also the highest theoretical throughput for FP32 arithmetic.
We refer to this number as the core's theoretical FP32 *peak performance*.

.. note::

   On Neoverse V2, the 128-bit SVE FMLA instruction also has a throughput of four instructions per cycle. So, in theory, we can achieve FP32 peak performance using different instructions. This may seem intuitive since the Neoverse V2 core uses the same pipelines to execute Neon and SVE instructions. However, there are microarchitectures where the situation is different. An example is the `A64FX <https://www.fujitsu.com/downloads/SUPER/a64fx/a64fx_datasheet_en.pdf>`__, where we need to use 512-bit SVE instructions for the highest performance.

So far we have discussed instruction throughput from a theoretical point of view.
We will now discuss how to write microbenchmarks that allow us to measure execution throughput.
To maximize instruction throughput, we must 1) utilize all available pipelines and 2) keep the pipelines full.
In summary, we design the throughput microbenchmarks to execute many independent instructions.

.. _lst:micro_bench_thr_neon:

.. literalinclude:: ../data_micro_bench/thr_neon_fmla_4s.s
  :language: gas
  :lines: 62-108
  :dedent:
  :linenos:
  :caption: FMLA (vector), ``4S`` microbenchmark that measures execution throughput.

:numref:`lst:micro_bench_thr_neon` shows the loop of a :download:`microbenchmark <../data_micro_bench/thr_neon_fmla_4s.s>` that measures the execution throughput of FMLA (vector) using the ``4S`` arrangement specifier.
The benchmark consists of two parts: a loop and a block of code that is executed at each iteration of the loop.
The loop decrements the value of ``X0`` at each iteration (line 2)  and continues iterating only if the value in ``X0`` is not zero (line 46).
The value in ``X0`` is the only function parameter of the microbenchmark, allowing the surrounding scope to determine how many times the code block is executed.

The code block itself consists of 3200 instructions, realized by a subblock of 32 instructions (lines 5-43), repeated 100 times by the `.rept <https://sourceware.org/binutils/docs/as/Rept.html>`__ and ``.endr`` directives in lines 4 and 44.
Looking at the subblock, we see that the first eight instructions in lines 5-13 are completely independent.
The following eight instructions in lines 15-23 have write-after-read dependencies with respect to the previous eight instructions in lines 5-13.
For example, the ninth instruction ``fmla v8.4s, v16.4s, v24.4s`` in line 15 writes to register ``V8.4s``, which is read by the instruction ``fmla v0.4s, v8.4s, v16.4s`` in line 5.
Write-after-read dependencies are typically resolved by the renaming capabilities of the microarchitecture.

The first true dependencies (read-after-write) occur in the second half of the 32-instruction subblock.
For example, the ``fmla v16.4s, v24.4s, v0.4s`` instruction in line 25 reads from source register ``V0.4s``, which is written by the ``fmla v0.4s, v8.4s, v16.4s`` instruction in line 5.
Read-after-write dependencies cannot be resolved by the microarchitecture, meaning that the corresponding result must be available before the instruction with the dependent read can be executed.
Thus, this situation can become a bottleneck if we run out of independent instructions to keep the pipelines busy.
The number of cycles required before the result of an instruction is available to other instructions is called latency and is discussed in :numref:`sec:micro_bench_inst_lat`.

.. _tab:micro_bench_thr_neon_measured:

.. list-table:: Microbenchmarked throughput of various floating-point instructions on a core of NVIDIA's Grace CPU.
   :header-rows: 1

   * - Instruction
     - Inst / Cycle
     - GFLOPS
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``4S``
     - 4.0
     - 105.7
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``2S``
     - 4.0
     - 52.9
   * - :a64_isa:`FMUL (scalar) <FMUL--scalar---Floating-point-multiply--scalar-->`
     - TODO
     - TODO

Using the microbenchmark structure in :numref:`lst:micro_bench_thr_neon` to measure the execution throughput of FMLA (vector) and FMUL (scalar) on Grace (Neoverse V2), we obtain the performance shown in :numref:`tab:micro_bench_thr_neon_measured`.
In summary, our microbenchmarks are able to achieve the execution throughput we expected from theoretical considerations.

.. note::

   Another approach to writing the throughput benchmark could be to use the same source registers for the multiplicands in each instruction. For example, we can use ``V30`` and ``V31`` as source registers and accumulate in ``V0``-``V29``
   
   .. code-block:: gas
   
      fmla  v0.4s, v30.4s, v31.4s
      fmla  v1.4s, v30.4s, v31.4s
      // [...]
      fmla v29.4s, v30.4s, v31.4s
   
   However, on Neoverse V2 (Grace) this leads to an unstable setup where the performance is 2x lower than expected every few runs.
   Possibly only half of the available SIMD pipelines are used in the underperforming runs.
   Details are unclear at the time of writing, so the structure in :numref:`lst:micro_bench_thr_neon` is recommended.

.. _sec:micro_bench_inst_lat:

Execution Latency
=================
An instruction's *execution latency* is the minimum number of processor cycles between the time an instruction's microoperations are issued into the execution pipeline and the time its result is available -- via write-back or forwarding -- to a dependent instruction.
Again, as with execution throughput, we assume ideal conditions.

As discussed in :numref:`sec:micro_bench_ins_thr`, we need independent instructions to fully utilize all available execution pipelines.
On the other hand, if the instructions are interdependent, our workload can become latency-bound.
Suppose we have :math:`n` instructions :math:`I_1` to :math:`I_n` and each instruction depends on the previous one.
This means that we have the dependency chain :math:`I_1 \rightarrow I_2 \rightarrow \ldots \rightarrow I_{n-1} \rightarrow I_n`.
In this case, we lose core-level parallelism in two ways.
First, we can only use one execution pipeline at a time.
Second, pipeline parallelism is either completely inhibited or significantly reduced.
More specifically, if the instruction :math:`I_k` requires :math:`I_{k-1}` to write the result back to the register file, we cannot use pipeline parallelism.
However, if :math:`I_{k-1}` can share the required results with :math:`I_k` via late forwarding, then at least a limited amount of pipeline parallelism is possible.
*Late forwarding* lets an earlier microoperation forward its result directly to a dependent microoperation without waiting for register-file write-back.
We can now write microbenchmarks for both cases.

.. _lst:micro_bench_lat_neon_fmla_4s_wb:

.. literalinclude:: ../data_micro_bench/lat_neon_fmla_4s_wb.s
  :language: gas
  :lines: 63-108
  :dedent:
  :linenos:
  :caption: FMLA (vector), ``4S`` microbenchmark with dependencies on the first source register and destination register.

:numref:`lst:micro_bench_lat_neon_fmla_4s_wb` shows a microbenchmark loop for the write-back case using FMLA (vector) with arrangement specifier ``4S`` as an example.
The dependency chain is given by means of Neon register ``V0``.
All instructions use ``V0`` as their first source and destination register. 
In the case of Neoverse V2, Table 3-16 of the `Software Optimization Guide <https://developer.arm.com/documentation/109898/0300>`__ specifies an execution latency of 4 cycles for this case.
This means that we can expect to execute 0.25 instructions per cycle for the :download:`microbenchmark <../data_micro_bench/lat_neon_fmla_4s_wb.s>`.

.. _lst:micro_bench_lat_neon_fmla_4s_lf:

.. literalinclude:: ../data_micro_bench/lat_neon_fmla_4s_lf.s
  :language: gas
  :lines: 63-108
  :dedent:
  :linenos:
  :caption: FMLA (vector), ``4S`` microbenchmark with destination register dependency.

:numref:`lst:micro_bench_lat_neon_fmla_4s_lf` shows a microbenchmark loop for the forwarding case.
The dependency chain now exists only on the destination register, not on any of the source registers.
In the case of Neoverse V2, the microarchitecture supports late-forwarding of accumulate operands, and an execution latency of 2 cycles is specified in this case.
This means that we can expect to execute 0.5 instructions per cycle for the :download:`microbenchmark <../data_micro_bench/lat_neon_fmla_4s_lf.s>`.

.. _tab:micro_bench_lat_neon_measured:

.. list-table:: Measured execution latencies of various floating-point instructions on a core of NVIDIA's Grace CPU.
   :header-rows: 1

   * - Instruction
     - Dependencies
     - Inst / Cycle
     - GFLOPS
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``4S``
     - destination & first source
     - 0.25
     - 6.6
   * - :a64_isa:`FMLA (vector) <SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-multiply-add-to-accumulator--vector-->`, ``4S``
     - destination
     - 0.5
     - 13.2

We can confirm the theoretical Neoverse V2 execution latencies by running the microbenchmarks.
:numref:`tab:micro_bench_lat_neon_measured` shows the measured latencies for the two cases.
We see that we can reproduce the theoretical execution latencies given in the Software Optimization Guide.

In particular, the 6.6 GFLOPS measured for source register dependencies represents a 16-fold slowdown compared to the 105.6 GFLOPS throughput achieved when all pipelines and pipeline parallelism are fully utilized.
In summary, we need to design our compute kernels very carefully to fully exploit the available parallelism of the target microarchitecture.