.. _ch:assembly_language:

*****************
Assembly Language
*****************

Assembly language is a low-level language that is tied to specific classes of hardware.
In contrast, C and C++ are high-level languages.
Any system with an appropriate compiler can compile C/C++ programs into assembly code.
This is the reason why C/C++ code is portable, while assembly code is not.
If we want to target a different core, say an x86-64 core instead of an AArch64 core, we simply recompile the C/C++ code and generate new assembly code.
If we wrote the assembly code ourselves, we would be stuck and have to start over.
So why do we care about assembly code? The answer is simple: Performance.

This chapter covers the fundamentals of assembly language for the ARM architecture.
:numref:`ch:base_instructions` introduces base instructions, followed by :numref:`ch:neon`, which introduces vector instructions for the heavy lifting in tensor workloads.
We'll then use these basics to write a set of performance-critical primitives for our tensor compiler in :numref:`ch:primitives`.

.. _sec:assembly_aarch64:

AArch64
=======

.. _fig:assembly_ls_arch:

.. figure:: ../data_assembly/arch.svg
   :width: 25%

   Illustration of a load-store architecture.


Before jumping into assembly language, let us define a very simple model of a CPU core to help us understand how programs are executed in hardware.
Our core, shown in :numref:`fig:assembly_ls_arch`, has the following components:

Memory
  Large but slow storage for instructions and data.
Registers
  Fast but small storage for data.
Program Counter (PC)
  Location of the current instruction in memory.
Arithmetic Logic Unit (ALU)
  Circuit that performs arithmetic and logical operations on binary numbers.

We also assume that we are programming a *load-store architecture*.
This means that the CPU core strictly separates data movement instructions from data processing instructions.
Data movement instructions are responsible for copying data from memory to registers (load) and from registers to memory (store).
Data processing instructions only modify data in registers, never data in memory.
In this book, we will learn how to program AArch64, the 64-bit execution state of the Arm architecture. 
AArch64 is a load-store architecture and each instruction has a fixed length of 32 bits or 4 bytes.

In a very simplified way, the core executes instructions using the following procedure:

1. Fetch the instruction from memory relative to the PC.
2. Increment the PC by 4, now holding the address of the next instruction.
3. Execute the instruction (possibly changing the PC).
4. Repeat

.. admonition:: Example

   Suppose we want to write a simple function that adds two eight-bit binary numbers.
   The numbers are initially located in memory, and the result should also be located in memory after our function has finished.
   We could write an assembly program that contains the following instructions:

   1. Load the first number from memory into a first register.
   2. Load the second number from memory into a second register.
   3. Add the data in the first and second registers and write the result into a third register.
   4. Store the data in the third register to memory.
  
   Assume that the values of the two numbers in memory are ``01010111`` and ``00010101`` and that our simple program is stored in memory starting at address ``my_prog``.
   Since memory is byte-addressable, we would find the four instructions at addresses ``my_prog+0``, ``my_prog+4``, ``my_prog+8``, and ``my_prog+12``.
   By initializing the program counter to ``my_prog+0``, we can execute the program by repeating the above procedure four times:

   .. list-table::
     :header-rows: 1

     * - Action
       - PC
       - 1st Register
       - 2nd Register
       - 3rd Register
     * - Fetch first load instruction from memory at address ``my_prog+0``
       - ``my_prog+0``
       -
       -
       -
     * - Increment PC by 4
       - ``my_prog+4``
       -
       -
       -
     * - Execute load instruction
       - ``my_prog+4``
       - ``01010111``
       -
       -
     * - Fetch second load instruction from memory at address ``my_prog+4``
       - ``my_prog+4``
       - ``01010111``
       -
       -
     * - Increment PC by 4
       - ``my_prog+8``
       - ``01010111``
       -
       -
     * - Execute load instruction
       - ``my_prog+8``
       - ``01010111``
       - ``00010101``
       -
     * - Fetch add instruction from memory at address ``my_prog+8``
       - ``my_prog+8``
       - ``01010111``
       - ``00010101``
       -
     * - Increment PC by 4
       - ``my_prog+12``
       - ``01010111``
       - ``00010101``
       -
     * - Execute add instruction
       - ``my_prog+12``
       - ``01010111``
       - ``00010101``
       - ``01101100``
     * - Fetch store instruction from memory at address ``my_prog+12``
       - ``my_prog+12``
       - ``01010111``
       - ``00010101``
       - ``01101100``
     * - Increment PC by 4
       - ``my_prog+16``
       - ``01010111``
       - ``00010101``
       - ``01101100``
     * - Execute store instruction
       - ``my_prog+16``
       - ``01010111``
       - ``00010101``
       - ``01101100``

   At this point, we have achieved our goal, as ``01101100``, the result of adding ``01010111`` and ``00010101``, is now located in memory.

GNU Assembly Syntax
===================
.. _lst_hello_world_c:

.. literalinclude:: ../data_assembly/hello_world.c
  :language: C
  :caption: Simple C program that prints "Hello World!". The source code is stored in the file ``hello_world.c``.

We start to learn about assembly language by looking at some compiler-generated code.
Let us take the simple example program in :numref:`lst_hello_world_c`.
We can compile this program into assembly code using the `GNU Compiler Collection (GCC) <https://gcc.gnu.org>`__ with the following command:

.. code-block:: bash

  gcc -S hello_world.c

Alternatively, we can do the same using the `Clang Compiler <https://clang.llvm.org>`__ using the command:

.. code-block:: bash

  clang -S hello_world.c

Both compilers will produce an assembly file called ``hello_world.s``.
Since compilation depends on many parameters, we will typically get different results on different systems and with different compilers.

We can also write assembly code directly.
Assembly language is the human-readable language that is closest to hardware.
It has different flavors and syntaxes.
We use the `GNU Assembly Syntax <https://sourceware.org/binutils/docs/as/>`__ (GAS), which is also the default when the GCC or Clang compiler generates assembly code.

.. _lst_hello_world_gas:

.. literalinclude:: ../data_assembly/hello_world.s
  :language: gas
  :caption: Example assembly program stored in the file ``hello_world.s``.

On an AArch64 Linux system, we would write assembly code similar to that shown in :numref:`lst_hello_world_gas` for our "Hello World!" example.
In summary, the assembly program uses four key building blocks:

`Assembler Directives <https://sourceware.org/binutils/docs/as/Pseudo-Ops.html>`__
  * Names begin with a period ``.``.
  * Define symbols, allocate memory, control the assembly process.
  * There are many directives available, we only need a few.
  * Directives in the example: `.section <https://sourceware.org/binutils/docs/as/Section.html>`__, .rodata, `.asciz <https://sourceware.org/binutils/docs/as/Asciz.html>`__, `.text <https://sourceware.org/binutils/docs/as/Text.html>`__, `.global <https://sourceware.org/binutils/docs/as/Global.html>`__.

`Labels <https://sourceware.org/binutils/docs/as/Labels.html>`__
  * Label definitions are immediately followed by a colon ``:``.
  * Allow us to abstractly refer to locations in an assembly program, such as the location of a function.
  * Labels in the example: ``my_msg``, ``main``.

:a64_isa:`Assembly Instructions <>`
  * An instruction performs a specific operation.
  * May be broken down by the hardware into multiple micro-ops (µOPs).
  * Instructions in the example:

    1. :a64_isa:`STP: Store pair of registers <Base-Instructions/STP--Store-pair-of-registers->`.
    2. :a64_isa:`ADR: Form PC-relative address <Base-Instructions/ADR--Form-PC-relative-address->`.
    3. :a64_isa:`BL: Branch with link <Base-Instructions/BL--Branch-with-link->`.
    4. :a64_isa:`MOV (wide immediate): Move wide immediate value <Base-Instructions/MOV--wide-immediate---Move-wide-immediate-value--an-alias-of-MOVZ->`.
    5. :a64_isa:`LDP: Load pair of registers <Base-Instructions/LDP--Load-pair-of-registers->`.
    6. :a64_isa:`RET: Return from subroutine <Base-Instructions/RET--Return-from-subroutine->`.

`Comments <https://sourceware.org/binutils/docs/as/Comments.html>`__
  * Help us understand the assembly code.
  * Ignored by the assembler.
  * C-style syntax
  * Use extensively, assembly code is not self-explanatory.

.. _sec:assembly_assembler:

Assembler
=========
The previous section discussed the basic structure of assembly code.
Now we will assemble the program in :numref:`lst_hello_world_gas`, that is, translate it so that an AArch64 machine can run it.

We can invoke the GNU assembler with the following command:

.. code-block:: bash

   as hello_world.s -o hello_world.o

The resulting file ``hello_world.o`` contains binary data.
We can display its contents in ASCII by using `od <https://www.gnu.org/software/coreutils/manual/html_node/od-invocation.html>`__:

.. code-block:: bash

   od -A x -t x1 hello_world.o > hello_world.hex

where ``-A x`` specifies hexadecimal addresses and ``-t x1`` specifies the output format of the data as hexadecimal with one byte per column.

.. _lst_hello_world_hex:

.. literalinclude:: ../data_assembly/gen/hello_world.hex
  :language: hexdump
  :lines: 1-10
  :caption: Hexadecimal dump ``hello_world.hex`` of the file ``hello_world.o`` generated by the tool ``od``.

The first 10 lines of the resulting ASCII file will be similar to those shown in :numref:`lst_hello_world_hex`.
At first glance, this may not appear helpful.
We make sense of the bytes by examining the sections in the `ELF <https://man7.org/linux/man-pages/man5/elf.5.html>`__ file:

.. code-block:: bash

   readelf -S hello_world.o > hello_world.relf

The result shows us the location of each section in the assembled file.

.. _lst_hello_world_relf:

.. literalinclude:: ../data_assembly/gen/hello_world.relf
   :language: none
   :caption: Output of the tool ``readelf`` in the file ``hello_world.relf``.

For our "Hello World!" example, the output will be similar to that shown in :numref:`lst_hello_world_relf`.
The offset values in the last column indicate the location in the file, given in bytes.
Returning to our example program in :numref:`lst_hello_world_gas`, we are interested in the two sections ``.rodata`` and ``.text``.
Starting with ``.rodata``, we can see that the section starts at offset ``0x5c`` and has a size of ``0xd`` bytes.
Looking at the hexadecimal dump in :numref:`lst_hello_world_hex`, we find the following 13 bytes, which we can decode to `ASCII <https://en.cppreference.com/w/c/language/ascii>`__:

.. list-table::
  :widths: 20 20 25
  :header-rows: 1

  * - File Offset
    - Byte
    - ASCII Character
  * - 0x5c
    - 48
    - H
  * - 0x5d
    - 65
    - e
  * - 0x5e
    - 6c
    - l
  * - 0x5f
    - 6c
    - l
  * - 0x60
    - 6f
    - o
  * - 0x61
    - 20
    - (space)
  * - 0x62
    - 57
    - W
  * - 0x63
    - 6f
    - o
  * - 0x64
    - 72
    - r
  * - 0x65
    - 6c
    - l
  * - 0x66
    - 64
    - d
  * - 0x67
    - 21
    - !
  * - 0x68
    - 00
    - NUL (null)

We found the null-terminated string ``Hello World!`` in the ``.rodata`` section.

Next we look at the ``.text`` section.
This section starts at offset ``0x40`` and has a size of ``0x1c`` bytes.
From :numref:`sec:assembly_aarch64` we already know that each instruction is 32 bits or 4 bytes long.
So the ``0x1c`` bytes correspond to 7 instructions, which is what we expect.
The assembler has translated the human-readable mnemonics into machine code.
The instructions are `encoded <https://developer.arm.com/documentation/102376/0200/Alignment-and-endianness/Endianness>`__ as little-endian which means that the least significant byte is stored first.
The encoding of instructions is :a64_isa:`defined <Base-Instructions/STP--Store-pair-of-registers->` in the :a64_isa:`instruction set architecture <>` and will be discussed in the following chapters.

.. admonition:: Example

   The first instruction in the sample code in :numref:`lst_hello_world_gas` is ``stp x29, x30, [sp, -16]!``, which is ``fd 7b bf a9`` in the hexdump shown in :numref:`lst_hello_world_hex`.
   Being aware of the little-endian encoding of the instructions, we can convert the hexadecimal value to binary, which in the case of ``0xa9bf7bfd`` is ``0b1010'1001'1011'1111'0111'1011'1111'1101``.
   This is the 64-bit variant of the :a64_isa:`pre-index encoding of STP <Base-Instructions/STP--Store-pair-of-registers-?lang=en#iclass_pre_index>` in the instruction set architecture, encoded as: ``1010 1001 10 imm7 Rt2 Rn Rt`` where ``imm7`` is a 7-bit immediate value and ``Rt2``, ``Rn``, and ``Rt`` are register IDs.

Manually looking up the machine code of instructions in an ELF file is tedious.
We can use the `objdump <https://man7.org/linux/man-pages/man1/objdump.1.html>`__ tool to automate this process and *disassemble* the binary file:

.. code-block:: bash

   objdump --syms -S -d hello_world.o > hello_world.dis

This will create a ``hello_world.dis`` file containing the disassembled code.

.. _lst_hello_world_dis:

.. literalinclude:: ../data_assembly/gen/hello_world.dis
   :language: none
   :lines: 2-
   :caption: Output of the tool ``objdump`` when applied to ``hello_world.o``.

Sample output for our "Hello World!" example is shown in :numref:`lst_hello_world_dis`.
As we can see, the ``objdump`` shows mnemonics for the machine code.

.. note::

   Each instruction in assembly code has exactly one representation as 32 machine code bits. However, in some cases, different mnemonics can result in the same machine code. In such cases ``objdump`` shows the *preferred disassembly* when disassembling the machine code. One such case is :a64_isa:`CMP (extended register) <Base-Instructions/CMP--extended-register---Compare--extended-register---an-alias-of-SUBS--extended-register-->`, which is an alias of :a64_isa:`SUBS (extended register) <Base-Instructions/SUBS--extended-register---Subtract-extended-and-scaled-register--setting-flags->`, where CMP is the preferred disassembly.

However, the program cannot be executed because it contains undefined symbols.
For example, the ``puts`` function is marked as ``*UND*`` in the symbol table.
This is because `puts <https://www.man7.org/linux/man-pages/man3/puts.3.html>`__ is provided in libc, the standard library for the C language.
We still need to use a linker, the topic of the next section, to make it available.

Linker
======
The goal of this section is to create an executable from the object file ``hello_world.o`` that we created in :numref:`sec:assembly_assembler`. 
To do this, we need to perform a step called *linking*.
During linking, we, for example, combine object files, resolve dependencies, and assign addresses.
We can choose from several linkers to perform this step, including the GNU linker `ld <https://www.man7.org/linux/man-pages/man1/ld.1.html>`__ or the LLVM linker `lld <https://lld.llvm.org>`__. 
However, manually linking with ``ld`` or ``lld`` requires us to perform some extra steps, such as writing startup and exit code that communicates with Linux through system calls.
Instead, we will use a wrapper that performs the extra steps automatically.

Using the GNU Compiler Collection, we can create our executable by simply using ``gcc``:

.. code-block:: bash

   gcc -o hello_world hello_world.o

Similarly, using the compiler and tooling infrastructure Clang, we can do the following:

.. code-block:: bash

   clang -o hello_world hello_world.o

Both will create the executable ``hello_world`` that we can run:

.. code-block:: bash

   ./hello_world

We see that the string "Hello World!" is printed on the command line.
We have just successfully executed our first assembly program.

If we look at the symbol table of the executable using ``objdump --syms hello_world``, we see many additional entries. Before, we had the ``*UND*`` entry for ``puts``.
Now, a glibc version is mentioned in the name, e.g., ``puts@GLIBC_2.17``, but the section is still given as ``*UND*``. This is because dynamic linking is the default behavior for external libraries in most cases.
The respective symbols are resolved by the dynamic linker at runtime of the program, which also loads the required dynamic libraries.
If we want the implementation of ``puts`` to be part of our executable, we can opt for *static linking*.
Using GCC, this can be achieved by using the static flag:

.. code-block:: bash

   gcc -static -o hello_world hello_world.o

Looking at the symbol table again, we now see that the implementation of ``puts`` is included in our executable:

.. literalinclude:: ../data_assembly/gen/hello_world_link_static.dis
  :language: none
  :lines: 1200-1200

Registers
=========
The `AArch64 Application Level Programmers' Model <https://developer.arm.com/documentation/ddi0487/>`__ has thirty-one *general-purpose registers* (GPRs) that are visible to the A64 instruction set: ``R0``-``R30``.
The GPRs are also called *integer registers*.
They can be accessed as 64-bit X registers or as 32-bit W registers.

.. _fig:assembly_gprs:

.. figure:: ../data_assembly/gprs.svg
   :width: 40%

   Illustration of the thirty-one general-purpose registers ``R0``-``R30`` visible to the A64 instruction set, some special-purpose registers and the PSTATE fields.

As shown in :numref:`fig:assembly_gprs`, the lower 32 bits of the X registers overlap with the W registers.

In addition to the general-purpose registers, AArch64 defines a number of *special-purpose registers* and *vector registers*.
We will discuss the vector registers as part of the :ref:`ch:neon` chapter.
The most important abstractions are the following:

Link Register (LR)
  The link register is simply another name for the general-purpose register ``X30``.
  It stores the return address when a function is called.
  This means that after a function call, we return to the calling scope by jumping to the address in the link register.
Zero Register (ZR)
  The zero register can be used in some instructions.
  It is always read as zero, writes to the register are simply ignored.
  Note that the AArch64 programmers' model assumes that there is a zero register.
  However, this does not imply that the register exists in hardware as a physical register, typically it will not.
Stack Pointer (SP)
  Dedicated 64-bit stack pointer register that holds the address where the stack ends.
  Note that the stack grows with decreasing virtual addresses, i.e. we have to decrement the stack pointer to allocate memory.
Program Counter (PC)
  64-bit program counter that holds the address of the current instruction.
  We cannot write directly to the program counter, but some instructions modify it.
  We will use the program counter for branching, for example, when implementing loops or conditional code.
Process State (PSTATE)
  The process state (PSTATE) is an abstraction that holds process state information.
  Of particular interest to us are the NZCV condition flags that are part of this state.
  Simply put, we will use them to store the result of comparisons in conditional code execution, and then jump conditionally based on these flags (see :ref:`sec:base_branches` section).

.. _sec:assembly_pcs:

Procedure Call Standard
=======================
The Procedure Call Standard (PCS) used by the Application Binary Interface (ABI) `defines <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#general-purpose-registers>`__ the role of the GPRs in function calls.
GPRs ``R0``-``R7`` are used to pass arguments to a function and to return values.
Registers ``R8``-``R17`` and ``R19``-``R28`` can be used as scratch registers.
``R18`` and ``R29`` can be used as temporary registers in some cases, but it is advisable to avoid using them altogether.
One such place with strict requirements is Apple platforms:

  Apple platforms adhere to the following choices:
    * The platforms reserve register x18. Don't use this register.
    * The frame pointer register (x29) must always address a valid frame record.

  -- `Apple: Writing ARM64 code for Apple platforms <https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms>`__

The PCS provides the rules for calling functions of others and for writing functions that are called by others.

In the first case, we are the *caller* of the other function.
According to the PCS, it is our responsibility to save intermediate data in *caller-saved registers* before calling the function.
In other words: The called function may overwrite the data in any caller-saved registers.
Then, after the function returns to our scope, we must restore the data.
In the PCS, registers ``R0``-``R18`` and ``R30`` are caller-saved.

In the second case, our function is the *callee*.
Again, according to the PCS, we must preserve the data in a set of registers.
These are called *callee-saved registers*.
So, if we plan to overwrite the data in a callee-saved register, we must save the intermediate data before modifying the register.
Then, before jumping back to the caller's scope, we must restore the data, since the caller may rely on it being preserved.
According to the PCS, registers ``R19``-``R29`` must be saved by the callee.

In fact, when writing assembly code, we can use the stack to temporarily store the contents of registers.
So, when implementing a function, we would first identify all the registers that our function modifies and that we need to preserve.
Then, at the beginning of the function, we save the contents of all these registers on the stack.
Now we can proceed with implementing the function and the registers without worrying about the PCS.
When we are finished, just before jumping back to the scope of the caller, we simply restore the contents of the previously saved registers by loading them from the stack.

.. _lst_assembly_pcs_gprs:

.. literalinclude:: ../data_assembly/pcs_gprs.s
  :language: gas
  :linenos:
  :caption: Example assembly function that sets the frame pointer register and temporarily stores registers ``X19``-``X30`` on the stack.

:numref:`lst_assembly_pcs_gprs` shows an example implementation that adheres to the PCS.
Specifically, we use the :a64_isa:`pre-index encoding <Base-Instructions/STP--Store-pair-of-registers-?lang=en#iclass_pre_index>` of STP to copy the contents of registers ``X19``-``X30`` to the stack.
The STP instructions first decrement the stack pointer by 16, thus allocating 16 bytes on the stack, before storing the 16 bytes of data distributed across two X registers to the stack.

.. note::

   We will discuss base instructions, including STP and LDP, in more detail in :numref:`ch:base_instructions`.

LDP's :a64_isa:`post-index encoding <Base-Instructions/STP--Store-pair-of-registers-?lang=en#iclass_post_index>` has the opposite effect.
First, each instruction loads 16 bytes from the stack into two X registers.
Then the stack pointer is incremented by 16, effectively freeing the memory.

The two instructions in lines 6 and 8 create the stack frame. First, in line 6, the data of the frame pointer register (``X29``) and the link register (``X30``) are stored to the stack.
Then, the :a64_isa:`MOV (to/from SP) <Base-Instructions/MOV--to-from-SP---Move-register-value-to-or-from-SP--an-alias-of-ADD--immediate-->` instruction in line 8 copies the data from the stack pointer to the frame pointer register (``X29``), as required on Apple platforms, for example.

Note that we also temporarily store the data of the frame pointer register (X29) and the link register (``X30``) on the stack.
Although ``FP`` (``X29``) and ``LR`` (``X30``) are typically not used directly in function implementations, their values must be preserved because they may be modified during nested function calls.
In this case, we must restore them after the function calls.

.. _sec:assembly_cpp_drivers:

C/C++ Interoperability
======================

The following chapters discuss efficient assembly functions for performance-critical parts of our tensor compiler.
We will write all other parts of the compiler in high-level languages with the immediately following layers realized in C/C++.
Fortunately, calling a function written in assembly language from C or C++ is simple.

.. _lst_assembly_inter_my_func:

.. literalinclude:: ../data_assembly/my_func.s
  :language: gas
  :caption: Example assembly function that increments the value in register ``X0``.

:numref:`lst_assembly_inter_my_func` shows an assembly implementation of the function ``int64_t my_func(int64_t)``.
In chapter :ref:`ch:base_instructions`, we will see that the function simply increments the passed 64-bit value and returns it.
To call ``my_func``, we have to make the compiler aware of its signature.
Following the PCS, the compiler can then infer how to call the function.

.. _lst_assembly_inter_driver_c:

.. literalinclude:: ../data_assembly/driver.c
  :language: C
  :linenos:
  :caption: Example C driver program that calls the function ``my_func``.

:numref:`lst_assembly_inter_driver_c` illustrates this procedure for an exemplary C driver.
Line 4 makes the compiler aware of the signature of ``my_func`` that is called in line 8.

We can invoke the assembler to create ``my_func.o`` from ``my_func.s``,  compile ``driver.c`` to obtain ``driver.o``, and generate the executable program ``my_prog.exe`` from the two object files:

.. code-block:: bash

   as my_func.s -o my_func.o
   gcc -c driver.c -o driver.o
   gcc driver.o my_func.o -o my_prog.exe

Executing the program leads to the expected result:

.. code-block:: bash

   ./my_prog.exe 
   Result: 6

Calling our assembly function from C++ is also straightforward.
However, in this case, we must be aware of `name mangling in C++ <https://en.wikipedia.org/wiki/Name_mangling#C++>`__.

.. _lst_assembly_inter_driver_name_mangling_cpp:

.. literalinclude:: ../data_assembly/driver_name_mangling.cpp
  :language: C++
  :caption: Example C++ driver program that is subject to name mangling by the compiler. The program is stored in the file ``driver_name_mangling.cpp``.

:numref:`lst_assembly_inter_driver_name_mangling_cpp` provides a simple C++ driver that we can compile using ``g++ -S driver_name_mangling.cpp``.

.. _lst_assembly_inter_driver_name_mangling_cpp_s:

.. literalinclude:: ../data_assembly/gen/driver_name_mangling_cpp.s
  :language: gas
  :linenos:
  :lineno-match:
  :lines: 8-23
  :caption: Excerpt of the assembly code in ``driver_name_mangling.s`` generated by g++ when compiling ``driver_name_mangling.cpp``.

The generated code in line 21 of :numref:`lst_assembly_inter_driver_name_mangling_cpp_s` shows that the compiler calls the function using the name-mangled name ``_Z7my_funcl``.
Our assembly function name in :numref:`lst_assembly_inter_my_func` has the incompatible name ``my_func``.

.. _lst_assembly_inter_driver_cpp:

.. literalinclude:: ../data_assembly/driver.cpp
  :language: C++
  :linenos:
  :caption: Example C++ driver ``driver.cpp`` that instructs the compiler to assume C linkage for ``my_func``.

The `extern "C" <https://learn.microsoft.com/en-us/cpp/cpp/extern-cpp#extern-c-and-extern-c-function-declarations>`__ syntax in line 4 of :numref:`lst_assembly_inter_driver_cpp` solves the problem by specifying that the function has C linkage.
We compile the driver using ``g++ -S driver.cpp`` and inspect the generated assembly code.

.. _lst_assembly_inter_driver_cpp_s:

.. literalinclude:: ../data_assembly/gen/driver_cpp.s
  :language: gas
  :linenos:
  :lineno-match:
  :lines: 8-23
  :caption: Excerpt of the assembly code in ``driver.s`` generated by g++ when compiling ``driver.cpp``.

As shown in line 21 of :numref:`lst_assembly_inter_driver_cpp_s`, the compiler now calls the function ``int64_t my_func(int64_t)`` using the name ``my_func`` matching our assembly code in :numref:`lst_assembly_inter_my_func`.

In summary, we can test our assembly function using the C++ driver in :numref:`lst_assembly_inter_driver_cpp` by generating and executing the program ``my_prog.exe``:

.. code-block:: bash

   as my_func.s -o my_func.o
   g++ -c driver.cpp -o driver.o
   g++ driver.o my_func.o -o my_prog.exe
   ./my_prog.exe