.. _ch:intro:

Introduction
============

.. _sec:intro_structure:

Structure
---------

Reading this book requires familiarity with C and C++, basic computer architecture concepts, and the Linux command line.
The chapters are organized bottom-up and can be approached in different ways.
Chapters :ref:`ch:assembly_language` and :ref:`ch:base_instructions` should be read first since they cover the basics of programming in assembly language for AArch64 systems.

The following chapters :ref:`ch:neon`, :ref:`ch:sve` and :ref:`ch:sme` cover vector and matrix instructions.
These instructions perform the heavy lifting in a high-performance tensor program.
At this point, two reading paths are possible:

* The *minimal path* covers the vector instructions in :ref:`ch:neon` and then jumps directly to :ref:`ch:micro_bench`, skipping the later sections on the Scalable Vector Extension (SVE) or Scalable Matrix Extension (SME).
  This path covers the full compiler stack first and then goes back to the more advanced instruction set extensions SVE and SME.
* The *full ISA path* reads the three chapters in order. This enables a broader foundation for the downstream compiler stack at the cost of slower progress.

Chapter :ref:`ch:micro_bench` analyzes AArch64 hardware and guides the design of the high-performance kernel operations discussed in :ref:`ch:primitives`.
Next, :ref:`ch:code_generation` automates the generation of high-performance kernels from primitive specifications, building a versatile building block in the tensor compiler stack.
The following chapter, :ref:`ch:teir`, describes an intermediate representation for tensor operations that is used in the chapter on einsum trees to execute sequences of tensor operations.

.. _sec:intro_hardware:

Hardware
--------
The tensor compiler developed in this book relies on a few primitive operations that are translated by a code generator into highly optimized kernels.
In fact, the first few chapters cover the code generators themselves for AArch64, the 64-bit execution state of the Arm architecture.
This means that we can run our code on many recent smartphone, notebook, desktop, and server chips.

In general, the code generation concepts discussed here also apply to other types of computer architectures, such as x86 processors.
However, these architectures are outside the scope of this book, so it is recommended to follow along on an AArch64 system.
In addition, most descriptions assume the Linux operating system, which is therefore recommended for development.

.. _tab:intro_hardware:

.. list-table:: Recommended AArch64 development platforms.
  :header-rows: 1

  * - Vendor
    - Device
    - SoC/CPU
    - #Cores
    - Microarchitecture
  * - Amazon
    - c8g
    - AWS Graviton4
    - up to 96
    - Arm Neoverse V2
  * - Google
    - GCE C4A
    - Google Axion
    - up to 72
    - Arm Neoverse V2
  * - Microsoft
    - Azure Dpsv6
    - Azure Cobalt 100
    - up to 96
    - Arm Neoverse N2
  * - NVIDIA
    - --
    - Grace CPU
    - 72
    - Arm Neoverse V2
  * - Apple
    - Mac
    - M4 / M5
    - 4+6
    - Custom
  * - Qualcomm
    - --
    - X2 Plus / Elite / Elite Extreme
    - 6--18
    - 3rd Gen Qualcomm Oryon
  * - Raspberry Pi Ltd
    - Raspberry Pi 5
    - Broadcom BCM2712
    - 4
    - Arm Cortex-A76

:numref:`tab:intro_hardware` lists hardware platforms that are suitable for development.
All hardware platforms support Neon for the vector processing described in :numref:`ch:neon`.
However, only Apple M4 and M5 chips supports the Scalable Matrix Extension (SME) described in :numref:`ch:sme`.
Also note that running natively under MacOS requires some changes to the assembly code structure and just-in-time code generation described.
Therefore, the easiest way to get started is to use virtualization for development under MacOS, for example by using `Podman <https://podman.io>`__ containers.

Structure
---------