1. Introduction#
1.1. Structure#
Reading this book requires familiarity with C and C++, basic computer architecture concepts, and the Linux command line. The chapters are organized bottom-up and can be approached in different ways. Chapters Assembly Language and Base Instructions should be read first since they cover the basics of programming in assembly language for AArch64 systems.
The following chapters Neon, Scalable Vector Extension and Scalable Matrix Extension cover vector and matrix instructions. These instructions perform the heavy lifting in a high-performance tensor program. At this point, two reading paths are possible:
The minimal path covers the vector instructions in Neon and then jumps directly to Microbenchmarks, skipping the later sections on the Scalable Vector Extension (SVE) or Scalable Matrix Extension (SME). This path covers the full compiler stack first and then goes back to the more advanced instruction set extensions SVE and SME.
The full ISA path reads the three chapters in order. This enables a broader foundation for the downstream compiler stack at the cost of slower progress.
Chapter Microbenchmarks analyzes AArch64 hardware and guides the design of the high-performance kernel operations discussed in Primitives. Next, Code Generation automates the generation of high-performance kernels from primitive specifications, building a versatile building block in the tensor compiler stack. The following chapter, Tiled Execution IR, describes an intermediate representation for tensor operations that is used in the chapter on einsum trees to execute sequences of tensor operations.
1.2. Hardware#
The tensor compiler developed in this book relies on a few primitive operations that are translated by a code generator into highly optimized kernels. In fact, the first few chapters cover the code generators themselves for AArch64, the 64-bit execution state of the Arm architecture. This means that we can run our code on many recent smartphone, notebook, desktop, and server chips.
In general, the code generation concepts discussed here also apply to other types of computer architectures, such as x86 processors. However, these architectures are outside the scope of this book, so it is recommended to follow along on an AArch64 system. In addition, most descriptions assume the Linux operating system, which is therefore recommended for development.
Vendor |
Device |
SoC/CPU |
#Cores |
Microarchitecture |
|---|---|---|---|---|
Amazon |
c8g |
AWS Graviton4 |
up to 96 |
Arm Neoverse V2 |
GCE C4A |
Google Axion |
up to 72 |
Arm Neoverse V2 |
|
Microsoft |
Azure Dpsv6 |
Azure Cobalt 100 |
up to 96 |
Arm Neoverse N2 |
NVIDIA |
– |
Grace CPU |
72 |
Arm Neoverse V2 |
Apple |
Mac |
M4 / M5 |
4+6 |
Custom |
Qualcomm |
– |
X2 Plus / Elite / Elite Extreme |
6–18 |
3rd Gen Qualcomm Oryon |
Raspberry Pi Ltd |
Raspberry Pi 5 |
Broadcom BCM2712 |
4 |
Arm Cortex-A76 |
Table 1.2.1 lists hardware platforms that are suitable for development. All hardware platforms support Neon for the vector processing described in Section 4. However, only Apple M4 and M5 chips supports the Scalable Matrix Extension (SME) described in Section 6. Also note that running natively under MacOS requires some changes to the assembly code structure and just-in-time code generation described. Therefore, the easiest way to get started is to use virtualization for development under MacOS, for example by using Podman containers.