1. Introduction#

1.1. Structure#

Reading this book requires familiarity with C and C++, basic computer architecture concepts, and the Linux command line. The chapters are organized bottom-up and can be approached in different ways. Chapters Assembly Language and Base Instructions should be read first since they cover the basics of programming in assembly language for AArch64 systems.

The following chapters Neon, Scalable Vector Extension and Scalable Matrix Extension cover vector and matrix instructions. These instructions perform the heavy lifting in a high-performance tensor program. At this point, two reading paths are possible:

The minimal path covers the vector instructions in Neon and then jumps directly to Microbenchmarks, skipping the later sections on the Scalable Vector Extension (SVE) or Scalable Matrix Extension (SME). This path covers the full compiler stack first and then goes back to the more advanced instruction set extensions SVE and SME.
The full ISA path reads the three chapters in order. This enables a broader foundation for the downstream compiler stack at the cost of slower progress.

Chapter Microbenchmarks analyzes AArch64 hardware and guides the design of the high-performance kernel operations discussed in Primitives. Next, Code Generation automates the generation of high-performance kernels from primitive specifications, building a versatile building block in the tensor compiler stack. The following chapter, Tiled Execution IR, describes an intermediate representation for tensor operations that is used in the chapter on einsum trees to execute sequences of tensor operations.

1.2. Hardware#

The tensor compiler developed in this book relies on a few primitive operations that are translated by a code generator into highly optimized kernels. In fact, the first few chapters cover the code generators themselves for AArch64, the 64-bit execution state of the Arm architecture. This means that we can run our code on many recent smartphone, notebook, desktop, and server chips.

In general, the code generation concepts discussed here also apply to other types of computer architectures, such as x86 processors. However, these architectures are outside the scope of this book, so it is recommended to follow along on an AArch64 system. In addition, most descriptions assume the Linux operating system, which is therefore recommended for development.

Table 1.2.1 Recommended AArch64 development platforms.#
Vendor	Device	SoC/CPU	#Cores	Microarchitecture
Amazon	c8g	AWS Graviton4	up to 96	Arm Neoverse V2
Google	GCE C4A	Google Axion	up to 72	Arm Neoverse V2
Microsoft	Azure Dpsv6	Azure Cobalt 100	up to 96	Arm Neoverse N2
NVIDIA	–	Grace CPU	72	Arm Neoverse V2
Apple	Mac	M4 / M5	4+6	Custom
Qualcomm	–	X2 Plus / Elite / Elite Extreme	6–18	3rd Gen Qualcomm Oryon
Raspberry Pi Ltd	Raspberry Pi 5	Broadcom BCM2712	4	Arm Cortex-A76

Table 1.2.1 lists hardware platforms that are suitable for development. All hardware platforms support Neon for the vector processing described in Section 4. However, only Apple M4 and M5 chips supports the Scalable Matrix Extension (SME) described in Section 6. Also note that running natively under MacOS requires some changes to the assembly code structure and just-in-time code generation described. Therefore, the easiest way to get started is to use virtualization for development under MacOS, for example by using Podman containers.

Introduction

Contents

1. Introduction#

1.1. Structure#

1.2. Hardware#