Introduction

1. Introduction#

1.1. Structure#

Reading this book requires familiarity with C and C++, basic computer architecture concepts, and the Linux command line. The chapters are organized bottom-up and can be approached in different ways. Chapters Assembly Language and Base Instructions should be read first since they cover the basics of programming in assembly language for AArch64 systems.

The following chapters Neon, Scalable Vector Extension and Scalable Matrix Extension cover vector and matrix instructions. These instructions perform the heavy lifting in a high-performance tensor program. At this point, two reading paths are possible:

  • The minimal path covers the vector instructions in Neon and then jumps directly to Microbenchmarks, skipping the later sections on the Scalable Vector Extension (SVE) or Scalable Matrix Extension (SME). This path covers the full compiler stack first and then goes back to the more advanced instruction set extensions SVE and SME.

  • The full ISA path reads the three chapters in order. This enables a broader foundation for the downstream compiler stack at the cost of slower progress.

Chapter Microbenchmarks analyzes AArch64 hardware and guides the design of the high-performance kernel operations discussed in Primitives. Next, Code Generation automates the generation of high-performance kernels from primitive specifications, building a versatile building block in the tensor compiler stack. The following chapter, Tiled Execution IR, describes an intermediate representation for tensor operations that is used in the chapter on einsum trees to execute sequences of tensor operations.

1.2. Hardware#

The tensor compiler developed in this book relies on a few primitive operations that are translated by a code generator into highly optimized kernels. In fact, the first few chapters cover the code generators themselves for AArch64, the 64-bit execution state of the Arm architecture. This means that we can run our code on many recent smartphone, notebook, desktop, and server chips.

In general, the code generation concepts discussed here also apply to other types of computer architectures, such as x86 processors. However, these architectures are outside the scope of this book, so it is recommended to follow along on an AArch64 system. In addition, most descriptions assume the Linux operating system, which is therefore recommended for development.

Table 1.2.1 Recommended AArch64 development platforms.#

Vendor

Device

SoC/CPU

#Cores

Microarchitecture

Amazon

c8g

AWS Graviton4

up to 96

Arm Neoverse V2

Google

GCE C4A

Google Axion

up to 72

Arm Neoverse V2

Microsoft

Azure Dpsv6

Azure Cobalt 100

up to 96

Arm Neoverse N2

NVIDIA

Grace CPU

72

Arm Neoverse V2

Apple

Mac

M4 / M5

4+6

Custom

Qualcomm

X2 Plus / Elite / Elite Extreme

6–18

3rd Gen Qualcomm Oryon

Raspberry Pi Ltd

Raspberry Pi 5

Broadcom BCM2712

4

Arm Cortex-A76

Table 1.2.1 lists hardware platforms that are suitable for development. All hardware platforms support Neon for the vector processing described in Section 4. However, only Apple M4 and M5 chips supports the Scalable Matrix Extension (SME) described in Section 6. Also note that running natively under MacOS requires some changes to the assembly code structure and just-in-time code generation described. Therefore, the easiest way to get started is to use virtualization for development under MacOS, for example by using Podman containers.