.. _ch:teir:

Tiled Execution IR
==================
We have assembled the necessary tools to execute tensor operations efficiently.
This chapter introduces the *Tiled Execution Intermediate Representation (TEIR)*, which describes the execution of tensor operations through primitives that operate on subtensors, known as *tiles*.
The IR guides the backend implementation and controls when and where primitives are executed.
Conceptually, the IR is closely related to tile-based programming models such as `Triton <https://triton-lang.org/>`__, `CUDA Tile IR <https://docs.nvidia.com/cuda/tile-ir/latest/>`__, `Pallas <https://docs.jax.dev/en/latest/pallas/index.html>`__, and `TileIR <https://github.com/microsoft/TileIR>`__.

.. _sec:teir_spec:

Specification
-------------
.. rubric:: Domains and Notation

.. math::

  \begin{aligned}
    T &\in \{2,3\} \\[2pt]
    D &\in \mathbb{N}^+ \\[2pt]
    \mathbf{dim\_types} &= (t_0,\ldots,t_{D-1}),\quad t_i \in \{\mathrm{C},\mathrm{M},\mathrm{N},\mathrm{K}\} \\
    \mathbf{exec\_types} &= (e_0,\ldots,e_{D-1}),\quad e_i \in \{\mathrm{seq},\,\mathrm{parallel},\,\mathrm{prim}\} \\
    \mathbf{dim\_sizes} &\in (\mathbb{N}^+)^{D} \\
    \mathbf{strides} &\in \mathbb{N}^{T \times D} \\[4pt]
    \mathrm{data\_type} &\in \{\mathrm{FP16},\mathrm{BF16},\mathrm{TF32},\mathrm{FP32},\mathrm{FP64}\} \\
    \mathrm{prim\_first} &\in \{\mathrm{None},\mathrm{Zero},\mathrm{ReLU}\} \\
    \mathrm{prim\_main} &\in \{\mathrm{None},\mathrm{Copy}, \mathrm{ReLU}, \mathrm{GEMM},\mathrm{BRGEMM}\} \\
    \mathrm{prim\_last} &\in \{\mathrm{None},\mathrm{ReLU}\}
  \end{aligned}

.. rubric:: Operation-Type Constraints

The operation type is determined by :math:`\mathrm{prim\_main}`.
Unary operations use :math:`\mathrm{prim\_main}\in\{\mathrm{Copy},\mathrm{ReLU}\}` and require :math:`T=2` (two tensors).
Binary contractions use :math:`\mathrm{prim\_main}\in\{\mathrm{GEMM},\mathrm{BRGEMM}\}` and require :math:`T=3` (three tensors).

**Unary operations**:

- :math:`T=2` (tensors :math:`\mathrm{in0}`, :math:`\mathrm{out}`).
- All axes MUST have type :math:`\mathrm{C}`: :math:`t_i=\mathrm{C}\; \forall i`.
- :math:`\mathrm{prim\_first}=\mathrm{None}`.
- :math:`\mathrm{prim\_main}\in\{\mathrm{Copy},\mathrm{ReLU}\}`.
- :math:`\mathrm{prim\_last}=\mathrm{None}`.

**Binary contractions**:

- :math:`T=3` (tensors :math:`\mathrm{in0}`, :math:`\mathrm{in1}`, :math:`\mathrm{out}`).
- :math:`t_i \in \{\mathrm{C},\mathrm{M},\mathrm{N},\mathrm{K}\}`.
- :math:`\mathrm{prim\_first}\in\{\mathrm{None},\mathrm{Zero}\}`.
- :math:`\mathrm{prim\_main}\in\{\mathrm{GEMM},\mathrm{BRGEMM}\}`.
- :math:`\mathrm{prim\_last}\in\{\mathrm{None},\mathrm{ReLU}\}`.

.. rubric:: Records

.. math::

  \begin{aligned}
    \mathrm{TEIR\mbox{-}Schedule} &=  \langle \mathbf{dim\_types}, \mathbf{exec\_types}, \mathbf{dim\_sizes},\, \mathbf{strides} \rangle \\[4pt]
    \mathrm{TEIR\mbox{-}Primitives} &= \langle \mathrm{data\_type}, \mathrm{prim\_first}, \mathrm{prim\_main}, \mathrm{prim\_last} \rangle
  \end{aligned}

.. rubric:: Axis Roles

The tensor index :math:`j \in \{0,\ldots,T-1\}` indexes the tensors in a configuration.
The axis role mapping :math:`\mathbf m` determines which tensors participate in a given axis: :math:`\mathbf m(t)[j]=1` if tensor :math:`j` participates in an axis of type :math:`t`, and :math:`0` otherwise.
For unary operations (:math:`T=2`), only :math:`\mathrm{C}` is permitted as a :math:`dim\_type`; every axis touches both tensors, so :math:`\mathbf m(\mathrm{C})=(1,1)`.
For binary contraction operations (:math:`T=3`), the full mapping is:

.. math::

  \mathbf m:\{\mathrm{C},\mathrm{M},\mathrm{N},\mathrm{K}\}\to\{0,1\}^3,\qquad
  \mathbf m(t)=
  \begin{cases}
  (1,1,1) & t=\mathrm{C},\\
  (1,0,1) & t=\mathrm{M},\\
  (0,1,1) & t=\mathrm{N},\\
  (1,1,0) & t=\mathrm{K},
  \end{cases}
  \quad\text{(order: }\mathrm{in0}, \mathrm{in1}, \mathrm{out}\text{).}

.. rubric:: Well-Formedness

- All schedule vectors have length :math:`D`; the stride tensor has shape :math:`T \times D`:

.. math::
  |\mathbf{dim\_types}|=|\mathbf{exec\_types}|=|\mathbf{dim\_sizes}|=D, \qquad
  \mathbf{strides}\in\mathbb{N}^{T\times D}.

- :math:`\mathbf{strides}[j][i]` MUST be :math:`0` whenever tensor :math:`j` does not participate in axis :math:`i`, that is, whenever :math:`\mathbf m(t_i)[j] = 0`.

.. rubric:: Primitive-Specific Requirements

.. math::

  \begin{aligned}
    &P_{\mathrm{prim}}=\{\, i \mid e_i=\mathrm{prim}\,\}, \\
    &P_{\mathrm{C}}=\{\, i \in P_{\mathrm{prim}} \mid t_i=\mathrm{C}\,\}, \quad
     P_{\mathrm{M}}=\{\, i \in P_{\mathrm{prim}} \mid t_i=\mathrm{M}\,\}, \\
    &P_{\mathrm{N}}=\{\, i \in P_{\mathrm{prim}} \mid t_i=\mathrm{N}\,\}, \quad
     P_{\mathrm{K}}=\{\, i \in P_{\mathrm{prim}} \mid t_i=\mathrm{K}\,\}.
  \end{aligned}

.. math::

  \begin{aligned}
    \textbf{R1:}\;&
    \bigl(\mathrm{prim\_main}=\mathrm{Copy}\ \lor\ \mathrm{prim\_first}\in\{\mathrm{Zero},\mathrm{ReLU}\}\ \lor\ \mathrm{prim\_last}\in\{\mathrm{ReLU}\}\bigr) \\
    & \Rightarrow\ |P_{\mathrm{C}}| + |P_{\mathrm{M}}| + |P_{\mathrm{N}}| \ge 1.\\[4pt]
    \textbf{R2:}\;&
    \mathrm{prim\_main}=\mathrm{GEMM} \\
    & \Rightarrow\ |P_{\mathrm{C}}| = 0 \ \land\ |P_{\mathrm{M}}| = 1 \ \land\ |P_{\mathrm{N}}| = 1 \ \land\ |P_{\mathrm{K}}| = 1.\\[4pt]
    \textbf{R3:}\;&
    \mathrm{prim\_main}=\mathrm{BRGEMM} \\
    & \Rightarrow\ |P_{\mathrm{C}}| = 0 \ \land\ |P_{\mathrm{M}}| = 1 \ \land\ |P_{\mathrm{N}}| = 1 \ \land\ |P_{\mathrm{K}}| = 2.
  \end{aligned}

.. rubric:: Execution Semantics

Execution Types
  If :math:`e_i=\mathrm{prim}`, axis :math:`i` is consumed inside the primitive(s).
  Values :math:`\mathrm{seq}` and :math:`\mathrm{parallel}` denote sequential and parallel traversal of axis :math:`i` in the schedule.
  The overall schedule order is determined by the order of all axes with :math:`e_i \neq \mathrm{prim}` as they appear in the TEIR-Schedule.
  Traversal proceeds from the first such axis (outermost) to the last (innermost).
  No ordering guarantees are imposed between multiple axes marked as :math:`\mathrm{parallel}`.

First/Last-Access Primitives
  Primitives :math:`\mathrm{prim\_first}` and :math:`\mathrm{prim\_last}` define initialization and finalization steps applied to output tiles:

  * :math:`\mathrm{prim\_first}` is applied the first time an output tile is accessed in a given schedule.
  * :math:`\mathrm{prim\_last}` is applied the last time an output tile is accessed.

.. _sec:teir_toc:

Tensor Operation Configuration
------------------------------
:numref:`sec:teir_spec` contains the formal specification of the Tiled Execution Intermediate Representation (TEIR).
TEIR comprises two records: TEIR-Primitives, which specifies the primitives to be executed, and TEIR-Schedule, which defines how these primitives are applied to tiles of the tensors.
This section describes the IR from the perspective of a user who configures a tensor operation using TEIR.

.. _tab:teir_schedule:

.. list-table:: TEIR Schedule — Axis Mapping
   :header-rows: 1
   :widths: 20 40 40

   * - Field
     - Meaning
     - Domain & constraints
   * - ``dim_types``
     - Axis roles across tensors (D axes)
     - {``C``, ``M``, ``N``, ``K``} per axis
   * - ``exec_types``
     - Execution policy per axis (D axes)
     - {``seq``, ``parallel``, ``prim``} per axis
   * - ``dim_sizes``
     - Positive extent per axis
     - ``array[D]`` of ℕ⁺
   * - ``strides``
     - Per-tensor strides (T×D stride tensor)
     - ``array[T][D]`` of ℕ, where ``strides[j][i] = 0`` when tensor ``j`` does not participate in axis ``i``

:numref:`tab:teir_schedule` provides a concise informal form of TEIR-Schedule.
The field ``dim_types`` describes whether an axis spans both inputs and the output (``C``), only the first input and the output (``M``), only the second input and the output (``N``), or only the two inputs (``K``).
The field ``exec_types`` specifies the execution type of each axis.
Setting ``seq`` results in sequential execution of an axis.
Parallelization is achieved with ``parallel``.
Axes with type ``prim`` are consumed inside the primitives.
The remaining fields describe the sizes of the axes in field ``dim_sizes`` and the data layout of each tensor in the stride tensor ``strides``, where ``strides[j][i]`` gives the stride of tensor ``j`` along axis ``i``.

.. _tab:teir_primitives:

.. list-table:: TEIR Primitives — Primitive Specification
   :header-rows: 1
   :widths: 20 40 40

   * - Field
     - Meaning
     - Allowed values
   * - ``data_type``
     - Data type of inputs and output
     - {``FP16``, ``BF16``, ``TF32``, ``FP32``, ``FP64``}
   * - ``prim_first``
     - First-access primitive
     - {``None``, ``Zero``, ``ReLU``}
   * - ``prim_main``
     - Main primitive
     - {``None``, ``Copy``, ``ReLU``, ``GEMM``, ``BRGEMM``}
   * - ``prim_last``
     - Last-access primitive
     - {``None``, ``ReLU``}

:numref:`tab:teir_primitives` provides a short form of TEIR-Primitives.
TEIR-Primitives contains four fields.
``data_type`` determines the data type of the input and output tensors.
In addition, up to three primitives can be used in TEIR.
The first-access primitive (``prim_first``) is applied to a tile of the output tensor when it is accessed for the first time.
Similarly, the last-access primitive (``prim_last``) is applied to a tile of the output tensor when it is accessed for the last time.
The possible types used by ``prim_first`` and ``prim_last`` are listed below (see the table above for the per-field constraints):

``None``
  No primitive is executed.
``Zero``
  Zero the output tile.
``ReLU``
  Apply ReLU to the output tile's values.

By contrast, the main primitive (``prim_main``) is executed for every valid combination of input and output tiles in the TEIR schedule.
The main primitive can have one of the following types:

``None``
  No primitive is executed.
``Copy``
  Copy the input tile's values to the output tile.
``GEMM``
  General Matrix Multiply (GEMM).
  Multiply two 2D input tiles and accumulate the result into the 2D output tile.
``BRGEMM``
  Batch-Reduce GEMM (BRGEMM).
  Perform a BRGEMM operation on two 3D input tiles and accumulate the result into the 2D output tile.


.. _sec:teir_example_configs:

Example Configurations
----------------------

This section illustrates TEIR through example configurations for two types of tensor operations:

Permutation
  Permute the axes of the input tensor to obtain the output tensor: :math:`T_\text{out} = \operatorname{permute} \left( T_\text{in0} \right)`.
Binary Tensor Contraction
  Contract two input tensors to obtain an output tensor: :math:`T_\text{out} = \operatorname{contract} \left( T_\text{in0}, T_\text{in1} \right)`.

Each example provides a TEIR schedule table specifying the dimension types, execution types, sizes, and per-tensor strides, followed by pseudocode showing the corresponding execution.

Scalar Permutation
^^^^^^^^^^^^^^^^^^

A sequential element-wise permutation of a 4D tensor: the input has shape :math:`|a| \times |b| \times |c| \times |d|` stored in row-major order, and the output stores the permuted result ``abcd -> cdab`` in row-major order.
All axes are sequential and have type ``C`` (unary operation).
The main primitive is ``Copy``; first- and last-access primitives are ``None``.

.. _tab:ex_perm1_schedule:

.. list-table:: TEIR schedule for the scalar permutation ``abcd -> cdab``.
   :header-rows: 1
   :stub-columns: 1
   :widths: 30 15 15 15 15

   * - Dimension ID
     - a
     - b
     - c
     - d
   * - dim_sizes
     - \|a\|
     - \|b\|
     - \|c\|
     - \|d\|
   * - dim_types
     - C
     - C
     - C
     - C
   * - exec_types
     - seq
     - seq
     - seq
     - seq
   * - strides[in0]
     - \|b\| × \|c\| × \|d\|
     - \|c\| × \|d\|
     - \|d\|
     - 1
   * - strides[out]
     - \|b\|
     - 1
     - \|d\| × \|a\| × \|b\|
     - \|a\| × \|b\|

.. code-block:: text
   :caption: Scalar execution of the permutation ``abcd -> cdab``.
   :name: code:ex_perm1_pseudo

   for a in |a|
     for b in |b|
       for c in |c|
         for d in |d|
           out[c][d][a][b] = in0[a][b][c][d]

Tiled Permutation
^^^^^^^^^^^^^^^^^

Same permutation as the :ref:`scalar permutation <tab:ex_perm1_schedule>`, but axes ``d`` and ``b`` are consumed by a ``Copy`` primitive that copies a 2D tile at once.
The outer axes ``a`` and ``c`` are traversed sequentially, and each iteration copies a tile of shape :math:`|d| \times |b|`.

.. _tab:ex_perm2_schedule:

.. list-table:: TEIR schedule for the tiled permutation ``abcd -> cdab``.
   :header-rows: 1
   :stub-columns: 1
   :widths: 30 15 15 15 15

   * - Dimension ID
     - a
     - c
     - d
     - b
   * - dim_sizes
     - \|a\|
     - \|c\|
     - \|d\|
     - \|b\|
   * - dim_types
     - C
     - C
     - C
     - C
   * - exec_types
     - seq
     - seq
     - prim
     - prim
   * - strides[in0]
     - \|b\| × \|c\| × \|d\|
     - \|d\|
     - 1
     - \|c\| × \|d\|
   * - strides[out]
     - \|b\|
     - \|d\| × \|a\| × \|b\|
     - \|a\| × \|b\|
     - 1

.. code-block:: text
   :caption: Tiled execution of the permutation ``abcd -> cdab``.
   :name: code:ex_perm2_pseudo

   for a in |a|
     for c in |c|
       Copy( in  = in0[a][0][c],
             out = out[c][0][a],
             m   = |d|,
             n   = |b|,
             ldI = |c| * |d|,
             ldO = |a| * |b| )

Scalar Batched GEMM
^^^^^^^^^^^^^^^^^^^

All dimensions are sequential.
Each iteration performs a single scalar multiply-accumulate.

.. _tab:ex_cont1_schedule:

.. list-table:: TEIR schedule for the scalar batched GEMM.
   :header-rows: 1
   :stub-columns: 1
   :widths: 30 15 15 15 15

   * - Dimension ID
     - a
     - b
     - c
     - d
   * - dim_sizes
     - \|a\|
     - \|b\|
     - \|c\|
     - \|d\|
   * - dim_types
     - K
     - M
     - N
     - C
   * - exec_types
     - seq
     - seq
     - seq
     - seq
   * - strides[in0]
     - 1
     - \|a\|
     - 0
     - \|b\| × \|a\|
   * - strides[in1]
     - \|c\|
     - 0
     - 1
     - \|a\| × \|c\|
   * - strides[out]
     - 0
     - \|c\|
     - 1
     - \|b\| × \|c\|

.. code-block:: text
   :caption: Scalar execution of the batched GEMM.
   :name: code:ex_cont1_pseudo

   for a in |a|
     for b in |b|
       for c in |c|
         for d in |d|
           out[d][b][c] += in0[d][b][a] * in1[d][a][c]

Scalar Batched GEMM — Reordered
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Same operation as the :ref:`scalar batched GEMM <tab:ex_cont1_schedule>`, but with the contraction axis ``K`` moved to the innermost position.
This changes the traversal order without affecting the result.

.. _tab:ex_cont2_schedule:

.. list-table:: TEIR schedule for the reordered scalar batched GEMM.
   :header-rows: 1
   :stub-columns: 1
   :widths: 30 15 15 15 15

   * - Dimension ID
     - b
     - c
     - d
     - a
   * - dim_sizes
     - \|b\|
     - \|c\|
     - \|d\|
     - \|a\|
   * - dim_types
     - M
     - N
     - C
     - K
   * - exec_types
     - seq
     - seq
     - seq
     - seq
   * - strides[in0]
     - \|a\|
     - 0
     - \|b\| × \|a\|
     - 1
   * - strides[in1]
     - 0
     - 1
     - \|a\| × \|c\|
     - \|c\|
   * - strides[out]
     - \|c\|
     - 1
     - \|b\| × \|c\|
     - 0

.. code-block:: text
   :caption: Scalar execution of the reordered batched GEMM.
   :name: code:ex_cont2_pseudo

   for b in |b|
     for c in |c|
       for d in |d|
         for a in |a|
           out[d][b][c] += in0[d][b][a] * in1[d][a][c]


Scalar Tensor Contraction
^^^^^^^^^^^^^^^^^^^^^^^^^

A general tensor contraction ``trus,pqtu->pqrs`` with six sequential axes and scalar execution.
Each of the ``M``, ``N``, and ``K`` roles spans two axes.

.. _tab:ex_cont3_schedule:

.. list-table:: TEIR schedule for the scalar tensor contraction ``trus,pqtu->pqrs``.
   :header-rows: 1
   :stub-columns: 1
   :widths: 20 15 15 15 15 15 15

   * - Dimension ID
     - p
     - q
     - r
     - s
     - t
     - u
   * - dim_sizes
     - \|p\|
     - \|q\|
     - \|r\|
     - \|s\|
     - \|t\|
     - \|u\|
   * - dim_types
     - N
     - N
     - M
     - M
     - K
     - K
   * - exec_types
     - seq
     - seq
     - seq
     - seq
     - seq
     - seq
   * - strides[in0]
     - 0
     - 0
     - \|u\| × \|s\|
     - 1
     - \|r\| × \|u\| × \|s\|
     - \|s\|
   * - strides[in1]
     - \|q\| × \|t\| × \|u\|
     - \|t\| × \|u\|
     - 0
     - 0
     - \|u\|
     - 1
   * - strides[out]
     - \|q\| × \|r\| × \|s\|
     - \|r\| × \|s\|
     - \|s\|
     - 1
     - 0
     - 0

.. code-block:: text
   :caption: Scalar execution of the tensor contraction ``trus,pqtu->pqrs``.
   :name: code:ex_cont3_pseudo

   for p in |p|
     for q in |q|
       for r in |r|
         for s in |s|
           for t in |t|
             for u in |u|
               out[p][q][r][s] += in0[t][r][u][s] * in1[p][q][t][u]

Tensor Contraction with GEMM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Same contraction as the :ref:`scalar tensor contraction <tab:ex_cont3_schedule>`, but axes ``s``, ``q``, ``u`` are consumed by a ``GEMM`` primitive.
The remaining axes ``p``, ``r``, ``t`` are traversed sequentially.

.. _tab:ex_cont4_schedule:

.. list-table:: TEIR schedule for the tensor contraction ``trus,pqtu->pqrs`` with GEMM primitive.
   :header-rows: 1
   :stub-columns: 1
   :widths: 20 15 15 15 15 15 15

   * - Dimension ID
     - p
     - r
     - t
     - s
     - q
     - u
   * - dim_sizes
     - \|p\|
     - \|r\|
     - \|t\|
     - \|s\|
     - \|q\|
     - \|u\|
   * - dim_types
     - N
     - M
     - K
     - M
     - N
     - K
   * - exec_types
     - seq
     - seq
     - seq
     - prim
     - prim
     - prim
   * - strides[in0]
     - 0
     - \|u\| × \|s\|
     - \|r\| × \|u\| × \|s\|
     - 1
     - 0
     - \|s\|
   * - strides[in1]
     - \|q\| × \|t\| × \|u\|
     - 0
     - \|u\|
     - 0
     - \|t\| × \|u\|
     - 1
   * - strides[out]
     - \|q\| × \|r\| × \|s\|
     - \|s\|
     - 0
     - 1
     - \|r\| × \|s\|
     - 0

.. code-block:: text
   :caption: GEMM-based execution of the tensor contraction ``trus,pqtu->pqrs``.
   :name: code:ex_cont4_pseudo

   for p in |p|
     for r in |r|
       for t in |t|
         GEMM( A   = in0[t][r],
               B   = in1[p][0][t],
               C   = out[p][0][r],
               m   = |s|,
               n   = |q|,
               k   = |u|,
               ldA = |s|,
               ldB = |t| * |u|,
               ldC = |r| * |s| )

:numref:`fig:bc_dims` and :numref:`fig:bc_order` illustrate this configuration for a concrete instance with ``|r|=3``, ``|t|=2``, and ``|p|=4``.
Each tensor is drawn as a grid of tiles.
The ``seq`` axes (``p``, ``r``, ``t``) index tiles across each grid; the ``prim`` axes span the interior of each tile according to the axis roles:
``s`` and ``u`` for ``in0``, ``u`` and ``q`` for ``in1``, and ``s`` and ``q`` for ``out``.
Spatial alignment of the grids reflects the dimension roles:
``r`` aligns the rows of ``in0`` and ``out`` (``M``),
``p`` aligns the columns of ``in1`` and ``out`` (``N``), and ``t`` connects the columns of ``in0`` to the rows of ``in1`` (``K``, contracted).
:numref:`fig:bc_order` shows the memory layout within and across tiles.

.. _fig:bc_dims:

.. figure:: ../data_teir/bc_dims.svg
   :width: 70%

   Illustration of the dimensions for the binary tensor contraction ``trus,pqtu->pqrs``.

.. _fig:bc_order:

.. figure:: ../data_teir/bc_order.svg
   :width: 70%

   Illustration of the memory layout for the binary tensor contraction ``trus,pqtu->pqrs``.

Parallel Tensor Contraction with BRGEMM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Same contraction as the :ref:`scalar <tab:ex_cont3_schedule>` and :ref:`GEMM-based <tab:ex_cont4_schedule>` examples.
The contraction axis ``t`` is absorbed into a ``BRGEMM`` primitive alongside ``s``, ``q``, ``u``.
The outer axes ``p`` and ``r`` use parallelism.

.. _tab:ex_cont5_schedule:

.. list-table:: TEIR schedule for the parallel tensor contraction ``trus,pqtu->pqrs`` with BRGEMM primitive.
   :header-rows: 1
   :stub-columns: 1
   :widths: 20 15 15 15 15 15 15

   * - Dimension ID
     - p
     - r
     - t
     - s
     - q
     - u
   * - dim_sizes
     - \|p\|
     - \|r\|
     - \|t\|
     - \|s\|
     - \|q\|
     - \|u\|
   * - dim_types
     - N
     - M
     - K
     - M
     - N
     - K
   * - exec_types
     - parallel
     - parallel
     - prim
     - prim
     - prim
     - prim
   * - strides[in0]
     - 0
     - \|u\| × \|s\|
     - \|r\| × \|u\| × \|s\|
     - 1
     - 0
     - \|s\|
   * - strides[in1]
     - \|q\| × \|t\| × \|u\|
     - 0
     - \|u\|
     - 0
     - \|t\| × \|u\|
     - 1
   * - strides[out]
     - \|q\| × \|r\| × \|s\|
     - \|s\|
     - 0
     - 1
     - \|r\| × \|s\|
     - 0

.. code-block:: text
   :caption: Parallel BRGEMM-based execution of the tensor contraction ``trus,pqtu->pqrs``.
   :name: code:ex_cont5_pseudo

   #pragma omp parallel for collapse(2)
   for p in |p|
     for r in |r|
       BRGEMM( A      = in0[0][r],
               B      = in1[p][0],
               C      = out[p][0][r],
               m      = |s|,
               n      = |q|,
               k      = |u|,
               ldA    = |s|,
               ldB    = |t| * |u|,
               ldC    = |r| * |s|,
               brSize = |t|,
               brStrA = |r| * |u| * |s|,
               brStrB = |u| )

Optimization Passes
-------------------