Introduction

The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.

Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.

Motivations

The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.

The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!

What is MLIR and why use it?

MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:

  • Industry Standard Compiler Framework
    • Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
  • Ecosystem
    • Hook into existing front-end MLIR projects
  • Testing framework
    • A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
    • Common IR Serialization Format that's easy to test, debug, and edit

Additional documentation to highlight the benefits of MLIR can be found here:

MLIR: Overview

MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.

MLIR Primitives

So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR Module is as shown:

#any_device = #tt.operand_constraint<dram|l1|scalar|tile|any_device|any_device_tile>
module attributes {tt.system_desc = #tt.system_desc<[<#tt.arch<wormhole_b0>, #tt.grid<8x8>>], [0], [<pcie|host_mmio>], [<0, 0, 0, 0>]>} {
  func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
    %0 = tensor.empty() : tensor<64x128xf32>
    %1 = "ttir.multiply"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>, operand_constraints = [#any_device, #any_device, #any_device]}> : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
    return %1 : tensor<64x128xf32>
  }
}
  • Attributes (defined using #)

    • The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
  • Operations

    • These operations are accessed with the . method, so you'll see some examples like func.func or tensor.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result

    • Quotes are added around ttir.multiply since it's part of a custom dialect, and more custom assembly instructions are applied to specify operand_constraints.

    • Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations

  • Types

    • Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.

    • They help to demonstrate the transformation of information and it's representation as it's processed across this module.

MLIR Workflow

The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:

  1. Graph Information exists

  2. Graph Information is transformed (through any which method) into a high-level MLIR representation

  3. Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends

  4. Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)

What are Passes?

Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.

Why not make our own?

Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?

One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.

Automation

It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!

TT-MLIR: Bringing MLIR to the TT Stack

Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:

  • Forge-FE (frontend)

    • Graph Optimizations, Constant Folding, Operation Fusion
  • TT-MLIR (mid-level)

    • Data Storage, Memory Configuration, Grid Configuration
  • TT-NN (backend)

    • Kernel Configuration*, Network Optimization

*Subject to Change / Be Moved to TT-MLIR

TT-MLIR Dialects

Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.

TT Dialect

The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.

TTIR Dialect

The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.

Generic Operation

This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.

Layout Operation

The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.

Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.

Built-in MLIR Dialects

The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects

Affine Dialect

[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.

Tensor Dialect

[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.

Func Dialect

[Reference]

TOSA Dialect

[Reference]

SCF Dialect

[Reference]

EmitC Dialect

[Reference]

TT-Explorer - Performance Optimization Tool

A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. TT-Explorer is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!

Building

These are the steps required to get the TT-MLIR project running on your machine

Please refer to the Dependencies section before building the project.

Environment setup

You only need to build this once, it builds llvm, flatbuffers and a python virtual environment.

cmake -B env/build env
cmake --build env/build
  • It is recommended to use the system installation of python3 for the virtual environment. Please ensure that you do not already have a venv activated before running the above command.
  • Please ensure the directory /opt/ttmlir-toolchain exist and its owner is the current user, i.e. the one that executes the above cmake commands. The commands create it and assign the proper ownership are:
    sudo mkdir -p /opt/ttmlir-toolchain
    sudo chown -R $USER /opt/ttmlir-toolchain
    

Build

source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17
cmake --build build
  • To enable the ttnn/metal runtime add -DTTMLIR_ENABLE_RUNTIME=ON
  • Clang 17 is the minimum required version when enabling the runtime.
  • To enable the ttnn/metal perf runtime add -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
  • To accelerate the builds with ccache use -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
  • To accelerate builds further, if python bindings aren't needed, -DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF. For some reason the python bindings link step is very slow.
  • TTNN build is automatically integrated / handled by tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
  • The runtime build step depends on the ARCH_NAME environment variable, which is set in the env/activate script. If you want to build the runtime for a different architecture, please set ARCH_NAME to the desired value (one of grayskull, wormhole_b0, or blackhole). Please note that the runtime is built only if TTMLIR_ENABLE_RUNTIME=ON.
  • In addition to ARCH_NAME, the runtime build depends on TT_METAL_HOME variable, which is also set in env/activate script. For more information, please refer to TT-NN and TT-Metailium installation documentation.
OSOffline Compiler OnlyRuntime Enabled BuildRuntime + Perf Enabled Build
Ubuntu 22.04
Ubuntu 20.04
MacOS

Test

source env/activate
cmake --build build -- check-ttmlir

llvm-lit

Under the hood the check-ttmlir cmake target is running llvm-lit. With it you can:

# Query which tests are available
llvm-lit -sv ./build/test --show-tests

# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir

# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR

See the full llvm-lit documentation for more information.

Lint

source env/activate
cmake --build build -- clang-tidy

Pre-Commit

Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit action. Install from the root of the repository using:

source env/activate
pre-commit install

If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":

pre-commit run --all-files

For more information visit pre-commit

Docs

source env/activate
cmake --build build -- docs
mdbook serve build/docs
  • mdbook can be installed with the system's package manager.
  • mdbook serve will by default create a local server at http://localhost:3000.

Note: If you want to build the docs on MacOS, there are two extra dependencies:

Both can be installed using Homebrew by running the following commands:

brew install doxygen
brew install graphviz

Dependencies

Ubuntu Common

Make sure to have Git LFS installed. You can install it with the following command:

sudo apt-get install git-lfs

Ubuntu 22.04

On Ubuntu 22.04 we need to install clang, ninja, and to update the version of cmake because 3.20 is the minimum required for this project.

sudo apt update
sudo apt upgrade

sudo apt install clang-17
sudo apt install ninja-build

sudo apt remove cmake -y
pip3 install cmake --upgrade
hash -r

Ensure cmake can by found in this path pip installed it to. E.g. PATH=$PATH:$HOME/.local/bin

Then run the following command to see the cmake version which should be later than 3.20

cmake --version

We also need to install Ninja which can be done with the following command

sudo apt install ninja-build

MacOS

On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).

brew install cmake
brew install ninja

Common Build Errors

TTMLIRPythonCAPI target requires changing an RPATH

CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
  The install of the TTMLIRPythonCAPI target requires changing an RPATH from
  the build tree, but this is not supported with the Ninja generator unless
  on an ELF-based or XCOFF-based platform.  The
  CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
  step.

If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja, either way the version 1.11.1.git.kitware.jobserver-1 should work.

clang++ is not a full path and was not found in the PATH

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:
    clang++
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:
    clang
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.

If you get the following error, it means you need to install clang which you can do with sudo apt install clang on Ubuntu.

sfpi, trisc, ncrisc build failure

tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 1: version: not found
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 2: oid: not found
size: '1961632': No such file
size: '1961632': No such file
size: '1961632': No such file
Always | FATAL | ncrisc build failed

If you got the above error, it means that SFPI or similar component build failed. First, make sure you have GIT LFS setup (e.g. sudo apt-get install git-lfs). Then, try to pull SFPI submodule manually:

cd third_party/tt-metal/src/tt-metal
git submodule foreach 'git lfs fetch --all && git lfs pull'

Then, try to build again.

Common Runtime Errors

Debugging python on macOS

When debugging python on macOS via lldb you may see an error like:

(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process.  Look in the console messages (Console.app), near the debugserver entries, when the attach failed.  The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))

For preinstalled macOS binaries you must manually codesign with debug entitlements.

Create file debuggee-entitlement.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>com.apple.security.cs.disable-library-validation</key>
        <true/>
        <key>com.apple.security.get-task-allow</key>
        <true/>
</dict>
</plist>

Sign the binary:

sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python

Internal Build Notes / IRD

  • When building the runtime we must use Ubuntu 22.04 docker image
    • When making an IRD reservation use --docker-image yyz-gitlab.local.tenstorrent.com:5005/tenstorrent/infra/ird-ubuntu-22-04-amd64:latest
  • You'll have to manaully install a newer version of cmake, at least 3.22, the easiest way to do this is to pip install cmake and make sure this one is in your path
  • You'll want LLVM installation to persist IRD reservations, you can achieve this by:
    • mkdir /localdev/$USER/ttmlir-toolchain
    • When requesting an IRD use --volumes /localdev/$USER/ttmlir-toolchain:/opt/ttmlir-toolchain

Working with Docker Images

Components:

  • Dockerfile
  • Workflow for building Docker image
  • Project build using Docker image

Overview

We use docker images to prepare project enviroment, install dependancies, tooling and prebuild toolchain. Project builds four docker images:

Base image tt-mlir-base-ubuntu-22-04 Dockerfile.base CI image tt-mlir-ci-ubuntu-22-04 Dockerfile.ci Base IRD image tt-mlir-base-ird-ubuntu-22-04Dockerfile.ird IRD image tt-mlir-ird-ubuntu-22-04 Dockerfile.ird

Base image starts with a supported base image (Ubuntu 22.04) and installs dependancies for project build. From there we build CI image that contains prebuild toolcahin and is used in CI to shoten the build time. IRD image contain dev tools like GDB, vim etc and shh and are use in IRD enviroments.

During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.

Building the Docker Image using GitHub Actions

The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.

Building the Docker Image Locally

To test the changes and build the image locally, use the following command:

docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .

Using the Image in GitHub Actions Jobs

The GitHub Actions workflow Build in Docker uses a Docker container for building:

    container:
      image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
      options: --user root

Tools

Currently, there are a few primary tools that are part of the ttmlir project:

  • ttmlir-opt: The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.
  • ttrt: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.
  • tt-explorer: Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.

ttmlir-opt

The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.

Simple Test

./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir

ttmlir-translate

The ttmlir-translate translation utility. Unlike ttmlir-opt tool which is used to run passes within the MLIR world, ttmlir-translate allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.

Generate C++ code from MLIR

# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o c.mlir

# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate -mlir-to-cpp c.mlir -allow-unregistered-dialect

Bonus: These two commands can be piped, to avoid writing a mlir file to disk, like so:

./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp -allow-unregistered-dialect

Generate flatbuffer file from MLIR

# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o ttnn.mlir

# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

ttrt

This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.

Building

source env/activate
cmake --build build -- ttrt
ttrt --help

Installing ttrt as python whls (coming soon)

  1. Download whls
  2. Create a python venv
python -m venv ttrt_env
source ttrt_env/bin/activate
  1. Install whls
pip install *.whl

Building runtime mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON

If you are building with runtime mode on with -DTTMLIR_ENABLE_RUNTIME=ON, you will have to install the following packages when using ttrt

pip install torch

Building perf mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON

If you are building with perf mode on with -DTT_RUNTIME_ENABLE_PERF_TRACE=ON, you will have to install the following packages when using ttrt

pip install torch
pip install loguru
pip install pandas
pip install seaborn
pip install graphviz
pip install pyyaml
pip install click

Generate a flatbuffer file from compiler

The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pip path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
  1. Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
  1. Run your test cases using ttrt
ttrt run /path/to/out.ttnn

Generate flatbuffer files using llvm-lit

There are already existing .mlir test cases under test/ttmlir/Silicon. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Export this file in your environment using export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit.
  2. Generate your test cases. This will generate all your ttnn and ttm files under build/test/ttmlir/Silicon. ttnn files have a .ttnn file extension and ttmetal files have a .ttm extension.
cmake --build build -- check-ttmlir
  1. (Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!
llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon
  1. Run your test cases using ttrt
ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers

Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries

Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --ttir-load-system-desc="path=%system_desc_path%" which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. You can also paste a custom path to a system desc file as well.

// RUN: ttmlir-opt --ttir-load-system-desc="path=%system_desc_path%" --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s  > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn

Adding new mlir test cases

You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!

Versioning

ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.

vmajor.minor.patch

APIs

ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check

Command Line

There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts folder under TT_MLIR_HOME environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.

read

Read sections of a binary file

ttrt read --help
ttrt read --section mlir out.ttnn
ttrt read --section cpp out.ttnn
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section all out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log

run

Run a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log

query

Query the system to obtain the system desc file (optionally store it to disk) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt query --help
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log

perf

Run performance mode of a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON. Note: You can collect host only related performance data via --host-only flag. By default, host and device side performance data are both collected. Restriction: /dir/of/flatbuffers can only be used if collecting --host-only (as performance data is collected upon closing of device, if we run a directory of flatbuffers, we cannot get accurate device performance data since device is only closed at end of execution). Restriction: We can only run perf mode (for now) on .mlir files that have only 1 function (func.func)

ttrt perf --help
ttrt perf out.ttnn
ttrt perf out.ttnn --clean-artifacts
ttrt perf out.ttnn --save-artifacts
ttrt perf out.ttnn --loops 10
ttrt perf --program-index all out.ttnn
ttrt perf --program-index 0 out.ttnn
ttrt perf --host-only out.ttnn
ttrt perf /dir/of/flatbuffers --host-only
ttrt perf /dir/of/flatbuffers --loops 10 --host-only
ttrt perf /dir/of/flatbuffers --log-file ttrt.log --host-only

check

Check a binary file or a directory of binary files against a system desc (by default, uses the host machine) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc

ttrt as a python package

The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.

Import ttrt as a python package

from ttrt.common.api import API

Setup API and register all features

API.initialize_apis()

Setup arguments

You can specify certain arguments to pass to each API, or use the default arguments provided

args

This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via API.registered_args dictionary. Any argument not provided will be set to the default.

custom_args = API.Query.registered_args
custom_args["clean-artifacts"] = True
query_instance = API.Query(args=custom_args)

custom_args = { "clean-artifacts": True }
query_instance = API.Query(args=custom_args)

logging

You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file.

from ttrt.common.util import Logger

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logging=custom_logger)

artifacts

You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.

from ttrt.common.util import Artifacts

log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logging=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)

Execute API

Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.

query_instance()
read_instance()
run_instance()

Putting it all together

You can do interesting stuff when combining all the above features into your python script

from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts

API.initialize_apis()

custom_args = API.Run.registered_args
custom_args["clean-artifacts"] = True
custom_args["save-artifacts"] = True
custom_args["loops"] = 10
custom_args["init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)

artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logging=custom_logger, artifacts_folder_path=artifacts_folder_path)

run_instance = API.Run(args=custom_args, logging=custom_logger, artifacts=custom_artifacts)

FAQ

Flatbuffer version does not match ttrt version!

  • ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.

System desc does not match flatbuffer!

  • flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.

I just want to test and push my commit! What do I do!

  • follow these steps (on both n150 and n300)
1. Build ttmlir (sample instructions - subject to change)
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build

2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt

3. Query system
ttrt query --save-artifacts

4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)

5. Generate test cases
cmake --build build -- check-ttmlir

6. Run test cases
ttrt run build/test/ttmlir/Silicon

7. (Optional) Run perf test cases
ttrt perf build/test/ttmlir/Silicon

tt-explorer

Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.

For more information and installation details please check out the repository README at https://github.com/vprajapati-tt/tt-explorer?tab=readme-ov-file#tt-explorer

Flatbuffers

Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):

  • .ttsys: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation.
  • .ttnn: A compiled binary file intended to be loaded and executed by the TTNN backend runtime.
  • .ttb: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).

ci

Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.

Key Words

Target Silicon (coming soon)

- 1:1 mapping to unique system-desc (this is because an n150 card can have different harvested rows)

Target Family

- product type (n150, n300)

Target Capabilities (coming soon)

- describes testable traits of Target Family
n150: {
    test params to use if running on n150
}
n300: {
    test params to use if running on n150
}

Test Capabilities (coming soon)

- set of target capabilities defined in the test
- test will populate certain parameters depending on the Target Family/Target Silicon it is running on

GitHub Runner CI Tags

Runner Use

There are 2 types of runner machines. Builders build offline and runners are silicon machines.

- builder
- runner

Runner Type

There are 2 runner types. Bare metals are standalone and virtual machines are kubernetes pods.

- bare-metal
- virtual-machine

Architecture

Supported architectures

- wormhole_b0
- blackhole (coming soon)

Pipeline Type

Supported pipelines

- perf
- functional

Active

Defines whether a runner is in service or taken out of service for maintenance

- in-service
- out-of-service

Target Family

Supported configurations of machines

- n150
- n300
- t3000 (coming soon)
- tg (coming soon)
- tgg (coming soon)

Target Silicon (coming soon)

-silicon-n150-0 (0th row harvested)
-silicon-n150-1 (1th row harvested)
-silicon-n300-0-0 (0th row harvested both chips)

Pipeline durations

- push: every push to main
- pr: every PR

CI Test Flow

1. GitHub runner
- build tt-mlir
- build ttrt
- upload artifacts

2. Silicon runner
- download tt-mlir / ttrt artifacts
- ttrt generate system desc
- llvm-lit runs all unit test, including silicon ones to generate flatbuffers (will only generate ones that are supported for that test file)
- ttrt runs generated flatbuffers

Adding a test

When adding a test, you can specify when the test should run and what values it should inherit. The test defines how it should run, not the infrastructure. The infrastructure will execute what the test defines. For now, if you specify nothing, it will run on all default parameters. Note: if you provide a target family, then it will be default run on any target silicon machine. If you need a specific target silicon machine (eg one with 1st row harvested), specify it in Target Silicon. Note: if you specify perf pipeline, it will automatically run on a bare metal machine Default parameters

[Architecture]: [wormhole_b0]
[Pipeline]: [functional, perf]
[Target Family]: [n150, n300]
[Target Silicon]: []
[Duration]: [push]
Location: test/ttmlir/Silicon
File Type: .mlir
REQUIRES: [Architecture] [Pipeline] [Target Family] [Target Silicon] [Duration] (coming soon)
UNSUPPORTED: [Target Family] [Target Silicon] (coming soon)

Additional Reading

This section contains pointers to reading material that may be useful for understanding the project.

MLIR

  • https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
  • https://mlir.llvm.org/docs/Tutorials/Toy/
  • https://www.jeremykun.com/2023/08/10/mlir-getting-started/
  • https://arxiv.org/pdf/2002.11054
  • https://ieeexplore.ieee.org/abstract/document/9370308

Dialects

Tablegen

LLVM Testing Framework Tools

Jax
Flatbuffer
Openxla Website
openxla
StableHLO

Contributor Covenant Code of Conduct

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of behavior that contributes to a positive environment for our community include:

  • Demonstrating empathy and kindness toward other people
  • Being respectful of differing opinions, viewpoints, and experiences
  • Giving and gracefully accepting constructive feedback
  • Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
  • Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

  • The use of sexualized language or imagery, and sexual attention or advances of any kind
  • Trolling, insulting or derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others' private information, such as a physical or email address, without their explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

1. Correction

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

2. Warning

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

3. Temporary Ban

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

4. Permanent Ban

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.

Project Structure

  • env: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffers
  • include/ttmlir: Public headers for the TTMLIR library
    • Dialect: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:
      • IR: MLIR operation/type/attribute interfaces and definitions
      • Passes.[h|td]: MLIR pass interfaces and definitions
      • Transforms: Common MLIR transformations, typically invoked by passes
    • Target: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
  • lib: TTMLIR library implementation
    • CAPI: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-api
    • Dialect: MLIR dialect implementations
  • runtime: Device runtime implementation
    • include/tt/runtime: Public headers for the runtime interface
    • lib: Runtime implementation
    • tools/python: Python bindings for the runtime, currently this is where ttrt is implemented
  • test: Test suite
  • tools/ttmlir-opt: TTMLIR optimizer driver

Namespaces

  • mlir: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.
    • mlir::tt: Everything ttmlir related is underneath this namespace. Since we need to subnamespace under mlir, just mlir::tt seemed better than mlir::ttmlir which feels redundant.
      • mlir::tt::ttir: The TTIR dialect namespace
      • mlir::tt::ttnn: The TTNN dialect namespace
      • mlir::tt::ttmetal: The TTMetal dialect namespace
      • mlir::tt::ttkernel: The TTKernel dialect namespace
  • tt::runtime: On the runtime side, we use the tt::runtime namespace for all runtime types and operations.
    • tt::runtime::ttnn: The TTNN runtime namespace
    • tt::runtime::ttmetal: The TTMetal runtime namespace (not implemented)

Dialects Overview

Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:

  • tt: Common types such as, tt.tile, tt.layout, tt.grid, etc. and enums such as, data formats, memory spaces, iterator types etc.
  • ttir: A high level dialect that models the tensor compute graph on tenstorrent devices. Accepts tosa and linalg input.
    • ttir.generic: Generically describe compute work.
    • ttir.to_layout: Convert between different tensor memory layouts and transfer between different memory spaces.
    • tensor.pad: Pad a tensor with a value (ie. convs)
    • ttir.yield: return result memref of computation in dispatch region body, lowers to ttkernel.yield
    • ttir.kernel: lowers to some backend kernel
  • ttnn: A TTNN dialect that models ttnn API.
  • ttkernel: Tenstorrent kernel library operations.
    • ttkernel.noc_async_read
    • ttkernel.noc_async_write
    • ttkernel.cb_push_back
    • ttkernel.[matmul|add|multiply]: Computations on tiles in source register space, store the result in dest register space.
    • ttkernel.sfpu_*: Computations on tiles in dest register space using sfpu coprocessor.
  • ttmetal: Operations that dispatch work from host to device.
    • ttmetal.dispatch: Dispatch a grid of compute work.

Adding an Op

This guide will walk you through the process of adding a new Op end to end in tt-mlir, in this case we will be adding a matmul operation. Note that the matmul op was added as part of the same changeset as this guide, it could be useful to reference the diff alongside this guide to see the changes in full.

This guide will cover the following steps:

1. Define the Op in the TTIR frontend dialect

We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td.

Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).

Here is an example of defining matmul in the TTIR dialect:

def TTIR_MatmulOp : TTIR_DPSOp<"matmul"> {
    let summary = "Matrix multiply operation.";
    let description = [{
      Matrix multiply operation.
    }];

    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output,
                         TT_OperandConstraintArrayAttr:$operand_constraints);

    let results = (outs AnyRankedTensor:$result);

    let extraClassDeclaration = [{
      MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
    }];

    let hasVerifier = 1;
}

There are many things to break down here, starting from the top:

  • def in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating file build/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc.
  • It inherits from class TTIR_DPSOp, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inherited defs. TTIR_DPSOp is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argument AnyRankedTensor:$output.
  • Next we have a list of arguments. These arguments consist of a mixture of Types (i.e. AnyRankedTensor) and Attributes (i.e. TT_OperandConstraintArrayAttr). Read more about Types & Attributes here.
    • AnyRankedTensor is part of a tablegen standand library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.
    • TT_OperandConstraintArrayAttr is a custom attribute that we have defined in the TT dialect. This attribute is used to specify constraints on the operands of the operation. For example, the TTIR_MatmulOp requires that the input tensors be in tile layout, this attribute captures this constraint.
  • Next we have a list of results in this case just 1, which aliases the output tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome.
  • Next we have extraClassDeclaration, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutatated output tensor.
  • Finally, we have hasVerifier = 1, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.

We can now try building and opening the TTIROps.h.inc file to see the generated C++ code. We will actually get a linker error because we have hasVerifier = 1 which automatically declared a verifier function, but we need to go implement.

Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp and implement the verifier.

::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();
  auto inputAShape = inputAType.getShape();
  auto inputBShape = inputBType.getShape();
  auto outputShape = outputType.getShape();
  if (inputAShape.size() < 2) {
    return emitOpError("Input A must be at least a 2D tensor");
  }
  if (inputBShape.size() < 2) {
    return emitOpError("Input B must be at least a 2D tensor");
  }
  if (inputAShape.size() != inputBShape.size()) {
    return emitOpError("Input A and B must have the same rank");
  }
  if (inputAShape.size() != outputShape.size()) {
    return emitOpError("Input A and B must have the same rank as the output");
  }
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A and B must have matching inner dimensions");
  }
  if (outputShape[outputShape.size() - 2] !=
      inputAShape[inputAShape.size() - 2]) {
    return emitOpError("Output must have the same number of rows as input A");
  }
  if (outputShape[outputShape.size() - 1] !=
      inputBShape[inputBShape.size() - 1]) {
    return emitOpError(
        "Output must have the same number of columns as input B");
  }
  return success();
}

2. Define the Op in the TTNN backend dialect

Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.

TTNNOps.td

def TTNN_MatmulOp : TTNN_NamedDPSOp<"matmul"> {
    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output);
    let results = (outs AnyRankedTensor:$result);

    let extraClassDeclaration = [{
      MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
    }];

    let hasVerifier = 1;
}

TTNNOps.cpp

::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();
  auto inputAShape = inputAType.getShape();
  auto inputBShape = inputBType.getShape();
  auto outputShape = outputType.getShape();
  if (inputAShape.size() < 2) {
    return emitOpError("Input A must be at least a 2D tensor");
  }
  if (inputBShape.size() < 2) {
    return emitOpError("Input B must be at least a 2D tensor");
  }
  if (inputAShape.size() != inputBShape.size()) {
    return emitOpError("Input A and B must have the same rank");
  }
  if (inputAShape.size() != outputShape.size()) {
    return emitOpError("Input A and B must have the same rank as the output");
  }
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A and B must have matching inner dimensions");
  }
  if (outputShape[outputShape.size() - 2] !=
      inputAShape[inputAShape.size() - 2]) {
    return emitOpError("Output must have the same number of rows as input A");
  }
  if (outputShape[outputShape.size() - 1] !=
      inputBShape[inputBShape.size() - 1]) {
    return emitOpError(
        "Output must have the same number of columns as input B");
  }
  return success();
}

3. Convert / Implement the Op in the TTNN passes

Next we will implement the conversion from the TTIR matmul Op to the TTNN matmul Op. This is a trivial conversion, as the Ops are identical in their semantics, so the changeset isn't going to be very instructive, but will at least point to the files involved. The conversion is implemented in the ConvertTTIRToTTNNPass pass in file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp.

Zooming into class ConvertTTIRToTTNNPass we can see we implement the pass interface via member function void runOnOperation() final. This function will be called for every operation matching the type specified in the pass tablegen file. A quick look at include/ttmlir/Conversion/Passes.td we can see:

def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {

This means that runOnOperation will be called for every ModuleOp in the graph, usually there is only one ModuleOp which serves as the root of the graph.

Inside runOnOperation is usually where we define a rewrite pattern set that can match much more complicated patterns (nested inside of the ModuleOp's regions) than just a single operation. In runOperation method you will see the call to method populateTTIRToTTNNPatterns(...) that actually generates rewrite patterns. Method populateTTIRToTTNNPatterns(...) is defined in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp.

  patterns
      .add<TensorEmptyConversionPattern,
           ToLayoutOpConversionPattern,
           ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
           ElementwiseOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
           ElementwiseOpConversionPattern<ttir::SubtractOp, ttnn::SubtractOp>,
           ElementwiseOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
           ElementwiseOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
           ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
           ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
           ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
           ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
           ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
           ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
           ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
           ElementwiseOpConversionPattern<ttir::DivOp, ttnn::DivOp>,
           ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
           ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
           ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
           BroadcastOpConversionPattern,
           EmbeddingOpConversionPattern,
           SoftmaxOpConversionPattern,
           TransposeOpConversionPattern,
           ConcatOpConversionPattern,
           ReshapeOpConversionPattern,
           SqueezeOpConversionPattern,
           UnsqueezeOpConversionPattern,
           ConstantOpConversionPattern,
           MatmulOpConversionPattern,
           Conv2dOpConversionPattern,
           MaxPool2dOpConversionPattern
           >(typeConverter, ctx);

More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.

For matmul, we defined a new conversion pattern that's generic to all binary ops with arguments named a and b:

class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
  using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
        op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
        adaptor.getB(), adaptor.getOutput());
    return success();
  }
};

Invoked as part of the rewrite set:

MatmulOpConversionPattern

Note:

We also need to add this op to the C++ emitter, lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp see populateTTNNToEmitCPatterns(...).

4. Add a unit test for the Op

So far we have defined the Op in the TTIR and TTNN dialects, implemented verifiers, and have conversion passes. Now we need to add a unit test to ensure that the pass is working correctly. The unit tests are located in test/ttmlir/Dialect area. In this case we'll add a test under the TTNN subdirectory since we are testing the ConvertTTIRToTTNNPass.

test/ttmlir/Dialect/TTNN/simple_matmul.mlir

// RUN: ttmlir-opt --ttir-load-system-desc --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s | FileCheck %s
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = tensor.empty() : tensor<64x96xbf16>
    // CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }
}

Unit tests in MLIR are typically written using a tool called FileCheck, please refer to the llvm FileCheck documentation for a tutorial and more information about the RUN and CHECK directives.

A few things to point out specifically regarding tt-mlir dialects:

  • tt.system_desc: This is a 1-1 mapping to the SystemDesc flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.
  • Pass --ttir-layout is a prerequisite before running convert-ttir-to-ttnn. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN.
  • This test is asserting that ttir.matmul converts to ttnn.matmul.

To run the test, you can use the following command:

cmake --build build -- check-ttmlir

You can also manually run ttmlir-opt on the test file to see the resulting output:

./build/bin/ttmlir-opt --ttir-layout --convert-ttir-to-ttnn test/ttmlir/Dialect/TTNN/simple_matmul.mlir

5. Define flatbuffer schema for the Op

Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.

include/ttmlir/Target/TTNN/program.fbs

table MatmulOp {
  in0: tt.target.TensorRef;
  in1: tt.target.TensorRef;
  out: tt.target.TensorRef;
}

Type TensorRef, flatbuffer tables with suffix Ref are used to represent live values during the runtime, decoupled from the underlying Desc suffixes which carry the type and attribute information for the object.

We also add this new op to the union OpType, which is the variant type for all ops.

More information about writing flatbuffer schemas can be found in the flatbuffers documentation

6. Serialize the Op in the flatbuffer format

In the previous section we defined the flatbuffer schema for the matmul Op, now let's put our new schema definition to use. The schema is used as input to a program called flatc which generates C++ code (or any language for that matter) for serializing and deserializing the schema. This generated code can be found in build/include/ttmlir/Target/TTNN/program_generated.h.

Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp to define a createOp overloaded function that does the conversion from MLIR to flatbuffer:

::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
  auto in0 =
      cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getA()));
  auto in1 =
      cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getB()));
  auto output = cache.at<::tt::target::TensorRef>(
      getOperandThroughDPSOps(op.getResult()));
  return ::tt::target::ttnn::CreateMatmulOp(*cache.fbb, in0, in1, output);
}

Lots of things are happening here, let's break it down:

  • FlatbufferObjectCache: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, a TensorRef with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.
  • getOperandThroughDPSOps: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the original TensorRef.
  • CreateMatmulOp: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.

We can finally generate a binary with our new Op! We can use the following command:

./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn

And we can inspect the with ttrt:

ttrt read out.ttnn

7. Add runtime support for the Op

The final step is to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.

runtime/lib/ttnn/program.cpp


A couple things to note from above:

  • Most runtime op functions will follow a similar pattern, they will take in some additional datastructures for managing live tensors.
  • liveTensors.at(op->in0()->global_id()): global_id is a unique identifier for the tensor that was generated and managed by the FlatbufferObjectCache. This is how it's intended to be used by the runtime.

We can test our changes with ttrt (don't forget to rebuild ttrt):

ttrt run out.ttnn

Doxygen

This is a link to a doxygen autogenerated code reference. Doxygen

Build Instructions

To build Doxygen use the doxygen target in CMake

cmake -B build
cmake --build build -- doxygen

Specifications

Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.

Specifications are written in markdown and are stored in the docs/src/specs directory of the repository. Below is a template that should be used when creating a new specification.

Specification Template

# [Title]

A brief description of the feature or concept that this specification is
defining.

## Motivation

A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.

## Proposed Changes

A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.

It should also call out any interactions between components and how they might
share an interface or communicate with each other.

## Test Plan

A brief description of how the feature or concept will be tested.

## Concerns

A list of concerns that have been identified during the design of this feature.

Runtime Stitching

Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.

Motivation

In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.

Simple Example

mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)

for i in range(10):
    outs_a = mod_a(ins_a)
    outs_b = mod_b(outs_a)

mod_a and mod_b are 2 independent compile steps, during the compile step for mod_a it should be completely unaware that mod_b will take place and vice-versa. In order to achieve this we propose a new runtime concept called stitching:

  • forge invokes compile step for mod_a, tt-mlir compiler determines where the inputs (ins_a) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission.
  • forge invokes compile step for mod_b, same happens as bullet 1
  • mod_a is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.
  • forge runtime invokes mod_a program submit
  • mod_b is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start running mod_b since the inputs are already in the correct location.

A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.

Proposed Changes

Compiler Metadata

Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.

Compiler will decide where the tensors should live, host, device dram, device l1.

Runtime

  • Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
  • New runtime API Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
  • Runtime will need to invoke toLayout on all input tensors before invoking the program.

Test Plan

  • Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.

Concerns

  • Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.

Tensor Layout

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Motivation / High level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

An Example / Walkthrough

Let's consider a snippet of MLIR:

tensor<2x3x64x128xf32>

Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:

tensor<2x3x64x128xf32,
  #tt.layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <1x1>,
    memref<384x128xf32, #tt.memory_space<l1>>
  >
>

At the time of this writing there are 4 properties that make up a tensor layout:

  • linear: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.
  • oob_val: A tracked out of bounds value that fills padding space.
  • grid: The grid shape that this tensor is divided onto.
  • memref: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.

This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:

Before we jump into more advanced topics there are two resources that could be useful to have at hand:

  • test/python/tensor_layout.py: Python test with many convenience functions for creating and experimenting with tensor layouts.
  • TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.

Dimension Collapsing

Probably the most important concept in tt.layout is dimension collapsing. This is captured by the affine map linear property which provides a mapping from tensor dim space to a reduced physical dimensional space. This single-handedly touches on most of the tensor layout goals mentioned at the beginning of the doc:

  • Leaves tensor shapes intact
  • Logical-Physical mapping, how the tensor is laid out in memory over a grid
  • Enables more flexible sharding
  • Explicit padding

To see how these goals are achieved we'll continue working on an explicit example, same one as above:

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)

To recap, we have our example 4d tensor (2, 3, 64, 128), which maps directly to the LHS (d0, d1, d2, d3). We have our 2d grid shape (1, 1), notice the affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later, but first let's look at how this forms an affine-map iteration space. If we index our tensor at say [1, 1, 6, 100], we can simply plugin those numbers to get our remapped offset:

(1 * 192 + 1 * 64 + 6, 100) = (262, 100)

This remapped offset (262, 100) corresponds to the row and column index of the collapsed physical memory.

By default, the dim range [0, -1) is collapsed, but the tt.layout contructor can actually take a programmable range called collapseIntervals. collapseIntervals is a list of pairs, where each pair is a dim range interval, left inclusive, right exclusive. Let's consider a few examples:

Instead of multiplying out real shapes, we will use <> to represent a dimension join operator.

  • 3D tensor onto a 2D grid and default collapseIntervals=[(0, -1)]:
(d0, d1, d2) -> (d0 <> d1, d2)
  • 4D tensor onto a 3D grid and collapseIntervals=[(1, -1)]:
(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)
  • 4D tensor onto a 3D grid and collapseIntervals=[(0, 2)]:
(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)
  • 7D tensor onto a 4D grid and collapseIntervals=[(0, 3), (-3, -1)]:
(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)

Multi-core

Let's consider the original example again, but on a larger grid than 1x1, say 2x4:

tensor<2x3x64x128xf32,
  #tt.layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <2x4>,
    memref<192x32xf32, #tt.memory_space<l1>>
  >
>

The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)

Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:

(383 + 1, 127 + 1) = (384, 128)

Finally, we divide the dims by the respective grid dims:

(384 / 2, 128 / 4) = (192, 32)

Here's a few more example mlir snippets:

tensor<8x300xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

A couple of final notes regarding grid shape:

  • Grid shapes of rank > 2 are perfectly legal. Not only it this useful for describing multi-device grid topologies, but it is often convenient to have higher ranked grids to better describe how a high rank tensor should be divided. The grid shape here is a virtual grid shape, the tt.device attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one.
  • Grid shapes where either columns or rows are > physical device grid is also legal. Since this is only a virtual grid shape we could have some grid 1x64 that maps to a physical 8x8 device grid (this particular example is called width sharding in TTNN).

Tilized

A tilized tensor is one with a memref that has a tile element type.

Given some tensor with scalar layout:

tensor<3x64x128xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<64x64xf32, #tt.memory_space<l1>>
  >
>

After tilizing we'll have:

tensor<3x64x128xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Notice the memref dim was ceilDiv'd by tile shape and the element type becomes a tt.tile type. Also notice that the tensor shape and element type remains intact.

Padding

Padding can be a bit of an overloaded term, but in this context it refers to an out of bounds area in the physical memory allocation that has no real tensor data in it. The contents of this area is tracked by oob_val and the padding area can be automatically derived from the attributes of tt.layout.

Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.

Example of non-divisible grid:

tensor<53x63xf32,
  #tt.layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<18x32xf32, #tt.memory_space<l1>>
  >
>

The grid dims always ceilDiv the affine map results, real tensor data will entirely fill initial shards and the last shard in each dimension will be partially filled.

In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.

Taking the above example a step further, we could tilize it:

tensor<53x63xf32,
  #tt.layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<1x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Tile dims also always ceilDiv the resulting memref shape. Notice now that the padding is slightly more complicated. Our scalar shard shape was 18x32, but this was further padded to 32x32 meaning that every core now has 14 rows of padding except for the last row of cores which has 15 rows of padding.

Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.

With affine map we can be even more flexible in how we pad, we can bump our stride between dimensions. Consider tensor (w/ batch dim 2):

tensor<2x8x32xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 8 + d1, d2),
    undef,
    <1x2>,
    memref<16x16xf32, #tt.memory_space<l1>>
  >
>

If we tilized the above tensor we'd end up with a memref shape of 1x1x!tt.tile<32x32>, that is, all batches are tightly packed within a single tile. Let's say that for some reason, we do not want the batches (2) to be tightly packed within a tile, perhaps the mathematical operation we're doing requires the batch to be independently evaluated and thus the (S)FPU needs them in separate tiles. We can adjust this by adjusting the stride of the affine map:

(d0, d1, d2) -> (d0 * 32 + d1, d2),

Instead of striding by the number of logical rows, 8, we bump the stride up to 32 effectively pushing a gap between the collapsed rows and enabling each batch to fall on a tile boundary.

Memory Spaces

At the time of writing this document there are 4 memory spaces:

  1. System: Host memory space that is not device visible.
  2. SystemMMIO: Host memory space that is device visible.
  3. DeviceDRAM: DRAM local to the device.
  4. DeviceL1: SRAM on each core.

Something worth noting here is that a tensor must belong exclusively to only one of these memory spaces at a time. For example, in order to stream tensor data from DeviceDRAM to DeviceL1 you would need to either manually slice the tensor into smaller tensors that do fit in L1 or have native support in the op's kernel for double buffering a block (most TTNN ops already support this).

Multi-device

Multi-device can be naturally represented via a combination of two concepts already touched on above, higher ranked grids and collapseIntervals. Let's consider the following example with a 3d grid and collapseIntervals=[(1, -1)].

tensor<2x3x64x128xf32,
  #tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<1x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Here we've left the batch dim intact and started collapsing at d1. This enables us to define a 3d grid where the outermost grid dim divides the batch directly. This could map to a 2 device system where the batch dim is evenly divided between 2 devices. Within each device this op runs on a 2x4 grid.

The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.

Test Plan

  • test/python/tensor_layout.py: Assertions for LayoutAttr to make sure it's spec compliant.
  • Sweep tests:
    • Grid dim sweeps
    • Tilize / untilize sweeps
    • Padding sweeps
  • Multi-device tests

Concerns

  • tt.layout is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.
  • Optimization solution space is potentially large with all of this flexibility. Two things that I hope can help protect us here:
    • By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
    • We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.







TTNN Tensor Layout

The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.

Terms

  • shape: Always logical shape, n-dimensional
  • stride: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from stride
  • shard_shape: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiple
  • physical_shard_shape: The shard_shape padded out to tile_shape
  • tile_shape: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatible
  • grid_shape: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]

Mapping from the compiler

The compiler uses an affine map to explicitly track which dimensions are folded together, but TTNN does not have affine maps so the representation is a bit more implicit. TTNN captures the dimension collapsing in the stride attribute where dimensions [0, N-1] are always collapsed. This is less flexible so the compiler will have to enforce only collapsing supported dimensions when targeting TTNN, or handle lowering in a different way. For example, in the compiler we might want to represent data parallel over the tensor batch dim by leaving d0 and collapsing d1 - d[-1]. TTNN doesn't support this in its tensor layout representation, but this could be lowered to a TTNN mesh tensor where the mesh could be sliced on the batch and each per-device tensor has d0 fully collapsed.

TTNN Example

Alt text

TTNN Interactive Visualizer

Device

Device in tt-mlir is somewhat of an overloaded term and can refer to different things depending on the context. This document will only speak to the compiler's abstract representation of a device captured by attribute #tt.device.

Terms

There are many overloaded terms when talking about devices and grids, this document will use the following definitions:

  • Physical Grid: A 2D array of tensix cores on a chip.
  • Chip: A single physical chip with a Physical Grid of cores.
  • Card: A PCIE or Ethernet card that may contain multiple Chips.
  • System: A collection of Cards that are usually connected together on the same host via PCIE or networked via ethernet. A system is represented by SystemDesc in the compiler.
  • Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
  • Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
  • Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.

Motivation

The device attribute strives to achieve the following goals:

  • Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
  • Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
  • Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
  • Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.

Scope

This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.

Examples

All of the following examples will assume the physical hardware has an 8x8 physical grid of cores. We will use notation [N, 8x8] to represent a N chip system, each with an 8x8 physical grid.

#tt.device in is simplest, single chip form [1, 8x8], just maps directly 1-1 to the underlying physical hardware device.

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>

Let's break down what each of these attributes mean:

  • workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>: This is a 2D logical grid with dim 8x8. It's followed by an affine map (d0, d1) -> (0, d0, d1) that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.
  • meshShape = 1: A shape provided as part of the DeviceAttr constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • chipIds = [0]: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. The SystemDesc attribute that this is in reference to is tagged on the top level ModuleOp.

Specific examples that this document will cover:

Before we move on to more complex examples, it's worth having on hand:

  • The python test test/python/device_attr.py which shows how all of these examples can actually be programmed for the device attribute.
  • The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.

Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.

Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.

Data Parallel Over Batch

Given a 2 chip system, [2, 8x8], we can represent a simple data parallel logical grid that divides the batch dimension in half across the two chips. This is denoted by meshShape = 2x1x1 which means the logical grid is 3D.

#tt.device<
  workerGrid = #tt.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
  meshShape = 2x1x1,
  chipIds = [0, 1]
>

The affine map here is just identity, so dims d1 and d2 directly index the physical grid and d0 indexes the chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<16x3x64x128xf32,
  #tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<8x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.

Data Parallel Over 2d

In this example we will consider a 2 chip system, [2, 8x8], and view it as though the two chips are concatenated together side by side to form a single 8x16 grid. This is denoted by meshShape = 1x2 which means to concatenate the chips in the second dimension.

#tt.device<
  workerGrid = #tt.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
  meshShape = 1x2,
  chipIds = [0, 1]
>

Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<256x1024xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <4x16>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.

Data Parallel Over 2d and Batch

The previous 2 examples can be composed together to form a logical grid that divides tensor across multiple dimensions. Here we will consider a 4 chip system [4, 8x8] and view it as a 2x8x16 grid. Note that the meshShape is 2x1x2 which means to concatenate the chips in the first and third dimensions.

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>

We can evaluate the affine map to see that the chips are interpreted in chunks of two, where groups [0, 1] and [2, 3] each form 8x16 grids and these 2 groups concatenate to form a 2x8x16 grid.

We can consider the following tensor to map onto this grid:

tensor<64x256x1024xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <2x4x16>,
    memref<32x2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Pipeline Parallel

Pipeline parallel in the scope of this spec isn't particularly interesting, it is intended to be used in conjunction with the ttir.pipeline operation which will group sections of the module's operations into groups to form pipeline regions and will be covered in a separate spec.

What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.

Given an 8 chip system [8, 8x8], we can form two virtual devices that each take 4 chips and interpret them differently (though they could take the same logical grid).

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>
#tt.device<
  workerGrid = #tt.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
  meshShape = 2x2,
  chipIds = [4, 5, 6, 7]
>

Reinterpreted Grids (Transpose)

One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.

To keep things simple, let's consider a 1 chip system [1, 8x8], but it's not too big a leap to see how this could map to multi-chip where the cost of moving data is even higher.

Let's also consider a simple (totally contrived) eltwise unary graph:

a = exp(a)
aT = transpose(a)
relu(aT)
  1. We'll establish a regular, single chip, identity logical grid:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute exp.
  2. We'll reinterpret the grid as transposed:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d1, d0)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute transpose. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation.
  2. Execute relu.

It's important to note that we effectively implemented transpose without moving data anywhere.

Reinterpreted Grids (Extra)

For the sake of examples, here's a few more ways of reinterpreting the logical grid.

Extra Wide Grid

#tt.device<
  workerGrid = #tt.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Extra Tall + Transposed Grid

#tt.device<
  workerGrid = #tt.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Staircase

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.

Lowering to TTNN

While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.

Grid/Shard Orientation

TODO

Multi-device

Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.

Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:

Compiler

  • Device Creation: The TTNN device in the compiler is exactly the same attribute from the ttir dialect. It will encode the meshShape into the flatbuffer which can be directly used to program ::ttnn::MeshShape.
  • Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the ttir dialect. The grid attribute in the tensor layout can be trivially divided by meshShape to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:
    • Mesh Sharded: If the tensor grid is > 1 along the meshShape dimensions, the tensor will be sharded across the mesh devices.
    • Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.

Runtime

  • Device Creation: The ttnn runtime will wholesale switch to working with mesh devices via api ttnn::multi_device::open_mesh_device, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be 1xN where N is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly with SubMesh API.
  • Tensor Creation: Tensor creation in a multi-device system is a bit more involved. In order to upload a multi-device tensor to the mesh, the host tensor much first be created with MultiDeviceHostStorage. The ttnn runtime can automatically do this during handleToHostMemoryConfigOp:
    • Regular host tensor will bounce through new tensor with MultiDeviceHostStorage type.
    • tensor.to(mesh_device) will allocate/move the tensor to the mesh device.

Lowering to TTMetal

In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.

Test Plan

  • test/python/device_attr.py covers all of the examples above and asserts the IR is correctly generated.
  • Additional functional unit tests will be added as op and runtime support is added.

Concerns

  • tt.device is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.

'tt' Dialect

TT types and attributes common to all TT dialects. This dialect defines types and attributes common to all TT dialects.

[TOC]

ArchAttr

TT Arch

Syntax:

#tt.arch<
  ::mlir::tt::Arch   # value
>

Enum cases:

  • grayskull (Grayskull)
  • wormhole_b0 (WormholeB0)
  • blackhole (Blackhole)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::Archan enum of type Arch

ArgumentAllocationAttr

Argument allocation attribute in TT dialect

Syntax:

#tt.arg_alloc<
  uint64_t,   # address
  uint64_t,   # size
  MemorySpace   # memorySpace
>

Holds the metadata for the allocation of an function argument i.e. for graph inputs.

Parameters:

ParameterC++ typeDescription
addressuint64_t
sizeuint64_t
memorySpaceMemorySpace

BufferAccessAttr

TT Buffer Access

Syntax:

#tt.buffer_access<
  ::mlir::tt::BufferAccess   # value
>

Enum cases:

  • alias (Alias)
  • stream (Stream)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::BufferAccessan enum of type BufferAccess

BufferAttr

Buffer attribute in TT dialect

Syntax:

#tt.buffer<
  MemRefType,   # memref
  BufferAccess   # buffer_access
>

Describes the physical footprint and layout of a buffer in L1. Its memref must also have a shape with rank equal to DeviceAttr grid. It also carries a buffer access attribute which can be one of:

  • Alias: This buffer aliases a persistent Tensor L1 allocation directly. Implies that no datamovement occurs and the compute kernel just accesses the local allocation directly.
  • Stream: This buffer is a temporary destination as a means to get remote data for local computation. Remote data is most likely a a tensor that is allocated in dram, but could also be data from a remote core.

Parameters:

ParameterC++ typeDescription
memrefMemRefTypeA memref that describes the physical footprint and layout of the buffer. It must also have a shape with rank equal to DeviceAttr grid.
buffer_accessBufferAccessHow data is accessed through this buffer, alias or stream.

ChipChannelAttr

TT chip_channel attribute

Syntax:

#tt.chip_channel<
  unsigned,   # deviceId0
  ::llvm::ArrayRef<int64_t>,   # ethernetCoreCoord0
  unsigned,   # deviceId1
  ::llvm::ArrayRef<int64_t>   # ethernetCoreCoord1
>

TT chip_channel attribute

Parameters:

ParameterC++ typeDescription
deviceId0unsigned
ethernetCoreCoord0::llvm::ArrayRef<int64_t>
deviceId1unsigned
ethernetCoreCoord1::llvm::ArrayRef<int64_t>

ChipCoordAttr

TT chip_coord attribute

Syntax:

#tt.chip_coord<
  unsigned,   # rack
  unsigned,   # shelf
  unsigned,   # y
  unsigned   # x
>

TT chip_coord attribute

Parameters:

ParameterC++ typeDescription
rackunsigned
shelfunsigned
yunsigned
xunsigned

ChipDescAttr

TT chip_desc attribute

Syntax:

#tt.chip_desc<
  ArchAttr,   # arch
  ::llvm::ArrayRef<int64_t>,   # grid
  unsigned,   # l1Size
  unsigned,   # numDramChannels
  unsigned,   # dramChannelSize
  unsigned,   # nocL1AddressAlignBytes
  unsigned,   # pcieAddressAlignBytes
  unsigned,   # nocDRAMAddressAlignBytes
  unsigned,   # l1UnreservedBase
  unsigned,   # eriscL1UnreservedBase
  unsigned,   # dramUnreservedBase
  unsigned,   # dramUnreservedEnd
  ChipPhysicalCoresAttr,   # chipPhysicalCores
  ::llvm::ArrayRef<DataTypeAttr>,   # supportedDataTypes
  ::llvm::ArrayRef<TileSizeAttr>   # supportedTileSizes
>

TT chip_desc attribute

Parameters:

ParameterC++ typeDescription
archArchAttr
grid::llvm::ArrayRef<int64_t>
l1Sizeunsigned
numDramChannelsunsigned
dramChannelSizeunsigned
nocL1AddressAlignBytesunsigned
pcieAddressAlignBytesunsigned
nocDRAMAddressAlignBytesunsigned
l1UnreservedBaseunsigned
eriscL1UnreservedBaseunsigned
dramUnreservedBaseunsigned
dramUnreservedEndunsigned
chipPhysicalCoresChipPhysicalCoresAttr
supportedDataTypes::llvm::ArrayRef<DataTypeAttr>
supportedTileSizes::llvm::ArrayRef<TileSizeAttr>

ChipPhysicalCoresAttr

TT chip_physical_cores attribute

Syntax:

#tt.chip_physical_cores<
  ::llvm::ArrayRef<CoreCoordAttr>,   # worker
  ::llvm::ArrayRef<CoreCoordAttr>,   # dram
  ::llvm::ArrayRef<CoreCoordAttr>,   # eth
  ::llvm::ArrayRef<CoreCoordAttr>   # eth_inactive
>

TT chip_physical_cores attribute containing arrays of physical cores by core type in order of logical cores.

Parameters:

ParameterC++ typeDescription
worker::llvm::ArrayRef<CoreCoordAttr>
dram::llvm::ArrayRef<CoreCoordAttr>
eth::llvm::ArrayRef<CoreCoordAttr>
eth_inactive::llvm::ArrayRef<CoreCoordAttr>

CoreCoordAttr

TT core_coord attribute

Syntax:

#tt.core_coord<
  int64_t,   # y
  int64_t   # x
>

TT core_coord attribute containing a single physical core coordinate.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

DataTypeAttr

TT DataTypes

Syntax:

#tt.supportedDataTypes<
  ::mlir::tt::DataType   # value
>

Enum cases:

  • f32 (Float32)
  • f16 (Float16)
  • bf16 (BFloat16)
  • bfp_f8 (BFP_Float8)
  • bfp_bf8 (BFP_BFloat8)
  • bfp_f4 (BFP_Float4)
  • bfp_bf4 (BFP_BFloat4)
  • bfp_f2 (BFP_Float2)
  • bfp_bf2 (BFP_BFloat2)
  • u32 (UInt32)
  • u16 (UInt16)
  • u8 (UInt8)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::DataTypean enum of type DataType

DeviceAttr

Device attribute in TT dialect.

Syntax:

#tt.device<
  ::mlir::tt::GridAttr,   # workerGrid
  AffineMap,   # l1Map
  AffineMap,   # dramMap
  ::llvm::ArrayRef<int64_t>,   # meshShape
  ::llvm::ArrayRef<unsigned>   # chipIds
>

Describes the physical layout of a device in the system and is made up of a few components:

  • A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
  • Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
  • A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.

Parameters:

ParameterC++ typeDescription
workerGrid::mlir::tt::GridAttrTT grid attribute
l1MapAffineMap
dramMapAffineMap
meshShape::llvm::ArrayRef<int64_t>
chipIds::llvm::ArrayRef<unsigned>

GridAttr

TT grid attribute

Syntax:

#tt.grid<
  ::llvm::ArrayRef<int64_t>,   # shape
  AffineMap   # mapping
>

TT grid attribute

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
mappingAffineMap

IteratorTypeAttr

TT IteratorType

Syntax:

#tt.iterator_type<
  ::mlir::tt::IteratorType   # value
>

Enum cases:

  • parallel (Parallel)
  • systolic (Systolic)
  • broadcast (Broadcast)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::IteratorTypean enum of type IteratorType

LayoutAttr

Tensor layout attribute

Syntax:

#tt.layout<
  AffineMap,   # linear
  OOBVal,   # oob_val
  GridAttr,   # grid
  MemRefType,   # memref
  TensorMemoryLayout   # mem_layout
>

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Some high level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

Please refer to the Tensor Layout Spec for more in depth documentation.

Examples:

tensor<8x300xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

Parameters:

ParameterC++ typeDescription
linearAffineMapAn affine map that defines how the logical tensor dimensions map to a grid shape.
oob_valOOBValA tracked out of bounds value that fills padding space.
gridGridAttrThe grid shape that this tensor is divided onto.
memrefMemRefTypeA memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.
mem_layoutTensorMemoryLayoutThe layout of the tensor in memory.

MemorySpaceAttr

TT MemorySpace

Syntax:

#tt.memory_space<
  ::mlir::tt::MemorySpace   # value
>

Enum cases:

  • system (System)
  • mmio (SystemMMIO)
  • dram (DeviceDRAM)
  • l1 (DeviceL1)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::MemorySpacean enum of type MemorySpace

OOBValAttr

TT OOBVal

Syntax:

#tt.oob_val<
  ::mlir::tt::OOBVal   # value
>

Enum cases:

  • undef (Undef)
  • zero (Zero)
  • one (One)
  • inf (Inf)
  • neginf (NegInf)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::OOBValan enum of type OOBVal

OperandConstraintAttr

TT Operand Constraints

Syntax:

#tt.operand_constraint<
  ::mlir::tt::OperandConstraint   # value
>

Enum cases:

  • system (System)
  • dram (DRAM)
  • l1 (L1)
  • scalar (Scalar)
  • tile (Tile)
  • none (None)
  • interleaved (Interleaved)
  • single_bank (SingleBank)
  • height_sharded (HeightSharded)
  • width_sharded (WidthSharded)
  • block_sharded (BlockSharded)
  • any_layout (AnyLayout)
  • any (Any)
  • any_device (AnyDevice)
  • any_device_tile (AnyDeviceTile)
  • l1_block_sharded (L1BlockSharded)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::OperandConstraintan enum of type OperandConstraint

SystemDescAttr

TT system_desc attribute

Syntax:

#tt.system_desc<
  ::llvm::ArrayRef<ChipDescAttr>,   # chipDescs
  ::llvm::ArrayRef<unsigned>,   # chipDescIndices
  ::llvm::ArrayRef<ChipCapabilityAttr>,   # chipCapabilities
  ::llvm::ArrayRef<ChipCoordAttr>,   # chipCoords
  ::llvm::ArrayRef<ChipChannelAttr>   # chipChannels
>

TT system_desc attribute

Parameters:

ParameterC++ typeDescription
chipDescs::llvm::ArrayRef<ChipDescAttr>
chipDescIndices::llvm::ArrayRef<unsigned>
chipCapabilities::llvm::ArrayRef<ChipCapabilityAttr>
chipCoords::llvm::ArrayRef<ChipCoordAttr>
chipChannels::llvm::ArrayRef<ChipChannelAttr>

TensorMemoryLayoutAttr

TT TensorMemoryLayout

Syntax:

#tt.tensor_memory_layout<
  ::mlir::tt::TensorMemoryLayout   # value
>

Enum cases:

  • none (None)
  • interleaved (Interleaved)
  • single_bank (SingleBank)
  • height_sharded (HeightSharded)
  • width_sharded (WidthSharded)
  • block_sharded (BlockSharded)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::TensorMemoryLayoutan enum of type TensorMemoryLayout

TileSizeAttr

TT tile_size attribute

Syntax:

#tt.tile_size<
  int64_t,   # y
  int64_t   # x
>

TT tile_size attribute containing a supported Tensix tile shape.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

DeviceType

TT device

Syntax:

!tt.device<
  ::mlir::tt::DeviceAttr   # desc
>

Device type in TT dialect

Parameters:

ParameterC++ typeDescription
desc::mlir::tt::DeviceAttrDevice attribute in TT dialect.

TileType

TT tile

Syntax:

!tt.tile<
  ::llvm::ArrayRef<int64_t>,   # shape
  DataType   # dataType
>

Tile type in TT dialect

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
dataTypeDataType

'ttir' Dialect

TTIR dialect provides high level semantics for dispatching work to TT HW. This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.

[TOC]

ttir.abs (tt::ttir::AbsOp)

Eltwise absolute op.

Eltwise absolute operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.add (tt::ttir::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.alloc (tt::ttir::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttr
TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttir.broadcast (tt::ttir::BroadcastOp)

Broadcast operation.

Broadcast op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::ArrayAttr64-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.concat (tt::ttir::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.constant (tt::ttir::ConstantOp)

Constant op.

Produces tensor filled with given constant value.

Examples: %0 = "ttir.constant"() {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> // %0: [[0, 0, 0], [0, 0, 0]] %1 = "ttir.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]

Traits: ConstantLike

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.conv2d (tt::ttir::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
groups::mlir::IntegerAttr32-bit signed integer attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.dealloc (tt::ttir::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
resultranked tensor of any type values

ttir.div (tt::ttir::DivOp)

Eltwise divide.

Eltwise divide operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.embedding (tt::ttir::EmbeddingOp)

Embedding op.

Embedding operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.exp (tt::ttir::ExpOp)

Eltwise exponential op.

Eltwise exponential operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.generic (tt::ttir::GenericOp)

Generically dispatch work to a grid of cores.

This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.

%5 = "ttir.generic"(%1, %3, %4) <{
  grid = #tt.grid<1x1>,                     // The grid range of cores to dispatch work to.
  indexing_maps = [#map, #map, #map],       // Affine maps for indexing into the input/output tensors. See linalg.generic
  iterator_types = [#parallel, #parallel],  // Iterator types for the input/output tensors. See linalg.generic
  operandSegmentSizes = array<i32: 2, 1>,   // Sizes of the operand segments, i.e. 2 inputs and 1 output.
({
^bb0(%arg2: memref<64x128xf32, #l1_>, %arg3: memref<64x128xf32, #l1_>, %arg4: memref<64x128xf32, #l1_>):
    // Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
grid::mlir::tt::GridAttr
TT grid attribute{{% markdown %}} TT grid attribute {{% /markdown %}}
indexing_maps::mlir::ArrayAttrAffineMap array attribute
iterator_types::mlir::ArrayAttr
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.ge (tt::ttir::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.kernel (tt::ttir::KernelOp)

Kernel call.

A generic kernel call operation. This operation is used to pattern match by some consuming backend.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
op::mlir::FlatSymbolRefAttrflat symbol reference attribute
kind::mlir::FlatSymbolRefAttrflat symbol reference attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values
outputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

ttir.matmul (tt::ttir::MatmulOp)

Matrix multiply operation.

Matrix multiply operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max (tt::ttir::MaxOp)

Max reduction op.

Max reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max_pool2d (tt::ttir::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr
original_height::mlir::IntegerAttr32-bit signed integer attribute
original_width::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.maximum (tt::ttir::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.mean (tt::ttir::MeanOp)

Mean reduction op.

Mean reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.multiply (tt::ttir::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.neg (tt::ttir::NegOp)

Eltwise negate op.

Eltwise negate operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.reciprocal (tt::ttir::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.relu (tt::ttir::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.reshape (tt::ttir::ReshapeOp)

Reshape op.

Reshape tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sigmoid (tt::ttir::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.softmax (tt::ttir::SoftmaxOp)

Softmax operation.

Softmax operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sqrt (tt::ttir::SqrtOp)

Eltwise square root.

Eltwise square root operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.squeeze (tt::ttir::SqueezeOp)

Squeeze op.

Squeeze tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.subtract (tt::ttir::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.sum (tt::ttir::SumOp)

Sum reduction op.

Sum reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.to_layout (tt::ttir::ToLayoutOp)

Layout op.

ToLayout operation, transition tensors from one layout to another. Some examples include:

  • Transitioning between different memory spaces, e.g. DRAM to L1.
  • Transitioning between different data types, e.g. f32 to f16.
  • Transitioning between different tile sizes, e.g. 1x16 to 32x32
  • Transitioning between different tensor sharding
  • Some combination of the above
#layout = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.transpose (tt::ttir::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.unsqueeze (tt::ttir::UnsqueezeOp)

Unsqueeze op.

Unsqueeze tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.yield (tt::ttir::YieldOp)

Yield op.

Yield operation, this is required by MLIR to mark the end of a dispatch region.

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
valuesvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

'ttkernel' Dialect

A TTKernel out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

EthernetConfigAttr

TT EthernetConfig attribute

Syntax:

#ttkernel.ethernet_config<
  EthType,   # eth_type
  NocIndex   # noc_index
>

TT ethernet_config attribute

Parameters:

ParameterC++ typeDescription
eth_typeEthType
noc_indexNocIndex

NocConfigAttr

TT NocConfig attribute

Syntax:

#ttkernel.noc_config<
  NocIndex   # noc_index
>

TT noc_config attribute

Parameters:

ParameterC++ typeDescription
noc_indexNocIndex

TensixConfigAttr

TT TensixConfig attribute

Syntax:

#ttkernel.tensix_config<
  MathFidelity,   # math_fidelity
  bool,   # fp32_dest_acc_en
  bool,   # preserve_fp32_precision
  bool   # math_approx_mode
>

TT compute_desc attribute

Parameters:

ParameterC++ typeDescription
math_fidelityMathFidelity
fp32_dest_acc_enbool
preserve_fp32_precisionbool
math_approx_modebool

ThreadTypeAttr

TTKernel ThreadTypes

Syntax:

#ttkernel.thread<
  ::mlir::tt::ttkernel::ThreadType   # value
>

Enum cases:

  • noc (Noc)
  • tensix (Tensix)
  • ethernet (Ethernet)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ThreadTypean enum of type ThreadType

ttkernel.acquire_dst (tt::ttkernel::AcquireDstOp)

Aquire dest call.

Aquire dest operation

ttkernel.add (tt::ttkernel::AddOp)

Add operation

Add operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.add_tiles_init (tt::ttkernel::AddTilesInitOp)

Short init function

Must be run before add_tiles.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.add_tiles (tt::ttkernel::AddTilesOp)

Add operation

Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_index32-bit signless integer
in1_tile_index32-bit signless integer
dst_index32-bit signless integer

ttkernel.binary_op_init_common (tt::ttkernel::BinaryOpInitCommonOp)

Init function for all binary ops

Followed by the specific init required with an opcode (binrary_op_specific_init).

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
out_cbTTKernel cb

ttkernel.builtin (tt::ttkernel::BuiltinOp)

Builtin call.

Kernel operation

Attributes:

AttributeMLIR TypeDescription
op::mlir::FlatSymbolRefAttrflat symbol reference attribute
kind::mlir::FlatSymbolRefAttrflat symbol reference attribute

Operands:

OperandDescription
argsvariadic of non-0-ranked.memref of any type values or TTKernel cb

ttkernel.cb_pop_front (tt::ttkernel::CBPopFrontOp)

CBPopFront call.

CBPopFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_push_back (tt::ttkernel::CBPushBackOp)

CBPushBack call.

CBPushBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_reserve_back (tt::ttkernel::CBReserveBackOp)

CBReserveBack call.

CBReserveBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_wait_front (tt::ttkernel::CBWaitFrontOp)

CBWaitFront call.

CBWaitFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.get_noc_addr (tt::ttkernel::GetNocAddrOp)

GetNocAddr

GetNocAddr

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
x32-bit signless integer
y32-bit signless integer
l1Address32-bit signless integer

Results:

ResultDescription
nocAddrTTKernel noc address

ttkernel.matmul (tt::ttkernel::MatmulOp)

Matmul operation

Matmul operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.mul (tt::ttkernel::MulOp)

Mul operation

Mul operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.mul_tiles_init (tt::ttkernel::MulTilesInitOp)

Short init function

Must be run before mul_tiles.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.mul_tiles (tt::ttkernel::MulTilesOp)

Mul operation

Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_index32-bit signless integer
in1_tile_index32-bit signless integer
dst_index32-bit signless integer

ttkernel.noc_async_read_barrier (tt::ttkernel::NocAsyncReadBarrierOp)

NocAsyncReadBarrier

NocAsyncReadBarrier

ttkernel.noc_async_read (tt::ttkernel::NocAsyncReadOp)

NocAsyncRead

NocAsyncRead

Operands:

OperandDescription
srcNocAddrTTKernel noc address
dstLocalL1Addr32-bit signless integer
size32-bit signless integer

ttkernel.noc_async_write_barrier (tt::ttkernel::NocAsyncWriteBarrierOp)

NocAsyncWriteBarrier

NocAsyncWriteBarrier

ttkernel.noc_async_write (tt::ttkernel::NocAsyncWriteOp)

NocAsyncWrite

NocAsyncWrite

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrTTKernel noc address
size32-bit signless integer

ttkernel.pack (tt::ttkernel::PackOp)

Pack op.

Pack operation

Operands:

OperandDescription
dst_index32-bit signless integer
out_cbTTKernel cb
out_index32-bit signless integer

ttkernel.pack_set_data_type (tt::ttkernel::PackSetDataTypeOp)

Pack set DataType op.

Pack set DataType operation

Attributes:

AttributeMLIR TypeDescription
data_type::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}

ttkernel.pack_tile (tt::ttkernel::PackTileOp)

PackTile op.

Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.

Operates in tandem with functions cb_reserve_back and cb_push_back.

A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.

Operands:

OperandDescription
dst_index32-bit signless integer
out_cbTTKernel cb
out_index32-bit signless integer

ttkernel.release_dst (tt::ttkernel::ReleaseDstOp)

Release dest call.

Release dest operation

ttkernel.return (tt::ttkernel::ReturnOp)

Return op.

Return operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

ttkernel.sub (tt::ttkernel::SubOp)

Sub operation

Sub operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.tile_regs_acquire (tt::ttkernel::TileRegsAcquireOp)

Tile_regs_acquire

Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tile_regs_commit (tt::ttkernel::TileRegsCommitOp)

Tile_regs_commit

Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.

ttkernel.tile_regs_release (tt::ttkernel::TileRegsReleaseOp)

Tile_regs_release

Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.

ttkernel.tile_regs_wait (tt::ttkernel::TileRegsWaitOp)

Tile_regs_wait

Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tilize_block (tt::ttkernel::TilizeBlockOp)

TilizeBlockOp call.

TilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.tilize_init (tt::ttkernel::TilizeInitOp)

TilizeInitOp call.

TilizeInitOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.unpack_ab (tt::ttkernel::UnpackABOp)

UnpackAB op.

UnpackAB operation

Operands:

OperandDescription
cb_aTTKernel cb
src_a_index32-bit signless integer
cb_bTTKernel cb
src_b_index32-bit signless integer

ttkernel.unpack_a (tt::ttkernel::UnpackAOp)

UnpackA op.

UnpackA operation

Operands:

OperandDescription
cbTTKernel cb
src_index32-bit signless integer

ttkernel.unpack_set_data_type (tt::ttkernel::UnpackSetDataTypeOp)

Unpack set DataType op.

Unpack set DataType operation

Attributes:

AttributeMLIR TypeDescription
data_type_a::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}
data_type_b::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}

ttkernel.unreachable (tt::ttkernel::UnreachableOp)

Unreachable op.

Unreachable operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

ttkernel.untilize_block (tt::ttkernel::UntilizeBlockOp)

UntilizeBlockOp call.

UntilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.untilize_init (tt::ttkernel::UntilizeInitOp)

UntilizeInitOp call.

UntilizeInitOp operation

Operands:

OperandDescription
cbInTTKernel cb
cbOutTTKernel cb

CBType

TTKernel cb

Syntax:

!ttkernel.cb<
  CBPort,   # port
  uint64_t,   # address
  MemRefType,   # memref
  uint64_t,   # page_size
  uint64_t   # num_buffers
>

Circular buffer type in TTKernel dialect

Parameters:

ParameterC++ typeDescription
portCBPort
addressuint64_t
memrefMemRefType
page_sizeuint64_t
num_buffersuint64_t

NocAddrType

TTKernel noc address

Syntax: !ttkernel.noc_addr

Noc address type in TTKernel dialect

'ttmetal' Dialect

A TTMetal out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

CoreRangeAttr

TTMetal grid attribute

Syntax:

#ttmetal.core_range<
  ::llvm::ArrayRef<int64_t>,   # offset
  ::llvm::ArrayRef<int64_t>   # size
>

TTMetal grid attribute

Parameters:

ParameterC++ typeDescription
offset::llvm::ArrayRef<int64_t>
size::llvm::ArrayRef<int64_t>

ttmetal.alloc (tt::ttmetal::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttr
TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttmetal.dealloc (tt::ttmetal::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
inputranked tensor of any type values

ttmetal.dispatch (tt::ttmetal::DispatchOp)

Dispatch op.

Dispatch operation

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
core_ranges::mlir::ArrayAttr
kernelConfigs::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttmetal.host_read (tt::ttmetal::HostReadOp)

Host read op.

Host read operation

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttmetal.host_write (tt::ttmetal::HostWriteOp)

Host write op.

Host write operation

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

'ttnn' Dialect

A TTNN out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ttnn.abs (tt::ttnn::AbsOp)

Eltwise absolute.

Eltwise absolute operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.add (tt::ttnn::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.alloc (tt::ttnn::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttr
TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttnn.concat (tt::ttnn::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.conv2d (tt::ttnn::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
in_channels::mlir::IntegerAttr32-bit signless integer attribute
out_channels::mlir::IntegerAttr32-bit signless integer attribute
batch_size::mlir::IntegerAttr32-bit signless integer attribute
input_height::mlir::IntegerAttr32-bit signless integer attribute
input_width::mlir::IntegerAttr32-bit signless integer attribute
kernel_height::mlir::IntegerAttr32-bit signless integer attribute
kernel_width::mlir::IntegerAttr32-bit signless integer attribute
stride_height::mlir::IntegerAttr32-bit signless integer attribute
stride_width::mlir::IntegerAttr32-bit signless integer attribute
padding_height::mlir::IntegerAttr32-bit signless integer attribute
padding_width::mlir::IntegerAttr32-bit signless integer attribute
dilation_height::mlir::IntegerAttr32-bit signless integer attribute
dilation_width::mlir::IntegerAttr32-bit signless integer attribute
groups::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.dealloc (tt::ttnn::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
inputranked tensor of any type values

ttnn.div (tt::ttnn::DivOp)

Eltwise divide.

Eltwise divide operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.embedding (tt::ttnn::EmbeddingOp)

Embedding op.

Embedding operation.

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.empty (tt::ttnn::EmptyOp)

Empty op.

Tensor empty operation

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
dtype::mlir::tt::DataTypeAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}
layout::mlir::tt::ttnn::LayoutAttr
TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}}
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.exp (tt::ttnn::ExpOp)

Eltwise exponential.

Eltwise exponential operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.full (tt::ttnn::FullOp)

Full op.

Tensor full operation

Attributes:

AttributeMLIR TypeDescription
fillValue::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.get_device (tt::ttnn::GetDeviceOp)

Get Device op.

This op returns the current runtime device.

Results:

ResultDescription
deviceTT device

ttnn.ge (tt::ttnn::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.matmul (tt::ttnn::MatmulOp)

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max (tt::ttnn::MaxOp)

Max reduction op.

Max reduction op.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max_pool2d (tt::ttnn::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
batch_size::mlir::IntegerAttr32-bit signed integer attribute
input_height::mlir::IntegerAttr32-bit signed integer attribute
input_width::mlir::IntegerAttr32-bit signed integer attribute
channels::mlir::IntegerAttr32-bit signed integer attribute
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_height::mlir::IntegerAttr32-bit signed integer attribute
padding_width::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.maximum (tt::ttnn::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.mean (tt::ttnn::MeanOp)

Mean reduction op.

Mean reduction op.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.multiply (tt::ttnn::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.neg (tt::ttnn::NegOp)

Eltwise negate.

Eltwise negate operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.reciprocal (tt::ttnn::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.relu (tt::ttnn::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.reshape (tt::ttnn::ReshapeOp)

Reshape op.

Reshape tensor.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sigmoid (tt::ttnn::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.softmax (tt::ttnn::SoftmaxOp)

Softmax op.

Softmax operation.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sqrt (tt::ttnn::SqrtOp)

Eltwise sqrt.

Eltwise sqrt operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.subtract (tt::ttnn::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.sum (tt::ttnn::SumOp)

Sum reduction op.

Sum reduction op.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_device (tt::ttnn::ToDeviceOp)

ToDevice op.

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_layout (tt::ttnn::ToLayoutOp)

ToLayout op.

Attributes:

AttributeMLIR TypeDescription
layout::mlir::tt::ttnn::LayoutAttr
TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_memory_config (tt::ttnn::ToMemoryConfigOp)

ToMemoryConfig op.

Operands:

OperandDescription
inputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.transpose (tt::ttnn::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values