Introduction

The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.

Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.

Motivations

The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.

The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!

What is MLIR and why use it?

MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:

  • Industry Standard Compiler Framework
    • Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
  • Ecosystem
    • Hook into existing front-end MLIR projects
  • Testing framework
    • A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
    • Common IR Serialization Format that's easy to test, debug, and edit

Additional documentation to highlight the benefits of MLIR can be found here:

MLIR: Overview

MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.

MLIR Primitives

So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR Module is as shown:

#any_device = #tt.operand_constraint<dram|l1|scalar|tile|any_device|any_device_tile>
module attributes {tt.system_desc = #tt.system_desc<[<#tt.arch<wormhole_b0>, #tt.grid<8x8>>], [0], [<pcie|host_mmio>], [<0, 0, 0, 0>]>} {
  func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
    %0 = tensor.empty() : tensor<64x128xf32>
    %1 = "ttir.multiply"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>, operand_constraints = [#any_device, #any_device, #any_device]}> : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
    return %1 : tensor<64x128xf32>
  }
}
  • Attributes (defined using #)

    • The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
  • Operations

    • These operations are accessed with the . method, so you'll see some examples like func.func or tensor.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result

    • Quotes are added around ttir.multiply since it's part of a custom dialect, and more custom assembly instructions are applied to specify operand_constraints.

    • Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations

  • Types

    • Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.

    • They help to demonstrate the transformation of information and it's representation as it's processed across this module.

MLIR Workflow

The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:

  1. Graph Information exists

  2. Graph Information is transformed (through any which method) into a high-level MLIR representation

  3. Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends

  4. Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)

What are Passes?

Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.

Why not make our own?

Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?

One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.

Automation

It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!

TT-MLIR: Bringing MLIR to the TT Stack

Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:

  • Forge-FE (frontend)

    • Graph Optimizations, Constant Folding, Operation Fusion
  • TT-MLIR (mid-level)

    • Data Storage, Memory Configuration, Grid Configuration
  • TT-NN (backend)

    • Kernel Configuration*, Network Optimization

*Subject to Change / Be Moved to TT-MLIR

TT-MLIR Dialects

Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.

TT Dialect

The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.

TTIR Dialect

The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.

Generic Operation

This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.

Layout Operation

The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.

Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.

Built-in MLIR Dialects

The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects

Affine Dialect

[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.

Tensor Dialect

[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.

Func Dialect

[Reference]

TOSA Dialect

[Reference]

SCF Dialect

[Reference]

EmitC Dialect

[Reference]

TT-Explorer - Performance Optimization Tool

A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. TT-Explorer is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!

Building

These are the steps required to get the TT-MLIR project running on your machine

Please refer to the Dependencies section before building the project.

Environment setup

You only need to build this once, it builds llvm, flatbuffers and a python virtual environment.

cmake -B env/build env
cmake --build env/build
  • It is recommended to use the system installation of python3 for the virtual environment. Please ensure that you do not already have a venv activated before running the above command.
  • Please ensure the directory /opt/ttmlir-toolchain exist and its owner is the current user, i.e. the one that executes the above cmake commands. The commands create it and assign the proper ownership are:
    sudo mkdir -p /opt/ttmlir-toolchain
    sudo chown -R $USER /opt/ttmlir-toolchain
    

Build

source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17
cmake --build build
  • To enable the ttnn/metal runtime add -DTTMLIR_ENABLE_RUNTIME=ON
  • Clang 17 is the minimum required version when enabling the runtime.
  • To enable the ttnn/metal perf runtime add -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
  • To accelerate the builds with ccache use -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
  • To accelerate builds further, if python bindings aren't needed, -DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF. For some reason the python bindings link step is very slow.
  • TTNN build is automatically integrated / handled by tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
  • The runtime build step depends on the ARCH_NAME environment variable, which is set in the env/activate script. If you want to build the runtime for a different architecture, please set ARCH_NAME to the desired value (one of grayskull, wormhole_b0, or blackhole). Please note that the runtime is built only if TTMLIR_ENABLE_RUNTIME=ON.
  • In addition to ARCH_NAME, the runtime build depends on TT_METAL_HOME variable, which is also set in env/activate script. For more information, please refer to TT-NN and TT-Metailium installation documentation.
OSOffline Compiler OnlyRuntime Enabled BuildRuntime + Perf Enabled Build
Ubuntu 22.04
Ubuntu 20.04
MacOS

Test

source env/activate
cmake --build build -- check-ttmlir

llvm-lit

Under the hood the check-ttmlir cmake target is running llvm-lit. With it you can:

# Query which tests are available
llvm-lit -sv ./build/test --show-tests

# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir

# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR

See the full llvm-lit documentation for more information.

Lint

source env/activate
cmake --build build -- clang-tidy

Pre-Commit

Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit action. Install from the root of the repository using:

source env/activate
pre-commit install

If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":

pre-commit run --all-files

For more information visit pre-commit

Docs

source env/activate
cmake --build build -- docs
mdbook serve build/docs
  • mdbook can be installed with the system's package manager.
  • mdbook serve will by default create a local server at http://localhost:3000.

Note: If you want to build the docs on MacOS, there are two extra dependencies:

Both can be installed using Homebrew by running the following commands:

brew install doxygen
brew install graphviz

Dependencies

Ubuntu Common

Make sure to have Git LFS installed. You can install it with the following command:

sudo apt-get install git-lfs

Ubuntu 22.04

On Ubuntu 22.04 we need to install clang, ninja, and to update the version of cmake because 3.20 is the minimum required for this project.

sudo apt update
sudo apt upgrade

sudo apt install clang-17
sudo apt install ninja-build

sudo apt remove cmake -y
pip3 install cmake --upgrade
hash -r

Ensure cmake can by found in this path pip installed it to. E.g. PATH=$PATH:$HOME/.local/bin

Then run the following command to see the cmake version which should be later than 3.20

cmake --version

We also need to install Ninja which can be done with the following command

sudo apt install ninja-build

MacOS

On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).

brew install cmake
brew install ninja

Common Build Errors

TTMLIRPythonCAPI target requires changing an RPATH

CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
  The install of the TTMLIRPythonCAPI target requires changing an RPATH from
  the build tree, but this is not supported with the Ninja generator unless
  on an ELF-based or XCOFF-based platform.  The
  CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
  step.

If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja, either way the version 1.11.1.git.kitware.jobserver-1 should work.

clang++ is not a full path and was not found in the PATH

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:
    clang++
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:
    clang
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.

If you get the following error, it means you need to install clang which you can do with sudo apt install clang on Ubuntu.

sfpi, trisc, ncrisc build failure

tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 1: version: not found
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 2: oid: not found
size: '1961632': No such file
size: '1961632': No such file
size: '1961632': No such file
Always | FATAL | ncrisc build failed

If you got the above error, it means that SFPI or similar component build failed. First, make sure you have GIT LFS setup (e.g. sudo apt-get install git-lfs). Then, try to pull SFPI submodule manually:

cd third_party/tt-metal/src/tt-metal
git submodule foreach 'git lfs fetch --all && git lfs pull'

Then, try to build again.

Common Runtime Errors

Debugging python on macOS

When debugging python on macOS via lldb you may see an error like:

(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process.  Look in the console messages (Console.app), near the debugserver entries, when the attach failed.  The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))

For preinstalled macOS binaries you must manually codesign with debug entitlements.

Create file debuggee-entitlement.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>com.apple.security.cs.disable-library-validation</key>
        <true/>
        <key>com.apple.security.get-task-allow</key>
        <true/>
</dict>
</plist>

Sign the binary:

sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python

Internal Build Notes / IRD

  • When building the runtime we must use Ubuntu 22.04 docker image
    • When making an IRD reservation use --docker-image yyz-gitlab.local.tenstorrent.com:5005/tenstorrent/infra/ird-ubuntu-22-04-amd64:latest
  • You'll have to manaully install a newer version of cmake, at least 3.22, the easiest way to do this is to pip install cmake and make sure this one is in your path
  • You'll want LLVM installation to persist IRD reservations, you can achieve this by:
    • mkdir /localdev/$USER/ttmlir-toolchain
    • When requesting an IRD use --volumes /localdev/$USER/ttmlir-toolchain:/opt/ttmlir-toolchain

Working with Docker Images

Components:

  • Dockerfile
  • Workflow for building Docker image
  • Project build using Docker image

Overview

We use docker images to prepare project enviroment, install dependancies, tooling and prebuild toolchain. Project builds four docker images:

Base image tt-mlir-base-ubuntu-22-04 Dockerfile.base CI image tt-mlir-ci-ubuntu-22-04 Dockerfile.ci Base IRD image tt-mlir-base-ird-ubuntu-22-04Dockerfile.ird IRD image tt-mlir-ird-ubuntu-22-04 Dockerfile.ird

Base image starts with a supported base image (Ubuntu 22.04) and installs dependancies for project build. From there we build CI image that contains prebuild toolcahin and is used in CI to shoten the build time. IRD image contain dev tools like GDB, vim etc and shh and are use in IRD enviroments.

During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.

Building the Docker Image using GitHub Actions

The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.

Building the Docker Image Locally

To test the changes and build the image locally, use the following command:

docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .

Using the Image in GitHub Actions Jobs

The GitHub Actions workflow Build in Docker uses a Docker container for building:

    container:
      image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
      options: --user root

Tools

Currently, there are a few primary tools that are part of the ttmlir project:

  • ttmlir-opt: The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.
  • ttrt: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.
  • tt-explorer: Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.

ttmlir-opt

The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.

Simple Test

./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir

ttmlir-translate

The ttmlir-translate translation utility. Unlike ttmlir-opt tool which is used to run passes within the MLIR world, ttmlir-translate allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.

Generate C++ code from MLIR

# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o c.mlir

# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate -mlir-to-cpp c.mlir -allow-unregistered-dialect

Bonus: These two commands can be piped, to avoid writing a mlir file to disk, like so:

./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp -allow-unregistered-dialect

Generate flatbuffer file from MLIR

# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o ttnn.mlir

# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

ttrt

This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.

Building

source env/activate
cmake --build build -- ttrt
ttrt --help

Building runtime mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON

Building perf mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON

LOGGER Levels

ttrt support logging at different logger levels. You will need to set env var TTRT_LOGGER_LEVEL. By default, it will print all log messages.

TTRT_LOGGER_LEVEL=INFO
TTRT_LOGGER_LEVEL=CRITICAL
TTRT_LOGGER_LEVEL=ERROR
TTRT_LOGGER_LEVEL=WARNING
TTRT_LOGGER_LEVEL=DEBUG

Installing ttrt as python whls

Everytime you build ttrt, it will create a whls file in build/runtime/tools/python/build. Ex filename ttrt-0.0.235-cp310-cp310-linux_x86_64.whl. You can take this whls file and install it in any docker container and in any venv outside of ttmlir. After which, you can use all the following functionality as the same.

  1. Download whls
  2. Create a python venv
python -m venv ttrt_env
source ttrt_env/bin/activate
  1. Install whls (replace with your version of the whls)
pip install ttrt-0.0.235-cp310-cp310-linux_x86_64.whl

Generate a flatbuffer file from compiler

The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pipe path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
  1. Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
  1. Run your test cases using ttrt
ttrt run /path/to/out.ttnn

Generate flatbuffer files using llvm-lit

There are already existing .mlir test cases under test/ttmlir/Silicon. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Export this file in your environment using export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit.
  2. Generate your test cases. This will generate all your ttnn and ttm files under build/test/ttmlir/Silicon. ttnn files have a .ttnn file extension and ttmetal files have a .ttm extension.
cmake --build build -- check-ttmlir
  1. (Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!
llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon
  1. Run your test cases using ttrt
ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers

Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries

Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --ttir-load-system-desc="path=%system_desc_path%" which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. You can also paste a custom path to a system desc file as well.

// RUN: ttmlir-opt --ttir-load-system-desc="path=%system_desc_path%" --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s  > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn

Adding new mlir test cases

You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!

Versioning

ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.

vmajor.minor.patch

APIs

ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check

Command Line

There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts folder under TT_MLIR_HOME environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.

read

Read sections of a binary file

ttrt read --help
ttrt read --section mlir out.ttnn
ttrt read --section cpp out.ttnn
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section all out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log
ttrt read out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt read out.ttnn --result-file result.json

run

Run a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log
ttrt run out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt run out.ttnn --load-kernels-from-disk
ttrt run out.ttnn --enable-async-ttnn
ttrt run out.ttnn --result-file result.json

query

Query the system to obtain the system desc file (optionally store it to disk) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt query --help
ttrt query
ttrt query --quiet
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log
ttrt query --save-artifacts --artifact-dir /path/to/some/dir
ttrt query --result-file result.json

perf

Run performance mode of a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON. Note: You can collect host only related performance data via --host-only flag. By default, host and device side performance data are both collected. If the saving artifacts flag is provided, perf mode will dump the following files in the artifacts directory

ops_perf_results.csv : compiled op performance results
profile_log_device.csv : dump of all device side profiled results
tracy_ops_data.csv : op data results dumped in a readable format
tracy_ops_times.csv : op time results dumped in a readable format
tracy_profile_log_host.tracy : tracy profiled results file, this file can be fed into the tracy GUI
ttrt perf --help
ttrt perf out.ttnn
ttrt perf out.ttnn --clean-artifacts
ttrt perf out.ttnn --save-artifacts
ttrt perf out.ttnn --loops 10
ttrt perf --program-index all out.ttnn
ttrt perf --program-index 0 out.ttnn
ttrt perf --host-only out.ttnn
ttrt perf /dir/of/flatbuffers --host-only
ttrt perf /dir/of/flatbuffers --loops 10 --host-only
ttrt perf /dir/of/flatbuffers --log-file ttrt.log --host-only
ttrt perf --save-artifacts --artifact-dir /path/to/some/dir
ttrt perf out.ttnn --result-file result.json

To use the Tracy GUI, run the following instructions on your macbook. You can upload your .tracy file into the GUI to view the profiled dumps.

git clone https://github.com/tenstorrent-metal/tracy.git
cd tracy/profiler/build/unix
make all
./Tracy-release

check

Check a binary file or a directory of binary files against a system desc (by default, uses the host machine) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc
ttrt check --save-artifacts --artifact-dir /path/to/some/dir out.ttnn
ttrt check out.ttnn --result-file result.json

ttrt as a python package

The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.

Import ttrt as a python package

from ttrt.common.api import API

Setup API and register all features

API.initialize_apis()

Setup arguments

You can specify certain arguments to pass to each API, or use the default arguments provided

args

This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via ttrt help command line. Any argument not provided will be set to the default.

custom_args = {}
custom_args["--clean-artifacts"] = True
query_instance = API.Query(args=custom_args)

logging

You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file.

from ttrt.common.util import Logger

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logger=custom_logger)

artifacts

You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.

from ttrt.common.util import Artifacts

log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)

Execute API

Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.

result_code, results = query_instance()
result_code, results = read_instance()
result_code, results = run_instance()

Putting it all together

You can do interesting stuff when combining all the above features into your python script

from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts

API.initialize_apis()

custom_args = {}
custom_args["--clean-artifacts"] = True
custom_args["--save-artifacts"] = True
custom_args["--loops"] = 10
custom_args["--init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)

artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)

run_instance = API.Run(args=custom_args, logger=custom_logger, artifacts=custom_artifacts)
result_code, results = run_instance()

FAQ

Flatbuffer version does not match ttrt version!

  • ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.

System desc does not match flatbuffer!

  • flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.

I just want to test and push my commit! What do I do!

  • follow these steps (on both n150 and n300)
1. Build ttmlir (sample instructions - subject to change)
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build

2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt

3. Query system
ttrt query --save-artifacts

4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)

5. Generate test cases
cmake --build build -- check-ttmlir

6. Run test cases
ttrt run build/test/ttmlir/Silicon

7. (Optional) Run perf test cases
ttrt perf build/test/ttmlir/Silicon

tt-explorer

Welcome to the tt-explorer wiki! The Wiki will serve as a source for documentation, examples, and general knowledge related to the TT-MLIR visualization project. The sidebar will provide navigation to relevant pages. If this is your first time hearing about the project, take a look at Project Architecture for an in-depth introduction to the tool and motivations behind it :)

Quick Start

TT-Explorer is made to be as painless as possible, as such the installation on top of the pre-existing tt-mlir project is as minimal as possible.

  1. Build tt-mlir
  2. Run source env/activate to be in tt-mlir virtualenv for the following steps
  3. Install tt-adapter using pip install -e . in tt-adapter root directory.
  4. Install tt-explorer using pip install -e . in tt-explorer root directory
  5. Run tt-explorer in terminal to start tt-explorer instance. (Refer to CLI section in API for specifics)
  6. Ensure server has started in tt-explorer shell instance (check for message below)
Starting Model Explorer server at:
http://localhost:8080

Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.

TT-Explorer - Project Architecture

TT-Explorer is a tool made to ease the pain of tuning a model and developing on Tenstorrent hardware. It provides a “Human-In-Loop” interface such that the compiler results can be actively tuned and understood by the person compiling the model. To complete this goal, the tool has to be designed such that users of any level of experience are all able to glean useful information from the visualization of the model, and be able to explore what the model does.

Software Architecture

The software will be built around the TT-Forge compiler to provide most of the functionality. Model Explorer will be used for the visualization functionality and as the main platform upon which TT-Explorer is built on.

Since Model-Explorer is built using Python, the majority of TT-Explorer will be structured in Python, with frequent use of the bindings to C++ provided by TT-MLIR.

The following components will be put together:

ttExplorerArchitecture

TT-Forge-FE (Front End)

TT-Forge FE is currently the primary frontend which uses TVM to transform conventional AI models into the MLIR in the TTIR Dialect.

Ingests: AI Model defined in PyTorch, TF, etc… Emits: Rudimentary TTIR Module consisting of Ops from AI Model.

TT-MLIR

TT-MLIR currently defines the out-of-tree MLIR compiler created by Tenstorrent to specifically target TT Hardware as a backend. It comprises a platform of several dialects (TTIR, TTNN, TTMetal) and the passes and transformations to compile a model into an executable that can run on TT hardware. In the scope of TT-Explorer the python bindings will be leveraged.

Ingests: TTIR Module, Overrides JSON Emits: Python Bindings to interface with TTIR Module, Overridden TTIR Modules, Flatbuffers

TT-Adapter

Model Explorer provides an extension interface where custom adapters can be implemented to visualize from different formats. TT-Adapter is the adapter created for TT-Explorer that parses TTIR Modules using the Python Bindings provided by TT-MLIR to create a graph legible by model-explorer. It also has an extensible REST endpoint that is leveraged to implement functionality, this endpoint acts as the main bridge between the Client and Host side processes.

Ingests: TTIR Modules, TT-MLIR Python Bindings, REST API Calls Emits: Model-Explorer Graph, REST API Results

TTRT

TT-RT is the runtime library for TT-Forge, which provides an API to run Flatbuffers generated from TT-MLIR. These flatbuffers contain the compiled results of the TTIR module, and TTRT allows us to query and execute them. Particularly, a performance trace can be generated using Tracy, which is fed into model-explorer to visualize the performance of operations in the graph.

Ingests: Flatbuffers Emits: Performance Trace, Model Results

Model-Explorer

Model Explorer is the backbone of the client and visualization of these models. It is deceptively placed in the “Client” portion of the diagram, but realistically TT-Explorer will be run on the host, and so will the model-explorer instance. The frontend will be a client of the REST API created by TT-Adapter and will use URLs from the model-explorer server to visualize the models.

Ingests: Model Explorer Graph, User-Provided Overrides (UI), Performance Trace Emits: Overrides JSON, Model Visualization

These components all work together to provide the TT-Explorer platform.

Client-Host Design Paradigm

Since performance traces and execution rely on Silicon machines, there is a push to decouple the execution and MLIR-environment heavy aspects of TT-Explorer onto some host device and have a lightweight client API that uses the REST endpoint provided by TT-Adapter to leverage the host device without having to constantly be on said host. This is very useful for cloud development (as is common Tenstorrent). In doing so, TT-Explorer is a project that can be spun up in either a tt-mlir environment, or without one. The lightweight python version of TT-Explorer provides a set of utilities that call upon and visualize models from the host, the host will create the server and serve the API to be consumed.

TT-Explorer

The following is a listed reference for the API in using TT-Explorer, check the TT-Adapter API reference below.

TTExplorer

Overview

The TTExplorer class is responsible for interacting with the model_explorer server, including uploading models, initializing settings, and executing models.

Initialization

__init__(self, port=8080, url="http://localhost", server=False, config=None)

Initializes the TTExplorer instance.

  • Parameters:
    • port (int): The port number for the model_explorer server. Default is 8080.
    • url (str): The base URL for the model_explorer server. Default is "http://localhost".
    • server (bool): Flag to indicate if the server should be created. If this is set to true, ensure an environment where the ttrt and ttmlir python bindings is used. Default is False.
    • config (dict): Configuration for the model_explorer server. Default is None.

Methods

get_model_path(self, file) -> str

Uploads a model file to the model_explorer server and returns the temporary path provided by the server.

  • Parameters:
    • file (file-like object): The model file to be uploaded.
  • Returns:
    • str: The temporary path of the uploaded model file.

initialize(self, settings={})

Initializes the server-side TT-Explorer by assigning a System Descriptor for future operations, needed to execute models.

  • Parameters:
    • settings (dict): Settings for initialization, currently none. Default is an empty dictionary.
  • Returns:
    • dict: dict with system_desc_path key pointing to server-path to System Descriptor

execute_model(self, model_path: str, settings={})

Executes a model on the model_explorer server with the provided settings.

  • Parameters:
    • model_path (str): Server path to ttir module to be executed, ensure that module has been uploaded first.
    • settings (dict): Settings for execution. Default is an empty dictionary.
      • "ttir_to_ttnn_options": List[str] Pipeline options to be fed into ttir-to-ttnn-backend-pipeline's String Parser
      • "artifact_dir": str(Path) A valid Server-Path to store artifacts from execution, if this flag is set then artifacts are not automatically deleted after execution is complete.
  • Returns:
    • dict: Relevant emitted files from Execution
      • "log_file": str: Log-File from ttrt perf call
      • "stdout": str: STDOUT from ttrt perf call, utf-8 decoded.
      • "perf_trace": str: CSV Performance Trace from module run.

Example Usage

# Initialize TTExplorer
explorer = TTExplorer(server=True)
# Explorer instance now running on thread on http://localhost:8080
# Make sure you wait until the thread has started the Flask server, you can check by viewing STDOUT.

# Upload a model file
file = open('my-module.mlir', 'r')
model_path = explorer.get_model_path(file)
# Since local==server, the model_path is moved to a tempfile on the same machine

# Initialize the SystemDesc on Machine for execution purposes
explorer.initialize()

# Execute the model, store artifacts permanently in home directory.
resp = explorer.execute_model(model_path, settings={'artifact_dir': '/home/<user>/ttrt-artifacts'})

csv = resp['perf_trace'] # Do with the CSV trace as you would like to view the performance results!

TT-Adapter

The following is a reference for the "REST" API provided by TT-Adapter. First, a short info-dump on how an extensible API can be built on top of Model Explorer.

Building an API using Model Explorer

The /apipost/v1/send_command endpoint provides an extensible platform with which commands are sent to be executed directly by the adapter specified. This becomes the main endpoint through which communication is facilitated between the server and client, the commands respond with an "adapter response".

Sending Commands

The body of the command must be JSON, and only the following fields are fed into the adapter functions:

cmd = {
  "extensionId": "tt_adapter", // tt_adapter to invoke functions from TT-Adapter
  "cmdId": "<name of function>", // Name of function to be run, "convert" is built into all adapters to convert some model to graph
  "modelPath": "<local_path to file>", // Path to model on server to be fed into function
  "deleteAfterConversion": False, // True if file at modelPath is to be deleted after function run
  "settings": {...}, // Object holding custom settings to be fed into function
}

More often than not, functions do not need all of these fields, but they must all be present to properly process the command sent into the function. Speaking of function, the function signature that all commands have to follow is as such:

class TTAdapter(Adapter):
  ...
  def my_adapter_fn(self, model_path: str, settings: dict):
    pass # Parse model_path and settings objects as they are fed from send_command endpoint.

This function is invoked and called from a new instance every time. This is important to understand for the idea of persisting information on the server. The onus is often on the end-user to store and preserve important information such as the path of a model they've uploaded, or the paths of important artifacts that the server has produced. TTExplorer aims to make this as easy as possible.

Information can be processed in this function however the user would like to define, and often settings becomes a versatile endpoint to provide more information and context for the execution of some function. As an example, refer to TTAdapter:initialize, this function to load a SystemDesc into the environment has little to do with modelPath or deleteAfterConversion, as such these variables are not processed at all, and the function only executes a static initialization process regardless of the parameters passed into the command.

Adapter Response

Model Explorer was probably not made to allow for such an extensible framework to be tacked onto it. As such, the adapter response is processed in a very particular way before it is sent back to the user. In particular, refer to model_explorer.utils.convert_adapter_response which is run on the output of every function. This means that responses can only be in JSON format and are constructed as:

{
  "graphs": [
    {/* response goes here */},
  ]
}

for custom adapter responses. This limits the transfer of raw bytes data through different MIME Types, and requires the tt_adapter.utils.to_adapter_format which turns any dict object into a model explorer adapter compatible response. While this framework works well for graphs, it makes an "extensible" API difficult to implement.

Current API Reference:

Initialize

Called from TTExplorer.initialize, used to Load SystemDesc into environment.

cmd = {
  "extensionId": "tt_adapter",
  "cmdId": "initialize",
  "modelPath": "", // Irrelevant
  "deleteAfterConversion": False,
  "settings": {}, // None at the moment
}

// RESPONSE

{"system_desc_path": "<path to system_desc.ttsys>"}

Execute

Called from TTExplorer.execute_model, executes a model.

cmd = {
  "extensionId": "tt_adapter",
  "cmdId": "execute",
  "modelPath": "<server-path-to-model>",
  "deleteAfterConversion": False, // Can be set to True if TTIR module is to be deleted after run
  "settings": {
    "ttir_to_ttnn_options": List[str], // Pipeline Options to feed into ttir_to_ttnn_backend_pipeline
    "artifact_dir": str, // Path on server to store TTRT Artifacts to, artifacts are not deleted after perf if set.
  },
}

// RESPONSE

{
  "stdout": "<raw text output to STDOUT from TTRT Perf Run>",
  "log_file": "<raw logging output>",
  "perf_trace": "<raw CSV perf trace collected from TTRT Perf Run, Not present if TTRT Perf failed>",
}

Convert

Standard built-in conversion function, converts TTIR Module into Model Explorer Graph. Also provides settings as a platform for overrides to be applied to the graph.

cmd = {
  "extensionId": "tt_adapter",
  "cmdId": "convert",
  "modelPath": "<server-path-to-ttir-module>",
  "deleteAfterConversion": True/False,
  "settings": {/* Overrides */}, // Undefined at the moment
}

// RESPONSE

<model-explorer-graph JSON Object>

Milestone 1 (v0.1)

Main Goal - Visualize & Execute

This will highlight half of the essential work that this tool should be able to do in both visualizing a model and executing it using the current TT-Forge stack. The frontend transformation of a model -> TTIR will be done outside of the scope of TT-Explorer at the moment. For this milestone TT-Explorer will be able to spin up a host-side and a client-side instance. The tool will be able to ingest TTIR modules to produce a visual result, and be able to execute this module. Ambitiously, the performance traces should be collected back into TT-Explorer to be displayed.

Tasks:

  • Load TTIR Modules and Visualize TTIR-Ops in Model Explorer
  • Create Extensible Notebook UX allowing for visualization and scripting capabilities
  • Add functionality to Model Explorer to load from re-compiled TTIR Modules (might be from JSON)
  • Add functionality to TT-MLIR to execute from Python Bindings
  • Create REST API skeleton in TT-Adapter
  • From REST API Call, Invoke python bindings to execute TTIR module using TT-Adapter
  • (If possible) Parse Perf Trace Artifact and visualize performance in Model-Explorer (as Node Data)

Milestone 2 (v0.2)

Main Goal - Model Editor

The primary function of TT-Explorer is to visualize and edit the model according to what the user defines as overrides the automatically generated compiler results. This milestone highlights that functionality in TT-Explorer, focusing around providing UI, TT-MLIR, and TT-Explorer features that enable the user to edit and tune a model “in-loop” with the TT-Forge compiler.

Tasks:

  • Flesh out and test locations ID such that operations can be tracked through the compiler stack.
  • Use Loc IDs to bind TTIR Ops with Tracy Perf Trace Artifact, and send to Model-Explorer to visualize.
  • Implement Overrides Functionality into TT-MLIR, tracking based on Loc IDs.
  • Overhaul UI to enable editing node attributes, use these updated fields to send information back to TT-Explorer via REST API (in the form of an Overrides JSON)
  • Parse Overrides JSON and apply Overrides over a REST API Call, visualize re-compiled graph now.
  • Provide REST API endpoint to track “legal” configurations and provide “legal” options attached to Graph JSON.

Milestone 3 (v0.3+)

Main Goal - Matured Tool and Extensibility

The focus of this milestone is to transition TT-Explorer from a prototype tool into a mature visualization and editing tool for “Human-In-Loop” compilation. The tool is now planned to made extensible for other dialects and entry points forecast into TT-MLIR (Jax, StableHLO, etc…) and development of the visualization components of the tool provide feedback to upstream repos like model-explorer. Here the focus is on providing extensible interfaces for new UI elements (in supporting multi-chip and beyond), REST API, and Overrides.

Tasks:

  • Begin researching autogenerated Python bindings for pipelines and transformations defined in C++.
  • Create modular frontend capabilities out of Flask app in Model-Explorer
  • Create a “mono-adapter” which holds the paths to invoke dialect-specific adapters for each dialect to be supported by TT-Explorer
  • Begin adding new dialects like .ttm, .ttnn to Model Explorer so that complied results can be inspected and analyzed to optimize at different steps of the compiler.
  • To be defined later, depending on the growth of the MLIR Project

Flatbuffers

Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):

  • .ttsys: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation.
  • .ttnn: A compiled binary file intended to be loaded and executed by the TTNN backend runtime.
  • .ttb: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).

ci

Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.

Key Words

Target Silicon (coming soon)

- 1:1 mapping to unique system-desc (this is because an n150 card can have different harvested rows)

Target Family

- product type (n150, n300)

Target Capabilities (coming soon)

- describes testable traits of Target Family
n150: {
    test params to use if running on n150
}
n300: {
    test params to use if running on n150
}

Test Capabilities (coming soon)

- set of target capabilities defined in the test
- test will populate certain parameters depending on the Target Family/Target Silicon it is running on

GitHub Runner CI Tags

Runner Use

There are 2 types of runner machines. Builders build offline and runners are silicon machines.

- builder
- runner

Runner Type

There are 2 runner types. Bare metals are standalone and virtual machines are kubernetes pods.

- bare-metal
- virtual-machine

Architecture

Supported architectures

- wormhole_b0
- blackhole (coming soon)

Pipeline Type

Supported pipelines

- perf
- functional

Active

Defines whether a runner is in service or taken out of service for maintenance

- in-service
- out-of-service

Target Family

Supported configurations of machines

- n150
- n300
- t3000 (coming soon)
- tg (coming soon)
- tgg (coming soon)

Target Silicon (coming soon)

-silicon-n150-0 (0th row harvested)
-silicon-n150-1 (1th row harvested)
-silicon-n300-0-0 (0th row harvested both chips)

Pipeline durations

- push: every push to main
- pr: every PR

CI Test Flow

1. GitHub runner
- build tt-mlir
- build ttrt
- upload artifacts

2. Silicon runner
- download tt-mlir / ttrt artifacts
- ttrt generate system desc
- llvm-lit runs all unit test, including silicon ones to generate flatbuffers (will only generate ones that are supported for that test file)
- ttrt runs generated flatbuffers

Adding a test

When adding a test, you can specify when the test should run and what values it should inherit. The test defines how it should run, not the infrastructure. The infrastructure will execute what the test defines. For now, if you specify nothing, it will run on all default parameters. Note: if you provide a target family, then it will be default run on any target silicon machine. If you need a specific target silicon machine (eg one with 1st row harvested), specify it in Target Silicon. Note: if you specify perf pipeline, it will automatically run on a bare metal machine Default parameters

[Architecture]: [wormhole_b0]
[Pipeline]: [functional, perf]
[Target Family]: [n150, n300]
[Target Silicon]: []
[Duration]: [push]
Location: test/ttmlir/Silicon
File Type: .mlir
REQUIRES: [Architecture] [Pipeline] [Target Family] [Target Silicon] [Duration] (coming soon)
UNSUPPORTED: [Target Family] [Target Silicon] (coming soon)

Additional Reading

This section contains pointers to reading material that may be useful for understanding the project.

MLIR

  • https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
  • https://mlir.llvm.org/docs/Tutorials/Toy/
  • https://www.jeremykun.com/2023/08/10/mlir-getting-started/
  • https://arxiv.org/pdf/2002.11054
  • https://ieeexplore.ieee.org/abstract/document/9370308

Dialects

Tablegen

LLVM Testing Framework Tools

Jax
Flatbuffer
Openxla Website
openxla
StableHLO

Contributor Covenant Code of Conduct

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of behavior that contributes to a positive environment for our community include:

  • Demonstrating empathy and kindness toward other people
  • Being respectful of differing opinions, viewpoints, and experiences
  • Giving and gracefully accepting constructive feedback
  • Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
  • Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

  • The use of sexualized language or imagery, and sexual attention or advances of any kind
  • Trolling, insulting or derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others' private information, such as a physical or email address, without their explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

1. Correction

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

2. Warning

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

3. Temporary Ban

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

4. Permanent Ban

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.

Project Structure

  • env: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffers
  • include/ttmlir: Public headers for the TTMLIR library
    • Dialect: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:
      • IR: MLIR operation/type/attribute interfaces and definitions
      • Passes.[h|td]: MLIR pass interfaces and definitions
      • Transforms: Common MLIR transformations, typically invoked by passes
    • Target: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
  • lib: TTMLIR library implementation
    • CAPI: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-api
    • Dialect: MLIR dialect implementations
  • runtime: Device runtime implementation
    • include/tt/runtime: Public headers for the runtime interface
    • lib: Runtime implementation
    • tools/python: Python bindings for the runtime, currently this is where ttrt is implemented
  • test: Test suite
  • tools/ttmlir-opt: TTMLIR optimizer driver

Namespaces

  • mlir: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.
    • mlir::tt: Everything ttmlir related is underneath this namespace. Since we need to subnamespace under mlir, just mlir::tt seemed better than mlir::ttmlir which feels redundant.
      • mlir::tt::ttir: The TTIR dialect namespace
      • mlir::tt::ttnn: The TTNN dialect namespace
      • mlir::tt::ttmetal: The TTMetal dialect namespace
      • mlir::tt::ttkernel: The TTKernel dialect namespace
  • tt::runtime: On the runtime side, we use the tt::runtime namespace for all runtime types and operations.
    • tt::runtime::ttnn: The TTNN runtime namespace
    • tt::runtime::ttmetal: The TTMetal runtime namespace (not implemented)

Dialects Overview

Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:

  • tt: Common types such as, tt.tile, tt.layout, tt.grid, etc. and enums such as, data formats, memory spaces, iterator types etc.
  • ttir: A high level dialect that models the tensor compute graph on tenstorrent devices. Accepts tosa and linalg input.
    • ttir.generic: Generically describe compute work.
    • ttir.to_layout: Convert between different tensor memory layouts and transfer between different memory spaces.
    • tensor.pad: Pad a tensor with a value (ie. convs)
    • ttir.yield: return result memref of computation in dispatch region body, lowers to ttkernel.yield
    • ttir.kernel: lowers to some backend kernel
  • ttnn: A TTNN dialect that models ttnn API.
  • ttkernel: Tenstorrent kernel library operations.
    • ttkernel.noc_async_read
    • ttkernel.noc_async_write
    • ttkernel.cb_push_back
    • ttkernel.[matmul|add|multiply]: Computations on tiles in source register space, store the result in dest register space.
    • ttkernel.sfpu_*: Computations on tiles in dest register space using sfpu coprocessor.
  • ttmetal: Operations that dispatch work from host to device.
    • ttmetal.dispatch: Dispatch a grid of compute work.

Adding an Op

This guide will walk you through the process of adding a new Op end to end in tt-mlir, in this case we will be adding a matmul operation. Note that the matmul op was added as part of the same changeset as this guide, it could be useful to reference the diff alongside this guide to see the changes in full.

This guide will cover the following steps:

1. Define the Op in the TTIR frontend dialect

We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td.

Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).

Here is an example of defining matmul in the TTIR dialect:

def TTIR_MatmulOp : TTIR_DPSOp<"matmul"> {
    let summary = "Matrix multiply operation.";
    let description = [{
      Matrix multiply operation.
    }];

    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output,
                         TT_OperandConstraintArrayAttr:$operand_constraints);

    let results = (outs AnyRankedTensor:$result);

    let extraClassDeclaration = [{
      MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
    }];

    let hasVerifier = 1;
}

There are many things to break down here, starting from the top:

  • def in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating file build/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc.
  • It inherits from class TTIR_DPSOp, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inherited defs. TTIR_DPSOp is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argument AnyRankedTensor:$output.
  • Next we have a list of arguments. These arguments consist of a mixture of Types (i.e. AnyRankedTensor) and Attributes (i.e. TT_OperandConstraintArrayAttr). Read more about Types & Attributes here.
    • AnyRankedTensor is part of a tablegen standard library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.
    • TT_OperandConstraintArrayAttr is a custom attribute that we have defined in the TT dialect. This attribute is used to specify constraints on the operands of the operation. For example, the TTIR_MatmulOp requires that the input tensors be in tile layout, this attribute captures this constraint.
  • Next we have a list of results in this case just 1, which aliases the output tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome.
  • Next we have extraClassDeclaration, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutated output tensor.
  • Finally, we have hasVerifier = 1, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.

We can now try building and opening the TTIROps.h.inc file to see the generated C++ code. We will actually get a linker error because we have hasVerifier = 1 which automatically declared a verifier function, but we need to go implement.

Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp and implement the verifier.

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimension for the
  // purpose of the matrix multiply. After the matrix multiply, the prepended
  // dimension is removed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimension for
  // the purpose of the matrix-vector product and removed after.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  }

  // Verify that the input A and input B has matching inner dimensions
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError(
        "Input A[-1](" + std::to_string(inputAShape[inputAShape.size() - 1]) +
        ") and B[-2](" + std::to_string(inputBShape[inputBShape.size() - 2]) +
        ") must have matching inner dimensions");
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct the
  // expected output shape
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims, inputBBatchDims;

    if (inputAShape.size() > 2) {
      inputABatchDims.insert(inputABatchDims.begin(), inputAShape.begin(),
                             inputAShape.end() - 2);
    }

    if (inputBShape.size() > 2) {
      inputBBatchDims.insert(inputBBatchDims.begin(), inputBShape.begin(),
                             inputBShape.end() - 2);
    }

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
                                            broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape
    expectedOutputShape.insert(expectedOutputShape.begin(),
                               broadcastedShape.begin(),
                               broadcastedShape.end());
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't support
  // scalars in IR, hence check that the output is at least 1D tensor of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return llvm::success();
  }

  // Verify that the output shape is correct
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(" +
                       std::to_string(outputShape.size()) +
                       ") must match the expected output shape rank(" +
                       std::to_string(expectedOutputShape.size()) + ")");
  }

  // Verify each dim of the output shape
  for (size_t i = 0; i < outputShape.size(); i++) {
    if (outputShape[i] != expectedOutputShape[i]) {
      return emitOpError(
          "Output shape dimension[" + std::to_string(i) + "](" +
          std::to_string(outputShape[i]) +
          ") doesn't match the expected output shape dimension[" +
          std::to_string(i) + "](" + std::to_string(expectedOutputShape[i]) +
          ")");
    }
  }

  return success();
}

2. Define the Op in the TTNN backend dialect

Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.

TTNNOps.td

def TTNN_MatmulOp : TTNN_NamedDPSOp<"matmul"> {
    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output);
    let results = (outs AnyRankedTensor:$result);

    let extraClassDeclaration = [{
      MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
    }];

    let hasVerifier = 1;
}

TTNNOps.cpp

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimension for the
  // purpose of the matrix multiply. After the matrix multiply, the prepended
  // dimension is removed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimension for
  // the purpose of the matrix-vector product and removed after.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  }

  // Verify that the input A and input B has matching inner dimensions
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError(
        "Input A[-1](" + std::to_string(inputAShape[inputAShape.size() - 1]) +
        ") and B[-2](" + std::to_string(inputBShape[inputBShape.size() - 2]) +
        ") must have matching inner dimensions");
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct the
  // expected output shape
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims, inputBBatchDims;

    if (inputAShape.size() > 2) {
      inputABatchDims.insert(inputABatchDims.begin(), inputAShape.begin(),
                             inputAShape.end() - 2);
    }

    if (inputBShape.size() > 2) {
      inputBBatchDims.insert(inputBBatchDims.begin(), inputBShape.begin(),
                             inputBShape.end() - 2);
    }

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
                                            broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape
    expectedOutputShape.insert(expectedOutputShape.begin(),
                               broadcastedShape.begin(),
                               broadcastedShape.end());
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't support
  // scalars in IR, hence check that the output is at least 1D tensor of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return llvm::success();
  }

  // Verify that the output shape is correct
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(" +
                       std::to_string(outputShape.size()) +
                       ") must match the expected output shape rank(" +
                       std::to_string(expectedOutputShape.size()) + ")");
  }

  // Verify each dim of the output shape
  for (size_t i = 0; i < outputShape.size(); i++) {
    if (outputShape[i] != expectedOutputShape[i]) {
      return emitOpError(
          "Output shape dimension[" + std::to_string(i) + "](" +
          std::to_string(outputShape[i]) +
          ") doesn't match the expected output shape dimension[" +
          std::to_string(i) + "](" + std::to_string(expectedOutputShape[i]) +
          ")");
    }
  }

  return success();
}

3. Convert / Implement the Op in the TTNN passes

Next we will implement the conversion from the TTIR matmul Op to the TTNN matmul Op. This is a trivial conversion, as the Ops are identical in their semantics, so the changeset isn't going to be very instructive, but will at least point to the files involved. The conversion is implemented in the ConvertTTIRToTTNNPass pass in file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp.

Zooming into class ConvertTTIRToTTNNPass we can see we implement the pass interface via member function void runOnOperation() final. This function will be called for every operation matching the type specified in the pass tablegen file. A quick look at include/ttmlir/Conversion/Passes.td we can see:

def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {

This means that runOnOperation will be called for every ModuleOp in the graph, usually there is only one ModuleOp which serves as the root of the graph.

Inside runOnOperation is usually where we define a rewrite pattern set that can match much more complicated patterns (nested inside of the ModuleOp's regions) than just a single operation. In runOperation method you will see the call to method populateTTIRToTTNNPatterns(...) that actually generates rewrite patterns. Method populateTTIRToTTNNPatterns(...) is defined in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp.

  patterns
      .add<TensorEmptyConversionPattern,
           ToLayoutOpConversionPattern,
           ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
           ElementwiseOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
           ElementwiseOpConversionPattern<ttir::CbrtOp, ttnn::CbrtOp>,
           ElementwiseOpConversionPattern<ttir::FloorOp, ttnn::FloorOp>,
           ElementwiseOpConversionPattern<ttir::IsFiniteOp, ttnn::IsFiniteOp>,
           ElementwiseOpConversionPattern<ttir::LogicalAndOp, ttnn::LogicalAndOp>,
           ElementwiseOpConversionPattern<ttir::LogicalOrOp, ttnn::LogicalOrOp>,
           ElementwiseOpConversionPattern<ttir::LogicalNotOp, ttnn::LogicalNotOp>,
           ElementwiseOpConversionPattern<ttir::LogicalXorOp, ttnn::LogicalXorOp>,
           ElementwiseOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
           ElementwiseOpConversionPattern<ttir::EqualOp, ttnn::EqualOp>,
           ElementwiseOpConversionPattern<ttir::NotEqualOp, ttnn::NotEqualOp>,
           ElementwiseOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
           ElementwiseOpConversionPattern<ttir::GreaterThanOp, ttnn::GreaterThanOp>,
           ElementwiseOpConversionPattern<ttir::LessEqualOp, ttnn::LessEqualOp>,
           ElementwiseOpConversionPattern<ttir::LessThanOp, ttnn::LessThanOp>,
           ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
           ElementwiseOpConversionPattern<ttir::MinimumOp, ttnn::MinimumOp>,
           ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
           ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
           ElementwiseOpConversionPattern<ttir::GeluOp, ttnn::GeluOp>,
           ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
           ElementwiseOpConversionPattern<ttir::RsqrtOp, ttnn::RsqrtOp>,
           ElementwiseOpConversionPattern<ttir::SignOp, ttnn::SignOp>,
           ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
           ElementwiseOpConversionPattern<ttir::Log1pOp, ttnn::Log1pOp>,
           ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
           ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
           ElementwiseOpConversionPattern<ttir::LogOp, ttnn::LogOp>,
           ElementwiseOpConversionPattern<ttir::DivOp, ttnn::DivOp>,
           ElementwiseOpConversionPattern<ttir::CeilOp, ttnn::CeilOp>,
           ElementwiseOpConversionPattern<ttir::SinOp, ttnn::SinOp>,
           ElementwiseOpConversionPattern<ttir::CosOp, ttnn::CosOp>,
           ElementwiseOpConversionPattern<ttir::Expm1Op, ttnn::Expm1Op>,
           ElementwiseOpConversionPattern<ttir::RemainderOp, ttnn::RemainderOp>,
           ElementwiseOpConversionPattern<ttir::WhereOp, ttnn::WhereOp>,
           ElementwiseUnaryWithFloatParameterOpConversionPattern<ttir::LeakyReluOp, ttnn::LeakyReluOp>,
           ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
           ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
           ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
           BroadcastOpConversionPattern,
           EmbeddingOpConversionPattern,
           SoftmaxOpConversionPattern,
           TransposeOpConversionPattern,
           TypecastOpConversionPattern,
           ClampOpConversionPattern,
           ConcatOpConversionPattern,
           ReshapeOpConversionPattern,
           SliceOpConversionPattern,
           SqueezeOpConversionPattern,
           UnsqueezeOpConversionPattern,
           ConstantOpConversionPattern,
           MatmulOpConversionPattern,
           Conv2dOpConversionPattern,
           MaxPool2dOpConversionPattern,
           SubtractOpConversionPattern,
           AllGatherOpConversionPattern
           >(typeConverter, ctx);

More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.

For matmul, we defined a new conversion pattern that's generic to all binary ops with arguments named a and b:

class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
  using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
        op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
        adaptor.getB(), adaptor.getOutput());
    return success();
  }
};

Invoked as part of the rewrite set:

MatmulOpConversionPattern

Note:

We also need to add this op to the C++ emitter, lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp see populateTTNNToEmitCPatterns(...).

4. Add a compiler unit test for the Op

So far we have defined the Op in the TTIR and TTNN dialects, implemented verifiers, and have conversion passes. Now we need to add a unit test to ensure that the pass is working correctly. The compiler unit tests are located in test/ttmlir/Dialect area. In this case we'll add a test under the TTNN subdirectory since we are testing the ConvertTTIRToTTNNPass.

test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline %s | FileCheck %s
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = tensor.empty() : tensor<64x96xbf16>
    // CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }
}

Unit tests in MLIR are typically written using a tool called FileCheck, please refer to the llvm FileCheck documentation for a tutorial and more information about the RUN and CHECK directives.

A few things to point out specifically regarding tt-mlir dialects:

  • tt.system_desc: This is a 1-1 mapping to the SystemDesc flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.
  • Pass --ttir-layout is a prerequisite before running convert-ttir-to-ttnn. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN.
  • This test is asserting that ttir.matmul converts to ttnn.matmul.

To run the test, you can use the following command:

cmake --build build -- check-ttmlir

You can also manually run ttmlir-opt on the test file to see the resulting output:

./build/bin/ttmlir-opt --ttir-load-system-desc="path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir

5. Define flatbuffer schema for the Op

Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.

include/ttmlir/Target/TTNN/program.fbs

table MatmulOp {
  in0: tt.target.TensorRef;
  in1: tt.target.TensorRef;
  out: tt.target.TensorRef;
}

Type TensorRef, flatbuffer tables with suffix Ref are used to represent live values during the runtime, decoupled from the underlying Desc suffixes which carry the type and attribute information for the object.

We also add this new op to the union OpType, which is the variant type for all ops.

More information about writing flatbuffer schemas can be found in the flatbuffers documentation

6. Serialize the Op in the flatbuffer format

In the previous section we defined the flatbuffer schema for the matmul Op, now let's put our new schema definition to use. The schema is used as input to a program called flatc which generates C++ code (or any language for that matter) for serializing and deserializing the schema. This generated code can be found in build/include/ttmlir/Target/TTNN/program_generated.h.

Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp to define a createOp overloaded function that does the conversion from MLIR to flatbuffer:

::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
  auto in0 =
      cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getA()));
  auto in1 =
      cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getB()));
  auto output = cache.at<::tt::target::TensorRef>(
      getOperandThroughDPSOps(op.getResult()));
  return ::tt::target::ttnn::CreateMatmulOp(*cache.fbb, in0, in1, output);
}

Lots of things are happening here, let's break it down:

  • FlatbufferObjectCache: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, a TensorRef with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.
  • getOperandThroughDPSOps: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the original TensorRef.
  • CreateMatmulOp: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.

We can finally generate a binary with our new Op! We can use the following command:

./build/bin/ttmlir-opt --ttir-load-system-desc="path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn

And we can inspect the with ttrt:

ttrt read out.ttnn

7. Add runtime support for the Op

Next, we want to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.

runtime/lib/ttnn/operations/matmul/matmul.cpp

namespace tt::runtime::ttnn::operations::matmul {
void run(const ::tt::target::ttnn::MatmulOp *op, ProgramContext &context) {
  ProgramTensorPool &tensorPool = context.getTensorPool();
  const ::ttnn::Tensor &lhs = tensorPool.at(op->in0()->global_id());
  const ::ttnn::Tensor &rhs = tensorPool.at(op->in1()->global_id());
  DEBUG_ASSERT(lhs.is_allocated());
  DEBUG_ASSERT(rhs.is_allocated());
  ::ttnn::DataType outputDataType = utils::getDataType(op->out());
  ::tt::tt_metal::MemoryConfig outputMemoryConfig =
      utils::createMemoryConfig(op->out());

  std::optional<
      ::ttnn::operations::matmul::MatmulMultiCoreReuseMultiCast1DProgramConfig>
      programConfig = std::nullopt;

  const std::optional<const ::tt::tt_metal::MemoryConfig> memoryConfig =
      std::make_optional(outputMemoryConfig);

  const std::optional<const ::ttnn::DataType> dtype =
      std::make_optional(outputDataType);

  ::ttnn::Tensor out = ::ttnn::matmul(
      lhs, rhs, /*transposeA*/ false, /*transposeB*/ false, memoryConfig, dtype,
      /*programConfig*/ std::nullopt, /*activation*/ std::nullopt,
      /*computeKernelConfig*/ std::nullopt, /*coreGrid*/ std::nullopt);

  tensorPool.insert_or_assign(op->out()->global_id(), out);
}
} // namespace tt::runtime::ttnn::operations::matmul

A couple things to note from above:

  • Most runtime op functions will follow a similar pattern, they will take in some additional datastructures for managing the program context.
    • Program context tracks the state of the current program. It stores intermediate tensors and devices.
  • tensorPool.at(op->in0()->global_id()): global_id is a unique identifier for the tensor that was generated and managed by the FlatbufferObjectCache. This is how it's intended to be used by the runtime.
  • Some operations may belong to a larger set of operations. For example, any eltwise unary operations can be added in runtime/lib/ttnn/operations/eltwise/unary.cpp directly without needing to create a new file.

If a new file is created for the op, we need to add a new source to runtime/lib/ttnn/operations/CMakeLists.txt and a new case to runtime/lib/ttnn/program.cpp.

To update runtime/lib/ttnn/operations/CMakeLists.txt, include the path to the source file in TTNN_OPS_SRCS:

runtime/lib/ttnn/operations/CMakeLists.txt

  ${CMAKE_CURRENT_SOURCE_DIR}/matmul/matmul.cpp

To update runtime/lib/ttnn/program.cpp, add a new case to the runOperation method of ProgramExecutor:

runtime/lib/ttnn/program.cpp

  case ::tt::target::ttnn::OpType::MatmulOp: {
    return operations::matmul::run(op->type_as_MatmulOp(), context);
  }

We can test our changes with ttrt (don't forget to rebuild ttrt):

ttrt run out.ttnn

8. Add a silicon unit test for the Op

After adding runtime support, we're ready to test our Op on silicon. All silicon tests are located under test/ttmlir/Silicon. The process is similar to adding a compiler unit test.

In our specific case, we create a unit test here: test/ttmlir/Silicon/TTNN/simple_matmul.mlir:

test/ttmlir/Silicon/TTNN/simple_matmul.mlir

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = tensor.empty() : tensor<64x96xbf16>
    // CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }
}

Couple things to point out about this process:

  • Tests placed under test/ttmlir/Dialect will only test the compiler's capability of compiling the module. If you want the module to run on silicon in CI, the test must be placed under test/ttmlir/Silicon.
  • Notice the differences between the compilation headers of test/ttmlir/Silicon/TTNN/simple_matmul.mlir and test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
    • --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%": The system-desc-path option specifies the location of the system descriptor required for compiling the module. This is crucial for silicon tests, as modules compiled with different system descriptors may vary in silicon compatibility. Ensuring the system descriptor accurately reflects the target hardware is essential for running the module correctly.
    • // RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn: This runs ttmlir-translate that serializes the output mlir module to a flatbuffer binary. We added the logic for this serialization in the Serialize the Op in the flatbuffer format section.

Decomposing an Op in TTIR

This guide explains how to add and decompose a new operation in the TTIR dialect. We’ll focus on adding an Index operation, which will be decomposed into the Slice operation. The decomposition is implemented as a conversion pass in MLIR since it allows us to mark operations or dialects as legal or illegal, type conversion...

This guide will cover the following steps:

1. Define the Op in the TTIR frontend dialect

The more information regarding this step can be found here: Define the Op in the TTIR frontend dialect

I updated the TTIROps.td as following:

def TTIR_IndexOp: TTIR_DPSOp<"index"> {
    let summary = "Index op.";
    let description = [{
      Extract a sub-tensor (slice) from the input tensor along a specified dimension.
      The `begin`, `end`, and `step` attributes define the start, stop, and step indices for the
      selected dimension (`dim`) of the tensor.
    }];

    let arguments = (ins AnyRankedTensor:$input,
                         AnyRankedTensor:$output,
                         I32Attr:$dim,
                         I32Attr:$begin,
                         I32Attr:$end,
                         I32Attr:$step,
                         TT_OperandConstraintArrayAttr:$operand_constraints);

    let results = (outs AnyRankedTensor:$result);

    let extraClassDeclaration = [{
      MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
    }];

    let hasVerifier = 1;
}

The verification function has been added as well:

// IndexOp verification
::mlir::LogicalResult mlir::tt::ttir::IndexOp::verify() {
  ::mlir::RankedTensorType inputType = getInput().getType();
  ::llvm::ArrayRef<int64_t> inputShape = inputType.getShape();
  ::mlir::RankedTensorType outputType = getOutput().getType();
  int32_t dim = getDim();
  int32_t begin = getBegin();
  int32_t end = getEnd();
  int32_t step = getStep();

  // Verify that the input is at least 1D tensor
  if (inputType.getRank() < 1) {
    return emitOpError("Input must be at least a 1D tensor");
  }

  // Validate that the output tensor has the same element type as the input
  // tensor
  if (inputType.getElementType() != outputType.getElementType()) {
    return emitOpError(
        "Output tensor must have the same element type as the input tensor");
  }

  // Verify the output tensor rank
  if (inputType.getRank() != outputType.getRank()) {
    return emitOpError(
        "Output tensor must have the same rank as the input tensor");
  }

  // Verify that the dim attribute is within the bounds of the input tensor
  if (dim < 0 || dim >= inputType.getRank()) {
    return emitOpError() << "Invalid dimension index " << dim
                         << ". Input tensor rank is " << inputType.getRank();
  }

  // Verify begin, end, step and the output tensor dimensions
  int64_t dimSize = inputShape[dim];

  // Adjust negative begin and end
  int32_t adjustedBegin = (begin < 0) ? (begin + dimSize) : begin;
  int32_t adjustedEnd = (end < 0) ? (end + dimSize) : end;

  std::ostringstream inputShapeStream;
  inputShapeStream << "(";
  for (size_t i = 0; i < inputShape.size(); ++i) {
    inputShapeStream << inputShape[i];
    if (i != inputShape.size() - 1) {
      inputShapeStream << ", ";
    }
  }
  inputShapeStream << ")";
  std::string inputShapeStr = inputShapeStream.str();

  if (adjustedBegin < 0 || adjustedBegin >= dimSize) {
    return emitOpError() << "Invalid begin index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "), got " << begin
                         << ". Input shape: " << inputShapeStr;
  }
  if (adjustedEnd < 0 || adjustedEnd > dimSize) {
    return emitOpError() << "Invalid end index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "], got " << end
                         << ". Input shape: " << inputShapeStr;
  }

  auto formatValueMessage = [](int value, int adjustedValue) {
    return value < 0 ? std::to_string(adjustedValue) + " (" +
                           std::to_string(value) + ")"
                     : std::to_string(value);
  };
  std::string beginValueMessage = formatValueMessage(begin, adjustedBegin);
  std::string endValueMessage = formatValueMessage(end, adjustedEnd);

  if (step == 0) {
    return emitOpError("Step value for dimension " + std::to_string(dim) +
                       " cannot be zero");
  }

  if (step > 0 && adjustedBegin > adjustedEnd) {
    return emitOpError() << "For positive step, begin index must be less "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  if (step < 0 && adjustedBegin < adjustedEnd) {
    return emitOpError() << "For negative step, begin index must be greater "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  // Calculate the expected size of the output dimension
  int32_t expectedDimSize =
      (std::abs(adjustedEnd - adjustedBegin) + std::abs(step) - 1) /
      std::abs(step);
  if (outputType.getDimSize(dim) != expectedDimSize) {
    return emitOpError() << "Mismatch in dimension " << std::to_string(dim)
                         << " of the output tensor: expected size "
                         << expectedDimSize << ", but got "
                         << outputType.getDimSize(dim);
  }

  return success();
}

2. Create a conversion pattern

A conversion pattern defines how MLIR should rewrite the Op. It can be implemented in either C++ or TableGen. Currently, we only have the C++ implementation; TableGen format will be added in the future.

C++ conversion pattern

For the Index operation, we use the C++ conversion pattern because it involves changing the Op’s input types from integers to arrays, which TableGen lacks flexibility for.

// This transformation adjusts IndexOp attributes so that `begin`, `end`, and
// `step` become arrays, where each array element corresponds to a dimension of
// the input tensor. For dimensions other than the sliced dimension, default
// values are used.
//
struct IndexToSliceConversionPattern
    : public OpConversionPattern<ttir::IndexOp> {
  using OpConversionPattern<ttir::IndexOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::IndexOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    auto inputType =
        ::mlir::dyn_cast<mlir::RankedTensorType>(adaptor.getInput().getType());
    if (!inputType || !inputType.hasRank()) {
      return failure();
    }

    int64_t rank = inputType.getRank();
    llvm::SmallVector<mlir::Attribute, 4> begins, ends, steps;

    for (int64_t i = 0; i < rank; ++i) {
      if (i == op.getDim()) {
        begins.push_back(rewriter.getI32IntegerAttr(adaptor.getBegin()));
        ends.push_back(rewriter.getI32IntegerAttr(adaptor.getEnd()));
        steps.push_back(rewriter.getI32IntegerAttr(adaptor.getStep()));
      } else {
        begins.push_back(rewriter.getI32IntegerAttr(0));
        ends.push_back(rewriter.getI32IntegerAttr(inputType.getDimSize(i)));
        steps.push_back(rewriter.getI32IntegerAttr(1));
      }
    }

    auto newOp = rewriter.create<ttir::SliceOp>(
        op.getLoc(), op.getType(), adaptor.getInput(), adaptor.getOutput(),
        rewriter.getArrayAttr(begins), rewriter.getArrayAttr(ends),
        rewriter.getArrayAttr(steps), adaptor.getOperandConstraints());

    rewriter.replaceOp(op, newOp.getResult());
    return success();
  }
};

The matchAndRewrite method from OpConversionPattern is implemented to replace the matched Op with the newly created Op. Since decomposition is implemented as a conversion pass, OpAdaptor is used to access the attributes of the original Op in their converted types. Finally, we instantiate the new Op and call the replaceOp method on ConversionPatternRewriter to replace the original Op.

Tablegen conversion pattern

TODO

3. Register the created conversion pattern

To register the new pattern, go to the populateTTIRToTTIRDecompositionPatterns function in TTIRToTTIRDecomposition.cpp and add it to RewritePatternSet using the add method. After that is done you should mark the decomposed op as illegal in runOnOperation method of TTIRToTTIRDecompositionPass in TTIRToTTIRDecompositionPass.cpp.

You should also add a silicon test like described here: Add a silicon unit test for the Op. This is how the silicon test for the Index operation looks like:

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
#any_device_tile = #tt.operand_constraint<dram|l1|interleaved>
module attributes {} {
  func.func @forward(%arg0: tensor<4x32x32xbf16>) -> tensor<4x32x16xbf16> {
    %0 = tensor.empty() : tensor<4x32x16xbf16>
    // CHECK: %[[C:.*]] = "ttnn.slice"[[C:.*]]
    %1 = "ttir.index"(%arg0, %0) <{dim = 2: i32, begin = 0: i32, end = 32: i32, step = 2: i32, operand_constraints = [#any_device_tile, #any_device_tile]}> : (tensor<4x32x32xbf16>, tensor<4x32x16xbf16>) -> tensor<4x32x16xbf16>
    return %1 : tensor<4x32x16xbf16>
  }
}

Doxygen

This is a link to a doxygen autogenerated code reference. Doxygen

Build Instructions

To build Doxygen use the doxygen target in CMake

cmake -B build
cmake --build build -- doxygen

Specifications

Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.

Specifications are written in markdown and are stored in the docs/src/specs directory of the repository. Below is a template that should be used when creating a new specification.

Specification Template

# [Title]

A brief description of the feature or concept that this specification is
defining.

## Motivation

A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.

## Proposed Changes

A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.

It should also call out any interactions between components and how they might
share an interface or communicate with each other.

## Test Plan

A brief description of how the feature or concept will be tested.

## Concerns

A list of concerns that have been identified during the design of this feature.

Runtime Stitching

Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.

Motivation

In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.

Simple Example

mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)

for i in range(10):
    outs_a = mod_a(ins_a)
    outs_b = mod_b(outs_a)

mod_a and mod_b are 2 independent compile steps, during the compile step for mod_a it should be completely unaware that mod_b will take place and vice-versa. In order to achieve this we propose a new runtime concept called stitching:

  • forge invokes compile step for mod_a, tt-mlir compiler determines where the inputs (ins_a) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission.
  • forge invokes compile step for mod_b, same happens as bullet 1
  • mod_a is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.
  • forge runtime invokes mod_a program submit
  • mod_b is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start running mod_b since the inputs are already in the correct location.

A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.

Proposed Changes

Compiler Metadata

Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.

Compiler will decide where the tensors should live, host, device dram, device l1.

Runtime

  • Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
  • New runtime API Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
  • Runtime will need to invoke toLayout on all input tensors before invoking the program.

Test Plan

  • Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.

Concerns

  • Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.

Tensor Layout

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Motivation / High level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

An Example / Walkthrough

Let's consider a snippet of MLIR:

tensor<2x3x64x128xf32>

Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:

tensor<2x3x64x128xf32,
  #tt.layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <1x1>,
    memref<384x128xf32, #tt.memory_space<l1>>
  >
>

At the time of this writing there are 4 properties that make up a tensor layout:

  • linear: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.
  • oob_val: A tracked out of bounds value that fills padding space.
  • grid: The grid shape that this tensor is divided onto.
  • memref: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.

This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:

Before we jump into more advanced topics there are two resources that could be useful to have at hand:

  • test/python/tensor_layout.py: Python test with many convenience functions for creating and experimenting with tensor layouts.
  • TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.

Dimension Collapsing

Probably the most important concept in tt.layout is dimension collapsing. This is captured by the affine map linear property which provides a mapping from tensor dim space to a reduced physical dimensional space. This single-handedly touches on most of the tensor layout goals mentioned at the beginning of the doc:

  • Leaves tensor shapes intact
  • Logical-Physical mapping, how the tensor is laid out in memory over a grid
  • Enables more flexible sharding
  • Explicit padding

To see how these goals are achieved we'll continue working on an explicit example, same one as above:

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)

To recap, we have our example 4d tensor (2, 3, 64, 128), which maps directly to the LHS (d0, d1, d2, d3). We have our 2d grid shape (1, 1), notice the affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later, but first let's look at how this forms an affine-map iteration space. If we index our tensor at say [1, 1, 6, 100], we can simply plugin those numbers to get our remapped offset:

(1 * 192 + 1 * 64 + 6, 100) = (262, 100)

This remapped offset (262, 100) corresponds to the row and column index of the collapsed physical memory.

By default, the dim range [0, -1) is collapsed, but the tt.layout contructor can actually take a programmable range called collapseIntervals. collapseIntervals is a list of pairs, where each pair is a dim range interval, left inclusive, right exclusive. Let's consider a few examples:

Instead of multiplying out real shapes, we will use <> to represent a dimension join operator.

  • 3D tensor onto a 2D grid and default collapseIntervals=[(0, -1)]:
(d0, d1, d2) -> (d0 <> d1, d2)
  • 4D tensor onto a 3D grid and collapseIntervals=[(1, -1)]:
(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)
  • 4D tensor onto a 3D grid and collapseIntervals=[(0, 2)]:
(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)
  • 7D tensor onto a 4D grid and collapseIntervals=[(0, 3), (-3, -1)]:
(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)

Multi-core

Let's consider the original example again, but on a larger grid than 1x1, say 2x4:

tensor<2x3x64x128xf32,
  #tt.layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <2x4>,
    memref<192x32xf32, #tt.memory_space<l1>>
  >
>

The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)

Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:

(383 + 1, 127 + 1) = (384, 128)

Finally, we divide the dims by the respective grid dims:

(384 / 2, 128 / 4) = (192, 32)

Here's a few more example mlir snippets:

tensor<8x300xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

A couple of final notes regarding grid shape:

  • Grid shapes of rank > 2 are perfectly legal. Not only it this useful for describing multi-device grid topologies, but it is often convenient to have higher ranked grids to better describe how a high rank tensor should be divided. The grid shape here is a virtual grid shape, the tt.device attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one.
  • Grid shapes where either columns or rows are > physical device grid is also legal. Since this is only a virtual grid shape we could have some grid 1x64 that maps to a physical 8x8 device grid (this particular example is called width sharding in TTNN).

Tilized

A tilized tensor is one with a memref that has a tile element type.

Given some tensor with scalar layout:

tensor<3x64x128xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<64x64xf32, #tt.memory_space<l1>>
  >
>

After tilizing we'll have:

tensor<3x64x128xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Notice the memref dim was ceilDiv'd by tile shape and the element type becomes a tt.tile type. Also notice that the tensor shape and element type remains intact.

Padding

Padding can be a bit of an overloaded term, but in this context it refers to an out of bounds area in the physical memory allocation that has no real tensor data in it. The contents of this area is tracked by oob_val and the padding area can be automatically derived from the attributes of tt.layout.

Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.

Example of non-divisible grid:

tensor<53x63xf32,
  #tt.layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<18x32xf32, #tt.memory_space<l1>>
  >
>

The grid dims always ceilDiv the affine map results, real tensor data will entirely fill initial shards and the last shard in each dimension will be partially filled.

In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.

Taking the above example a step further, we could tilize it:

tensor<53x63xf32,
  #tt.layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<1x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Tile dims also always ceilDiv the resulting memref shape. Notice now that the padding is slightly more complicated. Our scalar shard shape was 18x32, but this was further padded to 32x32 meaning that every core now has 14 rows of padding except for the last row of cores which has 15 rows of padding.

Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.

With affine map we can be even more flexible in how we pad, we can bump our stride between dimensions. Consider tensor (w/ batch dim 2):

tensor<2x8x32xf32,
  #tt.layout<
    (d0, d1, d2) -> (d0 * 8 + d1, d2),
    undef,
    <1x2>,
    memref<16x16xf32, #tt.memory_space<l1>>
  >
>

If we tilized the above tensor we'd end up with a memref shape of 1x1x!tt.tile<32x32>, that is, all batches are tightly packed within a single tile. Let's say that for some reason, we do not want the batches (2) to be tightly packed within a tile, perhaps the mathematical operation we're doing requires the batch to be independently evaluated and thus the (S)FPU needs them in separate tiles. We can adjust this by adjusting the stride of the affine map:

(d0, d1, d2) -> (d0 * 32 + d1, d2),

Instead of striding by the number of logical rows, 8, we bump the stride up to 32 effectively pushing a gap between the collapsed rows and enabling each batch to fall on a tile boundary.

Memory Spaces

At the time of writing this document there are 4 memory spaces:

  1. System: Host memory space that is not device visible.
  2. SystemMMIO: Host memory space that is device visible.
  3. DeviceDRAM: DRAM local to the device.
  4. DeviceL1: SRAM on each core.

Something worth noting here is that a tensor must belong exclusively to only one of these memory spaces at a time. For example, in order to stream tensor data from DeviceDRAM to DeviceL1 you would need to either manually slice the tensor into smaller tensors that do fit in L1 or have native support in the op's kernel for double buffering a block (most TTNN ops already support this).

Multi-device

Multi-device can be naturally represented via a combination of two concepts already touched on above, higher ranked grids and collapseIntervals. Let's consider the following example with a 3d grid and collapseIntervals=[(1, -1)].

tensor<2x3x64x128xf32,
  #tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<1x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Here we've left the batch dim intact and started collapsing at d1. This enables us to define a 3d grid where the outermost grid dim divides the batch directly. This could map to a 2 device system where the batch dim is evenly divided between 2 devices. Within each device this op runs on a 2x4 grid.

The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.

Test Plan

  • test/python/tensor_layout.py: Assertions for LayoutAttr to make sure it's spec compliant.
  • Sweep tests:
    • Grid dim sweeps
    • Tilize / untilize sweeps
    • Padding sweeps
  • Multi-device tests

Concerns

  • tt.layout is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.
  • Optimization solution space is potentially large with all of this flexibility. Two things that I hope can help protect us here:
    • By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
    • We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.







TTNN Tensor Layout

The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.

Terms

  • shape: Always logical shape, n-dimensional
  • stride: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from stride
  • shard_shape: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiple
  • physical_shard_shape: The shard_shape padded out to tile_shape
  • tile_shape: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatible
  • grid_shape: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]

Mapping from the compiler

The compiler uses an affine map to explicitly track which dimensions are folded together, but TTNN does not have affine maps so the representation is a bit more implicit. TTNN captures the dimension collapsing in the stride attribute where dimensions [0, N-1] are always collapsed. This is less flexible so the compiler will have to enforce only collapsing supported dimensions when targeting TTNN, or handle lowering in a different way. For example, in the compiler we might want to represent data parallel over the tensor batch dim by leaving d0 and collapsing d1 - d[-1]. TTNN doesn't support this in its tensor layout representation, but this could be lowered to a TTNN mesh tensor where the mesh could be sliced on the batch and each per-device tensor has d0 fully collapsed.

TTNN Example

Alt text

TTNN Interactive Visualizer

Device

Device in tt-mlir is somewhat of an overloaded term and can refer to different things depending on the context. This document will only speak to the compiler's abstract representation of a device captured by attribute #tt.device.

Terms

There are many overloaded terms when talking about devices and grids, this document will use the following definitions:

  • Physical Grid: A 2D array of tensix cores on a chip.
  • Chip: A single physical chip with a Physical Grid of cores.
  • Card: A PCIE or Ethernet card that may contain multiple Chips.
  • System: A collection of Cards that are usually connected together on the same host via PCIE or networked via ethernet. A system is represented by SystemDesc in the compiler.
  • Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
  • Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
  • Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.

Motivation

The device attribute strives to achieve the following goals:

  • Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
  • Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
  • Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
  • Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.

Scope

This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.

Examples

All of the following examples will assume the physical hardware has an 8x8 physical grid of cores. We will use notation [N, 8x8] to represent a N chip system, each with an 8x8 physical grid.

#tt.device in is simplest, single chip form [1, 8x8], just maps directly 1-1 to the underlying physical hardware device.

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>

Let's break down what each of these attributes mean:

  • workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>: This is a 2D logical grid with dim 8x8. It's followed by an affine map (d0, d1) -> (0, d0, d1) that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.
  • meshShape = 1: A shape provided as part of the DeviceAttr constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • chipIds = [0]: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. The SystemDesc attribute that this is in reference to is tagged on the top level ModuleOp.

Specific examples that this document will cover:

Before we move on to more complex examples, it's worth having on hand:

  • The python test test/python/device_attr.py which shows how all of these examples can actually be programmed for the device attribute.
  • The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.

Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.

Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.

Data Parallel Over Batch

Given a 2 chip system, [2, 8x8], we can represent a simple data parallel logical grid that divides the batch dimension in half across the two chips. This is denoted by meshShape = 2x1x1 which means the logical grid is 3D.

#tt.device<
  workerGrid = #tt.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
  meshShape = 2x1x1,
  chipIds = [0, 1]
>

The affine map here is just identity, so dims d1 and d2 directly index the physical grid and d0 indexes the chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<16x3x64x128xf32,
  #tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<8x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.

Data Parallel Over 2d

In this example we will consider a 2 chip system, [2, 8x8], and view it as though the two chips are concatenated together side by side to form a single 8x16 grid. This is denoted by meshShape = 1x2 which means to concatenate the chips in the second dimension.

#tt.device<
  workerGrid = #tt.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
  meshShape = 1x2,
  chipIds = [0, 1]
>

Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<256x1024xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <4x16>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.

Data Parallel Over 2d and Batch

The previous 2 examples can be composed together to form a logical grid that divides tensor across multiple dimensions. Here we will consider a 4 chip system [4, 8x8] and view it as a 2x8x16 grid. Note that the meshShape is 2x1x2 which means to concatenate the chips in the first and third dimensions.

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>

We can evaluate the affine map to see that the chips are interpreted in chunks of two, where groups [0, 1] and [2, 3] each form 8x16 grids and these 2 groups concatenate to form a 2x8x16 grid.

We can consider the following tensor to map onto this grid:

tensor<64x256x1024xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <2x4x16>,
    memref<32x2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Pipeline Parallel

Pipeline parallel in the scope of this spec isn't particularly interesting, it is intended to be used in conjunction with the ttir.pipeline operation which will group sections of the module's operations into groups to form pipeline regions and will be covered in a separate spec.

What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.

Given an 8 chip system [8, 8x8], we can form two virtual devices that each take 4 chips and interpret them differently (though they could take the same logical grid).

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>
#tt.device<
  workerGrid = #tt.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
  meshShape = 2x2,
  chipIds = [4, 5, 6, 7]
>

Reinterpreted Grids (Transpose)

One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.

To keep things simple, let's consider a 1 chip system [1, 8x8], but it's not too big a leap to see how this could map to multi-chip where the cost of moving data is even higher.

Let's also consider a simple (totally contrived) eltwise unary graph:

a = exp(a)
aT = transpose(a)
relu(aT)
  1. We'll establish a regular, single chip, identity logical grid:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute exp.
  2. We'll reinterpret the grid as transposed:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d1, d0)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute transpose. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation.
  2. Execute relu.

It's important to note that we effectively implemented transpose without moving data anywhere.

Reinterpreted Grids (Extra)

For the sake of examples, here's a few more ways of reinterpreting the logical grid.

Extra Wide Grid

#tt.device<
  workerGrid = #tt.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Extra Tall + Transposed Grid

#tt.device<
  workerGrid = #tt.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Staircase

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.

Lowering to TTNN

While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.

Grid/Shard Orientation

TODO

Multi-device

Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.

Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:

Compiler

  • Device Creation: The TTNN device in the compiler is exactly the same attribute from the ttir dialect. It will encode the meshShape into the flatbuffer which can be directly used to program ::ttnn::MeshShape.
  • Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the ttir dialect. The grid attribute in the tensor layout can be trivially divided by meshShape to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:
    • Mesh Sharded: If the tensor grid is > 1 along the meshShape dimensions, the tensor will be sharded across the mesh devices.
    • Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.

Runtime

  • Device Creation: The ttnn runtime will wholesale switch to working with mesh devices via api ttnn::multi_device::open_mesh_device, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be 1xN where N is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly with SubMesh API.
  • Tensor Creation: Tensor creation in a multi-device system is a bit more involved. In order to upload a multi-device tensor to the mesh, the host tensor much first be created with MultiDeviceHostStorage. The ttnn runtime can automatically do this during handleToHostMemoryConfigOp:
    • Regular host tensor will bounce through new tensor with MultiDeviceHostStorage type.
    • tensor.to(mesh_device) will allocate/move the tensor to the mesh device.

Lowering to TTMetal

In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.

Test Plan

  • test/python/device_attr.py covers all of the examples above and asserts the IR is correctly generated.
  • Additional functional unit tests will be added as op and runtime support is added.

Concerns

  • tt.device is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.

'tt' Dialect

TT types and attributes common to all TT dialects. This dialect defines types and attributes common to all TT dialects.

[TOC]

ArchAttr

TT Arch

Syntax:

#tt.arch<
  ::mlir::tt::Arch   # value
>

Enum cases:

  • grayskull (Grayskull)
  • wormhole_b0 (WormholeB0)
  • blackhole (Blackhole)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::Archan enum of type Arch

ArgumentAllocationAttr

Argument allocation attribute in TT dialect

Syntax:

#tt.arg_alloc<
  uint64_t,   # address
  uint64_t,   # size
  MemorySpace   # memorySpace
>

Holds the metadata for the allocation of an function argument i.e. for graph inputs.

Parameters:

ParameterC++ typeDescription
addressuint64_t
sizeuint64_t
memorySpaceMemorySpace

BufferAccessAttr

TT Buffer Access

Syntax:

#tt.buffer_access<
  ::mlir::tt::BufferAccess   # value
>

Enum cases:

  • alias (Alias)
  • stream (Stream)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::BufferAccessan enum of type BufferAccess

BufferAttr

Buffer attribute in TT dialect

Syntax:

#tt.buffer<
  MemRefType,   # memref
  BufferAccess   # buffer_access
>

Describes the physical footprint and layout of a buffer in L1. Its memref must also have a shape with rank equal to DeviceAttr grid. It also carries a buffer access attribute which can be one of:

  • Alias: This buffer aliases a persistent Tensor L1 allocation directly. Implies that no datamovement occurs and the compute kernel just accesses the local allocation directly.
  • Stream: This buffer is a temporary destination as a means to get remote data for local computation. Remote data is most likely a a tensor that is allocated in dram, but could also be data from a remote core.

Parameters:

ParameterC++ typeDescription
memrefMemRefTypeA memref that describes the physical footprint and layout of the buffer. It must also have a shape with rank equal to DeviceAttr grid.
buffer_accessBufferAccessHow data is accessed through this buffer, alias or stream.

CPUDescAttr

TT cpu_desc attribute

Syntax:

#tt.cpu_desc<
  CPURole,   # role
  StringAttr   # target_triple
>

TT cpu_desc attribute

Parameters:

ParameterC++ typeDescription
roleCPURole
target_tripleStringAttr

CPURoleAttr

TT CPU Role

Syntax:

#tt.cpu_role<
  ::mlir::tt::CPURole   # value
>

Enum cases:

  • host (Host)
  • device (Device)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::CPURolean enum of type CPURole

ChipChannelAttr

TT chip_channel attribute

Syntax:

#tt.chip_channel<
  unsigned,   # deviceId0
  ::llvm::ArrayRef<int64_t>,   # ethernetCoreCoord0
  unsigned,   # deviceId1
  ::llvm::ArrayRef<int64_t>   # ethernetCoreCoord1
>

TT chip_channel attribute

Parameters:

ParameterC++ typeDescription
deviceId0unsigned
ethernetCoreCoord0::llvm::ArrayRef<int64_t>
deviceId1unsigned
ethernetCoreCoord1::llvm::ArrayRef<int64_t>

ChipCoordAttr

TT chip_coord attribute

Syntax:

#tt.chip_coord<
  unsigned,   # rack
  unsigned,   # shelf
  unsigned,   # y
  unsigned   # x
>

TT chip_coord attribute

Parameters:

ParameterC++ typeDescription
rackunsigned
shelfunsigned
yunsigned
xunsigned

ChipDescAttr

TT chip_desc attribute

Syntax:

#tt.chip_desc<
  ArchAttr,   # arch
  ::llvm::ArrayRef<int64_t>,   # grid
  unsigned,   # l1Size
  unsigned,   # numDramChannels
  unsigned,   # dramChannelSize
  unsigned,   # nocL1AddressAlignBytes
  unsigned,   # pcieAddressAlignBytes
  unsigned,   # nocDRAMAddressAlignBytes
  unsigned,   # l1UnreservedBase
  unsigned,   # eriscL1UnreservedBase
  unsigned,   # dramUnreservedBase
  unsigned,   # dramUnreservedEnd
  ChipPhysicalCoresAttr,   # chipPhysicalCores
  ::llvm::ArrayRef<DataTypeAttr>,   # supportedDataTypes
  ::llvm::ArrayRef<TileSizeAttr>,   # supportedTileSizes
  unsigned   # numCBs
>

TT chip_desc attribute

Parameters:

ParameterC++ typeDescription
archArchAttr
grid::llvm::ArrayRef<int64_t>
l1Sizeunsigned
numDramChannelsunsigned
dramChannelSizeunsigned
nocL1AddressAlignBytesunsigned
pcieAddressAlignBytesunsigned
nocDRAMAddressAlignBytesunsigned
l1UnreservedBaseunsigned
eriscL1UnreservedBaseunsigned
dramUnreservedBaseunsigned
dramUnreservedEndunsigned
chipPhysicalCoresChipPhysicalCoresAttr
supportedDataTypes::llvm::ArrayRef<DataTypeAttr>
supportedTileSizes::llvm::ArrayRef<TileSizeAttr>
numCBsunsigned

ChipPhysicalCoresAttr

TT chip_physical_cores attribute

Syntax:

#tt.chip_physical_cores<
  ::llvm::ArrayRef<CoreCoordAttr>,   # worker
  ::llvm::ArrayRef<CoreCoordAttr>,   # dram
  ::llvm::ArrayRef<CoreCoordAttr>,   # eth
  ::llvm::ArrayRef<CoreCoordAttr>   # eth_inactive
>

TT chip_physical_cores attribute containing arrays of physical cores by core type in order of logical cores.

Parameters:

ParameterC++ typeDescription
worker::llvm::ArrayRef<CoreCoordAttr>
dram::llvm::ArrayRef<CoreCoordAttr>
eth::llvm::ArrayRef<CoreCoordAttr>
eth_inactive::llvm::ArrayRef<CoreCoordAttr>

CoreCoordAttr

TT core_coord attribute

Syntax:

#tt.core_coord<
  int64_t,   # y
  int64_t   # x
>

TT core_coord attribute containing a single physical core coordinate.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

DataTypeAttr

TT DataTypes

Syntax:

#tt.supportedDataTypes<
  ::mlir::tt::DataType   # value
>

Enum cases:

  • f32 (Float32)
  • f16 (Float16)
  • bf16 (BFloat16)
  • bfp_f8 (BFP_Float8)
  • bfp_bf8 (BFP_BFloat8)
  • bfp_f4 (BFP_Float4)
  • bfp_bf4 (BFP_BFloat4)
  • bfp_f2 (BFP_Float2)
  • bfp_bf2 (BFP_BFloat2)
  • u32 (UInt32)
  • u16 (UInt16)
  • u8 (UInt8)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::DataTypean enum of type DataType

DeviceAttr

Device attribute in TT dialect.

Syntax:

#tt.device<
  ::mlir::tt::GridAttr,   # workerGrid
  AffineMap,   # l1Map
  AffineMap,   # dramMap
  ::llvm::ArrayRef<int64_t>,   # meshShape
  ::llvm::ArrayRef<unsigned>   # chipIds
>

Describes the physical layout of a device in the system and is made up of a few components:

  • A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
  • Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
  • A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.

Parameters:

ParameterC++ typeDescription
workerGrid::mlir::tt::GridAttrTT grid attribute
l1MapAffineMap
dramMapAffineMap
meshShape::llvm::ArrayRef<int64_t>
chipIds::llvm::ArrayRef<unsigned>

GridAttr

TT grid attribute

Syntax:

#tt.grid<
  ::llvm::ArrayRef<int64_t>,   # shape
  AffineMap   # mapping
>

TT grid attribute

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
mappingAffineMap

IteratorTypeAttr

TT IteratorType

Syntax:

#tt.iterator_type<
  ::mlir::tt::IteratorType   # value
>

Enum cases:

  • parallel (Parallel)
  • systolic (Systolic)
  • broadcast (Broadcast)
  • reduction (Reduction)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::IteratorTypean enum of type IteratorType

LayoutAttr

Tensor layout attribute

Syntax:

#tt.layout<
  AffineMap,   # linear
  OOBVal,   # oob_val
  GridAttr,   # grid
  MemRefType,   # memref
  TensorMemoryLayout   # mem_layout
>

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Some high level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

Please refer to the Tensor Layout Spec for more in depth documentation.

Examples:

tensor<8x300xf32,
  #tt.layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

Parameters:

ParameterC++ typeDescription
linearAffineMapAn affine map that defines how the logical tensor dimensions map to a grid shape.
oob_valOOBValA tracked out of bounds value that fills padding space.
gridGridAttrThe grid shape that this tensor is divided onto.
memrefMemRefTypeA memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.
mem_layoutTensorMemoryLayoutThe layout of the tensor in memory.

MemorySpaceAttr

TT MemorySpace

Syntax:

#tt.memory_space<
  ::mlir::tt::MemorySpace   # value
>

Enum cases:

  • system (System)
  • mmio (SystemMMIO)
  • dram (DeviceDRAM)
  • l1 (DeviceL1)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::MemorySpacean enum of type MemorySpace

OOBValAttr

TT OOBVal

Syntax:

#tt.oob_val<
  ::mlir::tt::OOBVal   # value
>

Enum cases:

  • undef (Undef)
  • zero (Zero)
  • one (One)
  • inf (Inf)
  • neginf (NegInf)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::OOBValan enum of type OOBVal

OperandConstraintAttr

TT Operand Constraints

Syntax:

#tt.operand_constraint<
  ::mlir::tt::OperandConstraint   # value
>

Enum cases:

  • system (System)
  • dram (DRAM)
  • l1 (L1)
  • scalar (Scalar)
  • tile (Tile)
  • none (None)
  • interleaved (Interleaved)
  • single_bank (SingleBank)
  • height_sharded (HeightSharded)
  • width_sharded (WidthSharded)
  • block_sharded (BlockSharded)
  • any_layout (AnyLayout)
  • any (Any)
  • any_device (AnyDevice)
  • any_device_tile (AnyDeviceTile)
  • l1_block_sharded (L1BlockSharded)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::OperandConstraintan enum of type OperandConstraint

SystemDescAttr

TT system_desc attribute

Syntax:

#tt.system_desc<
  ::llvm::ArrayRef<CPUDescAttr>,   # cpuDescs
  ::llvm::ArrayRef<ChipDescAttr>,   # chipDescs
  ::llvm::ArrayRef<unsigned>,   # chipDescIndices
  ::llvm::ArrayRef<ChipCapabilityAttr>,   # chipCapabilities
  ::llvm::ArrayRef<ChipCoordAttr>,   # chipCoords
  ::llvm::ArrayRef<ChipChannelAttr>   # chipChannels
>

TT system_desc attribute

Parameters:

ParameterC++ typeDescription
cpuDescs::llvm::ArrayRef<CPUDescAttr>
chipDescs::llvm::ArrayRef<ChipDescAttr>
chipDescIndices::llvm::ArrayRef<unsigned>
chipCapabilities::llvm::ArrayRef<ChipCapabilityAttr>
chipCoords::llvm::ArrayRef<ChipCoordAttr>
chipChannels::llvm::ArrayRef<ChipChannelAttr>

TensorMemoryLayoutAttr

TT TensorMemoryLayout

Syntax:

#tt.tensor_memory_layout<
  ::mlir::tt::TensorMemoryLayout   # value
>

Enum cases:

  • none (None)
  • interleaved (Interleaved)
  • single_bank (SingleBank)
  • height_sharded (HeightSharded)
  • width_sharded (WidthSharded)
  • block_sharded (BlockSharded)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::TensorMemoryLayoutan enum of type TensorMemoryLayout

TileSizeAttr

TT tile_size attribute

Syntax:

#tt.tile_size<
  int64_t,   # y
  int64_t   # x
>

TT tile_size attribute containing a supported Tensix tile shape.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

DeviceType

TT device

Syntax:

!tt.device<
  ::mlir::tt::DeviceAttr   # desc
>

Device type in TT dialect

Parameters:

ParameterC++ typeDescription
desc::mlir::tt::DeviceAttrDevice attribute in TT dialect.

TileType

TT tile

Syntax:

!tt.tile<
  ::llvm::ArrayRef<int64_t>,   # shape
  DataType   # dataType
>

Tile type in TT dialect

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
dataTypeDataType

'ttir' Dialect

TTIR dialect provides high level semantics for dispatching work to TT HW. This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.

[TOC]

ttir.abs (tt::ttir::AbsOp)

Eltwise absolute op.

Eltwise absolute operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.add (tt::ttir::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.all_gather (tt::ttir::AllGatherOp)

All gather operation.

All gather op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.alloc (tt::ttir::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttr
TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttir.broadcast (tt::ttir::BroadcastOp)

Broadcast operation.

Broadcast op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::ArrayAttr64-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.cbrt (tt::ttir::CbrtOp)

Eltwise cubic root op.

Eltwise cubic root operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.ceil (tt::ttir::CeilOp)

Eltwise ceil op.

Eltwise ceil operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.clamp (tt::ttir::ClampOp)

Clamp op.

Clamp tensor values to a specified range.

Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00

"ttir.clamp"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
min::mlir::FloatAttr32-bit float attribute
max::mlir::FloatAttr32-bit float attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.concat (tt::ttir::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.constant (tt::ttir::ConstantOp)

Constant op.

Produces tensor filled with given constant value.

Examples: %0 = "ttir.constant"() {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> // %0: [[0, 0, 0], [0, 0, 0]] %1 = "ttir.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]

Traits: ConstantLike

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.conv2d (tt::ttir::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
groups::mlir::IntegerAttr32-bit signed integer attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.convolution (tt::ttir::ConvolutionOp)

Generalized convolution op.

Applies a convolution of the rhs with the lhs.

This operation captures convolutions of all dimensionality as well as deconvolution/conv transpose.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
window_strides::mlir::DenseI64ArrayAttri64 dense array attribute
padding::mlir::DenseI64ArrayAttri64 dense array attribute
input_dilation::mlir::DenseI64ArrayAttri64 dense array attribute
weight_dilation::mlir::DenseI64ArrayAttri64 dense array attribute
window_reversal::mlir::DenseBoolArrayAttri1 dense array attribute
convolution_layout::mlir::tt::ttir::ConvolutionLayoutAttr
Structure of dimension information for convolution op{{% markdown %}} Holds the layout information for the input activation, weights, and output. {{% /markdown %}}
feature_group_count::mlir::IntegerAttr64-bit signless integer attribute whose value is positive
batch_group_count::mlir::IntegerAttr64-bit signless integer attribute whose value is positive
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
«unnamed»ranked tensor of any type values

ttir.cos (tt::ttir::CosOp)

Eltwise cosine op.

Eltwise cosine operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.dealloc (tt::ttir::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
resultranked tensor of any type values

ttir.div (tt::ttir::DivOp)

Eltwise divide.

Eltwise divide operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.embedding (tt::ttir::EmbeddingOp)

Embedding op.

Embedding operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.eq (tt::ttir::EqualOp)

Eltwise equal to.

Eltwise equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.exp (tt::ttir::ExpOp)

Eltwise exponential op.

Eltwise exponential operation. Calculates e^x for all elements x in input tensor.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.expm1 (tt::ttir::Expm1Op)

Eltwise unary op.

Performs element-wise exponential minus one operation on operand tensor and stores the result in the output tensor.

Example: %a: [[0, 1], [0, 0]] "ttir.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.fill (tt::ttir::FillOp)

Fill operation.

Produces tensor filled with given fill value.

Examples: %0 = tensor.empty() : () -> tensor<2x3xi32> %1 = "ttir.fill"(%0) {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> %2 = tensor.empty() : () -> tensor<2xf32> %3 = "ttir.fill"(%2) {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32>

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.floor (tt::ttir::FloorOp)

Eltwise floor op.

Eltwise floor operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.gather (tt::ttir::GatherOp)

Gather operation.

Gathers slices from operand tensor from offsets specified in start_indices and produces a result tensor. From StableHLO Gather Op: https://openxla.org/stablehlo/spec#gather

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
offset_dims::mlir::DenseI64ArrayAttri64 dense array attribute
collapsed_slice_dims::mlir::DenseI64ArrayAttri64 dense array attribute
operand_batching_dims::mlir::DenseI64ArrayAttri64 dense array attribute
start_indices_batching_dims::mlir::DenseI64ArrayAttri64 dense array attribute
start_index_map::mlir::DenseI64ArrayAttri64 dense array attribute
index_vector_dim::mlir::IntegerAttr64-bit signed integer attribute
slice_sizes::mlir::DenseI64ArrayAttri64 dense array attribute
indices_are_sorted::mlir::BoolAttrbool attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
start_indicesranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.gelu (tt::ttir::GeluOp)

Eltwise GELU op.

Eltwise GELU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.generic (tt::ttir::GenericOp)

Generically dispatch work to a grid of cores.

This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself with respect to input and output operands. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.

%5 = "ttir.generic"(%1, %2, %3, %4) <{
  grid = #tt.grid<1x1>,                        // The grid range of cores to dispatch work to.
  indexing_maps = [#map, #map, #map],          // Affine maps for indexing into the input/output tensors. See linalg.generic
  iterator_types = [#parallel, #parallel],     // Iterator types for the input/output tensors. See linalg.generic
  operandSegmentSizes = array<i32: 2, 1, 1>,   // Sizes of the operand segments, i.e. 2 inputs, 1 cb and 1 output.
  operand_cb_mapping = array<i64: -1, 0, -1>,  // Mapping of input & output operands to cbs. -1 means no mapping.
                                               // Mapped operands correspond to buffers in streaming mode.
                                               // Non-mapped operands correspond to buffers in alias mode.
({
^bb0(%arg2: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, alias>>,
     %arg3: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, stream>>,
     %arg4: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, alias>>):
    // Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
grid::mlir::tt::GridAttr
TT grid attribute{{% markdown %}} TT grid attribute {{% /markdown %}}
indexing_maps::mlir::ArrayAttrAffineMap array attribute
iterator_types::mlir::ArrayAttr
operand_constraints::mlir::ArrayAttr
operand_cb_mapping::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
cbsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.get_dimension_size (tt::ttir::GetDimensionSizeOp)

GetDimensionSize op.

Produces the size of the given dimension of the operand.

Example: %operand: [[3, 2, 7], [1, 4, 4]] "ttir.get_dimension_size"(%operand, value = dense<0>, %out) -> %out: [[3]]

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
operandranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.ge (tt::ttir::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.gt (tt::ttir::GreaterThanOp)

Eltwise greater than.

Eltwise greater than operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.index (tt::ttir::IndexOp)

Index op.

Extract a sub-tensor (slice) from the input tensor along a specified dimension. The begin, end, and step attributes define the start, stop, and step indices for the selected dimension (dim) of the tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signless integer attribute
begin::mlir::IntegerAttr32-bit signless integer attribute
end::mlir::IntegerAttr32-bit signless integer attribute
step::mlir::IntegerAttr32-bit signless integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.isfinite (tt::ttir::IsFiniteOp)

Eltwise isfinite op.

Eltwise isfinite operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.kernel (tt::ttir::KernelOp)

Kernel call.

A generic kernel call operation. This operation is used to pattern match by some consuming backend.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
op::mlir::FlatSymbolRefAttrflat symbol reference attribute
kind::mlir::FlatSymbolRefAttrflat symbol reference attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values
outputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

ttir.leaky_relu (tt::ttir::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Attributes:

  • parameter (float): The slope for negative values.

Inputs:

  • input (Tensor): The input tensor to be activated.

Outputs:

  • output (Tensor): The tensor after applying the Leaky ReLU activation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
parameter::mlir::FloatAttr32-bit float attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.le (tt::ttir::LessEqualOp)

Eltwise less than or equal to.

Eltwise less than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.lt (tt::ttir::LessThanOp)

Eltwise less than.

Eltwise less than operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.log1p (tt::ttir::Log1pOp)

Eltwise log1p operation.

Performs element-wise logarithm plus one operation on operand tensor and puts the result in the output tensor.

Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttir.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.log (tt::ttir::LogOp)

Eltwise logarithm op.

Eltwise logarithm operation. Calculates log(x) for all elements x in input tensor.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.logical_and (tt::ttir::LogicalAndOp)

Eltwise logical and.

Eltwise logical and operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.logical_not (tt::ttir::LogicalNotOp)

Eltwise logical not op.

Eltwise logical not operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.logical_or (tt::ttir::LogicalOrOp)

Eltwise logical or.

Eltwise logical or operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.logical_xor (tt::ttir::LogicalXorOp)

Eltwise logical xor.

Eltwise logical xor operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.matmul (tt::ttir::MatmulOp)

Matrix multiply operation.

Matrix multiply operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max (tt::ttir::MaxOp)

Max reduction op.

Max reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max_pool2d (tt::ttir::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.maximum (tt::ttir::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.mean (tt::ttir::MeanOp)

Mean reduction op.

Mean reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.minimum (tt::ttir::MinimumOp)

Eltwise minimum OP.

Calculates minimum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.multiply (tt::ttir::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.neg (tt::ttir::NegOp)

Eltwise negate op.

Eltwise negate operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.ne (tt::ttir::NotEqualOp)

Eltwise not equal to.

Eltwise not equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.pooling (tt::ttir::PoolingOp)

General pooling op

General pooling op

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
pooling_method::mlir::tt::ttir::PoolingMethodAttr
TTIR PoolingMethod{{% markdown %}}Enum cases: * Average (`Average`) * Max (`Max`){{% /markdown %}}
window_dimensions::mlir::DenseI64ArrayAttri64 dense array attribute
window_strides::mlir::DenseI64ArrayAttri64 dense array attribute
base_dilations::mlir::DenseI64ArrayAttri64 dense array attribute
window_dilations::mlir::DenseI64ArrayAttri64 dense array attribute
padding::mlir::DenseI64ArrayAttri64 dense array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
«unnamed»variadic of ranked tensor of any type values

ttir.reciprocal (tt::ttir::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.relu (tt::ttir::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.remainder (tt::ttir::RemainderOp)

Eltwise remainder.

Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.

Example:

// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttir.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.reshape (tt::ttir::ReshapeOp)

Reshape op.

Reshape tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.rsqrt (tt::ttir::RsqrtOp)

Eltwise reciprocal square root.

Eltwise reciprocal square root operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.sigmoid (tt::ttir::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.sign (tt::ttir::SignOp)

Eltwise sign operation.

Returns the sign of the operand element-wise and produces a result tensor.

Example: %a: [[3, -2, 0], [1, -4, 4]] "ttir.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.sin (tt::ttir::SinOp)

Eltwise sine.

Eltwise sine operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.slice (tt::ttir::SliceOp)

Slice op.

Extract a sub-tensor (slice) from the input tensor across one or more dimensions. The begins, ends, and step attributes specify the start, stop, and step indices for each dimension of the tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
begins::mlir::ArrayAttr32-bit integer array attribute
ends::mlir::ArrayAttr32-bit integer array attribute
step::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.softmax (tt::ttir::SoftmaxOp)

Softmax operation.

Softmax operation.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sqrt (tt::ttir::SqrtOp)

Eltwise square root.

Eltwise square root operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.squeeze (tt::ttir::SqueezeOp)

Squeeze op.

Squeeze tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.subtract (tt::ttir::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.sum (tt::ttir::SumOp)

Sum reduction op.

Sum reduction op.

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_GenericRegionOpInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.to_layout (tt::ttir::ToLayoutOp)

Layout op.

ToLayout operation, transition tensors from one layout to another. Some examples include:

  • Transitioning between different memory spaces, e.g. DRAM to L1.
  • Transitioning between different data types, e.g. f32 to f16.
  • Transitioning between different tile sizes, e.g. 1x16 to 32x32
  • Transitioning between different tensor sharding
  • Some combination of the above
#layout = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.transpose (tt::ttir::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.typecast (tt::ttir::TypecastOp)

Eltwise cast op.

Eltwise cast operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.unsqueeze (tt::ttir::UnsqueezeOp)

Unsqueeze op.

Unsqueeze tensor.

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.where (tt::ttir::WhereOp)

Eltwise where op.

Eltwise where operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTIROpInterface, TTIR_ElementwiseOpInterface

Attributes:

AttributeMLIR TypeDescription
operand_constraints::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.yield (tt::ttir::YieldOp)

Yield op.

Yield operation, this is required by MLIR to mark the end of a dispatch region.

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
valuesvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

'ttkernel' Dialect

A TTKernel out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

EthernetConfigAttr

TT EthernetConfig attribute

Syntax:

#ttkernel.ethernet_config<
  EthType,   # eth_type
  NocIndex   # noc_index
>

TT ethernet_config attribute

Parameters:

ParameterC++ typeDescription
eth_typeEthType
noc_indexNocIndex

NocConfigAttr

TT NocConfig attribute

Syntax:

#ttkernel.noc_config<
  NocIndex   # noc_index
>

TT noc_config attribute

Parameters:

ParameterC++ typeDescription
noc_indexNocIndex

ReduceDimAttr

TTKernel Reduce Dimensions

Syntax:

#ttkernel.reduce_dim<
  ::mlir::tt::ttkernel::ReduceDim   # value
>

Enum cases:

  • reduce_dim_row (Row)
  • reduce_dim_col (Col)
  • reduce_dim_scalar (Scalar)
  • reduce_dim_none (None)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ReduceDiman enum of type ReduceDim

ReduceTypeAttr

TTKernel Reduce Types

Syntax:

#ttkernel.reduce_type<
  ::mlir::tt::ttkernel::ReduceType   # value
>

Enum cases:

  • reduce_sum (Sum)
  • reduce_max (Max)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ReduceTypean enum of type ReduceType

TensixConfigAttr

TT TensixConfig attribute

Syntax:

#ttkernel.tensix_config<
  MathFidelity,   # math_fidelity
  bool,   # fp32_dest_acc_en
  bool,   # math_approx_mode
  ::llvm::ArrayRef<UnpackToDestMode>   # unpack_to_dest_mode
>

TT compute_desc attribute

Parameters:

ParameterC++ typeDescription
math_fidelityMathFidelity
fp32_dest_acc_enbool
math_approx_modebool
unpack_to_dest_mode::llvm::ArrayRef<UnpackToDestMode>

ThreadTypeAttr

TTKernel ThreadTypes

Syntax:

#ttkernel.thread<
  ::mlir::tt::ttkernel::ThreadType   # value
>

Enum cases:

  • noc (Noc)
  • tensix (Tensix)
  • ethernet (Ethernet)

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ThreadTypean enum of type ThreadType

ttkernel.add (tt::ttkernel::AddOp)

Add operation

Add operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.add_tiles_init (tt::ttkernel::AddTilesInitOp)

Short init function

Must be run before add_tiles.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.add_tiles (tt::ttkernel::AddTilesOp)

Add operation

Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_index32-bit signless integer
in1_tile_index32-bit signless integer
dst_index32-bit signless integer

ttkernel.binary_op_init_common (tt::ttkernel::BinaryOpInitCommonOp)

Init function for all binary ops

Followed by the specific init required with an opcode (binrary_op_specific_init).

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
out_cbTTKernel cb

ttkernel.builtin (tt::ttkernel::BuiltinOp)

Builtin call.

Kernel operation

Attributes:

AttributeMLIR TypeDescription
op::mlir::FlatSymbolRefAttrflat symbol reference attribute
kind::mlir::FlatSymbolRefAttrflat symbol reference attribute

Operands:

OperandDescription
argsvariadic of non-0-ranked.memref of any type values or TTKernel cb

ttkernel.cb_pop_front (tt::ttkernel::CBPopFrontOp)

CBPopFront call.

CBPopFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_push_back (tt::ttkernel::CBPushBackOp)

CBPushBack call.

CBPushBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_reserve_back (tt::ttkernel::CBReserveBackOp)

CBReserveBack call.

CBReserveBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_wait_front (tt::ttkernel::CBWaitFrontOp)

CBWaitFront call.

CBWaitFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.reinterpret_cast<volatile tt_l1_ptr uint32_t*> (tt::ttkernel::CastToL1PtrOp)

CastToL1Ptr

Cast specified addr to L1 pointer.

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
addr32-bit signless integer or TTKernel l1 address

Results:

ResultDescription
l1_ptrTTKernel l1 address pointer

ttkernel.copy_tile_init (tt::ttkernel::CopyTileInitOp)

Perform the init for copy tile. This does not reconfigure the unpacker data types.

Must be called before copy_tile.

ttkernel.copy_tile (tt::ttkernel::CopyTileOp)

Copy tile from specified CB to DST.

Copies a single tile from the specified input CB and writes the result to DST at a specified index. The function will employ unpacker to first unpack into SRC registers and then perform move into DST registers, at a specified index. For the in_tile_index to be valid for this call, cb_wait_front(n) had to be previously called to ensure that at least some number n>0 of tiles are available in the input CB. The CB index 0 then references the first tile in the received section of the CB, up to index n-1 (in a FIFO order). The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
cb0TTKernel cb
tile_index_cb32-bit signless integer
tile_index_dst32-bit signless integer

ttkernel.exp_tile_init (tt::ttkernel::ExpTileInitOp)

Short init function which configures compute unit for execution of exp_tile.

Must be run before exp_tile.

ttkernel.exp_tile (tt::ttkernel::ExpTileOp)

Exp operation

Performs element-wise computation of exponential on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
tile_index32-bit signless integer

ttkernel.get_noc_addr (tt::ttkernel::GetNocAddrOp)

GetNocAddr

GetNocAddr

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
l1Address32-bit signless integer

Results:

ResultDescription
nocAddrTTKernel noc address

ttkernel.get_noc_addr_xy (tt::ttkernel::GetNocAddrXYOp)

GetNocAddrXY

GetNocAddr api including core coordinates

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
x32-bit signless integer
y32-bit signless integer
l1Address32-bit signless integer

Results:

ResultDescription
nocAddrTTKernel noc address

ttkernel.get_write_ptr (tt::ttkernel::GetWritePtrOp)

GetWritePtr

GetWritePtr operation

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
cbTTKernel cb

Results:

ResultDescription
writePtr32-bit signless integer

ttkernel.matmul (tt::ttkernel::MatmulOp)

Matmul operation

Matmul operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.mem_zeros_base (tt::ttkernel::MemZerosBaseOp)

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

ResultDescription
result32-bit signless integer

ttkernel.mem_zeros_size (tt::ttkernel::MemZerosSizeOp)

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

ResultDescription
result32-bit signless integer

ttkernel.mul (tt::ttkernel::MulOp)

Mul operation

Mul operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.mul_tiles_init_f (tt::ttkernel::MulTilesInitFOp)

Short init function. Init for math only.

Must be run before mul_tiles.

ttkernel.mul_tiles_init (tt::ttkernel::MulTilesInitOp)

Short init function

Must be run before mul_tiles.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.mul_tiles (tt::ttkernel::MulTilesOp)

Mul operation

Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_index32-bit signless integer
in1_tile_index32-bit signless integer
dst_index32-bit signless integer

ttkernel.noc_async_read_barrier (tt::ttkernel::NocAsyncReadBarrierOp)

NocAsyncReadBarrier

NocAsyncReadBarrier

ttkernel.noc_async_read_one_packet_set_state (tt::ttkernel::NocAsyncReadOnePacketSetStateOp)

NocAsyncReadOnePacketSetState

NocAsyncReadOnePacketSetState

Operands:

OperandDescription
srcNocAddrTTKernel noc address
size32-bit signless integer

ttkernel.noc_async_read_one_packet_with_state (tt::ttkernel::NocAsyncReadOnePacketWithStateOp)

NocAsyncReadOnePacketWithState

NocAsyncReadOnePacketWithState

Operands:

OperandDescription
srcNocAddrTTKernel noc address
dstLocalL1Addr32-bit signless integer or TTKernel l1 address

ttkernel.noc_async_read (tt::ttkernel::NocAsyncReadOp)

NocAsyncRead

NocAsyncRead

Operands:

OperandDescription
srcNocAddrTTKernel noc address
dstLocalL1Addr32-bit signless integer
size32-bit signless integer

ttkernel.noc_async_write_barrier (tt::ttkernel::NocAsyncWriteBarrierOp)

NocAsyncWriteBarrier

NocAsyncWriteBarrier

ttkernel.noc_async_write (tt::ttkernel::NocAsyncWriteOp)

NocAsyncWrite

NocAsyncWrite

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrTTKernel noc address
size32-bit signless integer

ttkernel.pack (tt::ttkernel::PackOp)

Pack op.

Pack operation

Operands:

OperandDescription
dst_index32-bit signless integer
out_cbTTKernel cb
out_index32-bit signless integer

ttkernel.pack_set_data_type (tt::ttkernel::PackSetDataTypeOp)

Pack set DataType op.

Pack set DataType operation

Attributes:

AttributeMLIR TypeDescription
data_type::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}

ttkernel.pack_tile (tt::ttkernel::PackTileOp)

PackTile op.

Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.

Operates in tandem with functions cb_reserve_back and cb_push_back.

A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.

Operands:

OperandDescription
dst_index32-bit signless integer
out_cbTTKernel cb
out_index32-bit signless integer

ttkernel.recip_tile_init (tt::ttkernel::RecipTileInitOp)

Init function for recip_tile operation. Refer to documentation for any init function.

Must be called before recip_tile function.

ttkernel.recip_tile (tt::ttkernel::RecipTileOp)

Recip tile in the DST at specified index.

Performs element-wise computation of the reciprocal on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine. Only works for Float32, Float16_b, Bfp8_b data formats for full accuracy.

Operands:

OperandDescription
tile_index32-bit signless integer

ttkernel.reduce_init (tt::ttkernel::ReduceInitOp)

Init function

Must be run before reduce_tile.

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ttkernel::ReduceTypeAttr
TTKernel Reduce Types{{% markdown %}}Enum cases: * reduce_sum (`Sum`) * reduce_max (`Max`){{% /markdown %}}
reduce_dim::mlir::tt::ttkernel::ReduceDimAttr
TTKernel Reduce Dimensions{{% markdown %}}Enum cases: * reduce_dim_row (`Row`) * reduce_dim_col (`Col`) * reduce_dim_scalar (`Scalar`) * reduce_dim_none (`None`){{% /markdown %}}

Operands:

OperandDescription
in_cbTTKernel cb
scaling_cbTTKernel cb
out_cbTTKernel cb

ttkernel.reduce_tile (tt::ttkernel::ReduceTileOp)

Reduce operation

Performs a reduction operation B = reduce(A) using reduce_func for dimension reduction on a tile in the CB at a given index and writes the result to the DST register at index dst_tile_index. Reduction can be either of type Reduce::R, Reduce::C or Reduce::RC, identifying the dimension(s) to be reduced in size to 1. The DST register buffer must be in acquired state via tile_regs_acquire call. The templates takes reduce_type which can be ReduceFunc::Sum, ReduceFunc::Max and reduce_dim which can be Reduce::R, Reduce::C, Reduce::RC. They can also be specified by defines REDUCE_OP and REDUCE_DIM. This call is blocking and is only available on the compute engine.

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ttkernel::ReduceTypeAttr
TTKernel Reduce Types{{% markdown %}}Enum cases: * reduce_sum (`Sum`) * reduce_max (`Max`){{% /markdown %}}
reduce_dim::mlir::tt::ttkernel::ReduceDimAttr
TTKernel Reduce Dimensions{{% markdown %}}Enum cases: * reduce_dim_row (`Row`) * reduce_dim_col (`Col`) * reduce_dim_scalar (`Scalar`) * reduce_dim_none (`None`){{% /markdown %}}

Operands:

OperandDescription
in_cbTTKernel cb
scaling_cbTTKernel cb
in_tile_index32-bit signless integer
scaling_tile_index32-bit signless integer
dst_index32-bit signless integer

ttkernel.return (tt::ttkernel::ReturnOp)

Return op.

Return operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

ttkernel.store_to_l1 (tt::ttkernel::StoreToL1Op)

StoreToL1

Store value to L1.

Operands:

OperandDescription
value32-bit signless integer
l1_ptrTTKernel l1 address pointer
offset32-bit signless integer

ttkernel.sub (tt::ttkernel::SubOp)

Sub operation

Sub operation

Operands:

OperandDescription
dst_index32-bit signless integer

ttkernel.tile_regs_acquire (tt::ttkernel::TileRegsAcquireOp)

Tile_regs_acquire

Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tile_regs_commit (tt::ttkernel::TileRegsCommitOp)

Tile_regs_commit

Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.

ttkernel.tile_regs_release (tt::ttkernel::TileRegsReleaseOp)

Tile_regs_release

Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.

ttkernel.tile_regs_wait (tt::ttkernel::TileRegsWaitOp)

Tile_regs_wait

Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tilize_block (tt::ttkernel::TilizeBlockOp)

TilizeBlockOp call.

TilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.tilize_init (tt::ttkernel::TilizeInitOp)

TilizeInitOp call.

TilizeInitOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.unary_op_init_common (tt::ttkernel::UnaryOpInitCommonOp)

Initialization function for unary operations.

This operation initializes all necessary components for unary operations, including unpacking, packing, and math configurations.

Operands:

OperandDescription
icbTTKernel cb
ocbTTKernel cb

ttkernel.unpack_ab (tt::ttkernel::UnpackABOp)

UnpackAB op.

UnpackAB operation

Operands:

OperandDescription
cb_aTTKernel cb
src_a_index32-bit signless integer
cb_bTTKernel cb
src_b_index32-bit signless integer

ttkernel.unpack_a (tt::ttkernel::UnpackAOp)

UnpackA op.

UnpackA operation

Operands:

OperandDescription
cbTTKernel cb
src_index32-bit signless integer

ttkernel.unpack_set_data_type (tt::ttkernel::UnpackSetDataTypeOp)

Unpack set DataType op.

Unpack set DataType operation

Attributes:

AttributeMLIR TypeDescription
data_type_a::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}
data_type_b::mlir::IntegerAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}

ttkernel.unreachable (tt::ttkernel::UnreachableOp)

Unreachable op.

Unreachable operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

ttkernel.untilize_block (tt::ttkernel::UntilizeBlockOp)

UntilizeBlockOp call.

UntilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.untilize_init (tt::ttkernel::UntilizeInitOp)

UntilizeInitOp call.

UntilizeInitOp operation

Operands:

OperandDescription
cbInTTKernel cb
cbOutTTKernel cb

CBType

TTKernel cb

Syntax:

!ttkernel.cb<
  CBPort,   # port
  uint64_t,   # address
  MemRefType,   # memref
  uint64_t,   # page_size
  uint64_t   # num_buffers
>

Circular buffer type in TTKernel dialect

Parameters:

ParameterC++ typeDescription
portCBPort
addressuint64_t
memrefMemRefType
page_sizeuint64_t
num_buffersuint64_t

L1AddrType

TTKernel l1 address

Syntax: !ttkernel.l1_addr

L1 address type in TTKernel dialect

L1AddrPtrType

TTKernel l1 address pointer

Syntax: !ttkernel.l1_addr_ptr

L1 pointer address type in TTKernel dialect

NocAddrType

TTKernel noc address

Syntax: !ttkernel.noc_addr

Noc address type in TTKernel dialect

'ttmetal' Dialect

A TTMetal out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

CoreRangeAttr

TTMetal grid attribute

Syntax:

#ttmetal.core_range<
  ::llvm::ArrayRef<int64_t>,   # offset
  ::llvm::ArrayRef<int64_t>   # size
>

TTMetal grid attribute

Parameters:

ParameterC++ typeDescription
offset::llvm::ArrayRef<int64_t>
size::llvm::ArrayRef<int64_t>

ttmetal.alloc (tt::ttmetal::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttr
TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttmetal.dealloc (tt::ttmetal::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
inputranked tensor of any type values

ttmetal.dispatch (tt::ttmetal::DispatchOp)

Dispatch op.

Dispatch operation

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
core_ranges::mlir::ArrayAttr
kernelConfigs::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttmetal.host_read (tt::ttmetal::HostReadOp)

Host read op.

Host read operation

Interfaces: DestinationStyleOpInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttmetal.host_write (tt::ttmetal::HostWriteOp)

Host write op.

Host write operation

Interfaces: DestinationStyleOpInterface

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute

Operands:

OperandDescription
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

'ttnn' Dialect

A TTNN out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ttnn.leaky_relu (tt::ttnn::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Attributes:

  • parameter (float): The slope for negative values.

Inputs:

  • input (Tensor): The input tensor to be activated.

Outputs:

  • output (Tensor): The tensor after applying the Leaky ReLU activation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
parameter::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.abs (tt::ttnn::AbsOp)

Eltwise absolute.

Eltwise absolute operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.add (tt::ttnn::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.all_gather (tt::ttnn::AllGatherOp)

All gather op.

Tensor All Gather operation

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
num_links::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.alloc (tt::ttnn::AllocOp)

Alloc op.

Tensor Alloc operation

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
buffer_type::mlir::tt::ttnn::BufferTypeAttr
TTNN Buffer Type{{% markdown %}}Enum cases: * dram (`DRAM`) * l1 (`L1`) * system_memory (`SystemMemory`) * l1_small (`L1Small`) * trace (`Trace`){{% /markdown %}}

Results:

ResultDescription
resultranked tensor of any type values

ttnn.cbrt (tt::ttnn::CbrtOp)

Eltwise cubic root.

Eltwise cubic root operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.ceil (tt::ttnn::CeilOp)

Eltwise ceil.

Eltwise ceil operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.clamp (tt::ttnn::ClampOp)

Clamp op.

Clamp tensor values to a specified range.

Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00

"ttnn.clamp"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
min::mlir::FloatAttr32-bit float attribute
max::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultvariadic of ranked tensor of any type values

ttnn.concat (tt::ttnn::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.conv2d (tt::ttnn::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
in_channels::mlir::IntegerAttr32-bit signless integer attribute
out_channels::mlir::IntegerAttr32-bit signless integer attribute
batch_size::mlir::IntegerAttr32-bit signless integer attribute
input_height::mlir::IntegerAttr32-bit signless integer attribute
input_width::mlir::IntegerAttr32-bit signless integer attribute
kernel_height::mlir::IntegerAttr32-bit signless integer attribute
kernel_width::mlir::IntegerAttr32-bit signless integer attribute
stride_height::mlir::IntegerAttr32-bit signless integer attribute
stride_width::mlir::IntegerAttr32-bit signless integer attribute
padding_height::mlir::IntegerAttr32-bit signless integer attribute
padding_width::mlir::IntegerAttr32-bit signless integer attribute
dilation_height::mlir::IntegerAttr32-bit signless integer attribute
dilation_width::mlir::IntegerAttr32-bit signless integer attribute
groups::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.cos (tt::ttnn::CosOp)

Eltwise cosine.

Eltwise cosine operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.deallocate (tt::ttnn::DeallocateOp)

Deallocate op.

Tensor Deallocate operation

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
force::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values

ttnn.div (tt::ttnn::DivOp)

Eltwise divide.

Eltwise divide operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.embedding (tt::ttnn::EmbeddingOp)

Embedding op.

Embedding operation.

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values
weightranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.empty (tt::ttnn::EmptyOp)

Empty op.

Tensor empty operation

Interfaces: NoMemoryEffect (MemoryEffectOpInterface), TTNN_OpModelInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
dtype::mlir::tt::DataTypeAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}
layout::mlir::tt::ttnn::LayoutAttr
TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}}
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.eq (tt::ttnn::EqualOp)

Eltwise equal to.

Eltwise equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.exp (tt::ttnn::ExpOp)

Eltwise exponential.

Eltwise exponential operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.expm1 (tt::ttnn::Expm1Op)

Eltwise unary op.

Performs element-wise exponential minus one operation on operand tensor and stores the result in the output tensor.

Example: %a: [[0, 1], [0, 0]] "ttnn.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.floor (tt::ttnn::FloorOp)

Eltwise floor op.

Eltwise floor operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.from_device (tt::ttnn::FromDeviceOp)

FromDevice op.

This op retrieves the input tensor from the given device.

Interfaces: TTNN_OpModelInterface

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.full (tt::ttnn::FullOp)

Full op.

Tensor full operation

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
fillValue::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.gelu (tt::ttnn::GeluOp)

Eltwise GELU.

Eltwise GELU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.get_device (tt::ttnn::GetDeviceOp)

Get Device op.

This op returns the current runtime device.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
mesh_shape::mlir::tt::ttnn::MeshShapeAttr
TTNN Mesh Shape{{% markdown %}} TTNN mesh shape {{% /markdown %}}

Results:

ResultDescription
deviceTT device

ttnn.ge (tt::ttnn::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.gt (tt::ttnn::GreaterThanOp)

Eltwise greater than.

Eltwise greater than operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.isfinite (tt::ttnn::IsFiniteOp)

Eltwise isfinite op.

Eltwise isfinite operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.le (tt::ttnn::LessEqualOp)

Eltwise less than or equal to.

Eltwise less than or equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.lt (tt::ttnn::LessThanOp)

Eltwise less than.

Eltwise less than operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.log1p (tt::ttnn::Log1pOp)

Eltwise log1p operation.

Performs element-wise logarithm plus one operation on operand tensor and puts the result in the output tensor.

Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttnn.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.log (tt::ttnn::LogOp)

Eltwise logarithm.

Eltwise logarithm operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.logical_and (tt::ttnn::LogicalAndOp)

Eltwise logical and.

Eltwise logical and operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.logical_not (tt::ttnn::LogicalNotOp)

Eltwise logical not op.

Eltwise logical not operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.logical_or (tt::ttnn::LogicalOrOp)

Eltwise logical or.

Eltwise logical or operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.logical_xor (tt::ttnn::LogicalXorOp)

Eltwise logical xor.

Eltwise logical xor operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.matmul (tt::ttnn::MatmulOp)

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max (tt::ttnn::MaxOp)

Max reduction op.

Max reduction op.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max_pool2d (tt::ttnn::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
batch_size::mlir::IntegerAttr32-bit signed integer attribute
input_height::mlir::IntegerAttr32-bit signed integer attribute
input_width::mlir::IntegerAttr32-bit signed integer attribute
channels::mlir::IntegerAttr32-bit signed integer attribute
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_height::mlir::IntegerAttr32-bit signed integer attribute
padding_width::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.maximum (tt::ttnn::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.mean (tt::ttnn::MeanOp)

Mean reduction op.

Mean reduction op.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.minimum (tt::ttnn::MinimumOp)

Eltwise minimum OP.

Calculates minimum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.multiply (tt::ttnn::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.neg (tt::ttnn::NegOp)

Eltwise negate.

Eltwise negate operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.ne (tt::ttnn::NotEqualOp)

Eltwise not equal to.

Eltwise not equal to operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.reciprocal (tt::ttnn::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.reduce_scatter (tt::ttnn::ReduceScatterOp)

Reduce scatter op.

Tensor Reduce Scatter operation

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
scatter_split_dim::mlir::IntegerAttr32-bit signed integer attribute
math_op::mlir::IntegerAttr
TTNN Reduce Operation Type{{% markdown %}}Enum cases: * sum (`Sum`) * mean (`Mean`) * max (`Max`) * min (`Min`) * std (`Std`) * var (`Var`){{% /markdown %}}
num_links::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.relu (tt::ttnn::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, OpModel, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.remainder (tt::ttnn::RemainderOp)

Eltwise remainder.

Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.

Example:

// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttnn.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.reshape (tt::ttnn::ReshapeOp)

Reshape op.

Reshape tensor.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.rsqrt (tt::ttnn::RsqrtOp)

Eltwise rsqrt.

Eltwise rsqrt operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.sigmoid (tt::ttnn::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.sign (tt::ttnn::SignOp)

Eltwise sign operation.

Returns the sign of the operand element-wise and produces a result tensor.

Example: %a: [[3, -2, 0], [1, -4, 4]] "ttnn.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.sin (tt::ttnn::SinOp)

Eltwise sine.

Eltwise sine operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.slice (tt::ttnn::SliceOp)

Slice op.

Extract a portion of a tensor based on the specified start (begins), stop (ends), and step indices for each dimension.

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
begins::mlir::ArrayAttr32-bit integer array attribute
ends::mlir::ArrayAttr32-bit integer array attribute
step::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.softmax (tt::ttnn::SoftmaxOp)

Softmax op.

Softmax operation.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sqrt (tt::ttnn::SqrtOp)

Eltwise sqrt.

Eltwise sqrt operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.subtract (tt::ttnn::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttnn.sum (tt::ttnn::SumOp)

Sum reduction op.

Sum reduction op.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_device (tt::ttnn::ToDeviceOp)

ToDevice op.

This op sends the input tensor to the given device with the given memory config.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_layout (tt::ttnn::ToLayoutOp)

ToLayout op.

This op wraps all layout information gathered from ttir.toLayout. It is used/updated by the optimizer to perform optimizations, and later broken down into specific memory/layout operations (toDevice, toMemoryConfig etc.). Currently in the TTNN backend, we use this op solely for tilize/untilize, therefore marking all other attrs as optional. Once ttnn::to_layout supports other attrs, we can remove the optional tag.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
layout::mlir::tt::ttnn::LayoutAttr
TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}}
dtype::mlir::tt::DataTypeAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTT device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_memory_config (tt::ttnn::ToMemoryConfigOp)

ToMemoryConfig op.

This op converts the memory config of the input tensor based on the given memory config. It handles:

  • Dram to L1
  • L1 to Dram
  • Interleaved to sharded
  • Sharded to interleaved
  • Sharded to sharded (reshard)

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.transpose (tt::ttnn::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.typecast (tt::ttnn::TypecastOp)

Typecast op.

This op converts the data type of the input tensor based on the given data type. It handles:

  • conversions of data types.

Interfaces: TTNN_OpModelInterface

Attributes:

AttributeMLIR TypeDescription
dtype::mlir::tt::DataTypeAttr
TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.where (tt::ttnn::WhereOp)

Eltwise where.

Eltwise where operation.

Traits: AttrSizedOperandSegments

Interfaces: DestinationStyleOpInterface, TTNN_OpModelInterface

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values